Design of Power Efficient VLSI Arithmetic: Speed and Power Trade-offs Vojin G. Oklobdzija, Ram Krishnamurthy Intel AMR / ACSEL Laboratory Intel Corp/ University

Design of Power Efficient VLSI Design of Power Efficient VLSI Arithmetic: Speed and Power Arithmetic: Speed and Power

Trade-offsTrade-offs

Vojin G. Oklobdzija, Ram KrishnamurthyIntel AMR / ACSEL Laboratory

Intel Corp/ University of California Daviswww.ece.ucdavis.edu/acsel

Tutorial Presentation16th International Symposium on Computer

Arithmetic

Santiago de Compostela, SPAIN

June 18, 2003

June 18, 2003 16th International Symposium on Computer Arithmetic, Santiago de Compostela, SPAIN 2

Issues to be addressed

• How do we compare different topologies for their efficiency ?

• How do we estimate speed and efficiency of our algorithm ?

• What criteria's should we use when developing a new algorithm ?

• How does power enter into this equation ?


Additional Issues

• Determine which topology is the best for given Power or Delay budget

• Determine which topology can stretch the furthest in terms of speed or power

Metric


Previously used estimates Counting the number of gates (logic levels): not accurate

C in

C out C in

C 4C 8C 12

C out

C 20C 24C 28

C in

C 16

a ib i

ind ividua l addersgenera ting: g i, p i,

and sum S i

C arry-lookahead b locks o f4-b its generating:

G i, P i, and C in fo r theadders

C arry-lookahead super- b locks o f4-b its b locks genera ting:

G * i, P * i, and C in fo r the 4-b itb locks

G roup producing fina lcarry C out and C 16

C ritica l pa th de lay = (fo r g i,p i)+2x2 (fo r G ,P )+3x2 (fo r C in)+1XO R - (fo r S um ) = appx. 12of de lay


Critical path in Motorola's 64-bit CLACritical path in Motorola's 64-bit CLA

C ritica l pa th : A , B - G 0 - G 3:0 - G 15:0 - G 47:0 - C 48 - C 60 - C 63 - S 63

G4

P7

G0

P0

G1

P1

G2

P2

G3

P3

...

CARRYBLOCK

G8

P1

1

... G1

2

P1

5

... G1

6

P3

1

... G3

2

P4

7

... G4

8

P5

1

G6

0

P6

0

G6

1

P6

1

G6

2

P6

2

G6

3

P6

3

... G5

2

P5

5

... G5

6

P5

9

...

PG BLOCK

PG BLOCK

PG BLOCK

PG BLOCK

P,G

0

P,G

1:0

P,G

2:0

G3

:0

P3

:0

G7

:4

P7

:4

G1

1:8

P1

1:8

G1

5:1

2

P1

5:1

2

G3

:0

P3

:0

G7

:0

P7

:0

G1

1:0

P1

1:0

G1

5:0

P1

5:0

G1

5:0

P1

5:0

G3

1:1

6

P3

1:1

6

G3

1:0

P3

1:0

G4

7:3

2

P4

7:3

2

G4

7:0

P4

7:0

G5

1:4

8

P5

1:4

8

G5

5:5

2

P5

5:5

2

G5

9:5

6

P5

9:5

6

C6

4

G5

1:4

8

P5

1:4

8

G5

5:4

8

P5

5:4

8

G5

9:4

8

P5

9:4

8

P,G

60

P,G

61

:60

P,G

62

:60

G6

3:6

0

P6

3:6

0

G6

3:4

8

P6

3:4

8

G6

3:0

P6

3:0

C0

C4

C8

C1

2

C1

6

C3

2

C4

8

C1

6

C3

2

C4

8

C5

2

C5

6

C6

0

C6

3

PG BLOCK

C6

2

C6

1


Motorola's 64-bit CLA

Modified PG Block

Intermediate propagate signals Pi:0 are generated to speed-up C3


Fan-In and Fan-Out DependencyFan-In and Fan-Out Dependency (Oklobdzija, Barnes: IBM 1985)


Delay Comparison: Variable Block Adder(Oklobdzija, Barnes: IBM 1985)

Delay Complexity


Design Objective• Design takes time:

– finding results afterward is not of much value

• There is a disconnect between measures used by computer arithmetic when developing an algorithm and what is obtained after implementation– we want to estimate as close to the measured

results

• A simple tool that can evaluate different design trade-off for a given technology is needed

• Power trade-off is the most important– speed and power are tradable


Logical Effort Theory

• “Back of the Envelope” complexity: good for estimating speed

• Gate delay = linear function of load– Slope: logical effort gate driving characteristics– Intersect: parasitic gate internal load

• “Logical Effort” accuracy is not sufficient– We needed to extend and refine the method– However, that becomes more than “Back of the

Envelope”

• Logical Effort does not account for possible power-delay trade-offs


Logical Effort Theory

• Excel –a platform of choice (ARITH-16)– Simple enough– Can provide computation quickly– Easy to enter a given design

• Technology characterization is needed:– This needs to be done only once: available for

every design afterwards– Domino gate = 2 stages of dynamic and static

• Different driving characteristics of these stages

• Multi-output gate (carry-look-ahead, Ling/conditional sum)

• Energy model needs to be included


AGUs: performance and peak-current limitersHigh activity thermal hotspotGoal: high-performance energy-efficient design

Energy Energy MotivationMotivation

Execution core

120oC

Cache

Processor thermal

map

AGU

Temp(oC)

*courtesy of Intel Corp.


Kogge-Stone AdderKogge-Stone Adder

Critical path = PG+5+XOR = 7 gate stages Generate,Propagate fanout of 2,3 Maximum interconnect spans 16b

Energy inefficientEnergy

inefficient

1235 4679 8101113 12141517 16181921 20222325 24262729 283031PG

Car

ry-m

erg

e g

ates

XOR

00


Sparse-tree Adder ArchitectureSparse-tree Adder Architecture

Generate every 4th carry in parallelSide-path: 4-bit conditional sum generator73% fewer carry-merge gatesenergy-efficient

C27 C23 C19 C15 C11 C7 C3

293031 28 252627 24 212223 20 171819 16 131415 12 91011 8 567 4 123 0


StageLogical Effort

(G)Branch

Effort (B)Int. Pitch

(C)Effective Brnch Effort (B+I.C)

Paras tic Com p.

Path Branch

Effort = Bi Path Logical Effort=Gi

Path EffortPath Delay

(ps)

PG 0.6 2 1 2.1 1.3CM0 1.48 2 2 2.2 2.5CM1 0.59 2 4 2.4 1.6CM2 1.48 2 8 2.8 2.5CM3 0.59 2 16 3.6 1.6CM4 1.48 1 0 1.0 2.5XOR 1.69 1 0 1.0 3.0Inv 1 1 0 1.0 1.0

124.63 93.97

Kogge Stone Adder

108.92 1.14

Kogge-Stone adder (8-stage)Kogge-Stone adder (8-stage)

Adder Pitch (um)

10

Interconnect Cap

(fF/um) 0.157

Gate Cap (fF/um)

1.15

Avg inp. Cap /gate (um)

14

% int to gate

cap/pitch I10%

Inv. L.E. 2.24

Parasitic delay 3.8

Design ParametersAdder Pitch

(um)10

Interconnect Cap

(fF/um) 0.157

Gate Cap (fF/um)

1.15

Avg inp. Cap /gate (um)

14

% int to gate

cap/pitch I10%

Inv. L.E. 2.24

Parasitic delay 3.8

Design Parameters

D = 8*(GBH)1/8*2.2 + 3.8*P


MXA2 – Architecture & Result

• Multiplexer-based• Generate carries using

radix-2 (P,G)• 4-bit conditional sum

selected by carries• 4-b cell width = 17m• 9-stage critical path

– Per-stage effort = 3.7– Total effort delay = 33.3– Total parasitic = 22.5– Total delay = 55.8

PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4 PG4

S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4 S4

60..6356..5952..5548..5144..4740..4336..3932..3528..3124..2720..2316..1912..158..114..70..3

S1 0

S

1 0S

10

G01G23

2

a3 a1a2 b2 a0 b0a3 b3 a2 b2 b0 a0 b1 a1

2

2

P03P03

p3p3

P23P23

G03

PG Group

S10

S

1 0

S10

S10

S10

S10

S10

p0

Sum0Sum1Sum2Sum3

p1g0p2

p3

G01

g2 g2 g1 a0 b0

a1 b1a2 b2

G01

Cin


(p,g)

XOR2NAND2

NOR2OAI

CM6CM1

NAND2AOI

NOR2OAI

CM2 CM3

NAND2AOI

NOR2OAI

CM4 CM5

AOI

OAI

CMo

XOR2NAND2

XOR2

XOR2

SumCiN

Evenbits

Oddbits

HC2 – ArchitectureHC2 – Architecture

• Generate even carries using radix-2 (P,G)

• Generate odd carries from even carries

• CMOS adder for sum• 1-b cell width 4m• 10-stage critical path

4 3 02 114 7 663

30

31

15... ... ...

L2

L4

L6

L1

L3

L5

562

Odd

Sum ... ... ...


HC2 – Circuits & ResultsHC2 – Circuits & Resultspi gi-1 gi

G

pi gi-1 gi

G

pi pi-1

P

pi pi-1

P

a b a b

g p

P Cin

Sum

CK

Gi

Gi-1

G

Pi

CKPi

Ai

Bi Gi-1

Pi

Gi

G

Gi-1

Gi

Pi-1

CKGi

Ai Bi

Per-Stage Effort Total Effort Delay Total Parasitic Total DelayStatic 2.8t 28.0t 34.5t 62.5t


KS2 – Architecture & ResultsKS2 – Architecture & Results

• Generate carries using radix-2 (P,G)

• CMOS adder for sum• Similar circuits as HC2• 1-b cell width 4m• 9-stage critical path

Per-Stage Effort Total Effort Delay Total Parasitic Total DelayStatic 3.0t 27.0t 30.6t 57.6tDynamic 2.11t 19.0t 23.6t 42.6t

4 3 02 114 7 615 ...

L2

L4

L6

L1

L3

L5

5

Inv

Sum ...

13...

...

...

...

30

31

29

63

62


63 62 5961 60 4 3 02 18 57 648 1632 12... ...... ... ...

G4P4

G16P16

CoSum

KS4 – ArchitectureKS4 – Architecture

• Generate carries using redundant radix-4 (P,G)• Dynamic circuit• 1-b cell width 4m• 6-stage critical path


CKG4

A3

B3

A2

B2

A1

B1 B0

A0

B1 A1

A3

B3

A3

A2

B3

B2

A3

B3

A2

A3

B2

B3

A3

B3

A2

B2

A1

B1 A0

A1 B1

B0

P4CK

CK

CKG16

CK

g3 g2 g1 g0

p1

g3 p2

p1

g3 p2

p3

p1CK

g3 g1g2 g0

CKP16

G3 P2

P3 HS

STB

HSN

Sum

CK P1

G3 G2 G1 G0

CK

KS4 – Circuits & ResultKS4 – Circuits & Result

Per-Stage Effort Total Effort Delay Total Parasitic Total DelayDynamic 2.3t 13.8t 16.3t 30.1t


b32

b0

b16

b48 b15

b31b47

b63

Cin = C0

C48

C16

C32

C4

C8

C12

C20

C24

C28C36

C40

C44

C52

C56

C60

PGC PGC PGC

PGC PGC

PGC PGC PGC PGC PGC

C

PGC

PGC

PGCPGCPGCPGC

PGC

PGC PGC PGC

(P,G,C) Network

G-PathP-Path

CLA4 – ArchitectureCLA4 – Architecture• Generate carries using radix-4 (P,G,C)• 1-b cell width 4m• 15-stage critical path


A

B

AAN

CK

BNB

CK

G P K

AN

BN

CK CK

CK Sum

CiN

STBpg

Ci

CLA4 – Circuits & ResultCLA4 – Circuits & Result


G0 G1 G2 G3P0 P1 P2 P3

C0

P2:0 P3:0P1:0

G2:0 G3:0G1:0

C2 C3C1


LNG4 – ArchitectureLNG4 – Architecture• Generate carries using Ling pseudo-carries• Conditional sums selected by local & long carries• 1-b cell width 5.1m; 9-stage critical path


LNG4 – Circuits & ResultLNG4 – Circuits & Result

A0

B0

A1 B1A1

B1

A2

B2

A2 B2

CKG3

G4

CK

A3

B3P4

A2 B2

B3A3B1

A0 B0

A1

CK

CK

P

LCH LCL

C1H C0LC1L C0H

SumH

CK

K

G

SumL LCH LCL

C1H C0LC1L C0H

CK

P2

P1

G0

CKLC

G2G1



Results from SimulationResults from Simulation

2.7

0.10.50.4

1.3

0.5

1.4-0.9

0

2

4

6

8

10

12

14

16

KS CS HC KS-4 KS-2 Ling HC CLA

HS

PIC

E &

Diff

eren

ce (

FO4)

• Fairly consistent with logical effort analysis

• Per-stage delay– 1.4 FO4 (static)

– 0.8 FO4 (dynamic)

Type Adder # Stages LE (FO4) SPICE (FO4) Diff (FO4)Static KS2 9 11.8 10.9 -0.88

MX2 9 11.4 12.8 1.41HC2 10 12.8 13.3 0.46

Dynamic KS4 6 6.2 7.4 1.27KS2 9 8.7 9.2 0.44

LNG4 9 9.0 9.5 0.51HC2 10 9.8 9.9 0.08

CLA4 16 11.4 14.2 2.74


Delay of Representative 64-b AddersDelay of Representative 64-b Adders

0

2

4

6

8

10

12

MXA2 HC2 KS2 QTA2 KS4 LNG4

To

tal D

elay

(F

O4)

Static

Dynamic


What happened when Power is considered ?

Delay

Energy

A

B

Adder A

Adder B

Region 1 Region 2


Energy-Delay Space

Energy

Delay

Emin

Dmin

speed barrier

power limit

Different Adders


Logical EffortLogical Effort


Delay in a Logic Gate

Delay of a logic gate has two components

d = f + p

• Logical effort describes relative ability of gate topology to deliver current (defined to be 1 for an inverter)

• Electrical effort is the ratio of output to input capacitance

parasitic delay

effort delay, stage effort

f = gh

logical effort

electrical effort = Cout/Cin

electrical effortis alsocalled “fanout”

*from Mathew Sanu / D. Harris


Logical Effort Parameters: Inverter

• d = gh + p• Delay increases linearly with fanout• More complex gates have greater g and p

0

2

4

6

8

10

12

14

16

0 1 2 3 4 5 6

p=3.8ps (parasitic delay)

Fanout: h =Cin/Cout

Del

ay

d=gh+p

g=2.2 (logic effort)



Normalized Logical Effort: Inverter

• Define delay of unloaded inverter = 1 • Define logical effort ‘g’ of inverter = 1• Delay of complex gates can be defined w.r.t d=1

1

2

3

4

5

6

1 2 3 4 5

parasitic delay

effortdelay

Fanout: h = Cout/Cin

Nor

mal

ized

del

ay:

d

inver

ter g =

p =d =

1 1gh + p = h+1



Computing Logical EffortDEF: Logical effort is the ratio of the input capacitance to the input

capacitance of an inverter delivering the same output current

• Measured from delay vs. fanout plots of simulated gates• Or estimated, counting capacitance in units of transistor W



L.E for Adder Gates

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

0 1 2 3 4 5 6

Fanout

Del

ay (

ps)

Inverter

Static CM

Dyn PG

Dyn CM

Mux

• Logical effort parameters obtained from simulation for std cells• Define logical effort ‘g’ of inverter = 1• Delay of complex gates can be defined w.r.t d=1



Normalized L.E

• Logical effort & parasitic delay normalized to that of inverter

Gate type Logical Eff. (g)Parasitics

(Pinv)

Inverter 1 1

Dyn. Nand 0.6 1.34

Dyn. CM 0.6 1.62

Dyn. CM-4N 1 3.71

Static CM 1.48 2.53

Mux 1.68 2.93

XOR 1.69 2.97

*from Mathew Sanu


Delay of a string of gates

• Delay of a path, D = di = gihi + pi

• gi & pi are constants

• To minimize path delay, optimal values of hi are to be

determined

D is minimized when each stage bears the same effort, i.e. gihi = g i+1h i+1



Minimizing path delay

• Logical Effort of a string of gates:

• Path Electrical Effort:

• Branching Effort

• Path Branching Effort:

• Path Effort: F=GBH

giG = Cout(path)

Cin(path)

H = hi =

biB =

Con-path + Coff-path

Con-path

b =

Delay is minimized when each stage bears the same effort:

f = gihi = F1/N

The minimum delay of an N-stage path is: NF1/N + P*from Mathew Sanu / D. Harris


Inclusion of Wire DelayInclusion of Wire Delayinto Logical Effortinto Logical Effort


Wiring LoadWiring Load

• Wiring in hand analysis– Only lumped capacitance included

• Wiring in HSPICE– Short wire: 1-segment -model RC network– Long wire: 4-segment -model RC network– Using worst-case wire capacitance

• Wire length– Estimated from most critical 1-bit pitch


Modeling interconnect cap.• Include interconnect cap in branching factor

Con-path + Coff-path

Con-path

b =

CM0

CM0

Coff-path

Con-path

PG

Add

er b

itpitc

h CM0

CM0Cint

Con-path

PG

Add

er b

itpitc

h

Coff-path

= 2 Con-path + Coff-path+Cint

Con-pathb = = 2+

Cint

Con-path

= 2 + I I : % int. cap to gate cap in 1 adder bitpitch


Branching

CINCOUT1

COUT2

f0 f1

f2 f3

g0 g1

g2 g3

Logical Effort assumes the “branching” factor of this circuit to be 2. This is incorrect and can create inaccuracies


CINCOUT1

COUT2

f0 f1

f2 f3

f0 = f1 , f2 = f3

Td1 = (f0 + f1 + parasitics) Td2 = (f2 + f3 + parasitics)

g0 g1

g2 g3

Minimum Delay occurs when Td1 = Td2

Correction on Branching


F1g0 g1 out1

CinF2

g2 g3 out2Cin

B1F1 F2

F1

B1g0 g1 out1 g2 g3 out2

g0 g1 out1

B2F1 F2

F2

B2g0 g1 out1 g2 g3 out2

g2 g3 out2

““Real” Branching CalculationReal” Branching Calculation

Branching only equals 2 when: g0 g1 out1 g2 g3 out2

This explains why we had to resort to Excel !


Technology Characterization


Characterization Setup

• Logical Effort Requirements:– Equalize input and output transitions.

• Logical Effort is characterized by varying the h (Cout/Cin) of a gate. By using a variable load of inverters each gate can be characterized over the same range of loads.

• The Logical Effort of each gate is characterized for each input.

• Energy is characterized for each output transition of the gate caused by each input transition.

i.e. for an inverter: energy is measured for tLH and tHL


LE Characterization Setup forLE Characterization Setup for Static Gates Static Gates

Gate Gate Gate GateIn

•tLH

•tHL

•Average•Energy

..

Variable Load


LE Characterization Setup forLE Characterization Setup for Dynamic Gates Dynamic Gates

Gate GateIn

•tHL

•Energy

Variable Load


LE Table (Static CMOS)

• Technology: P/N Ratio = 2 INV = 3.67, pINV = 4.29

• Measured on worst-case single-input switching

Fan-out INV NAND2 NAND3 NOR2 TGXORi TGXORs TGM UXi TGM UXs AOI OAI2 11.6 16.3 22.2 20.5 34.9 22.3 8.0 26.0 23.2 21.33 15.3 20.0 26.6 25.4 42.6 28.2 9.9 33.0 28.5 26.74 19.0 24.0 31.2 30.6 50.2 34.2 12.0 39.0 34.1 32.16 26.4 32.4 40.6 41.1 64.4 45.7 16.0 53.0 45.3 43.68 33.6 40.6 50.0 51.9 79.8 56.5 20.2 68.0 56.7 55.3

g (ps) 3.67 4.08 4.65 5.25 7.43 5.71 2.04 6.97 5.60 5.68p (ps) 4.29 7.90 12.74 9.77 20.19 11.12 3.85 11.76 11.82 9.69

g (norm) 1.00 1.11 1.27 1.43 2.03 1.56 0.55 1.90 1.52 1.55p (norm) 1.00 1.84 2.97 2.28 4.71 2.59 0.90 2.74 2.76 2.26


0

10

20

30

40

50

60

70

80

90

0 1 2 3 4 5 6 7 8 9

Fanout

Delay

INV

NAND2

NAND3

NOR2

AOI

OAI

Static CMOS Gates: Delay Graphs

0

10

20

30

40

50

60

70

80

90

0 1 2 3 4 5 6 7 8 9

FanoutD

elay

INV

TGXORi

TGXORs

TGMUXi

TGMUXs


Static Gates: Pull-up Delay Graph

0

10

20

30

40

50

60

70

0 1 2 3 4 5 6 7 8 9

Fanout

Del

ayINV

NAND2

NAND3

NOR2

AOI

OAI


LE Table (Dynamic CMOS)• Technology:• Minimum-sized keeper included• Measured on all-input switching of worst path

Fan-out DN2 DN3 DN4 Dk1ND2 Dk1NR2 DAOI_A DOAI_O2 9.9 12.7 16.0 13.7 10.6 10.1 8.83 12.6 14.7 19.1 16.7 13.2 12.1 11.34 16.0 18.3 23.2 20.7 16.7 14.7 14.06 21.7 24.7 30.2 27.9 23.2 20.0 19.28 27.3 31.2 37.8 36.1 29.5 24.8 24.0

g (ps) 2.92 3.15 3.65 3.75 3.19 2.49 2.55p (ps) 4.04 5.82 8.46 5.76 3.95 4.86 3.75

g (norm) 0.80 0.86 1.00 1.02 0.87 0.68 0.69p (norm) 0.94 1.36 1.97 1.34 0.92 1.13 0.87


Dynamic CMOS: Delay Graphs

0

5

10

15

20

25

30

35

40

0 2 4 6 8 10

N2

N3

N4

k1ND2

k1NR2

AOI_A

OAI_O

0

5

10

15

20

25

30

35

40

0 2 4 6 8 10

G4

P4

C4

STBSum


Dynamic CMOS: Delay Graphs

0

5

10

15

20

25

30

35

40

45

50

0 2 4 6 8 10

LG3

LP4

G4

P4

LC

Lsum

0

5

10

15

20

25

30

35

40

45

50

0 2 4 6 8 10

KSG4

KSP4

KSG16KSP16KSSum


Energy CalculationEnergy Calculation


Energy Calculation

8X Minimal Size Dyn-NAND

16X Minimal Size Dyn-NAND


Energy CalculationOffset (parasitic+wiring energy) vs. Size (in multiplesof the

gate size)

y = 0.8931x + 4.6411

y = 1.1413x + 10.22

y = 1.6382x + 11.988

y = 0.5538x + 12.338

y = 3.89x + 14.5

y = 1.9595x + 9.621

y = 1.2559x + 6.762

y = 1.0592x + 1.71

0

10

20

30

40

50

60

0 5 10 15 20 25 30 35 40 45

Gate Size (x)

Off

se

t

invdgckoai_odaoitgxoraoi_ona2stgmuxsLinear (inv)Linear (dgck)Linear (oai_o)Linear (daoi)Linear (tgxor)Linear (aoi_o)Linear (na2s)Linear (tgmuxs)


Energy CalculationEnergy Calculation

1218

2436

482.5

5

7.5

10

0.00E+00

2.00E+01

4.00E+01

6.00E+01

8.00E+01

1.00E+02

1.20E+02

1.40E+02

Energy [fJ]

Load [u]

Size

Inverter


Energy Calculation

M 1 5 10 15 20 1 5 10 15 200 1.12 5.6 11.2 16.8 22.4 2.51E+00 1.26E+01 2.51E+01 3.77E+01 5.02E+011 2.24 11.2 22.4 33.6 44.8 3.70E+00 1.85E+01 3.70E+01 5.54E+01 7.39E+012 3.36 16.8 33.6 50.4 67.2 4.85E+00 2.42E+01 4.85E+01 7.27E+01 9.70E+013 4.48 22.4 44.8 67.2 89.6 6.16E+00 3.08E+01 6.16E+01 9.24E+01 1.23E+024 5.6 28 56 84 112 7.45E+00 3.73E+01 7.45E+01 1.12E+02 1.49E+025 6.72 33.6 67.2 100.8 134.4 8.74E+00 4.37E+01 8.74E+01 1.31E+02 1.75E+026 7.84 39.2 78.4 117.6 156.8 1.02E+01 5.08E+01 1.02E+02 1.52E+02 2.03E+027 8.96 44.8 89.6 134.4 179.2 1.15E+01 5.75E+01 1.15E+02 1.72E+02 2.30E+028 10.08 50.4 100.8 151.2 201.6 1.27E+01 6.36E+01 1.27E+02 1.91E+02 2.54E+029 11.2 56 112 168 224 1.42E+01 7.08E+01 1.42E+02 2.13E+02 2.83E+0210 12.32 61.6 123.2 184.8 246.4 1.55E+01 7.76E+01 1.55E+02 2.33E+02 3.10E+0211 13.44 67.2 134.4 201.6 268.8 1.69E+01 8.44E+01 1.69E+02 2.53E+02 3.37E+0212 14.56 72.8 145.6 218.4 291.2 1.81E+01 9.05E+01 1.81E+02 2.71E+02 3.62E+0213 15.68 78.4 156.8 235.2 313.6 1.97E+01 9.85E+01 1.97E+02 2.96E+02 3.94E+0214 16.8 84 168 252 336 2.09E+01 1.04E+02 2.09E+02 3.13E+02 4.18E+0215 17.92 89.6 179.2 268.8 358.4 2.26E+01 1.13E+02 2.26E+02 3.39E+02 4.52E+0216 19.04 95.2 190.4 285.6 380.8 2.39E+01 1.20E+02 2.39E+02 3.59E+02 4.79E+0217 20.16 100.8 201.6 302.4 403.2 2.53E+01 1.27E+02 2.53E+02 3.80E+02 5.06E+0218 21.28 106.4 212.8 319.2 425.6 2.67E+01 1.34E+02 2.67E+02 4.01E+02 5.34E+0219 22.4 112 224 336 448 2.81E+01 1.40E+02 2.81E+02 4.21E+02 5.61E+02

INV

Output Capacitance (u) Energy [fJ]

Multiplier FactorEnergy Factors

1.211300121 7.39E-01Output Capacitance Factor

NAND-2


ExamplesExamples


64-Bit Adders

• Han-Carlson (prefix-2, HC2): Static and Dynamic• Han-Carlson (prefix-2, HC2-2): Dynamic-Static• Kogge-Stone (prefix-2, KS2): Static and Dynamic• Kogge-Stone (prefix-2, KS2-2): Dynamic-Static• Quaternary-Tree (prefix-2, QT2): Static and

Dynamic

Included wire delay, tdelay = 0.7RwireCwire

Included wire energy, Ew = CwireV2

Len (um) 10 20 30 40 60 80 120 160 240 320 480Delay (ps) 0.01 0.04 0.09 0.17 0.38 0.67 1.50 2.67 6.01 10.7 24.1


Adder

S0

S63

A0

A63

Cwire

Cwire

Test Setup

1mm wire

H=(Cin + Cwire)/Cin


Energy-Delay Estimates


Adders: EnergyAdders: EnergyEnergy vs. Delay

Cout = 1mm wire (160u gate cap)For Cin = ~minimum input to 50*minimum input

0

100

200

300

400

500

600

700

800

900

0 50 100 150 200 250 300

Delay [pS]

En

erg

y [p

J]

HC Dynamic (2-2)

KS Dynamic (2-0)

HC Dynamic (2-0)

KS Dynamic (2-2)

KS Static Prefix 2

HC Static Prefix 2

Quarternary Dynamic (2-2)

Quarternary Static

Dynamic: KS, HC

Static

Dynamic-Static

QT

KS

HC


Dynamic Static ImplementationDynamic Static Implementationof Carry-Merge stageof Carry-Merge stage

VDD

Clk

Gi

Gi-1 Pi

VDD

Clk

Gi-2

Gi-3 Pi-2

VDD

Clk

Pi-1 Pi

VDD

Delayed Clk

VDD

Clk

Gi-2

Gi-3 Pi-2

VDD

Clk

Gi

Gi-1 Pi

VDD

Clk

Pi-1 Pi

Static Gate

Regular Domino Implementation Compound-Domino Implementation

inverters to be eliminated


Energy-Delay comparison of 64-bit Energy-Delay comparison of 64-bit KS, HC and QT addersKS, HC and QT adders

0

0.5

1

1.5

2

2.5

3

0.9 1.1 1.3 1.5 1.7 1.9 2.1

Normalized Delay

No

rmal

ized

En

erg

y

QT Static

HC Static

KS Static

QT compound-domino

HC compound-domino

KS compound-domino


Adders: Critical Path EnergyAdders: Critical Path EnergyCritical Path Energy vs. Delay (no internal w ire Energy)

Cout = 1mm wire (160u gate cap)For Cin = ~minimum input to 50*minimum input

0

2000

4000

6000

8000

10000

12000

0 50 100 150 200 250 300

Delay [S]

En

erg

y [

fJ]

HC Dynamic (2-2)

KS Dynamic (2-0)

HC Dynamic (2-0)

KS Dynamic (2-2)

KS Static Prefix 2

HC Static Prefix 2

Quarternary (2-2)

Quarternary Static (2-2)

QT dynamic-static

HC dynamic-staticQT static

KS dynamic-static

HC-dynamic

KS dynamic

HC-staticKS-static


Intel 32-bit Adder 0.13u 1.2V [VLSI-2002]Intel 32-bit Adder 0.13u 1.2V [VLSI-2002]Comparison with Intel Measured Data

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100 120 140 160 180 200

Delay [pS]

En

erg

y [f

J]

Kogge-Stone (2-0)

Quarternary (2-2)

Intel Kogge-Stone (2-0)

Intel Quarternary (2-2)

QT

KS

KS estimated

QT Estimated


Energy-Delay comparison of 32-bit QT and KS adders: estimated vs. simulation

in 0.10mm technology

0

10

20

30

40

50

60

90 100 110 120 130 140 150 160Delay [pS]

En

erg

y [p

J]

KS [9]

QT [9]

KS Estimate

QT Estimate

55%

35%


Est. Results: All AddersEst. Results: All Addersw/o Wiresw/o Wires

0E+

002E

-11

4E-1

16E

-11

8E-1

11E

-10

7 8 9 10 11 12 13 14 15

Delay (FO4)

Est

imat

ed E

ner

gy

(J)

sKS

sHC

sQT9

dKS

dHC

dQT9

dQT7

dCLA

dIBM

dLNG


Est. Results: All Addersw/ Wires

0.0E

+00

5.0E

-11

1.0E

-10

1.5E

-10

2.0E

-10

8 10 12 14 16 18Delay (FO4)

Est

imat

ed E

ner

gy

(J).

sKS_LE

sHC_LE

sQT9_LE

dKS_LE

dHC_LE

dQT9_LE

dQT7_LE

dIBM_LE

dLNG_LE


ConclusionConclusion

• Using realistic measures for comparing various designs leads to better design choices

• Power is as important as speed

• Making comparison in Energy-Delay space is necessary:– power can always be traded for speed and

vice versa

• Wire effects are significant

• Leakage currents ?

Documents

Design of Power Efficient VLSI Arithmetic: Speed and Power Trade-offs Vojin G. Oklobdzija, Ram Krishnamurthy Intel AMR / ACSEL Laboratory Intel Corp/ University