A 240ps 64b Carry-Lookahead Adder in 90nm CMOS Faezeh Montazeri fmontazeri@ece.ut.ac.ir Advanced...

Preview:

Citation preview

A 240ps 64b Carry-Lookahead A 240ps 64b Carry-Lookahead Adder in 90nm CMOSAdder in 90nm CMOS

Faezeh MontazeriFaezeh Montazerifmontazeri@ece.ut.ac.irfmontazeri@ece.ut.ac.ir

Advanced VLSI Course PresentationAdvanced VLSI Course PresentationUniversity of TehranUniversity of Tehran

December 2006December 2006

Based on :Based on :A 240ps 64b Carry-Lookahead Adder in 90nm CMOSA 240ps 64b Carry-Lookahead Adder in 90nm CMOS

Sean Kao, Radu Zlatanovici, Borivoje NikolićSean Kao, Radu Zlatanovici, Borivoje NikolićUniversity of California, BerkeleyUniversity of California, Berkeley

2

0

10

20

30

0 10 20 30 40 50 60

Normalized Delay [90nm 1V FO4]

No

rma

lize

d E

ne

rgy

[r.

u.]

500 nm

350 nm

250 nm

180 nm

130 nm

90 nm

What Is an Optimal Adder?What Is an Optimal Adder?

Optimal adder:• Minimum delay for given energy• Minimum energy for given delay

64-bit Adders on IEEE Xplore 1995-2005

[1]

3

This WorkThis Work

Multi-issue 64-bit microprocessor environment:

• Optimize a set of representative 64-bit adders in

the energy – delay space

• Analyze the design tradeoffs

• Implement the optimal adder in

1.0V 90nm GP CMOS

4

OutlineOutline

• Energy – delay optimization

• Design tradeoffs for 64-bit adders

• Test chip implementation

• Measured results

• Summary

5

Energy – Delay OptimizationEnergy – Delay Optimization

Delay

Ene

rgy Domino CLA Adder

• Goal: obtain the energy – delay optimal adder • CAD tool: optimize custom digital circuits in the

energy – delay space [3]

Static CLA Adder

[1]

6

Circuit Optimization FrameworkCircuit Optimization Framework

Optimizer

(Matlab)

Delay, EnergyStatic timer

(C++)

Models Netlist Optimization Goal

Optimal Design

Variables

Design Variables

Static timer

(C++)

Optimization Core

[1]

7

Adder Optimization SetupAdder Optimization Setup

MinimizeDELAYsubject toMaximumENERGY

Generatesubtree

Propagatesubtree

G64

MUX

Carry

S0

S1

SUM

Sum precompute

A,B,Cin

Critical path

Non-critical path

CL = 27 fF

CIN ≤ 27fF

tSLOPE ≤ 100 ps [1]

8

0

10

20

30

40

50

6 8 10 12 14

Delay [FO4]

En

erg

y [p

J]

R2 CLA

R4 CLA

CLA: Full Tree ComparisonCLA: Full Tree Comparison

• 6 stages• Moderate

branching

• 3 stages• Larger

branching

Radix- 4 closer to optimum number of stages

Radix-2 Radix-4

[1]

9

CLA vs. LingCLA vs. Ling

0

10

20

30

40

50

6 8 10 12 14

Delay [FO4]

En

erg

y [p

J]

R2 Ling

R2 CLA

R4 Ling

R4 CLA

1i1i1iiiiiii

0121223

HbabaHbaS

gttgtgg0]:H[3

1iiii

0123123233

GbaS

gpppgppgpg0]:G[3

Conventional CLA• Higher stack in first stage• Simple sum precompute

Ling CLA• Lower stack in first stage • Complex sum precompute• Higher speed

[1]

[2]

10

Full vs. Sparse ComparisonFull vs. Sparse Comparison

0

10

20

30

40

50

6 8 10 12 14

Delay [FO4]

En

erg

y [p

J]

R2 FULL

R4 FULL

FULL SP2Ling CLA

[1]

11

Full vs. Sparse ComparisonFull vs. Sparse Comparison

0

10

20

30

40

50

6 8 10 12 14

Delay [FO4]

En

erg

y [p

J]

R2 FULL

R2 SP2

R4 FULL

R4 SP2

FULL SP2Ling CLA

SP2

R2 +

R4 +[1]

12

0

10

20

30

40

50

6 8 10 12 14

Delay [FO4]

En

erg

y [p

J]

R2 FULL

R2 SP2

R2 SP4

R4 FULL

R4 SP2

R4 SP4

Full vs. Sparse ComparisonFull vs. Sparse Comparison

Sparseness benefits adders with large carry trees

FULL SP4Ling CLA

SP2 SP4

R2 + +

R4 + –[1]

13

0

10

20

30

40

50

6 8 10 12 14

Delay [FO4]

En

erg

y [p

J]

R2 FULL

R2 SP2

R2 SP4

R4 FULL

R4 SP2

R4 SP4

Optimal AdderOptimal Adder

• Ling’s equations

• Radix-4 sparse-2

• Domino carry tree

• Static sum-precompute

• Delay of fastest adder:

7.3 FO4

[1]

14

Radix-4 Sparse-2 Carry TreeRadix-4 Sparse-2 Carry Tree

• Computes every other Ling pseudo-carry: H0, H2, H4 …• Each output selects two sums

SUMSEL

(A0, B0)

H4/I4

H16/I16

H64

Cin (A63, B63)G/T

s63Couts0

G/T gates

H gate

H/I gates

SUMSEL MUX

LEGEND

[1]

15

Adder Core Block DiagramAdder Core Block Diagram

• Critical paths implemented in clock-delayed domino • Non-critical paths implemented in static • At-speed BIST

TG

H4I4

H16I16

H64

Sum precompute

Sum selectMUX

pc1 pc2 pc3 pc4 psel

sum

Clock Generator

MUX Out FF

pc1

Scan chain

Scan chain

S0

S1 Buffer

Com

parator

Out

scan_in

footed domino

footless domino

static CMOShard edge

H64

H64'

Precomputed sums

inputs

[1]

16

Timing DiagramTiming Diagram

• 20 ps margin on all edges; Adjustable hard edges• Delay spread places precharge in critical path

pc1

pc2

pc3

pc4

psel

H64

H64'

Hard edge

TCYCLE DUTY CYCLE

24%

43%

53%

53%

45%

[1]

fmontazeri

17

Layout FloorplanLayout Floorplan

• Bitslice height: 24 metal tracks• Aligned clock lines• Sum precompute occupies space freed by sparse carry tree

TG H4

I16I4

H16

H64

J1

TG SUM SELECT

SUM SELECT

TG H4

I16I4

H16

H64

J1

TG SUM SELECT

SUM SELECT

XO

R2

XO

R2

XO

R2

XO

R2

XO

R2

XO

R2

XO

R2

XO

R2

XO

R2

XO

R2

K1

J1

J0J0

EVERY BITSLICE

SPARSE-2 CARRY TREE

SPARSE-2 SUM

PRECOMP

24 TRACKS

LEGEND

pc1 pc2 pc3 pc4 psel

[1]

fmontazeri

18

90 nm Test Chip90 nm Test Chip

CO

RE

2

CO

RE

3

CO

RE

4

CO

RE

6

CO

RE

7

CO

RE

8

CO

RE

5A

DD

ER

CO

RE

1

TE

ST

IN

TE

ST

OU

T

CK GEN

1.7 mm

1.6

mm

• 90 nm GP 7M 1P • SVT transistors• VDD = 1V• 8 adder cores + test

circuitry • Core 1: this work• Cores 2-8:

Supply noise measurements and supply grid experiments [4].

• Adder core size: 417 x 75m2

[1]

19

[1]

20

Chip PackagingChip Packaging

Chip-on-board:• Bond wires 60% shorter• Cleaner supply 10 ps shorter delays

Advance ProgramDigest

[1]

fmontazeri

21

Measured Results: DelayMeasured Results: Delay

CHIP-ON-BOARD:

• VDD = 1 V

– Average: 240 ps

– Fastest: 226 ps

• VDD = 1.3 V

– Average: 180 ps

Davg = 7.5 FO4

[1]

22

Measured Results: PowerMeasured Results: Power

VDD = 1V: Pmax = 260 mW

VDD = 1.3V: Pmax = 606 mW

Adder core

Clk gen

BIST

Leakage

[1]

23

ConclusionConclusion

• 90 nm GP 7M 1P

• SVT transistors

• VDD = 1V

• 8 adder cores + test circuitry

• Adder core size: 417 x 75m2

24

0

10

20

30

0 10 20 30 40 50 60

Normalized Delay [90nm 1V FO4]

No

rma

lize

d E

ne

rgy

[r.

u.]

500 nm350 nm250 nm180 nm130 nm90 nmThis work

64-bit Adders on IEEE Xplore 1995-2005

SummarySummary

• Ling radix-4 sparse-2 domino carry tree

• 90nm GP CMOS: 240ps, 260mW @1V

[1]

25

ReferencesReferences

• [1]. S. Kao, R. Zlatanovici, B. Nikolic, “A 240ps 64-bit Carry-Lookahead Adder in 90nm CMOS,” ISSCC2006, Feb.2006.

• [2]. H. Ling, “High Speed Binary Adder,” IBM J. R&D, vol. 25, no. 3, pp.156-166, May, 1981.

• [3]. R. Zlatanovici, B. Nikolic, “Power – Performance Optimization for Custom Digital Circuits,” Proc. PATMOS, pp. 404-414, Sept., 2005.

• [4] V. Abramzon, E. Alon, M. Horowitz Stanford University

Recommended