Download pdf - Lecture 17: Adders - Bangladesh University of Engineering ...teacher.buet.ac.bd/abmhrashid/Vlsi2AddersMult.pdf · 17: Adders CMOS VLSI DesignCMOS VLSI Design 4th Ed. 2 Outline Single-bit

Lecture 17: Adders

CMOS VLSI DesignCMOS VLSI Design 4th Ed.17: Adders 2

Outline Single-bit Addition Carry-Ripple Adder Carry-Skip Adder Carry-Lookahead Adder Carry-Select Adder Carry-Increment Adder Tree Adder


Single-Bit AdditionHalf Adder Full Adder

A B Cout S0 0 0 00 1 0 11 0 0 11 1 1 0

A B C Cout S0 0 0 0 00 0 1 0 10 1 0 0 10 1 1 1 01 0 0 0 11 0 1 1 01 1 0 1 01 1 1 1 1

A B

S

Cout

A B

C

S

Cout

out

S A BC A B

� out ( , , )

S A B CC MAJ A B C


PGK For a full adder, define what happens to carries

(in terms of A and B)– Generate: Cout = 1 independent of C

• G = A • B– Propagate: Cout = C

• P = A B– Kill: Cout = 0 independent of C

• K = ~A • ~B


Full Adder Design I Brute force implementation from eqns

out ( , , )S A B C

C MAJ A B C

ABC

S

Cout

MAJ

ABC

A

B BB

A

CS

C

CC

B BB

A A

A B

C

B

A

CBA A B C

Cout

C

A

A

BB


Full Adder Design II Factor S in terms of Cout

S = ABC + (A + B + C)(~Cout) Critical path is usually C to Cout in ripple adder

SS

Cout

A

BC

Cout

MINORITY


Layout Clever layout circumvents usual line of diffusion

– Use wide transistors on critical path– Eliminate output inverters


Full Adder Design III Complementary Pass Transistor Logic (CPL)

– Slightly faster, but more area

A

C

S

S

B

B

C

C

C

B

B Cout

Cout

C

C

C

C

B

B

B

B

B

B

B

B

A

A

A


Full Adder Design IV Dual-rail domino

– Very fast, but large and power hungry– Used in very fast multipliers

Cout _h

A_h B_h

C_h

B_h

A_h

Cout _l

A_l B_l

C_l

B_l

A_l

S_hS_l

A_h

B_h B_hB_l

A_l

C_lC_h C_h


Carry Propagate Adders N-bit adder called CPA

– Each sum bit depends on all previous carries– How do we compute all these carries quickly?

+

BN...1AN...1

SN...1

CinCout 11111 1111 +0000 0000

A4...1

carries

B4...1

S4...1

CinCout

00000 1111 +0000 1111

CinCout


Carry-Ripple Adder Simplest design: cascade full adders

– Critical path goes from Cin to Cout

– Design full adder to have fast carry delay

CinCout

B1A1B2A2B3A3B4A4

S1S2S3S4

C1C2C3

CMOS VLSI DesignCMOS VLSI Design 4th Ed.

Inversion Property

A B

S

CoCi FA

A B

S

CoCi FA


Minimize Critical Path by Reducing Inverting Stages

A0 B0

S0

Co,0Ci,0

A1 B1

S1

Co,1

A2 B2

S2

Co,2 Co,3FA’ FA’ FA’ FA’

A3 B3

S3

Odd CellEven Cell

Exploit Inversion Property

Note: need 2 different types of cells


Inversions Critical path passes through majority gate

– Built from minority + inverter– Eliminate inverter and use inverting full adder

Cout Cin

B1A1B2A2B3A3B4A4

S1S2S3S4

C1C2C3


Carry-Select Adder Trick for critical paths dependent on late input X

– Precompute two possible outputs for X = 0, 1– Select proper output when X arrives

Carry-select adder precomputes n-bit sums– For both possible carries into n-bit group

Cin+

A4:1 B4:1

S4:1

C4

+

+

01

A8:5 B8:5

S8:5

C8

+

+

01

A12:9 B12:9

S12:9

C12

+

+

01

A16:13 B16:13

S16:13

Cout

0

1

0

1

0

1


Carry Select Adder


Carry Skip Adder


Carry Skip Adder


Carry Look Ahead Adder



in

in

CBABABABABACBABACCBABAC

))(()()()(

0011001111011111

00000




Carry-Lookahead Adder Carry-lookahead adder computes Gi:0 for many bits

in parallel. Uses higher-valency cells with more than two inputs.

Cin+

S4:1

G4:1P4:1

A4:1 B4:1

+

S8:5

G8:5P8:5

A8:5 B8:5

+

S12:9

G12:9P12:9

A12:9 B12:9

+

S16:13

G16:13P16:13

A16:13 B16:13

C4C8C12Cout


Tree Adder If lookahead is good, lookahead across lookahead!

– Recursive lookahead gives O(log N) delay Many variations on tree adders


Generate / Propagate Equations often factored into G and P Generate and propagate for groups spanning i:j

Base case

Sum:

: : : 1:

: : 1:

i j i k i k k j

i j i k k j

G G P GP P P

�

�

:

:

i i i i i

i i i i i

G G A BP P A B

�

0:00:00inGCP0:00:00inGCP

0:0 0

0:0 0 0inG G C

P P

1:0i i iS P G


PG Logic

S1

B1A1

P1G1

G0:0

S2

B2

P2G2

G1:0

A2

S3

B3A3

P3G3

G2:0

S4

B4

P4G4

G3:0

A4 Cin

G0 P0

1: Bitwise PG logic

2: Group PG logic

3: Sum logicC0C1C2C3

Cout

C4


PG Carry-Ripple Addition:0 1:0 i i i iG G P G �

S1

B1A1

P1G1

G0:0

S2

B2

P2G2

G1:0

A2

S3

B3A3

P3G3

G2:0

S4

B4

P4G4

G3:0

A4 Cin

G0 P0

C0C1C2C3

Cout

C4 4-bit carry-ripple adder using PG logic


Carry-Ripple PG DiagramD

elay

0123456789101112131415

15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0

Bit Position

ripple xor( 1)pg AOt t N t t


PG Diagram Notation

i:j

i:j

i:k k-1:j

i:j

i:k k-1:j

i:j

Gi:k

Pk-1:j

Gk-1:j

Gi:j

Pi:j

Pi:k

Gi:k

Gk-1:j

Gi:j Gi:j

Pi:j

Gi:j

Pi:j

Pi:k

Black cell Gray cell Buffer


CLA PG Diagram

012345678910111213141516

15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:016:0


Higher-Valency Cells

i:j

i:k k-1:l l-1:m m-1:j

Gi:k

Gk-1:l

Gl-1:m

Gm-1:j

Gi:j

Pi:j

Pi:k

Pk-1:l

Pl-1:m

Pm-1:j


4-bit Kogge-Stone Example

17: Adders 31


Kogge-Stone

1:02:13:24:35:46:57:68:79:810:911:1012:1113:1214:1315:14

3:04:15:26:37:48:59:610:711:812:913:1014:1115:12

4:05:06:07:08:19:210:311:412:513:614:715:8

2:0

0123456789101112131415

15:014:013:012:011:010:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0


Tree Adder Taxonomy Ideal N-bit tree adder would have

– L = log N logic levels– Fanout never exceeding 2– No more than one wiring track between levels

Describe adder with 3-D taxonomy (l, f, t)– Logic levels: L + l– Fanout: 2f + 1– Wiring tracks: 2t

Known tree adders sit on plane defined byl + f + t = L-1


Taxonomy Revisited

f (Fanout)

t (Wire Tracks)

l (Logic Levels)

0 (2)1 (3)

2 (5)

3 (9)

0 (4)

1 (5)

2 (6)

3 (8)

2 (4)

1 (2)

0 (1)

3 (7)

Kogge-Stone

Sklansky

Brent-Kung

Han-Carlson

Knowles[2,1,1,1]

Knowles[4,2,1,1]

Ladner-Fischer

Han-Carlson

Ladner-Fischer

New(1,1,1)

(c) Kogge-Stone

1:02:13:24:35:46:57:68:79:810:911:1012:1113:1214:1315:14

3:04:15:26:37:48:59:610:711:812:913:1014:1115:12

4:05:06:07:08:19:210:311:412:513:614:715:8

2:0

0123456789101112131415

15:0 14:0 13:0 12:0 11:010:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0

(e) Knowles [2,1,1,1]

1:02:13:24:35:46:57:68:79:810:911:1012:1113:1214:1315:14

3:04:15:26:37:48:59:610:711:812:913:1014:1115:12

4:05:06:07:08:19:210:311:412:513:614:715:8

2:0

0123456789101112131415

15:014:013:0 12:011:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0

(b) Sklansky

1:0

2:03:0

3:25:47:69:811:1013:1215:14

6:47:410:811:814:1215:12

12:813:814:815:8

0123456789101112131415

15:0 14:013:0 12:011:010:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0

1:03:25:47:69:811:1013:12

3:07:411:815:12

5:07:013:815:8

15:14

15:8 13:0 11:0 9:0

0123456789101112131415

15:0 14:0 13:0 12:0 11:0 10:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0

(f) Ladner-Fischer

(a) Brent-Kung

1:03:25:47:69:811:1013:1215:14

3:07:411:815:12

7:015:8

11:0

5:09:013:0

0123456789101112131415

15:014:013:0 12:011:010:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0

1:03:25:47:69:811:1013:1215:14

3:05:27:49:611:813:1015:12

5:07:09:211:413:615:8

0123456789101112131415

15:014:013:0 12:0 11:010:0 9:0 8:0 7:0 6:0 5:0 4:0 3:0 2:0 1:0 0:0

(d) Han-Carlson


Summary

Architecture Classification Logic Levels

Max Fanout

Tracks Cells

Carry-Ripple N-1 1 1 N

Carry-Skip n=4 N/4 + 5 2 1 1.25N

Carry-Inc. n=4 N/4 + 2 4 1 2N

Brent-Kung (L-1, 0, 0) 2log2N – 1 2 1 2N

Sklansky (0, L-1, 0) log2N N/2 + 1 1 0.5 Nlog2N

Kogge-Stone (0, 0, L-1) log2N 2 N/2 Nlog2N

Adder architectures offer area / power / delay tradeoffs.

Choose the best one for your application.

CMOS VLSI DesignCMOS VLSI Design 4th Ed.18: Datapath Functional Units 36

Multiplication Example:

M x N-bit multiplication– Produce N M-bit partial products– Sum these to produce M+N-bit product

1100 : 1210 0101 : 510 1100 0000 1100 000000111100 : 6010

multipliermultiplicand

partialproducts

product


General Form Multiplicand: Y = (yM-1, yM-2, …, y1, y0) Multiplier: X = (xN-1, xN-2, …, x1, x0)

Product:1 1 1 1

0 0 0 02 2 2

M N N Mj i i j

j i i jj i i j

P y x x y

x0y5 x0y4 x0y3 x0y2 x0y1 x0y0

y5 y4 y3 y2 y1 y0

x5 x4 x3 x2 x1 x0






p0p1p2p3p4p5p6p7p8p9p10p11

multipliermultiplicand

partialproducts

product


The Binary Multiplication

1 0 1 1

1 0 1 0 1 0

0 0 0 0 0 0

1 0 1 0 1 0

1 0 1 0 1 0

1 0 1 0 1 0

1 1 1 0 0 1 1 1 0

+

Partial Products

AND operation


Dot Diagram Each dot represents a bit

partial products

multiplier x

x0

x15


Array Multipliery0y1y2y3

x0

x1

x2

x3

p0p1p2p3p4p5p6p7

B

ASin Cin

SoutCout

BA

CinCout

Sout

Sin

=

CSAArray

CPA

critical path BA

Sout

Cout CinCout

Sout

=Cin

BA


The Array Multiplier

HA FA FA HA

FA FA FA HA

FA FA FA HA

X0X1X2X3 Y1

X0X1X2X3 Y2

X0X1X2X3 Y3

Z1

Z2

Z3Z4Z5Z6

Z0

Z7

X0X1X2X3

Y0

X0X1X2X3


The MxN Array Multiplier— Critical Path

HA FA FA HA

HAFAFAFA

FAFA FA HA

Critical Path 1

Critical Path 2

Critical Path 1 & 2


Rectangular Array Squash array to fit rectangular floorplan

y0y1y2y3

x0

x1

x2

x3

p0

p1

p2

p3

p4p5p6p7


Carry-Save Multiplier

HA HA HA HA

FAFAFAHA

FAHA FA FA

FAHA FA HA

Vector Merging Adder


Shift & Add Multiplication Right shift and add

– Partial product array rows are accumulated from top to bottom on an N-bit adder

– After each addition, right shift (by one bit) the accumulated partial product to align it with the next row to add

– Time for N bits Tserial_mult = O(N Tadder) = O(N2) for a RCA Making it faster

l Use a faster adderl Use higher radix (e.g., base 4) multiplication

- Use multiplier recoding to simplify multiple formation

l Form partial product array in parallel and add it in parallel

Making it smaller (i.e., slower)l Use an array multiplier

- Very regular structure with only short wires to nearest neighbor cells. Thus, very simple and efficient layout in VLSI

- Can be easily and efficiently pipelined


(3,2) Counter (CSA32)

A CSA is effectively a “1”s counter that adds the number of 1’S on the A,B, and C inputs and encode them on the sum and carry output.

A B C Carry Sum No of 1’s

0 0 0 0 0 00 0 1 0 1 10 1 0 0 1 10 1 1 1 0 21 0 0 0 1 11 0 1 1 0 21 1 0 1 0 21 1 1 1 1 3



Instead of using two adders we want to separate adder the three terms are reduced to two and then added.

A= 6 Decimal = 0 0 1 1 0B= 12 Decimal = 0 1 1 0 0A= 24 Decimal = 1 1 0 0 0Sum = 1 0 0 1 0Carry = 0 1 1 0 0Result = 1 0 1 0 1 0

•To avoid delay multiplier perform carry save compression and at the end of the carry save compression the generated terms are added with an adder.



Built out of two (3,2) counters (just FA’s!)– all of the inputs (4 external plus one

internal) have the same weight (i.e., are in the same bit position)

– the internal output is carried to the next higher weight position (indicated by the )(3,2)

(3,2) Note: Two carry outs - one “internal” and one “external”


Tiling (4,2) Counters

Reduces columns four high to columns only two high– Tiles with neighboring (4,2) counters– Internal carry in at same “level” (i.e., bit position

weight) as the internal carry out

(3,2)

(3,2)

(3,2)

(3,2)

(3,2)

(3,2)


Tiling (4,2) Counters

Reduces columns four high to columns only two high– Tiles with neighboring (4,2) counters– Internal carry in at same “level” (i.e., bit position

weight) as the internal carry out

(3,2)

(3,2)

(3,2)

(3,2)

(3,2)

(3,2)


4x4 Partial Product Array Reduction

multiplicandmultiplier

partialproductarray

reduced pp array (to CPA)

double precision product

Fast 4x4 multiplication using (4,2) counters


4x4 Partial Product Array Reduction

multiplicandmultiplier

partialproductarray

reduced pp array (to CPA)

double precision product

Fast 4x4 multiplication using (4,2) counters


8x8 Partial Product Array Reduction‘icand‘ier

partialproductarray

How many (4,2) counters minimum are needed to reduce it to 2 rows?


8x8 Partial Product Array Reduction‘icand‘ier

partialproductarray

reduced partial product array

How many (4,2) countersminimum are needed to reduce it to 2 rows?

Answer: 24


Alternate 8x8 Partial Product Array Reduction‘icand‘ier

partialproductarray

reduced partial product array

More (4,2) counters, so what is the advantage?


Wallace Tree Multiplication

18: Datapath Functional Units 56


Wallace Tree Multiplication

17: Adders 57


Fewer Partial Products Array multiplier requires N partial products If we looked at groups of r bits, we could form N/r

partial products.– Faster and smaller?– Called radix-2r encoding

Ex: r = 2: look at pairs of bits– Form partial products of 0, Y, 2Y, 3Y– First three are easy, but 3Y requires adder


Booth Encoding Instead of 3Y, try –Y, then increment next partial

product to add 4Y Similarly, for 2Y, try –2Y + 4Y in next partial product


Booth Hardware Booth encoder generates control lines for each PP

– Booth selectors choose PP bits