6.1 ALU Blocks and Control 1. Adder 2. Multiplier 3. Datapath Generation Contents

6.1

ALU Blocks and Control

1. Adder

2. Multiplier

3. Datapath Generation

Contents

6.2

1. Adder Full Adder

Boolean equation

CARRY A B B C C A

A B C (A B)

SUM A B C A B C A B C A B C

A B C CARRY (A B C)

Sum(Odd Parity) CARRY A+B+CC

6.3

Which is better?

Boolean Equation 1 :

CARRY evaluation is more urgent since CARRY is in the critical

path

[ Ripple Carry Adder ]

CARRY A B C (A B)

SUM A B C CARRY (A B C)

Boolean Equation 2 : CARRY A B C SUM (A B C)

SUM A B C A B C A B C A B C

ADDER

A0 B0

C0

C1

S0

ADDER

A1 B1

C2

S1

ADDER

A2 B2

Cn

S2

ADDER

An Bn

Cn

Sn

6.4

Alternating Complementary Form

At Odd Stages At Even Stages

ABC

ABC

SUM

CARRY

SUM

CARRY

CARRY A B C (A B)

SUM A B C CARRY (A B C)

SUM

CARRY

CARRY (A B) (C A B)

SUM (A B C)(CARRY A B C)

SUM

CARRYABC

ABC

6.5

Alternating Complementary Form

6.6

Dynamic Serial Adder

A

B

SUM

CARRY

CR/S

Q D

CLOCK

A

B

S

a an1 0

b bn1 0

s sn1 0

)]1()1()1([)1()1()1()1()1(

)]1()1([)()1()1()1(

tCtBtAtCARRYtCtBtAtSUM

tBtAtCtBtAtCARRY

6.7

Dynamic Configuration

CK

A

C

B A

B

CARRY GATE

OPTIONALPRECHARGE

DEVICE

SR

CK

CK

S

R

CK

CKC (CARRY)

C B A

CK

A

B

C

CK

CKSUM

SUM GATE

OPTIONALPRECHARGEDEVICE

Set/ResetCircuit

][ BACBACARRY

6.8

Full Adder Truth Table

0

01234567

1 2 3

7 6 5 4

Mutually Complement

FC - on terms

FS - on terms

Conjugate Symmetry ; input 을 뒤집으면 output 도 뒤집힌다

A

00001111

B

00

C

01

110011

010101

CARRY

00010111

SUM

01101001

SUM F (A,B,C)

CARRY F (A, B,C)

SUM F (A, B,C)

CARRY F (A,B,C)

S

C

S

C

6.9

Another Configuration of Carry & Sum Logic

A

C

B

CARRY STAGE

A B

A

A

1 PROPAGATE

1 PROPAGATE

1 GENERATE

1 GENERATE

CARRY

SUM STAGE

CARRY

B

B

C

CSUM

A B C

A B C

A

A

CARRY(t 1) F (A, B,C) A B B C C A A B C (A B)

SUM(t 1) F (A, B,C) A B C A B C A B C A B C

A B C CARRY (A B C)

C

S

6.10

Dynamic full adder using np CMOS logic style

6.11

Layout of the dynamic full adder

6.12

Looking at the FA Truth Table

A

00001111

B

00

C

01

110011

010101

CARRY

00010111

SUM

01101001

CPCPSUM

BAP whereBPCPCARRY

0BA when C

1BA when CSUM

0=BA when A(orB)

1BA when CCARRY

6.13

Transmission Gate Implementation

AB

BA B CARRY

C

A B

C

SUM

C

)( BAP

CP

CPCP

A B

A B

A B

6.14

CLA (Carry Lookahead Adder)

C0

P1

G1

P2

G2

P3

G3

P4

G4

C1

C2

C3

C4

C G P C where G A B

= G P G P P G .. + P P .. P P C

S C P

i i i i 1 i i i

i i i 1 i i 1 i 2 i i-1 2 1 0

i i i

Available for (# of inputs 4)

. .

An

Bn

Gn

Pn

6.15

Carry bypass structure - basic concept

6.16

(N=16)-bit carry bypass adder(each stage: M bits)

tp = tsetup + M * tcarry+(N/M - 1) tbypass + M*tcarry+tsum

tsetup : time to create G and P signals

tcarry : propagation delay through a single bit

tbypass : propagation delay through MUX

tsum : time to generate sum

Worst case delay

6.17

Combining 4 Domino Carry Lookahead Blocks

Manchester Carry Chain (4-bit)

Limit 4 stages

In the worst case, 6 Series Tr.s to the ground.

C0

CK

CK

P1

C1

G1

P2

C2

G2

P3

C3

G3

P4

C4

G4

C4

C0 MANCHESTERCARRY CHAIN

G1 P1 G2 P2 G3 P3 G4 P4

C0 C4

C0 C1 C2 C3 C4

C G P C1 2 1 0 GP Block Sum Block

6.18

Improving Worst Case Carry Prop. Time

MANCHESTERCARRY CHAIN

C0 C4

C0 C4

CKP1 P2 P3 P4

CK

Faster pass transistor chain due to lower parasitic C loading

6.19

Manchester CC Adder Floorplan

Dual CC Scheme One for Carry Prop.

The other for off-loading the 1st CC from the SUM-block.

GP

C4

A4

GP

GP

SUMGENERATE

MA

NC

HE

STE

RC

AR

RY

CH

AIN

MA

NC

HE

STE

RC

AR

RY

CH

AIN

MA

NC

HE

STE

RC

AR

RY

CH

AIN

MA

NC

HE

STE

RC

AR

RY

CH

AIN

MA

NC

HE

STE

RC

AR

RY

CH

AIN

MA

NC

HE

STE

RC

AR

RY

CH

AIN

MA

NC

HE

STE

RC

AR

RY

CH

AIN

MA

NC

HE

STE

RC

AR

RY

CH

AIN

SUM

SUM

SUM

SUMGENERATE

S4

S3

S2

S1

B4

A3

B3

A2

B2

A1

B1

C0

BIT 4

BIT 3

BIT 2

BIT 1

6.20

CSA (Carry Select Adder)

1

0S4

~ S7

C8

S41 ~ S7

1

A4 ~ A7 B4 ~ B7

1

S40 ~ S7

00

S0 ~ S3

A0 ~ A3 B0 ~ B3

C0

C4

S0 ~ S3

C81

C80 )C(CC

)CC(CC

) 0CC always (since CCCCC

CCCCC

084

18

0844

18

18

08

18

084

184

084

1848

A4 ~ A7 B4 ~ B7

Realization of MUX with restoring logic

Note) Realization of MUX with pass-transistor gates

C8

0

1C81

C80

C4

C8

C81

C80

C4

C4

C4

C12

C121

C120

C8

C8

C8

Threshold voltage loss per stage

Vdd Vdd - Vt Vdd - 2Vt

Carry Selection

Use restoring logic for critical path

6.21


For carry propagation, use restoring logic in the alternating pattern

S0 ~ S3

A0 ~ A3 B0 ~ B3

C0

C4

C80 C8

1

C8

C120 C12

1

Number of bits for each stageex1) 32-bit case : 4, 4, 5, 6, 7, 6 ( or 4, 4, 5, 6, 6, 7)ex2) 64-bit case : 4, 4, 5, 6, 7, 8, 9, 10

6.22

Minimization of Carry Propagation Path Delay

Carry Select Scheme (prepare result for each case, Cin=1, Cin=0)

Simplify the carry selection using the characteristic between Ci0 & Ci

1

Take complement carries alternating the Even and Odd stages

Adjust each block size with the consideration to the delay of carry select logic carry propagation delay of each block = = carry propagation delay to the

block adjust

4 4 5 6 6 7

eg. for 32-bit path

6.23

16-bit Linear CSA(Carry Select Adder)

tadd = tsetup + M * tcarry+ (N/M ) tmux + tsumM: #of bits/stageN : total # of bits

6.24

Square Root CSA

tadd = tsetup + M * tcarry+ 2N tmux + tsum

N = M + (M+1) + ….. + (M+P-1) = MP + P(P-1)/2 = P2/2 + P(M - 1/2 ) ~ P2/2 9 stage

Assumed MUX delay is comparable to 1-stage carry prop delay

12 ~6(?) Number of clock cycles

for this signal to be obtained

6.25

Propagation Delay of Linear and Square Root CSA and linear RCA

6.26

Carry Skip Adder Ripple Carry Adder 와 CLA Adder 의 Compromise

P p p p p

G g g p g p p g p p p

O3 0 1 2 3

O3 3 2 3 1 3 2 0 3 2 1

a3b3a2b2

a1 b1a0 b0

a15b15

a14b14

a13 b13

a12b12

c0

c4c8c12

P12, 15 P8, 11 P4, 7

c16

G12,15 G8,11 G4,7

Worst case delay

6.27

pi’s and gi’s are computed from pi=aibi and gi = aibi

Initially, c4, c8 and c12 are cleared

After 4 clock cycle (at T0+4Tc), G-values are calculated as cout assuming ci=0(P-values are also calculated by then)

At this time (at T0+4Tc), true cout in the first stage, c4 is obtained.

After one, two and three clock cycles respectively, assuming the delay of each AOI gate as Tc, true values of c8, c12 and c16 are obtained.

Sum and cout of the last block are obtained at (T0+4Tc+2Tc+4Tc)

Worst case delay

6.28

Comparison of Carry Select & Carry Skip Adder

A 32-bit Carry Select Adder

A 32-bit Carry Skip Adder

RCAAreaArea

kkSpeed

2

delays)r multiplexe where(822

logic-P

22delays)r multiplexe where(12

AreaAreaArea

kkSpeed

RCA

Stage # 1 2 3 4 5 6bits/stage 4 4 5 6 7 6inc. delay 4 1 1 1 1 1

Stage # 1 2 3 4 5 6bits/stage 4 5 6 7 8 2inc. delay 4 1 1 1 1 2

32 bit9k2(k2=delay due to 1-bit addition or MUX)

10k2

6.29

Conditional Sum Adder

A2 B2

S21 C3

1 S20 C3

0

MPX

A1 B1

S11 C2

1 S10 C2

0

MPX

A0 B0

S01 C1

1 S00 C1

0

MPX

Triple 2-input MUX

S0

C1

C0

S2

(C1=1)C3

(C1=1)S1

(C1=1)S2

(C1=0)

C3

(C1=0)

S2 C3 S1

S1

(C1=0)

6.30

Carry Lookahead Tree Adder

Previous CLA implementation is not very adequate due to fan-in, fan-out problem & irregularity, despite the small(5) number of logic levels. Make it regular, using log2n - logic levels.a3 b3 a2 b2

g3 p3 g2 p2

G2,3 P2,3

G0,3 P0,3

a1 b1 a0 b0

g1 p1 g0 p0

G0,1 P0,1

ai bi

gi pi

Gj+1,k Pj+1,k

Gi,k Pi,k

Gi,j

Pi,j

iii

iii

bap

bag

kjjiki

jikjkjki

PPP

GPGG

,1,,

,,1,1,

[ 1st Part ]

6.31

Carry Lookahead Tree Adder

iijjijCPGC ,1

iii

iii

iiii

bag

baP

cbaS

C3 C2

g2

p2

C1 C0

g0

p0

G0,1

P0,1

C2 C0

C0

Cj+1 Ci

Gi,j

Pi,j

Ci

aibi

gi pi

Gj+1,k Pj+1,k

Gi,kPi,k

Gi,jPi,j

a3b3 a2b2

C0

a1 b1 a0b0S3 S2 S1 S0

C3 C2C1 C0

C0

S3

Ci

Cj+1

CiCi

[ 2nd Part ]

[ Complete CLA Tree Adder ]

6.32

Carry Save Adder

Ripple Carry Adder

Carry Lookahead Adder

CSA (Conditional Sum Adder)


CSA (Carry Skip Adder)

CSA (Carry Save Adder)

Carry Propagate Adder

6.33

Carry Save Adder

Carry Save Adder is used wherever a large number of operands have to be added.

F.AF.A F.AF.A F.AF.A F.AF.A F.AF.A F.AF.A



aibici

CSAstages

CPA

F.AF.A F.AF.A F.AF.A F.AF.A F.AF.A F.AF.AF.AF.A

CarryF/F

CarryF/F

SumF/F

SumF/F

Previous CycleCarry

Previous CycleSum Operand

6.34

2. Multiplier

Add-and-Shift Algorithm

1

0

0

0

1

1

0

1

1

0

0

0

1

0

1

0

1

0

0

0

1

1

0

0

1

1

0

0

0

0

0

Multiplication procedure

by Pencil-and-Paper Method

0

Multiplication procedure

by Add-and-Shift Algorithm

0

0

0

0

1

1

1

0

1

0

1

1

0

0

0

0

1

1

0

0

0

0

1

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

0

0

0

1

1

0

1

1

1

0 1+

+

+

+

multiplier

multiplicand

6.35

The Serial-Parallel Multiplier

0

0

1n

1n

n

n

01nn

01nn

b2Ab2Ab2ABA

as expressed is BAproduct The

)b, ... ,b,(bB

)a, ... ,a,(aA If

D

D

D

D

D D D D

F.A

D

0

D D D

F.A

D

F.A

D

F.A

D

Output

F.A

D

F.A

D

F.A

D

A

B

b2

b1

b0

a0a1a2a3

6.36

4x4 array multiplier

6.37

tmult = [(M-1) + (N-1)] * tcarry + (N-1) * tsum+ tand

both tcarry and tsum are important

Sum and Carry generation time need to be similar.

N(4)

M(3)

6.38

Carry-save Multiplier(CSM)

Rectangular floorplan of CSM

6.39

The Modified Booth Algorithm (cont’)

Booth Encoder Table

b2k+1

0

0

0

0

1

1

1

1

b2k

0

0

1

1

0

0

1

1

b2k-1

0

1

0

1

0

1

0

1

multiplied by

0

+ x

+ x

+ 2x

- 2x

- x

- x

0

Ab2k-1

b2k

b2k+1

negative

2A

Booth Encoder

= b2k b2k-1

= b2k+1

6.40

Booth Multiplication Example

A

X

Initial 0

Add -A

2-bit Shift

Add 2A

2-bit Shift

Add -A

01

11

-A

00

10

10

11

10

01

00

10

11

00

01

+2A

00

11

11

10

00

11

01

11

01

01

11

-A

00

11

11

11

10

01

11

11

10

11

11

01

01

17

-9

Operation

-153

+

+

+

11

11

6.41

The Modified Booth Algorithm

Let’s consider a number B = (bn-1, bn-2, ... , b1, b0) written in 2’s-complement.

B may be rewritten as follows :

Example

In this equation, the terms in brackets is in the set {-2, -1, 0, 1, 2}

n-bit multiplier generates exactly n/2 partial products

B b 2 b 2n 1n 1

kk 0

n 2k

0)=b (assume 2)b2b(bB 12k

1

0k12k2k12k

2n

0101 2)b2bb( 2

321 2)b2bb( 4

543 2)b2bb(

4

43

32

21

10

01 2b2b2b2b2bb

6.42

Parallel Multiplier

Multiplier has two basic operations

The generation of partial products

The summation of partial products

Parallel multiplier avoids the overhead that is due to the separate

controls of these two operations

The gain in speed is obtained at the expense of extra hardware

Parallel multiplier can be implemented such that it supports a high rate

of pipelining

6.43

The Braun Multiplier

a0

b0

a0b0

P0

a1

b1

a1b0

a0b1

P1

a2

b2

a2b0

a1b1

a0b2

P2

a3

b3

a3b0

a2b1

a1b2

a0b3

P3

a3b1

a2b2

a1b3

P4

a3b2

a2b3

P5

a3b3

P6

A straightforward implementation One bit of the new partial prod

uct

( ai . bj )

One bit of the previous partial product

Carry in

In the first four rows there is no horizontal carry propagation (using carry-save adder)

6.44

The Braun Multiplier (cont’)

F.A F.A F.A

F.A F.A F.A

F.A F.A F.A

F.A F.A F.A0

b0

b1

b2

b3

p0

p1

p2

p3

p4p5p6p7

a0a1a2a3

0 0 0

6.45

Baugh-Wooley Multiplier

Modified in order to allow multiplication of signed number

Let’s consider 2 number A and B (2’s complement number)

The product A.B is

2n

0

i

i

1n

1n01n

2n

0

i

i

1n

1n01n

2b2b)b ... (bB

2a2a)a ... (aA

22n

1-n1-n

12n22n

1n1n

1ni2n

0i1n

1ni2n

0i1n

1n

1n1n

2n

0

2n

0

ji22n12n

1ni2n

0i

1n22n

1n

1ni2n

0i

1n22n

1n

2n

0

2n

0

ji22n

1n1n

1in2n

0i1n

2n

0

1in

i1n

2n

0

2n

0

ji

ji

22n

1n1n

2)ba(22)a(b :because

2ba2ab2)b(a2ba2)baba(2

2a22b2b22a2ba2ba

2ab2ba2ba2baBA

ji1n1n1n1n

ji

1a when ,2a2aA

0a when ,2aA

complement s2'in bit sign :a

1-n

2n

0

i

i

1-n

1-n

1-n

2n

0

i

i

1n

6.46

Baugh-Wooley Multiplier (cont’)

a0a1a2a3

F.A F.A F.A

F.A F.A F.A

b0

b1

b2

b3

p0

p1

p2

0 0 0

F.A F.A F.A

F.A F.A F.A

p4p5p6p7

F.A

p3

a3 b3F.A

F.A1

6.47

Wallace Tree Multipliers

Full adder vs Wallace tree

Useful whenever a large number of operands are to add.

Completion time in Braun or Baugh-Wooley multiplier Using Ripple Carry Adder:

Proportional to the twice number of n of bits

Using Wallace trees,

Proportional to log2 (n)

Full Adder

20 20 20

21 20

Wallace n

20 20 20

2n 2021

6.48

Recursive Decomposition of the Multiplication

A 2 A A

B 2 B B

A B 2 A B 2 (A B A B ) A B

PH L

PH L

2PH H

PH L L H L L

Partitioning two operands

Four Terms (AH.BH, AH

.BL, AL.BH, AL

.BL) are computed using 4 p-bits multipliers

The results are collected through Wallace tree

6.49

Recursive Decomposition of the Multiplication

BH BL

AH AL

AL X BL

AH X BL

AH X BH

AL X BH

AL X BL

AH X BL

AH X BH

AL X BH

Aligning the four partial products

AL X BL

AL X BH

AH X BH

AH X BL

4 X W34 X W3

Adder

AH AL BH BL

6.50

Booth’s Algorithm Array Multiplication

Another approach to the design of a parallel multiplier for two’s complement operands

The basic cell in rows i perform an add, subtract or transfer-only

CASS (Controlled Add/Subtract/Shift) Cell

cin

Pin (partial product)a

HD

cout

(subtract)

(add)

ca)c(aPc1,D If

ca)c(aPc0,D If

ca)c(aD)(Pc

)(

)(

caPP1,H If

PP0,H If

H)(cH)(aPP

inininout

inininout

inininout

ininout

inout

ininout

sum

transfer

6.51

Booth’s Algorithm Array Multiplication (cont’)

CASS CASS

CASS CASS

CASS CASS

CASS CASS

CASS

CASS

CASS

CASS CASS CASS CASS

CASS CASS CASS CASS

CASS

CASS CASS

CTRL

P6

CTRL

CTRL

CTRL

x3

x2

x1

x0

0

P5 P4 P3 P2 P1 P0

a3 a2 a1 a0

0 0 0 0

00

00

00

0

HD

HD

HD

HD

i

ii

XD

XXH

1

Xi Xi-1

0110

0101

ShiftShift

SubtractAdd

0011

dd10

DH

6.52

Generalized block diagram of an array multiplier

6.53

Q. Why use an array multiplier if it requires as many addition steps?

A1) Array multiplier is combinational circuit, where the signals flow without being clocked.

Multi-pass Array Multiplier : normally use a clock, but the cycle time for passing through k arrays is < kTc

6.54

A2) Some speed-up schemes are possible.

e.g. E/O array, Wallace-tree

Even-Odd Array

6.55

Wallace-tree Multiplier

6.56

6 x 6 Wallace-tree Multiplier Example

(n : width of the Wallace tree)

e.g. For 32-bit, number of adders necessary for each stage is

32 - 22 - 16 - 12 - 8 - 6 - 4 - 3 - 2

Total delay = 9 x adder delay

nDelay2

3log

6.57

6.58

Datapath and its elements in bit-slice organization

INP

UT

-OU

TP

UT

MEMORY

DATAPATH

CONTROL

3. Datapath Generation

6.59

Two layout strategies for bit-slice datapath

6.60

Layout of 4-bit DP using layout strategy II (feedthrough)

6.61

1-D placement vs. 2-D placement

6.62

1-D placement vs. 2-D placement(Cont’)

6.63

Datapath Layout Flow

circuit design floorplan : block ordering, bus track assignment

schematic drawing : tr. sizing

layout cell drawing : leaf cell layout

layout assemble : leaf cell integration (routing)

DRC / LVS : design rule check, layout vs. schematic

back-annotation simulation with the exact capacitance

RTL descriptionRTL description

FloorplanFloorplan

Schematic DrawingSchematic Drawing

Cell DrawingCell Drawing

Layout AssembleLayout Assemble

DRC / LVSDRC / LVS

Back-AnnotationBack-Annotation

Datapath LayoutDatapath Layout

6.64

Datapath Design Case (ACCENT HK386)

real mode support of x86 instruction set

enhanced (pipelined) datapath

problems & practices of general DP layout

6.65

Datapath structure

3 major blocks alu, register file(32bit)

barallel shifter(40bit)

segment/effective address(32bit)

Seg

men

t,EA

AL

UR

eg

ister

File

BarrelShifter

6.66

Track capacity

VSS VDD TRACK(6)

Power

Control, Clock

N-well P-well

6 vertical wires/track in metal 1 metal3 reserved for P & G routing

metal2metal1

6.67

Power Grid From bottom & left(chip edges)

Considering IR drop

Seg

men

t,EA

AL

URF

BS

H

6.68

Cell Structure

Initial cell template decision Nwell in the left

Pwell in the right

data flow vertical

control flow horizontal

Similar cell structure as VTI

Cell width

– 80 for PMOS

– 70 for NMOS

2510 35 45 10 25

70 80

N-well P-well

6.69

Cell Structure

모든 쎌에 power line 이 통과함 power line width

10 (2 contact)

power line location 25 to the inside

from the boundary

6.70

Accent Cell Layout Flow ( 어느 학생의 탄식 )

Block Spec.

Schematic

SPICE

처음에 cap 을 가정하고 시뮬레이션 TR sizing 은 간단하게 끝냄 Cap 값이 정확하지 않으니까 optimize 는 필요

없고 spec 만 만족하면 된다고 생각함 전체 assemble 이 되어야 정확한 cap 이 나오므로

한참동안 일에서 손을 뗌 assemble 된 다음 layout 을 고치면 새로 다시

assemble 해야 하는데 엄청난 노가다

6.71

Data flow

Control

flow

Cell Design(I) Using 45 degree line for cell design

6.72

Cell Design(II) needless effort to reduce cell size

ugly poly; current crowding

Data flow

6.73

Critical path used for transistor sizing in relevant datapath element

6.74

•Track assignment needs to be done before the cell layout (not after).

AssembleData flow

6.75

대학 성적과 사회에서의 성공은별로 correlation이 없는데 ,

이것은 사실 신기한 일이 아니다 .

사회 성공의 요인과 대학성적 기준이 종종 상당히 다르니까 .

대학 성적과 사회에서의 성공은별로 correlation이 없는데 ,

이것은 사실 신기한 일이 아니다 .

사회 성공의 요인과 대학성적 기준이 종종 상당히 다르니까 .

학점의 가치학점의 가치

Documents

6.1 ALU Blocks and Control 1. Adder 2. Multiplier 3. Datapath Generation Contents