21
CSE 246: Computer Arithmetic Algorithms and Hardware Design Instructor: Prof. Chung-Kuan Cheng Fall 2006 Lecture 8: Division

CSE 246: Computer Arithmetic Algorithms and Hardware Design

  • Upload
    rafer

  • View
    20

  • Download
    0

Embed Size (px)

DESCRIPTION

CSE 246: Computer Arithmetic Algorithms and Hardware Design. Fall 2006 Lecture 8: Division. Instructor: Prof. Chung-Kuan Cheng. Topics:. Radix-4 SRT Division Division by a Constant Division by a Repeated Multiplication. Project Update. Come in to speak briefly about the final project - PowerPoint PPT Presentation

Citation preview

Page 1: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246: Computer Arithmetic Algorithms and Hardware Design

Instructor:Prof. Chung-Kuan Cheng

Fall 2006

Lecture 8: Division

Page 2: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 2

Topics:

Radix-4 SRT Division Division by a Constant Division by a Repeated Multiplication

Page 3: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 3

Project Update Come in to speak briefly about the

final project Status Update 2:30 – 3:00 p.m. Tuesday or Thursday

Page 4: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 4

Radix-4 SRT Division 4sj-1 = qjd + sj where

qj is in [-2,2] and sj-1 is in [-hd,+hd] h is less than or equal to 2/3 Therefore, sj-1 is in [-2d/3, 2d/3] And, 4sj-1 is in [-8d/3, 8d/3]

s shifts to the left by 2 bits

Page 5: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 5

Radix-4 SRT Division

0.0

0.1

1.0

1.1

10.0

10.1

11.0

.101 .110 .111 1.00.1

2d/3

-2d/3

qj=1

qj=0

qj=2

The overlap regions of qj denote a choice still allowing for recursion. The gap defines the precision for carry save addition.

Anything above 8d/3 goes against our assumption and is therefore the infeasible region

4sj-1

dd/3

8d/3

5d/3

4d/3

Page 6: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 6

Radix-4 SRT Division The value of qj determines the range

it governs

For example, qj = 1 1 + 2/3 = 5/3 1 – 2/3 = 1/3 The range is 1/3 to 5/3

Page 7: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 7

Division by a Constant Multiplication is O(log n) but division

is linear…much slower Try to convert division to multiplication

Property: Given an odd number d m such that d*m = 2n – 1

Ex. d = 3, m = 5 3*5 = 24 – 1 d = 7, m =9 7*9 = 26 – 1 d = 11, m = 93 11 * 93 = 210 - 1

E

Page 8: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 8

Division by a Constant 1/d = m/(2n – 1)

1/(1-r) = 1+r+r2+r3+… = (1+r)(1+r2)(1+r4)(1+r8)…

Example z/7 = zm/(2n-1), m=9, n=6

log(n/6) operations

m 12n 1-2-n= =

2n

m (1+2-n)(1+2-2n)(1+2-4n)

z 926 1-2-6= =

26

9z (1+2-6)(1+2-12)(1+2-24)

Page 9: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 9

Division by Reciprocation Find 1/d with iteration Newton Raphson Algorithm xi+1=xi-f(xi)/f’(xi) Set f(x)=1/x-d, (1/2<=d<1) We have f’(x)=-1/x2

Thus xi+1=xi(2-xid) Let ei=1/d-xi

We have ei+1=1/d-xi+1=1/d-xi(2-xid)

=d(1/d-xi)2=dei2

The convergence rate is quadratic. For k iterations, it takes 2k multiplications

Page 10: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 10

Division by Reciprocation z/d=3/0.7 x0=4(31/2-1)-2d=2.9282-2d=1.5282 e0=1/d-x0=1/0.7-1.5282=-0.0996286 x1=x0(2-x0d)=1.42164 e1=1/d-x1=1/0.7-1.42164=0.0069314 x2=x1(2-x1d)=1.4285377 e2=1/d-x2=1/0.7-1.4285377=0.0000337 x3=x2(2-x2d)=1.4285715 e3=1/d-x3=1/0.7-1.4285715=-0.000000(1) The convergence rate is quadratic.

Page 11: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 11

Division by Recursive Multiplication q = z/d =

(z/d) (x0/x0) (x1/x1)… (xk-1/xk-1) eq(a)

Let ½<=d<1 It takes 2k multiplication for eq(a) We also need k operations to find xi

Page 12: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 12

Division by a Repeated Multiplication q = z/d =

(z/d) (x0/x0) (x1/x1)… (xk-1/xk-1) Let ½<=d<1 Set d0=d, xk = 2-dk

1. d1 = dxo = d(2-d) = 1-(1-d)2

2. dk+1= dkxk = dk(2-dk) = 1-(1-dk)2

3. 1-dk+1 = (1-dk)2 =(1-d)2k

quadratic convergence For k-bit operands, we need 2m-1

multiplications m 2’s complement m = ceiling(log2 k) with log2 m extra bits for

precision

Page 13: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 13

Division by a Repeated Multiplication q = z/d=3/0.7 =

(z/d) (x0/x0) (x1/x1)… (xk-1/xk-1) d0=d=0.7, xk = 2-dk, dk+1=dkxk

1. x0=2-d0=1.3,

d1=d0xo= 0.7x1.3 = 0.91

2. x1=2-d1=1.09, d2=d1x1=0.91x1.09=0.9919

3. x2=2-d2=1.0081,

d3=d2x2=0.9919x1.0081=0.9999343

Page 14: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 14

Division Methods Iteration Memory Arithmetic

Page 15: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 15

Division – Iteration effort

Pencil and paper method: (A=QB+2-nR and R<B)

1 bit partial quotient per iteration, n iterations

A = 0.1001,B = 0.1010;Q = A / B.

Q = 0.1101

+Qi: Partial Quotient

Ri: Partial Remainder

Ri+1 = Ri – B Qi

1 0 0 11 0 1 0 R0=A

1

1 0 1 00 1 0 0 R2

0

0 0 0 01 0 0 0 R3

1

1 0 1 00 1 1 0 R4

1 0 1 0

0.1

1 0 0 0 R1Q1 = 0.1Q2 = 0.01Q3 = 0.000Q4 = 0.0001

Page 16: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 16

Division – Memory effort

Lookup table is the simplest way to obtain multiple partial quotient bits in each iteration.

SRT method: a lookup tables stores m-bit partial quotients decided by m bits of partial remainder and m bits of divisor.

Table size: 22m m STR method is limited by memory

wall.

Page 17: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 17

Division – Arithmetic effort

Partial quotient is calculated by arithmetic functions.

Prescaling:

Taylor expansion:

Series expansion:

ERRQd

z

Ed

Ez

d

z

dE

iii

'

'1

ERQ

dd

dd

ddE

ii

hl

hl

h

322 )1

()1

(11

ERQ

XXXXXXd

E

Xd

ii

)1)(1)(1(11

1

4232

Page 18: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 18

Division – Solution space

Modern FPGAs contains plenty of memory and build-in multipliers, which enable high performance divider.

Iteration Effort

Memory Effort

Arithmetic Effort

Memory Wall

Pencil-and-paper

SRT

Prescaling

Taylor Expansion

Low area

Series Expansion

Low latency

Our target

Page 19: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 19

Division – PST algorithm

Utilize the power of series expansion, but need a good start point.

Prescaling provide a scaled divisor close to 1.

0-order Taylor expansion iterates to reach the final quotient

21)1)(1(

1'

11'

XXXEB

Xd

EXd

'

'1

d

z

Ed

Ez

d

z

dE

ERQ ii

Page 20: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 20

Division – PST algorithm

E0 = Table (d(m)) 1/d

z1 = zE0; d1 = dE0

E1 = (2 d1) INV(d1(2m))

Qi = Ri-1 E1

Ri = Ri-1 Qi B1

Q = Q + Qi

z = 0.1011,0110d = 0.1100,1011

B(m) = 0.1100 E0 = 1.0011

E1 = INV(d1(2m)) = 1.0000,1110

z1 = z E0 = 0.1101,1000,0010d1 = d E0 = 0.1111,0001,0001

Q1 = z1 E1 = 0.1110,0011R1 = B1 – Q1 d1 = 0.0000,0010,0101,1110,1101

Q2 = R1 E1 = 0.1001,1111R2 = R1 – Q2 d1 = 0.0000,0001,1111,1011,0001

Q = 0.1110,0011 + 0.0000,0010,0111,11 = 0.1110,0101,0111,11

Page 21: CSE 246: Computer Arithmetic Algorithms and Hardware Design

CSE 246 21

Division – FPGA Implementation

PST algorithm is suitable for high-performance division unit design in FPGAs

Fmax(Period)

ALUTs Memory Bits

DSP Blocks

Power Consumption

(Dynamic+Static)

Throughput

IP Core(no DSP)

50.16MHz

(19.935ns)

1203 84 0 381mW(52mW+329mW)

50.16Mdiv/s

PST(DSP)

72.8MHz(13.737ns

)

213 768 28 350mW(23mW+327mW)

24.3Mdiv/s

PST(no DSP)

73.20MHz

(13.661ns)

1437 768 0 378mW(50mW+328mW)

24.4Mdiv/s

PST-pipelined(DSP)

74.15MHz

(13.486ns)

261 768 40 344mW(17mW+327mW)

74.15Mdiv/s

PSTp(no DSP)

76.05MHz

(13.150ns)

1940 768 0 359mW(31mW+328mW)

76.05Mdiv/s

32-bit division with 5-cycle latency