Upload
rafer
View
20
Download
0
Embed Size (px)
DESCRIPTION
CSE 246: Computer Arithmetic Algorithms and Hardware Design. Fall 2006 Lecture 8: Division. Instructor: Prof. Chung-Kuan Cheng. Topics:. Radix-4 SRT Division Division by a Constant Division by a Repeated Multiplication. Project Update. Come in to speak briefly about the final project - PowerPoint PPT Presentation
Citation preview
CSE 246: Computer Arithmetic Algorithms and Hardware Design
Instructor:Prof. Chung-Kuan Cheng
Fall 2006
Lecture 8: Division
CSE 246 2
Topics:
Radix-4 SRT Division Division by a Constant Division by a Repeated Multiplication
CSE 246 3
Project Update Come in to speak briefly about the
final project Status Update 2:30 – 3:00 p.m. Tuesday or Thursday
CSE 246 4
Radix-4 SRT Division 4sj-1 = qjd + sj where
qj is in [-2,2] and sj-1 is in [-hd,+hd] h is less than or equal to 2/3 Therefore, sj-1 is in [-2d/3, 2d/3] And, 4sj-1 is in [-8d/3, 8d/3]
s shifts to the left by 2 bits
CSE 246 5
Radix-4 SRT Division
0.0
0.1
1.0
1.1
10.0
10.1
11.0
.101 .110 .111 1.00.1
2d/3
-2d/3
qj=1
qj=0
qj=2
The overlap regions of qj denote a choice still allowing for recursion. The gap defines the precision for carry save addition.
Anything above 8d/3 goes against our assumption and is therefore the infeasible region
4sj-1
dd/3
8d/3
5d/3
4d/3
CSE 246 6
Radix-4 SRT Division The value of qj determines the range
it governs
For example, qj = 1 1 + 2/3 = 5/3 1 – 2/3 = 1/3 The range is 1/3 to 5/3
CSE 246 7
Division by a Constant Multiplication is O(log n) but division
is linear…much slower Try to convert division to multiplication
Property: Given an odd number d m such that d*m = 2n – 1
Ex. d = 3, m = 5 3*5 = 24 – 1 d = 7, m =9 7*9 = 26 – 1 d = 11, m = 93 11 * 93 = 210 - 1
E
CSE 246 8
Division by a Constant 1/d = m/(2n – 1)
1/(1-r) = 1+r+r2+r3+… = (1+r)(1+r2)(1+r4)(1+r8)…
Example z/7 = zm/(2n-1), m=9, n=6
log(n/6) operations
m 12n 1-2-n= =
2n
m (1+2-n)(1+2-2n)(1+2-4n)
z 926 1-2-6= =
26
9z (1+2-6)(1+2-12)(1+2-24)
CSE 246 9
Division by Reciprocation Find 1/d with iteration Newton Raphson Algorithm xi+1=xi-f(xi)/f’(xi) Set f(x)=1/x-d, (1/2<=d<1) We have f’(x)=-1/x2
Thus xi+1=xi(2-xid) Let ei=1/d-xi
We have ei+1=1/d-xi+1=1/d-xi(2-xid)
=d(1/d-xi)2=dei2
The convergence rate is quadratic. For k iterations, it takes 2k multiplications
CSE 246 10
Division by Reciprocation z/d=3/0.7 x0=4(31/2-1)-2d=2.9282-2d=1.5282 e0=1/d-x0=1/0.7-1.5282=-0.0996286 x1=x0(2-x0d)=1.42164 e1=1/d-x1=1/0.7-1.42164=0.0069314 x2=x1(2-x1d)=1.4285377 e2=1/d-x2=1/0.7-1.4285377=0.0000337 x3=x2(2-x2d)=1.4285715 e3=1/d-x3=1/0.7-1.4285715=-0.000000(1) The convergence rate is quadratic.
CSE 246 11
Division by Recursive Multiplication q = z/d =
(z/d) (x0/x0) (x1/x1)… (xk-1/xk-1) eq(a)
Let ½<=d<1 It takes 2k multiplication for eq(a) We also need k operations to find xi
CSE 246 12
Division by a Repeated Multiplication q = z/d =
(z/d) (x0/x0) (x1/x1)… (xk-1/xk-1) Let ½<=d<1 Set d0=d, xk = 2-dk
1. d1 = dxo = d(2-d) = 1-(1-d)2
2. dk+1= dkxk = dk(2-dk) = 1-(1-dk)2
3. 1-dk+1 = (1-dk)2 =(1-d)2k
quadratic convergence For k-bit operands, we need 2m-1
multiplications m 2’s complement m = ceiling(log2 k) with log2 m extra bits for
precision
CSE 246 13
Division by a Repeated Multiplication q = z/d=3/0.7 =
(z/d) (x0/x0) (x1/x1)… (xk-1/xk-1) d0=d=0.7, xk = 2-dk, dk+1=dkxk
1. x0=2-d0=1.3,
d1=d0xo= 0.7x1.3 = 0.91
2. x1=2-d1=1.09, d2=d1x1=0.91x1.09=0.9919
3. x2=2-d2=1.0081,
d3=d2x2=0.9919x1.0081=0.9999343
CSE 246 14
Division Methods Iteration Memory Arithmetic
CSE 246 15
Division – Iteration effort
Pencil and paper method: (A=QB+2-nR and R<B)
1 bit partial quotient per iteration, n iterations
A = 0.1001,B = 0.1010;Q = A / B.
Q = 0.1101
+Qi: Partial Quotient
Ri: Partial Remainder
Ri+1 = Ri – B Qi
1 0 0 11 0 1 0 R0=A
1
1 0 1 00 1 0 0 R2
0
0 0 0 01 0 0 0 R3
1
1 0 1 00 1 1 0 R4
1 0 1 0
0.1
1 0 0 0 R1Q1 = 0.1Q2 = 0.01Q3 = 0.000Q4 = 0.0001
CSE 246 16
Division – Memory effort
Lookup table is the simplest way to obtain multiple partial quotient bits in each iteration.
SRT method: a lookup tables stores m-bit partial quotients decided by m bits of partial remainder and m bits of divisor.
Table size: 22m m STR method is limited by memory
wall.
CSE 246 17
Division – Arithmetic effort
Partial quotient is calculated by arithmetic functions.
Prescaling:
Taylor expansion:
Series expansion:
ERRQd
z
Ed
Ez
d
z
dE
iii
'
'1
ERQ
dd
dd
ddE
ii
hl
hl
h
322 )1
()1
(11
ERQ
XXXXXXd
E
Xd
ii
)1)(1)(1(11
1
4232
CSE 246 18
Division – Solution space
Modern FPGAs contains plenty of memory and build-in multipliers, which enable high performance divider.
Iteration Effort
Memory Effort
Arithmetic Effort
Memory Wall
Pencil-and-paper
SRT
Prescaling
Taylor Expansion
Low area
Series Expansion
Low latency
Our target
CSE 246 19
Division – PST algorithm
Utilize the power of series expansion, but need a good start point.
Prescaling provide a scaled divisor close to 1.
0-order Taylor expansion iterates to reach the final quotient
21)1)(1(
1'
11'
XXXEB
Xd
EXd
'
'1
d
z
Ed
Ez
d
z
dE
ERQ ii
CSE 246 20
Division – PST algorithm
E0 = Table (d(m)) 1/d
z1 = zE0; d1 = dE0
E1 = (2 d1) INV(d1(2m))
Qi = Ri-1 E1
Ri = Ri-1 Qi B1
Q = Q + Qi
z = 0.1011,0110d = 0.1100,1011
B(m) = 0.1100 E0 = 1.0011
E1 = INV(d1(2m)) = 1.0000,1110
z1 = z E0 = 0.1101,1000,0010d1 = d E0 = 0.1111,0001,0001
Q1 = z1 E1 = 0.1110,0011R1 = B1 – Q1 d1 = 0.0000,0010,0101,1110,1101
Q2 = R1 E1 = 0.1001,1111R2 = R1 – Q2 d1 = 0.0000,0001,1111,1011,0001
Q = 0.1110,0011 + 0.0000,0010,0111,11 = 0.1110,0101,0111,11
CSE 246 21
Division – FPGA Implementation
PST algorithm is suitable for high-performance division unit design in FPGAs
Fmax(Period)
ALUTs Memory Bits
DSP Blocks
Power Consumption
(Dynamic+Static)
Throughput
IP Core(no DSP)
50.16MHz
(19.935ns)
1203 84 0 381mW(52mW+329mW)
50.16Mdiv/s
PST(DSP)
72.8MHz(13.737ns
)
213 768 28 350mW(23mW+327mW)
24.3Mdiv/s
PST(no DSP)
73.20MHz
(13.661ns)
1437 768 0 378mW(50mW+328mW)
24.4Mdiv/s
PST-pipelined(DSP)
74.15MHz
(13.486ns)
261 768 40 344mW(17mW+327mW)
74.15Mdiv/s
PSTp(no DSP)
76.05MHz
(13.150ns)
1940 768 0 359mW(31mW+328mW)
76.05Mdiv/s
32-bit division with 5-cycle latency