A Fast and Compact Circuit for Integer Square Root Computation Based on Mitchell Logarithmic Method
Joshua Yung Lih Low, Ching Chuen Jong, Jeremy Yung Shern Low, Thian Fatt Tay, and Chip-Hong Chang School of Electrical and Electronic Engineering, Nanyang Technological University
Nanyang Avenue, Singapore 639798 {josh0013}@e.ntu.edu.sg
Abstractβ A novel non-iterative circuit for computing integer square root based on logarithm is proposed in the paper. Mitchellβs methods are used for the logarithmic and antilogarithmic conversions. The proposed method merges two conversion stages into a single one to achieve better accuracy with a compact architecture. Hence, the circuit size and latency are reduced. Compared to an existing design based on the modified Dijkstra algorithm used in a coherent receiver, the proposed design is either 8 times smaller or 9 times faster for 16-bit integer input.
I. INTRODUCTION Various algorithms have been proposed to evaluate square
root in floating point. They can be broadly classified into non-iterative and iterative categories. The non-iterative category includes the piecewise table-lookup [1] and the tables-and-additions [2] algorithms. On the other hand, the Newton-Raphson and the Goldschmidt algorithms are two examples in the iterative category. Integer number is often used in application-specific processors for simpler implementation of arithmetic operations [3]. One instance is the integer square root operation in the coherent receiver [4], which uses the modified Dijkstra algorithm [3]. The modified Dijkstra algorithm is iterative in nature and can be implemented in a looped or a pipelined architecture. The looped architecture suffers from low throughput while the pipelined architecture has a high area cost.
Logarithm can be used to simplify the computation of arithmetic functions such as multiplication and division. Conventionally, the computation has three stages: logarithmic conversion, operation in the logarithmic domain and antilogarithmic conversion. In this work, we studied the computation of integer square root using the Mitchellβs logarithmic algorithms [5-14]. We investigated the 3-stage method and analyzed the errors in the logarithmic and antilogarithmic conversions. Based on the error characteristics, we found that if the 3 stages are merged into a single stage, not only can the errors be reduced but the computation speed can be increased as well. Hence, we propose a novel non-iterative single-stage circuit for integer square root computation based on the Mitchell logarithm. The single-stage circuit requires only one error correction function. As a result, higher computation speed, smaller circuit size and
more accurate results are achieved. To our best knowledge, this work is the first to investigate the non-iterative algorithm for integer square root based on logarithmic operations.
The rest of the paper is organized as follows. Section II describes the computation of integer square root using the Mitchellβs algorithms. Section III presents the proposed single-stage method and circuit. The performance analysis is given in Section IV. Section V concludes the paper.
II. 3-STAGE MITCHELL-BASED SQUARE ROOT COMPUTATION
Based on the logarithmic operations, the square root of a binary number π can be expressed as βπ = π
12 = 2
12πππ2(π) .
The conventional computation involves operations in three stages: (1) Logarithmic conversion ππ = πππ2(π) ; (2) Computation of square root, π = 1
2ππ and (3) Antilogarithmic conversion of π to linear domain, i.e. to compute 2π . Based on Mitchellβs methods, the three stages are described below.
A. Stage 1: Logarithmic Conversion In [5], Mitchell proposed a straight line approximation of
the logarithm of an π -bit unsigned binary number π =β 2ππ§πππ=β in the interval of 0 β€ π < 2π+1 , where β β€ π are
integers and π§π = {0, 1}. Mathematically, π can be written as
π = 2π(1 + π)
where π is the leading one position and π = β 2πβππ§ππβ1π=β is
in the range of [0, 1). Applying base-2 logarithm on π,
ππ = πππ2(π) = π + πππ2(1 + π).
The value of π can be obtained easily using a leading-one detector (LOD) and a small look-up table (LUT). ππ can be found after πππ2(1 +π) is computed. Mitchell approximated πππ2(1 + π) with a straight line πΏπ΄(1 + π) = π Γ π + π, where π = 1 and π = 0 . To improve the accuracy, numerous Mitchell-based logarithmic conversion algorithms [5-13] have been proposed. Generally, the algorithms divided the range of π, [0, 1), into certain number of regions and then applied piecewise straight line approximation πΏπ΄ππππ (a.k.a. error correction) on the regions as shown below.
πππ2(1 + π) β πΏπ΄ππππ,π(1 + π)
= ππ Γ π + ππ (1)
where ππ and ππ are, respectively, the gradient and constant in πth region. The Mitchell-based approximation is expressed as
ππ = π + πΏπ΄ππππ,π(1 +π) + ππΏπ΄(π)
= π + ππ Γ π + ππ + ππΏπ΄(π) (2)
where ππΏπ΄(π) = πππ2(1 +π)β ππ Γ π + ππ is the approximation error.
B. Stage 2: Logarithmic Square Root The computation of square root in logarithmic domain is
merely a divide-by-2 (1-bit shift) operation. Dividing (2) by 2,
π = 12ππ = 1
2[π + ππ Γ π + ππ + ππΏπ΄(π)]
= 12
(ππ) + 12ππ + 1
2ππ Γ π+ 1
2ππ + 1
2ππΏπ΄(π)
= ππππ‘ + ππππππ‘ (3)
where ππ is the least significant bit (LSB) of π, ππ = π β ππ ,
and ππππ‘ = 12
(ππ) (4)
and ππππππ‘ = 12ππ + 1
2ππ Γ π + 1
2ππ + 1
2ππΏπ΄(π) (5)
are the integer and fractional parts of π respectively.
C. Stage 3: Antilogarithmic Conversion Applying the Mitchell straight line approximation of the
antilogarithm to (3), the antilogarithm of π is
2π = 2ππππ‘ β 2ππππππ‘ .
2ππππ‘ can be realized easily using a shifter. Thus, 2π can be obtained once 2ππππππ‘ is computed. In [5], Mitchell approximated 2ππππππ‘ with a straight line π΄πΏ(ππππππ‘) = 1 +π Γ ππππππ‘ + π, where π = 1 and π = 0. Many Mitchell-based antilogarithmic algorithms, [7], [14], have been proposed to improve the accuracy by further dividing the range of ππππππ‘ , [0, 1), into two or more regions and applied piecewise straight line approximation on the regions as given below.
2ππππππ‘ β π΄πΏππππ,ποΏ½ππππππ‘οΏ½ = 1 + ππ Γ ππππππ‘ + ππ (6)
where ππ and ππ are the gradient and constant in the πth region. Therefore, the Mitchell-based piecewise straight line approximation of antilogarithm is expressed as
2π = 2ππππ‘ β [π΄πΏππππ,ποΏ½ππππππ‘οΏ½+ ππ΄πΏ(ππππππ‘)]
= 2ππππ‘ β [1 + ππ Γ ππππππ‘ + ππ + ππ΄πΏ(ππππππ‘)] (7)
where ππ΄πΏοΏ½ππππππ‘οΏ½ = 2ππππππ‘ β (1 + ππ Γ ππππππ‘ + ππ) is the approximation error.
III. PROPOSED SINGLE-STAGE METHOD As discussed in Section II, the computation of square root
of an integer number ( β = 0 in Section II.A) can be implemented directly using the Mitchell-based logarithmic
and antilogarithmic conversions in 3 stages. In the 3-stage implementation, the antilogarithmic conversion can be performed only after the logarithmic conversion is completed. The total computation time is the sum of the critical path delays in the two conversions. To reduce the total computation time, we propose to merge the error correction circuits of the logarithmic conversion into the antilogarithmic conversion and capitalize on the fact that the logarithmic square root operation is only a simple 1-bit right shift as described below.
The three stages are merged into a single stage by substituting ππππ‘ (4) and ππππππ‘ (5), (7) becomes
2π = 212ππ{1 + ππ Γ οΏ½1
2ππ + 1
2ππ Γ π + 1
2ππ + 1
2ππΏπ΄(π)οΏ½
+ππ + ππ΄πΏ(ππππππ‘)}
= 212ππ{1 + (1
2ππ Γ ππ) Γ π + οΏ½1
2ππ Γ ππ + 1
2ππ Γ ππ + πποΏ½
+[12ππΏπ΄(π) Γ ππ + ππ΄πΏ(ππππππ‘)]}
= 212πποΏ½1 + πππππ,π Γ π + ππππ π‘π ,π + ποΏ½ (8)
where πππππ,π = 12ππ Γ ππ, (9)
ππππ π‘π ,π = ππ Γ 12ππ + 1
2ππ Γ ππ + ππ (10)
and π = 12ππΏπ΄(π) Γ ππ + ππ΄πΏ(ππππππ‘) is the approximation
error.
It can be seen from (9) and (10) that the gradients, ππ and ππ , and the constants, ππ and ππ , from the two conversion stages are now combined in πππππ,π and ππππ π‘π,π respectively. As such, only a single error correction circuit is needed, which offers the following two advantages.
1) Two adders for the constants (one in (1) and the other in (6)) are reduced to only one in (8), as ππππ π‘π,π (10) can be pre-computed.
2) As shown in [6] and [9], the values for the gradient ππ in (1) must contain as few power-of-two terms as possible because π‘ number of terms require (π‘ β 1) adders. This constraint restricts the choice of the corretion function and hence, reduces the accuracy in approximating πππ2(1 + π). The same constraint applies to ππ in (6) likewise. With the merging of the error correction functions as in (8), only one gradient, πππππ,π, is subject to the constraint. This advantage is translated to fewer adders and higher accuracy of approximation as shown in Section IV.
To determine the values of πππππ,π and ππππ π‘π,π , we investigated the relationships between the logarithmic and antilogarithmic conversions and the corresponding approximation errors. Graphically, (8) can be visualized as shown in Fig. 1, where Fig. 1(b) depicts ππππππ‘ =0.5πππ2(1 + π) of (5) when ππ = 0, and Fig. 1(c) depicts ππππππ‘ = 0.5 + 0.5πππ2(1 + π) of (5) when ππ = 1 . Fig. 1(a) is the plot of 2ππππππ‘ and it is rotated by 90Β° anticlockwise to show the relationships between the logarithmic and the antilogarithmic conversions. We analyzed the errors when the πππ2(1 + π) curve is partitioned into several regions and the
curve segment in the πth region is approximated by one straight line, ππ Γ π + ππ, that connects the end points of the curve segment. In other words, a single piecewise linear approximation of πππ2(1 + π) is used to determine both ππππππ‘ in Fig. 1(b) and (c). We found that the error is a positive ββ©β-shape parabola above the π-axis. For example, when the range of π in Fig. 1(b) (and (c)) is divided at π = (2β2 + 2β3 + 2β4) = 0.4375 into two regions ( π ={1,2}), [0, P) and [P, 1.0), the corresponding error curves are the parabolas in Fig. 2(a). Similar error characteristics are obtained when π is divided into more regions. For the antilogarithmic conversion, similar error characteristics are also observed if the 2ππππππ‘ curve in the π -th region is approximated by a straight line, 1 + ππ Γ ππππππ‘ + ππ , that connects both end points of the curve within the region as in Fig. 1(a). The parabolic error curves in antilogarithmic conversion are ββͺβ-shape and always negative as depicted in Fig. 2(b). It is noted that in each region, the peak value of the error is located almost at the midpoint of the region for both the logarithmic and antilogarithmic conversions. The above error characteristics of ββ©β and ββͺβ shape curves suggest that the partition of regions in the logarithmic conversion has to be related to that in the antilogarithmic conversion so that a significant portion of errors can be cancelled out. From the analysis in Fig. 1, we propose the number of regions for the antilogarithmic conversion to be twice of that of the logarithmic conversion and the boundary values of the regions in the antilogarithmic conversion are dictated by the boundary values of the regions in the logarithmic conversion as shown in Fig. 1. For example, the region π = 1 in Fig. 1(a) is from 0 to 0.5πππ2(1 + P) . With more regions in antilogarithmic conversion, the maximum of ππ΄πΏ is always smaller than the maximum of ππΏπ΄ which contributes to large portion of the approximation error. To minimize the maximum of ππΏπ΄ , the size of each region for logarithmic conversion is adjusted such that the peak error values of all the regions are close to each other. i.e. The peaks are at similar levels.
Based on the above analysis, we divided the range of π into two regions as in Fig. 1(b) and (c), and the range of ππππππ‘ into four regions as in Fig. 1(a). The curves, πππ2(1 +π) and
2ππππππ‘ , are then approximated by straight lines, each connecting the two end points of the curve segment in a region. The straight line in each region is then shifted up (down) by the value of half of the maximum 0.5ππΏπ΄ (or ππ΄πΏ) of the region. The infinite precision values of ππ, ππ, ππ and ππ of the straight lines are used to determine the ideal values for πππππ,π and ππππ π‘π ,π. Keeping only three power-of-two terms for πππππ,π and truncating ππππ π‘π ,π to 16 MSBs (most significant bits), the proposed expression for (8) to compute the square root in one stage is
2π = 212πππ΄πΏπ,π(π), (11)
where
π΄πΏ1,1(π) = 2β1π+ (2β5 + 2β6)ποΏ½9πππ΅π + (2β1 + 2β2 +
2β3 + 2β4 + 2β6 + 2β9 + 2β15 + 2β16),
for ππ = 0 and 0 β€ π < 0.4375
π΄πΏ2,2(π) = 2β2π+ (2β3 + 2β7)π9πππ΅π + (2β5 + 2β9 +
2β11), for ππ = 0 and 0.4375 β€ π < 1.0
π΄πΏ1,3(π) = 2β1π+ (2β3 + 2β6)π9πππ΅π + (2β2 + 2β3 +
2β5 + 2β7 + 2β9), for ππ = 1 and 0 β€ π < 0.4375
π΄πΏ2,4(π) = 2β1π + (2β5 + 2β7)π9πππ΅π + (2β2 + 2β3 +
2β4 + 2β6 + 2β7 + 2β10), for ππ = 1 and 0.4375 β€ π < 1.0
where π9πππ΅π is the 9 MSBs of π and ποΏ½9πππ΅π = 1βπ9πππ΅π β 2β9 is the 1βs complement of π9πππ΅π .
IV. PERFORMANCE ANALYSIS A. Accuracy
We have simulated the 3-stage method and the proposed single-stage method in Matlab. For the 3-stage method, we simulated all the combinations of Mitchell-based logarithmic algorithms [6-9], [11-12] with two antilogarithmic algorithms [7], [14]. 64-bit double-precision is used to emulate the real values of βπ. For integer input with wordlength greater than 16 bits, the combination of the logarithm conversion in [6] and the antilogarithm conversion in [14] gives the lowest maximum percentage error of 0.45%. The proposed single-stage method obtains the maximum percentage error of 0.34% (equivalent to 8-bit output precision), achieving an improvement of 24.44%. Further improvement is possible by increasing the numbers of regions.
B. Hardware Complexity The proposed method is implemented in an architecture
similar to Fig. 2 of [9] and Fig. 3(a) of [14]. For an n-bit input,
12
8
4
0 0 0.2 0.4 0.6 0.8 1
i = 1 i = 2 m
(a)
0 0.5 1
j = 1 j = 2 j = 3 j = 4
-2 -4 -6 -8
Γ10-3
Xfract
(b)
Figure 2. Error curves: (a) 0.5ππΏπ΄ and (b) ππ΄πΏ
Γ10-3
Figure 1. Relationships between logarithmic and antilogarithmic conversions: (a) 2ππππππ‘(90Β° anticlockwise rotated); (b) 0.5 + 0.5πππ2(1 + π); (c) 0.5πππ2(1 + π)
1
0.5
0
i = 1
m 0.5 1
m
X fra
ct
i = 2
j = 1
j = 2
j = 3
j = 4
1 1.5 2
2Xfract
0.5+0.5log2(1 + m) 0.5+0.5(aim + bi)
0.5log2(1 + m) 0.5(aim + bi)
(a)
(b)
(c)
P
1 + ππ Γ ππππππ‘ + ππ
an π-bit LOD, an π-wordΓβlog2πβ-bit LUT and an (π β 1)-bit logarithmic shifter are used to obtain π and π of (11). Subsequently, the 15 MSBs of π ((π β 1) bits) are input to the error correction circuit implementing π΄πΏπ,π as shown in Fig. 3, where a 2-stage carry save adder (CSA) tree in the dotted polygon with each circle representing a 1-bit operand is used to accumulate the operands before summed up by a 16-bit carry propagation adder. A β1β bit is appended to the left of the output of the error correction circuit before it is input to an π-bit logarithmic shifter which implements 2
12ππ in (11).
Table I gives the area and speed comparisons between the proposed circuit in Fig. 3 and the architecture proposed in [4] for realizing the modified Dijkstra algorithm [3]. The basic building blocks of the architecture in [4] consist of mainly two n-bit adders, a comparator and two shifters. We use the unit-gate model in [15], in which a 2-input monotonic gate, such as a NAND gate, has one unit of area and one unit of delay, and a monotonic gate is conservatively assumed to consist of six transistors (T) in a classical CMOS process. A 2ππΌ Γ ππ ROM is estimated to have an area of (2βππΌ/2β(βππΌ/2β+ 1) +2ππΌππ + 2βππΌ/2βππ(βππΌ/2β + 2) +ππ(2βππΌ/2β + 2))T and a delay of (1 + βπππ2ππΌβ+ βππΌ/2β) unit [16]. Table I shows the estimated area and delay of the basic building blocks of the architecture in Fig. 15 of [4], the 3-stage design and the proposed single-stage design. The proposed single-stage design is smaller by 1.6% and faster by 15.0% even it is compared with only the basic building block in [4]. For 16-bit integer input, [4] requires either at least 8Γ4320T area (pipelined) or 8Γ140 units latency (looped) to achieve 8-bit output precision. In other words, the proposed single-stage method is about 9 times faster or 8 times smaller when compared to [4]. Note that iterative methods such as [4] are well known for area efficient at the expense of large latency. When compared to the 3-stage design, the proposed single-stage design reduces the area cost by 30.1% and computation delay by 40.8% as shown in Table I.
V. CONCLUSION A novel single-stage circuit based on Mitchellβs
logarithmic algorithms for computing integer square root was developed. The circuit is compact in size and fast in
computation speed. Based on the proposed approach, circuits for computing other arithmetic functions such as multiplication, division, exponential, etc., could be developed in the future.
REFERENCES [1] A. G. M. Strollo, D. De Caro, and N. Petra, βElementary functions
hardware implementation using constrained piecewise-polynomial approximations,β IEEE Trans. Comp., vol. 60, pp. 418-432, 2011.
[2] F. de Dinechin, and A. Tisserand, βMultipartite Table Methods,β IEEE Trans. Comp., vol. 54, pp. 319-330, 2005
[3] M. T. Tommiska, β Area-efficient implementation of a fast square root algorithm,β in Proc. 3rd IEEE Int. Caracas Conf. Devices, Circuits Syst., pp. S18/1-S18/4, 2000.
[4] V. B. Alluri, J. R. Heath, and M. Lhamon, βA New Multichannel, Coherent Amplitude Modulated, Time-Division Multiplexed, Software-Defined Radio Receiver Architecture, and Field-Programmable-Gate-Array Technology Implementation, β IEEE Trans. Signal Processing, vol 58, pp. 5369-5384, 2010.
[5] J. N. Mitchell, βComputer multiplication and division using binary logarithm,β IRE Trans. Comp., vol. EC-11, pp. 512-517, 1962.
[6] M. Combet, H. V. Zonneveld, and L. Verbeek, βComputation of the base two logarithm of binary numbers,β IEEE Trans. Electronic Comp., vol. EC-14, pp. 863-867, 1965.
[7] E. L. Hall, D. D. Lynch, and S. J. Dwyer, "Generation of products and quotients using approximate binary logarithms for digital filtering applications," IEEE Trans. Computers, vol. C-19, no. 2, pp. 97-105, 1970.
[8] S. L. SanGregory, C. Brothers, D. Gallagher, and R. Siferd, "A fast, low-power logarithm approximation with CMOS VLSI implementation," in Proc. 42nd Midwest Symp. Circuits and Systems, vol. 1, pp. 388-391, 1999.
[9] K. H. Abed, and R. E. Siferd, βCMOS VLSI implementation of a low-power logarithmic converter,β IEEE Trans. Comp., vol. 52, pp. 1421-1433, 2003.
[10] H. Kim, B. βG. Nam, J. βH. Sohn, J. βH. Woo, and H. βJ. Yoo, βA 231-MHz, 2.18-mW 32-bit logarithmic arithmetic unit for fixed-point 3-D graphics system,β IEEE J. Solid-State Circuits, vol. 41, no. 11, pp. 2373-2381, 2006.
[11] Z. Li, J. An, M. Yang, and J. Yang, "FPGA design and implementation of an improved 32-bit binary logarithm converter," in Proc. 4th Int. Conf. Wireless Communications, Networking and Mobile Computing (WiCOM '08), pp. 1-4, 2008.
[12] T. -B. Juang, and S. -H. Chen, "A lower error and ROM-free logarithmic converter for digital signal processing applications," IEEE Trans. Circuits and Systems-II: Express Briefs, vol. 56, no. 12, pp. 931-935, 2009.
[13] D. De Caro, N. Petra, and A. G. M. Strollo, βEfficient logarithmic converters for digital signal processing applications,β IEEE Trans. Circuits and Systems-II: Express Briefs, vol. 58, no. 10, pp. 667 β 671, 2011.
[14] K. H. Abed, and R. E. Siferd, βVLSI implementation of a low-power antilogarithmic converter,β IEEE Trans. Comp., vol. 52, pp. 1221-1228, 2003.
[15] H. T. Vergos, C. Efstathiou, and D. Nikolos, "Diminished-one modulo 2n+1 adder design," IEEE Trans. Comp. vol. 51, pp. 1389-1399, 2002.
[16] Z. D. Ulman, and M. Czyzak, βHighly parallel, fast scaling of numbers in nonredundant residue arithmetic,β IEEE Trans. Signal Processing, vol. 46, pp. 487-496, 1998.
r3,4
r3,4
Figure 3. Error correction circuit implementing (11)
Adder Tree Legend: = a bit in the word A, where i = {0, 1, 2, β¦}
r2,4
r1 m1-m9
9 0 {m1-m9, 0}
10 S
r2
{m1-m15}
15 U
{0, m1-m14}
15 15 r2, 3
m1-m9 r1 9
00 {m1-m9, 00}
11 T
11 10
R
m1 r3,4
r2 r2 r3 r4 r4 r2, 3
r2,4
r3,4 r2,4
r2,4 r3,4
kl
r2,4 r3,4
r2,4 r3,4 r1 r1
r2, 3
m2 m3 m4 r2,4 r2,4
10 11
t10 t9 t8 t7 t6 t5 t4 t4 t2 t1 t0
s9 s8 s7 s6 s5 s4 s3 s2 s1 s0
r1 r2 r2 r2,3 r2,3 r2,3 r3,4 r4 r2 r1 r1
u14 u13 u12 u11 u10 u9 u8 u7 u6 u5 u4 u3 u2 u1 u0
ai
TABLE I. AREA AND SPEED COMPARISONS (π = 16)
Area (T) Delay (unit) Design in
Fig. 15 of [4]
Basic building block > 4320 > 140 Pipelined architecture > 8*4320 = 34560 > 140 Looped architecture > 4320 > 8*140 = 1120
3-stage design: [6] + [14] 6086 201 Proposed single stage design 4252 119