Download pdf - [IEEE 2012 IEEE International Symposium on Circuits and Systems - ISCAS 2012 - Seoul, Korea (South) (2012.05.20-2012.05.23)] 2012 IEEE International Symposium on Circuits and Systems

A Fast and Compact Circuit for Integer Square Root Computation Based on Mitchell Logarithmic Method

Joshua Yung Lih Low, Ching Chuen Jong, Jeremy Yung Shern Low, Thian Fatt Tay, and Chip-Hong Chang School of Electrical and Electronic Engineering, Nanyang Technological University

Nanyang Avenue, Singapore 639798 {josh0013}@e.ntu.edu.sg

Abstract— A novel non-iterative circuit for computing integer square root based on logarithm is proposed in the paper. Mitchell’s methods are used for the logarithmic and antilogarithmic conversions. The proposed method merges two conversion stages into a single one to achieve better accuracy with a compact architecture. Hence, the circuit size and latency are reduced. Compared to an existing design based on the modified Dijkstra algorithm used in a coherent receiver, the proposed design is either 8 times smaller or 9 times faster for 16-bit integer input.

I. INTRODUCTION Various algorithms have been proposed to evaluate square

root in floating point. They can be broadly classified into non-iterative and iterative categories. The non-iterative category includes the piecewise table-lookup [1] and the tables-and-additions [2] algorithms. On the other hand, the Newton-Raphson and the Goldschmidt algorithms are two examples in the iterative category. Integer number is often used in application-specific processors for simpler implementation of arithmetic operations [3]. One instance is the integer square root operation in the coherent receiver [4], which uses the modified Dijkstra algorithm [3]. The modified Dijkstra algorithm is iterative in nature and can be implemented in a looped or a pipelined architecture. The looped architecture suffers from low throughput while the pipelined architecture has a high area cost.

Logarithm can be used to simplify the computation of arithmetic functions such as multiplication and division. Conventionally, the computation has three stages: logarithmic conversion, operation in the logarithmic domain and antilogarithmic conversion. In this work, we studied the computation of integer square root using the Mitchell’s logarithmic algorithms [5-14]. We investigated the 3-stage method and analyzed the errors in the logarithmic and antilogarithmic conversions. Based on the error characteristics, we found that if the 3 stages are merged into a single stage, not only can the errors be reduced but the computation speed can be increased as well. Hence, we propose a novel non-iterative single-stage circuit for integer square root computation based on the Mitchell logarithm. The single-stage circuit requires only one error correction function. As a result, higher computation speed, smaller circuit size and

more accurate results are achieved. To our best knowledge, this work is the first to investigate the non-iterative algorithm for integer square root based on logarithmic operations.

The rest of the paper is organized as follows. Section II describes the computation of integer square root using the Mitchell’s algorithms. Section III presents the proposed single-stage method and circuit. The performance analysis is given in Section IV. Section V concludes the paper.

II. 3-STAGE MITCHELL-BASED SQUARE ROOT COMPUTATION

Based on the logarithmic operations, the square root of a binary number 𝑁 can be expressed as √𝑁 = 𝑁

12 = 2

12𝑙𝑜𝑔2(𝑁) .

The conventional computation involves operations in three stages: (1) Logarithmic conversion 𝑁𝑙 = 𝑙𝑜𝑔2(𝑁) ; (2) Computation of square root, 𝑋 = 1

2𝑁𝑙 and (3) Antilogarithmic conversion of 𝑋 to linear domain, i.e. to compute 2𝑋 . Based on Mitchell’s methods, the three stages are described below.

A. Stage 1: Logarithmic Conversion In [5], Mitchell proposed a straight line approximation of

the logarithm of an 𝑛 -bit unsigned binary number 𝑁 =∑ 2𝑖𝑧𝑖𝑘𝑖=ℎ in the interval of 0 ≤ 𝑁 < 2𝑘+1 , where ℎ ≤ 𝑘 are

integers and 𝑧𝑖 = {0, 1}. Mathematically, 𝑁 can be written as

𝑁 = 2𝑘(1 + 𝑚)

where 𝑘 is the leading one position and 𝑚 = ∑ 2𝑖−𝑘𝑧𝑖𝑘−1𝑖=ℎ is

in the range of [0, 1). Applying base-2 logarithm on 𝑁,

𝑁𝑙 = 𝑙𝑜𝑔2(𝑁) = 𝑘 + 𝑙𝑜𝑔2(1 + 𝑚).

The value of 𝑘 can be obtained easily using a leading-one detector (LOD) and a small look-up table (LUT). 𝑁𝑙 can be found after 𝑙𝑜𝑔2(1 +𝑚) is computed. Mitchell approximated 𝑙𝑜𝑔2(1 + 𝑚) with a straight line 𝐿𝐴(1 + 𝑚) = 𝑎 × 𝑚 + 𝑏, where 𝑎 = 1 and 𝑏 = 0 . To improve the accuracy, numerous Mitchell-based logarithmic conversion algorithms [5-13] have been proposed. Generally, the algorithms divided the range of 𝑚, [0, 1), into certain number of regions and then applied piecewise straight line approximation 𝐿𝐴𝑎𝑙𝑔𝑜 (a.k.a. error correction) on the regions as shown below.

𝑙𝑜𝑔2(1 + 𝑚) ≅ 𝐿𝐴𝑎𝑙𝑔𝑜,𝑖(1 + 𝑚)

= 𝑎𝑖 × 𝑚 + 𝑏𝑖 (1)

where 𝑎𝑖 and 𝑏𝑖 are, respectively, the gradient and constant in 𝑖th region. The Mitchell-based approximation is expressed as

𝑁𝑙 = 𝑘 + 𝐿𝐴𝑎𝑙𝑔𝑜,𝑖(1 +𝑚) + 𝜀𝐿𝐴(𝑚)

= 𝑘 + 𝑎𝑖 × 𝑚 + 𝑏𝑖 + 𝜀𝐿𝐴(𝑚) (2)

where 𝜀𝐿𝐴(𝑚) = 𝑙𝑜𝑔2(1 +𝑚)− 𝑎𝑖 × 𝑚 + 𝑏𝑖 is the approximation error.

B. Stage 2: Logarithmic Square Root The computation of square root in logarithmic domain is

merely a divide-by-2 (1-bit shift) operation. Dividing (2) by 2,

𝑋 = 12𝑁𝑙 = 1

2[𝑘 + 𝑎𝑖 × 𝑚 + 𝑏𝑖 + 𝑒𝐿𝐴(𝑚)]

= 12

(𝑘𝑚) + 12𝑘𝑙 + 1

2𝑎𝑖 × 𝑚+ 1

2𝑏𝑖 + 1

2𝜀𝐿𝐴(𝑚)

= 𝑋𝑖𝑛𝑡 + 𝑋𝑓𝑟𝑎𝑐𝑡 (3)

where 𝑘𝑙 is the least significant bit (LSB) of 𝑘, 𝑘𝑚 = 𝑘 − 𝑘𝑙 ,

and 𝑋𝑖𝑛𝑡 = 12

(𝑘𝑚) (4)

and 𝑋𝑓𝑟𝑎𝑐𝑡 = 12𝑘𝑙 + 1

2𝑎𝑖 × 𝑚 + 1

2𝑏𝑖 + 1

2𝜀𝐿𝐴(𝑚) (5)

are the integer and fractional parts of 𝑋 respectively.

C. Stage 3: Antilogarithmic Conversion Applying the Mitchell straight line approximation of the

antilogarithm to (3), the antilogarithm of 𝑋 is

2𝑋 = 2𝑋𝑖𝑛𝑡 ∙ 2𝑋𝑓𝑟𝑎𝑐𝑡 .

2𝑋𝑖𝑛𝑡 can be realized easily using a shifter. Thus, 2𝑋 can be obtained once 2𝑋𝑓𝑟𝑎𝑐𝑡 is computed. In [5], Mitchell approximated 2𝑋𝑓𝑟𝑎𝑐𝑡 with a straight line 𝐴𝐿(𝑋𝑓𝑟𝑎𝑐𝑡) = 1 +𝑒 × 𝑋𝑓𝑟𝑎𝑐𝑡 + 𝑓, where 𝑒 = 1 and 𝑓 = 0. Many Mitchell-based antilogarithmic algorithms, [7], [14], have been proposed to improve the accuracy by further dividing the range of 𝑋𝑓𝑟𝑎𝑐𝑡 , [0, 1), into two or more regions and applied piecewise straight line approximation on the regions as given below.

2𝑋𝑓𝑟𝑎𝑐𝑡 ≅ 𝐴𝐿𝑎𝑙𝑔𝑜,𝑗�𝑋𝑓𝑟𝑎𝑐𝑡� = 1 + 𝑒𝑗 × 𝑋𝑓𝑟𝑎𝑐𝑡 + 𝑓𝑗 (6)

where 𝑒𝑗 and 𝑓𝑗 are the gradient and constant in the 𝑗th region. Therefore, the Mitchell-based piecewise straight line approximation of antilogarithm is expressed as

2𝑋 = 2𝑋𝑖𝑛𝑡 ∙ [𝐴𝐿𝑎𝑙𝑔𝑜,𝑗�𝑋𝑓𝑟𝑎𝑐𝑡�+ 𝜀𝐴𝐿(𝑋𝑓𝑟𝑎𝑐𝑡)]

= 2𝑋𝑖𝑛𝑡 ∙ [1 + 𝑒𝑗 × 𝑋𝑓𝑟𝑎𝑐𝑡 + 𝑓𝑗 + 𝜀𝐴𝐿(𝑋𝑓𝑟𝑎𝑐𝑡)] (7)

where 𝜀𝐴𝐿�𝑋𝑓𝑟𝑎𝑐𝑡� = 2𝑋𝑓𝑟𝑎𝑐𝑡 − (1 + 𝑒𝑗 × 𝑋𝑓𝑟𝑎𝑐𝑡 + 𝑓𝑗) is the approximation error.

III. PROPOSED SINGLE-STAGE METHOD As discussed in Section II, the computation of square root

of an integer number ( ℎ = 0 in Section II.A) can be implemented directly using the Mitchell-based logarithmic

and antilogarithmic conversions in 3 stages. In the 3-stage implementation, the antilogarithmic conversion can be performed only after the logarithmic conversion is completed. The total computation time is the sum of the critical path delays in the two conversions. To reduce the total computation time, we propose to merge the error correction circuits of the logarithmic conversion into the antilogarithmic conversion and capitalize on the fact that the logarithmic square root operation is only a simple 1-bit right shift as described below.

The three stages are merged into a single stage by substituting 𝑋𝑖𝑛𝑡 (4) and 𝑋𝑓𝑟𝑎𝑐𝑡 (5), (7) becomes

2𝑋 = 212𝑘𝑚{1 + 𝑒𝑗 × �1

2𝑘𝑙 + 1

2𝑎𝑖 × 𝑚 + 1

2𝑏𝑖 + 1

2𝜀𝐿𝐴(𝑚)�

+𝑓𝑗 + 𝜀𝐴𝐿(𝑋𝑓𝑟𝑎𝑐𝑡)}

= 212𝑘𝑚{1 + (1

2𝑎𝑖 × 𝑒𝑗) × 𝑚 + �1

2𝑘𝑙 × 𝑒𝑗 + 1

2𝑏𝑖 × 𝑒𝑗 + 𝑓𝑗�

+[12𝜀𝐿𝐴(𝑚) × 𝑒𝑗 + 𝜀𝐴𝐿(𝑋𝑓𝑟𝑎𝑐𝑡)]}

= 212𝑘𝑚�1 + 𝑔𝑟𝑎𝑑𝑖,𝑗 × 𝑚 + 𝑐𝑜𝑛𝑠𝑡𝑖 ,𝑗 + 𝜀� (8)

where 𝑔𝑟𝑎𝑑𝑖,𝑗 = 12𝑎𝑖 × 𝑒𝑗, (9)

𝑐𝑜𝑛𝑠𝑡𝑖 ,𝑗 = 𝑒𝑗 × 12𝑘𝑙 + 1

2𝑏𝑖 × 𝑒𝑗 + 𝑓𝑗 (10)

and 𝜀 = 12𝜀𝐿𝐴(𝑚) × 𝑒𝑗 + 𝜀𝐴𝐿(𝑋𝑓𝑟𝑎𝑐𝑡) is the approximation

error.

It can be seen from (9) and (10) that the gradients, 𝑎𝑖 and 𝑒𝑗 , and the constants, 𝑏𝑖 and 𝑓𝑗 , from the two conversion stages are now combined in 𝑔𝑟𝑎𝑑𝑖,𝑗 and 𝑐𝑜𝑛𝑠𝑡𝑖,𝑗 respectively. As such, only a single error correction circuit is needed, which offers the following two advantages.

1) Two adders for the constants (one in (1) and the other in (6)) are reduced to only one in (8), as 𝑐𝑜𝑛𝑠𝑡𝑖,𝑗 (10) can be pre-computed.

2) As shown in [6] and [9], the values for the gradient 𝑎𝑖 in (1) must contain as few power-of-two terms as possible because 𝑡 number of terms require (𝑡 − 1) adders. This constraint restricts the choice of the corretion function and hence, reduces the accuracy in approximating 𝑙𝑜𝑔2(1 + 𝑚). The same constraint applies to 𝑒𝑗 in (6) likewise. With the merging of the error correction functions as in (8), only one gradient, 𝑔𝑟𝑎𝑑𝑖,𝑗, is subject to the constraint. This advantage is translated to fewer adders and higher accuracy of approximation as shown in Section IV.

To determine the values of 𝑔𝑟𝑎𝑑𝑖,𝑗 and 𝑐𝑜𝑛𝑠𝑡𝑖,𝑗 , we investigated the relationships between the logarithmic and antilogarithmic conversions and the corresponding approximation errors. Graphically, (8) can be visualized as shown in Fig. 1, where Fig. 1(b) depicts 𝑋𝑓𝑟𝑎𝑐𝑡 =0.5𝑙𝑜𝑔2(1 + 𝑚) of (5) when 𝑘𝑙 = 0, and Fig. 1(c) depicts 𝑋𝑓𝑟𝑎𝑐𝑡 = 0.5 + 0.5𝑙𝑜𝑔2(1 + 𝑚) of (5) when 𝑘𝑙 = 1 . Fig. 1(a) is the plot of 2𝑋𝑓𝑟𝑎𝑐𝑡 and it is rotated by 90° anticlockwise to show the relationships between the logarithmic and the antilogarithmic conversions. We analyzed the errors when the 𝑙𝑜𝑔2(1 + 𝑚) curve is partitioned into several regions and the

curve segment in the 𝑖th region is approximated by one straight line, 𝑎𝑖 × 𝑚 + 𝑏𝑖, that connects the end points of the curve segment. In other words, a single piecewise linear approximation of 𝑙𝑜𝑔2(1 + 𝑚) is used to determine both 𝑋𝑓𝑟𝑎𝑐𝑡 in Fig. 1(b) and (c). We found that the error is a positive ‘∩’-shape parabola above the 𝑚-axis. For example, when the range of 𝑚 in Fig. 1(b) (and (c)) is divided at 𝑃 = (2−2 + 2−3 + 2−4) = 0.4375 into two regions ( 𝑖 ={1,2}), [0, P) and [P, 1.0), the corresponding error curves are the parabolas in Fig. 2(a). Similar error characteristics are obtained when 𝑚 is divided into more regions. For the antilogarithmic conversion, similar error characteristics are also observed if the 2𝑋𝑓𝑟𝑎𝑐𝑡 curve in the 𝑗 -th region is approximated by a straight line, 1 + 𝑒𝑗 × 𝑋𝑓𝑟𝑎𝑐𝑡 + 𝑓𝑗 , that connects both end points of the curve within the region as in Fig. 1(a). The parabolic error curves in antilogarithmic conversion are ‘∪’-shape and always negative as depicted in Fig. 2(b). It is noted that in each region, the peak value of the error is located almost at the midpoint of the region for both the logarithmic and antilogarithmic conversions. The above error characteristics of ‘∩’ and ‘∪’ shape curves suggest that the partition of regions in the logarithmic conversion has to be related to that in the antilogarithmic conversion so that a significant portion of errors can be cancelled out. From the analysis in Fig. 1, we propose the number of regions for the antilogarithmic conversion to be twice of that of the logarithmic conversion and the boundary values of the regions in the antilogarithmic conversion are dictated by the boundary values of the regions in the logarithmic conversion as shown in Fig. 1. For example, the region 𝑗 = 1 in Fig. 1(a) is from 0 to 0.5𝑙𝑜𝑔2(1 + P) . With more regions in antilogarithmic conversion, the maximum of 𝜀𝐴𝐿 is always smaller than the maximum of 𝜀𝐿𝐴 which contributes to large portion of the approximation error. To minimize the maximum of 𝜀𝐿𝐴 , the size of each region for logarithmic conversion is adjusted such that the peak error values of all the regions are close to each other. i.e. The peaks are at similar levels.

Based on the above analysis, we divided the range of 𝑚 into two regions as in Fig. 1(b) and (c), and the range of 𝑋𝑓𝑟𝑎𝑐𝑡 into four regions as in Fig. 1(a). The curves, 𝑙𝑜𝑔2(1 +𝑚) and

2𝑋𝑓𝑟𝑎𝑐𝑡 , are then approximated by straight lines, each connecting the two end points of the curve segment in a region. The straight line in each region is then shifted up (down) by the value of half of the maximum 0.5𝜀𝐿𝐴 (or 𝜀𝐴𝐿) of the region. The infinite precision values of 𝑎𝑖, 𝑏𝑖, 𝑒𝑗 and 𝑓𝑗 of the straight lines are used to determine the ideal values for 𝑔𝑟𝑎𝑑𝑖,𝑗 and 𝑐𝑜𝑛𝑠𝑡𝑖 ,𝑗. Keeping only three power-of-two terms for 𝑔𝑟𝑎𝑑𝑖,𝑗 and truncating 𝑐𝑜𝑛𝑠𝑡𝑖 ,𝑗 to 16 MSBs (most significant bits), the proposed expression for (8) to compute the square root in one stage is

2𝑋 = 212𝑘𝑚𝐴𝐿𝑖,𝑗(𝑚), (11)

where

𝐴𝐿1,1(𝑚) = 2−1𝑚+ (2−5 + 2−6)𝑚�9𝑀𝑆𝐵𝑠 + (2−1 + 2−2 +

2−3 + 2−4 + 2−6 + 2−9 + 2−15 + 2−16),

for 𝑘𝑙 = 0 and 0 ≤ 𝑚 < 0.4375

𝐴𝐿2,2(𝑚) = 2−2𝑚+ (2−3 + 2−7)𝑚9𝑀𝑆𝐵𝑠 + (2−5 + 2−9 +

2−11), for 𝑘𝑙 = 0 and 0.4375 ≤ 𝑚 < 1.0

𝐴𝐿1,3(𝑚) = 2−1𝑚+ (2−3 + 2−6)𝑚9𝑀𝑆𝐵𝑠 + (2−2 + 2−3 +

2−5 + 2−7 + 2−9), for 𝑘𝑙 = 1 and 0 ≤ 𝑚 < 0.4375

𝐴𝐿2,4(𝑚) = 2−1𝑚 + (2−5 + 2−7)𝑚9𝑀𝑆𝐵𝑠 + (2−2 + 2−3 +

2−4 + 2−6 + 2−7 + 2−10), for 𝑘𝑙 = 1 and 0.4375 ≤ 𝑚 < 1.0

where 𝑚9𝑀𝑆𝐵𝑠 is the 9 MSBs of 𝑚 and 𝑚�9𝑀𝑆𝐵𝑠 = 1−𝑚9𝑀𝑆𝐵𝑠 − 2−9 is the 1’s complement of 𝑚9𝑀𝑆𝐵𝑠.

IV. PERFORMANCE ANALYSIS A. Accuracy

We have simulated the 3-stage method and the proposed single-stage method in Matlab. For the 3-stage method, we simulated all the combinations of Mitchell-based logarithmic algorithms [6-9], [11-12] with two antilogarithmic algorithms [7], [14]. 64-bit double-precision is used to emulate the real values of √𝑁. For integer input with wordlength greater than 16 bits, the combination of the logarithm conversion in [6] and the antilogarithm conversion in [14] gives the lowest maximum percentage error of 0.45%. The proposed single-stage method obtains the maximum percentage error of 0.34% (equivalent to 8-bit output precision), achieving an improvement of 24.44%. Further improvement is possible by increasing the numbers of regions.

B. Hardware Complexity The proposed method is implemented in an architecture

similar to Fig. 2 of [9] and Fig. 3(a) of [14]. For an n-bit input,

12

8

4

0 0 0.2 0.4 0.6 0.8 1

i = 1 i = 2 m

(a)

0 0.5 1

j = 1 j = 2 j = 3 j = 4

-2 -4 -6 -8

×10-3

Xfract

(b)

Figure 2. Error curves: (a) 0.5𝜀𝐿𝐴 and (b) 𝜀𝐴𝐿

×10-3

Figure 1. Relationships between logarithmic and antilogarithmic conversions: (a) 2𝑋𝑓𝑟𝑎𝑐𝑡(90° anticlockwise rotated); (b) 0.5 + 0.5𝑙𝑜𝑔2(1 + 𝑚); (c) 0.5𝑙𝑜𝑔2(1 + 𝑚)

1

0.5

0

i = 1

m 0.5 1

m

X fra

ct

i = 2

j = 1

j = 2

j = 3

j = 4

1 1.5 2

2Xfract

0.5+0.5log2(1 + m) 0.5+0.5(aim + bi)

0.5log2(1 + m) 0.5(aim + bi)

(a)

(b)

(c)

P

1 + 𝑒𝑗 × 𝑋𝑓𝑟𝑎𝑐𝑡 + 𝑓𝑗

an 𝑛-bit LOD, an 𝑛-word×⌈log2𝑛⌉-bit LUT and an (𝑛 − 1)-bit logarithmic shifter are used to obtain 𝑘 and 𝑚 of (11). Subsequently, the 15 MSBs of 𝑚 ((𝑛 − 1) bits) are input to the error correction circuit implementing 𝐴𝐿𝑖,𝑗 as shown in Fig. 3, where a 2-stage carry save adder (CSA) tree in the dotted polygon with each circle representing a 1-bit operand is used to accumulate the operands before summed up by a 16-bit carry propagation adder. A ‘1’ bit is appended to the left of the output of the error correction circuit before it is input to an 𝑛-bit logarithmic shifter which implements 2

12𝑘𝑚 in (11).

Table I gives the area and speed comparisons between the proposed circuit in Fig. 3 and the architecture proposed in [4] for realizing the modified Dijkstra algorithm [3]. The basic building blocks of the architecture in [4] consist of mainly two n-bit adders, a comparator and two shifters. We use the unit-gate model in [15], in which a 2-input monotonic gate, such as a NAND gate, has one unit of area and one unit of delay, and a monotonic gate is conservatively assumed to consist of six transistors (T) in a classical CMOS process. A 2𝑊𝐼 × 𝑊𝑜 ROM is estimated to have an area of (2⌈𝑊𝐼/2⌉(⌈𝑊𝐼/2⌉+ 1) +2𝑊𝐼𝑊𝑜 + 2⌊𝑊𝐼/2⌋𝑊𝑜(⌊𝑊𝐼/2⌋ + 2) +𝑊𝑜(2⌊𝑊𝐼/2⌋ + 2))T and a delay of (1 + ⌈𝑙𝑜𝑔2𝑊𝐼⌉+ ⌈𝑊𝐼/2⌉) unit [16]. Table I shows the estimated area and delay of the basic building blocks of the architecture in Fig. 15 of [4], the 3-stage design and the proposed single-stage design. The proposed single-stage design is smaller by 1.6% and faster by 15.0% even it is compared with only the basic building block in [4]. For 16-bit integer input, [4] requires either at least 8×4320T area (pipelined) or 8×140 units latency (looped) to achieve 8-bit output precision. In other words, the proposed single-stage method is about 9 times faster or 8 times smaller when compared to [4]. Note that iterative methods such as [4] are well known for area efficient at the expense of large latency. When compared to the 3-stage design, the proposed single-stage design reduces the area cost by 30.1% and computation delay by 40.8% as shown in Table I.

V. CONCLUSION A novel single-stage circuit based on Mitchell’s

logarithmic algorithms for computing integer square root was developed. The circuit is compact in size and fast in

computation speed. Based on the proposed approach, circuits for computing other arithmetic functions such as multiplication, division, exponential, etc., could be developed in the future.

REFERENCES [1] A. G. M. Strollo, D. De Caro, and N. Petra, “Elementary functions

hardware implementation using constrained piecewise-polynomial approximations,” IEEE Trans. Comp., vol. 60, pp. 418-432, 2011.

[2] F. de Dinechin, and A. Tisserand, “Multipartite Table Methods,” IEEE Trans. Comp., vol. 54, pp. 319-330, 2005

[3] M. T. Tommiska, “ Area-efficient implementation of a fast square root algorithm,” in Proc. 3rd IEEE Int. Caracas Conf. Devices, Circuits Syst., pp. S18/1-S18/4, 2000.

[4] V. B. Alluri, J. R. Heath, and M. Lhamon, “A New Multichannel, Coherent Amplitude Modulated, Time-Division Multiplexed, Software-Defined Radio Receiver Architecture, and Field-Programmable-Gate-Array Technology Implementation, ” IEEE Trans. Signal Processing, vol 58, pp. 5369-5384, 2010.

[5] J. N. Mitchell, “Computer multiplication and division using binary logarithm,” IRE Trans. Comp., vol. EC-11, pp. 512-517, 1962.

[6] M. Combet, H. V. Zonneveld, and L. Verbeek, “Computation of the base two logarithm of binary numbers,” IEEE Trans. Electronic Comp., vol. EC-14, pp. 863-867, 1965.

[7] E. L. Hall, D. D. Lynch, and S. J. Dwyer, "Generation of products and quotients using approximate binary logarithms for digital filtering applications," IEEE Trans. Computers, vol. C-19, no. 2, pp. 97-105, 1970.

[8] S. L. SanGregory, C. Brothers, D. Gallagher, and R. Siferd, "A fast, low-power logarithm approximation with CMOS VLSI implementation," in Proc. 42nd Midwest Symp. Circuits and Systems, vol. 1, pp. 388-391, 1999.

[9] K. H. Abed, and R. E. Siferd, “CMOS VLSI implementation of a low-power logarithmic converter,” IEEE Trans. Comp., vol. 52, pp. 1421-1433, 2003.

[10] H. Kim, B. –G. Nam, J. –H. Sohn, J. –H. Woo, and H. –J. Yoo, “A 231-MHz, 2.18-mW 32-bit logarithmic arithmetic unit for fixed-point 3-D graphics system,” IEEE J. Solid-State Circuits, vol. 41, no. 11, pp. 2373-2381, 2006.

[11] Z. Li, J. An, M. Yang, and J. Yang, "FPGA design and implementation of an improved 32-bit binary logarithm converter," in Proc. 4th Int. Conf. Wireless Communications, Networking and Mobile Computing (WiCOM '08), pp. 1-4, 2008.

[12] T. -B. Juang, and S. -H. Chen, "A lower error and ROM-free logarithmic converter for digital signal processing applications," IEEE Trans. Circuits and Systems-II: Express Briefs, vol. 56, no. 12, pp. 931-935, 2009.

[13] D. De Caro, N. Petra, and A. G. M. Strollo, “Efficient logarithmic converters for digital signal processing applications,” IEEE Trans. Circuits and Systems-II: Express Briefs, vol. 58, no. 10, pp. 667 – 671, 2011.

[14] K. H. Abed, and R. E. Siferd, “VLSI implementation of a low-power antilogarithmic converter,” IEEE Trans. Comp., vol. 52, pp. 1221-1228, 2003.

[15] H. T. Vergos, C. Efstathiou, and D. Nikolos, "Diminished-one modulo 2n+1 adder design," IEEE Trans. Comp. vol. 51, pp. 1389-1399, 2002.

[16] Z. D. Ulman, and M. Czyzak, “Highly parallel, fast scaling of numbers in nonredundant residue arithmetic,” IEEE Trans. Signal Processing, vol. 46, pp. 487-496, 1998.

r3,4

r3,4

Figure 3. Error correction circuit implementing (11)

Adder Tree Legend: = a bit in the word A, where i = {0, 1, 2, …}

r2,4

r1 m1-m9

9 0 {m1-m9, 0}

10 S

r2

{m1-m15}

15 U

{0, m1-m14}

15 15 r2, 3

m1-m9 r1 9

00 {m1-m9, 00}

11 T

11 10

R

m1 r3,4

r2 r2 r3 r4 r4 r2, 3

r2,4

r3,4 r2,4

r2,4 r3,4

kl

r2,4 r3,4

r2,4 r3,4 r1 r1

r2, 3

m2 m3 m4 r2,4 r2,4

10 11

t10 t9 t8 t7 t6 t5 t4 t4 t2 t1 t0

s9 s8 s7 s6 s5 s4 s3 s2 s1 s0

r1 r2 r2 r2,3 r2,3 r2,3 r3,4 r4 r2 r1 r1

u14 u13 u12 u11 u10 u9 u8 u7 u6 u5 u4 u3 u2 u1 u0

ai

TABLE I. AREA AND SPEED COMPARISONS (𝑛 = 16)

Area (T) Delay (unit) Design in

Fig. 15 of [4]

Basic building block > 4320 > 140 Pipelined architecture > 8*4320 = 34560 > 140 Looped architecture > 4320 > 8*140 = 1120

3-stage design: [6] + [14] 6086 201 Proposed single stage design 4252 119