Transcript
Page 1: [IEEE 2012 IEEE International Symposium on Circuits and Systems - ISCAS 2012 - Seoul, Korea (South) (2012.05.20-2012.05.23)] 2012 IEEE International Symposium on Circuits and Systems

A Fast and Compact Circuit for Integer Square Root Computation Based on Mitchell Logarithmic Method

Joshua Yung Lih Low, Ching Chuen Jong, Jeremy Yung Shern Low, Thian Fatt Tay, and Chip-Hong Chang School of Electrical and Electronic Engineering, Nanyang Technological University

Nanyang Avenue, Singapore 639798 {josh0013}@e.ntu.edu.sg

Abstractβ€” A novel non-iterative circuit for computing integer square root based on logarithm is proposed in the paper. Mitchell’s methods are used for the logarithmic and antilogarithmic conversions. The proposed method merges two conversion stages into a single one to achieve better accuracy with a compact architecture. Hence, the circuit size and latency are reduced. Compared to an existing design based on the modified Dijkstra algorithm used in a coherent receiver, the proposed design is either 8 times smaller or 9 times faster for 16-bit integer input.

I. INTRODUCTION Various algorithms have been proposed to evaluate square

root in floating point. They can be broadly classified into non-iterative and iterative categories. The non-iterative category includes the piecewise table-lookup [1] and the tables-and-additions [2] algorithms. On the other hand, the Newton-Raphson and the Goldschmidt algorithms are two examples in the iterative category. Integer number is often used in application-specific processors for simpler implementation of arithmetic operations [3]. One instance is the integer square root operation in the coherent receiver [4], which uses the modified Dijkstra algorithm [3]. The modified Dijkstra algorithm is iterative in nature and can be implemented in a looped or a pipelined architecture. The looped architecture suffers from low throughput while the pipelined architecture has a high area cost.

Logarithm can be used to simplify the computation of arithmetic functions such as multiplication and division. Conventionally, the computation has three stages: logarithmic conversion, operation in the logarithmic domain and antilogarithmic conversion. In this work, we studied the computation of integer square root using the Mitchell’s logarithmic algorithms [5-14]. We investigated the 3-stage method and analyzed the errors in the logarithmic and antilogarithmic conversions. Based on the error characteristics, we found that if the 3 stages are merged into a single stage, not only can the errors be reduced but the computation speed can be increased as well. Hence, we propose a novel non-iterative single-stage circuit for integer square root computation based on the Mitchell logarithm. The single-stage circuit requires only one error correction function. As a result, higher computation speed, smaller circuit size and

more accurate results are achieved. To our best knowledge, this work is the first to investigate the non-iterative algorithm for integer square root based on logarithmic operations.

The rest of the paper is organized as follows. Section II describes the computation of integer square root using the Mitchell’s algorithms. Section III presents the proposed single-stage method and circuit. The performance analysis is given in Section IV. Section V concludes the paper.

II. 3-STAGE MITCHELL-BASED SQUARE ROOT COMPUTATION

Based on the logarithmic operations, the square root of a binary number 𝑁 can be expressed as βˆšπ‘ = 𝑁

12 = 2

12π‘™π‘œπ‘”2(𝑁) .

The conventional computation involves operations in three stages: (1) Logarithmic conversion 𝑁𝑙 = π‘™π‘œπ‘”2(𝑁) ; (2) Computation of square root, 𝑋 = 1

2𝑁𝑙 and (3) Antilogarithmic conversion of 𝑋 to linear domain, i.e. to compute 2𝑋 . Based on Mitchell’s methods, the three stages are described below.

A. Stage 1: Logarithmic Conversion In [5], Mitchell proposed a straight line approximation of

the logarithm of an 𝑛 -bit unsigned binary number 𝑁 =βˆ‘ 2π‘–π‘§π‘–π‘˜π‘–=β„Ž in the interval of 0 ≀ 𝑁 < 2π‘˜+1 , where β„Ž ≀ π‘˜ are

integers and 𝑧𝑖 = {0, 1}. Mathematically, 𝑁 can be written as

𝑁 = 2π‘˜(1 + π‘š)

where π‘˜ is the leading one position and π‘š = βˆ‘ 2π‘–βˆ’π‘˜π‘§π‘–π‘˜βˆ’1𝑖=β„Ž is

in the range of [0, 1). Applying base-2 logarithm on 𝑁,

𝑁𝑙 = π‘™π‘œπ‘”2(𝑁) = π‘˜ + π‘™π‘œπ‘”2(1 + π‘š).

The value of π‘˜ can be obtained easily using a leading-one detector (LOD) and a small look-up table (LUT). 𝑁𝑙 can be found after π‘™π‘œπ‘”2(1 +π‘š) is computed. Mitchell approximated π‘™π‘œπ‘”2(1 + π‘š) with a straight line 𝐿𝐴(1 + π‘š) = π‘Ž Γ— π‘š + 𝑏, where π‘Ž = 1 and 𝑏 = 0 . To improve the accuracy, numerous Mitchell-based logarithmic conversion algorithms [5-13] have been proposed. Generally, the algorithms divided the range of π‘š, [0, 1), into certain number of regions and then applied piecewise straight line approximation πΏπ΄π‘Žπ‘™π‘”π‘œ (a.k.a. error correction) on the regions as shown below.

Page 2: [IEEE 2012 IEEE International Symposium on Circuits and Systems - ISCAS 2012 - Seoul, Korea (South) (2012.05.20-2012.05.23)] 2012 IEEE International Symposium on Circuits and Systems

π‘™π‘œπ‘”2(1 + π‘š) β‰… πΏπ΄π‘Žπ‘™π‘”π‘œ,𝑖(1 + π‘š)

= π‘Žπ‘– Γ— π‘š + 𝑏𝑖 (1)

where π‘Žπ‘– and 𝑏𝑖 are, respectively, the gradient and constant in 𝑖th region. The Mitchell-based approximation is expressed as

𝑁𝑙 = π‘˜ + πΏπ΄π‘Žπ‘™π‘”π‘œ,𝑖(1 +π‘š) + πœ€πΏπ΄(π‘š)

= π‘˜ + π‘Žπ‘– Γ— π‘š + 𝑏𝑖 + πœ€πΏπ΄(π‘š) (2)

where πœ€πΏπ΄(π‘š) = π‘™π‘œπ‘”2(1 +π‘š)βˆ’ π‘Žπ‘– Γ— π‘š + 𝑏𝑖 is the approximation error.

B. Stage 2: Logarithmic Square Root The computation of square root in logarithmic domain is

merely a divide-by-2 (1-bit shift) operation. Dividing (2) by 2,

𝑋 = 12𝑁𝑙 = 1

2[π‘˜ + π‘Žπ‘– Γ— π‘š + 𝑏𝑖 + 𝑒𝐿𝐴(π‘š)]

= 12

(π‘˜π‘š) + 12π‘˜π‘™ + 1

2π‘Žπ‘– Γ— π‘š+ 1

2𝑏𝑖 + 1

2πœ€πΏπ΄(π‘š)

= 𝑋𝑖𝑛𝑑 + π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ (3)

where π‘˜π‘™ is the least significant bit (LSB) of π‘˜, π‘˜π‘š = π‘˜ βˆ’ π‘˜π‘™ ,

and 𝑋𝑖𝑛𝑑 = 12

(π‘˜π‘š) (4)

and π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ = 12π‘˜π‘™ + 1

2π‘Žπ‘– Γ— π‘š + 1

2𝑏𝑖 + 1

2πœ€πΏπ΄(π‘š) (5)

are the integer and fractional parts of 𝑋 respectively.

C. Stage 3: Antilogarithmic Conversion Applying the Mitchell straight line approximation of the

antilogarithm to (3), the antilogarithm of 𝑋 is

2𝑋 = 2𝑋𝑖𝑛𝑑 βˆ™ 2π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ .

2𝑋𝑖𝑛𝑑 can be realized easily using a shifter. Thus, 2𝑋 can be obtained once 2π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ is computed. In [5], Mitchell approximated 2π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ with a straight line 𝐴𝐿(π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘) = 1 +𝑒 Γ— π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ + 𝑓, where 𝑒 = 1 and 𝑓 = 0. Many Mitchell-based antilogarithmic algorithms, [7], [14], have been proposed to improve the accuracy by further dividing the range of π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ , [0, 1), into two or more regions and applied piecewise straight line approximation on the regions as given below.

2π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ β‰… π΄πΏπ‘Žπ‘™π‘”π‘œ,π‘—οΏ½π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘οΏ½ = 1 + 𝑒𝑗 Γ— π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ + 𝑓𝑗 (6)

where 𝑒𝑗 and 𝑓𝑗 are the gradient and constant in the 𝑗th region. Therefore, the Mitchell-based piecewise straight line approximation of antilogarithm is expressed as

2𝑋 = 2𝑋𝑖𝑛𝑑 βˆ™ [π΄πΏπ‘Žπ‘™π‘”π‘œ,π‘—οΏ½π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘οΏ½+ πœ€π΄πΏ(π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘)]

= 2𝑋𝑖𝑛𝑑 βˆ™ [1 + 𝑒𝑗 Γ— π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ + 𝑓𝑗 + πœ€π΄πΏ(π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘)] (7)

where πœ€π΄πΏοΏ½π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘οΏ½ = 2π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ βˆ’ (1 + 𝑒𝑗 Γ— π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ + 𝑓𝑗) is the approximation error.

III. PROPOSED SINGLE-STAGE METHOD As discussed in Section II, the computation of square root

of an integer number ( β„Ž = 0 in Section II.A) can be implemented directly using the Mitchell-based logarithmic

and antilogarithmic conversions in 3 stages. In the 3-stage implementation, the antilogarithmic conversion can be performed only after the logarithmic conversion is completed. The total computation time is the sum of the critical path delays in the two conversions. To reduce the total computation time, we propose to merge the error correction circuits of the logarithmic conversion into the antilogarithmic conversion and capitalize on the fact that the logarithmic square root operation is only a simple 1-bit right shift as described below.

The three stages are merged into a single stage by substituting 𝑋𝑖𝑛𝑑 (4) and π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ (5), (7) becomes

2𝑋 = 212π‘˜π‘š{1 + 𝑒𝑗 Γ— οΏ½1

2π‘˜π‘™ + 1

2π‘Žπ‘– Γ— π‘š + 1

2𝑏𝑖 + 1

2πœ€πΏπ΄(π‘š)οΏ½

+𝑓𝑗 + πœ€π΄πΏ(π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘)}

= 212π‘˜π‘š{1 + (1

2π‘Žπ‘– Γ— 𝑒𝑗) Γ— π‘š + οΏ½1

2π‘˜π‘™ Γ— 𝑒𝑗 + 1

2𝑏𝑖 Γ— 𝑒𝑗 + 𝑓𝑗�

+[12πœ€πΏπ΄(π‘š) Γ— 𝑒𝑗 + πœ€π΄πΏ(π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘)]}

= 212π‘˜π‘šοΏ½1 + π‘”π‘Ÿπ‘Žπ‘‘π‘–,𝑗 Γ— π‘š + π‘π‘œπ‘›π‘ π‘‘π‘– ,𝑗 + πœ€οΏ½ (8)

where π‘”π‘Ÿπ‘Žπ‘‘π‘–,𝑗 = 12π‘Žπ‘– Γ— 𝑒𝑗, (9)

π‘π‘œπ‘›π‘ π‘‘π‘– ,𝑗 = 𝑒𝑗 Γ— 12π‘˜π‘™ + 1

2𝑏𝑖 Γ— 𝑒𝑗 + 𝑓𝑗 (10)

and πœ€ = 12πœ€πΏπ΄(π‘š) Γ— 𝑒𝑗 + πœ€π΄πΏ(π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘) is the approximation

error.

It can be seen from (9) and (10) that the gradients, π‘Žπ‘– and 𝑒𝑗 , and the constants, 𝑏𝑖 and 𝑓𝑗 , from the two conversion stages are now combined in π‘”π‘Ÿπ‘Žπ‘‘π‘–,𝑗 and π‘π‘œπ‘›π‘ π‘‘π‘–,𝑗 respectively. As such, only a single error correction circuit is needed, which offers the following two advantages.

1) Two adders for the constants (one in (1) and the other in (6)) are reduced to only one in (8), as π‘π‘œπ‘›π‘ π‘‘π‘–,𝑗 (10) can be pre-computed.

2) As shown in [6] and [9], the values for the gradient π‘Žπ‘– in (1) must contain as few power-of-two terms as possible because 𝑑 number of terms require (𝑑 βˆ’ 1) adders. This constraint restricts the choice of the corretion function and hence, reduces the accuracy in approximating π‘™π‘œπ‘”2(1 + π‘š). The same constraint applies to 𝑒𝑗 in (6) likewise. With the merging of the error correction functions as in (8), only one gradient, π‘”π‘Ÿπ‘Žπ‘‘π‘–,𝑗, is subject to the constraint. This advantage is translated to fewer adders and higher accuracy of approximation as shown in Section IV.

To determine the values of π‘”π‘Ÿπ‘Žπ‘‘π‘–,𝑗 and π‘π‘œπ‘›π‘ π‘‘π‘–,𝑗 , we investigated the relationships between the logarithmic and antilogarithmic conversions and the corresponding approximation errors. Graphically, (8) can be visualized as shown in Fig. 1, where Fig. 1(b) depicts π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ =0.5π‘™π‘œπ‘”2(1 + π‘š) of (5) when π‘˜π‘™ = 0, and Fig. 1(c) depicts π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ = 0.5 + 0.5π‘™π‘œπ‘”2(1 + π‘š) of (5) when π‘˜π‘™ = 1 . Fig. 1(a) is the plot of 2π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ and it is rotated by 90Β° anticlockwise to show the relationships between the logarithmic and the antilogarithmic conversions. We analyzed the errors when the π‘™π‘œπ‘”2(1 + π‘š) curve is partitioned into several regions and the

Page 3: [IEEE 2012 IEEE International Symposium on Circuits and Systems - ISCAS 2012 - Seoul, Korea (South) (2012.05.20-2012.05.23)] 2012 IEEE International Symposium on Circuits and Systems

curve segment in the 𝑖th region is approximated by one straight line, π‘Žπ‘– Γ— π‘š + 𝑏𝑖, that connects the end points of the curve segment. In other words, a single piecewise linear approximation of π‘™π‘œπ‘”2(1 + π‘š) is used to determine both π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ in Fig. 1(b) and (c). We found that the error is a positive β€˜βˆ©β€™-shape parabola above the π‘š-axis. For example, when the range of π‘š in Fig. 1(b) (and (c)) is divided at 𝑃 = (2βˆ’2 + 2βˆ’3 + 2βˆ’4) = 0.4375 into two regions ( 𝑖 ={1,2}), [0, P) and [P, 1.0), the corresponding error curves are the parabolas in Fig. 2(a). Similar error characteristics are obtained when π‘š is divided into more regions. For the antilogarithmic conversion, similar error characteristics are also observed if the 2π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ curve in the 𝑗 -th region is approximated by a straight line, 1 + 𝑒𝑗 Γ— π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ + 𝑓𝑗 , that connects both end points of the curve within the region as in Fig. 1(a). The parabolic error curves in antilogarithmic conversion are β€˜βˆͺ’-shape and always negative as depicted in Fig. 2(b). It is noted that in each region, the peak value of the error is located almost at the midpoint of the region for both the logarithmic and antilogarithmic conversions. The above error characteristics of β€˜βˆ©β€™ and β€˜βˆͺ’ shape curves suggest that the partition of regions in the logarithmic conversion has to be related to that in the antilogarithmic conversion so that a significant portion of errors can be cancelled out. From the analysis in Fig. 1, we propose the number of regions for the antilogarithmic conversion to be twice of that of the logarithmic conversion and the boundary values of the regions in the antilogarithmic conversion are dictated by the boundary values of the regions in the logarithmic conversion as shown in Fig. 1. For example, the region 𝑗 = 1 in Fig. 1(a) is from 0 to 0.5π‘™π‘œπ‘”2(1 + P) . With more regions in antilogarithmic conversion, the maximum of πœ€π΄πΏ is always smaller than the maximum of πœ€πΏπ΄ which contributes to large portion of the approximation error. To minimize the maximum of πœ€πΏπ΄ , the size of each region for logarithmic conversion is adjusted such that the peak error values of all the regions are close to each other. i.e. The peaks are at similar levels.

Based on the above analysis, we divided the range of π‘š into two regions as in Fig. 1(b) and (c), and the range of π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ into four regions as in Fig. 1(a). The curves, π‘™π‘œπ‘”2(1 +π‘š) and

2π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ , are then approximated by straight lines, each connecting the two end points of the curve segment in a region. The straight line in each region is then shifted up (down) by the value of half of the maximum 0.5πœ€πΏπ΄ (or πœ€π΄πΏ) of the region. The infinite precision values of π‘Žπ‘–, 𝑏𝑖, 𝑒𝑗 and 𝑓𝑗 of the straight lines are used to determine the ideal values for π‘”π‘Ÿπ‘Žπ‘‘π‘–,𝑗 and π‘π‘œπ‘›π‘ π‘‘π‘– ,𝑗. Keeping only three power-of-two terms for π‘”π‘Ÿπ‘Žπ‘‘π‘–,𝑗 and truncating π‘π‘œπ‘›π‘ π‘‘π‘– ,𝑗 to 16 MSBs (most significant bits), the proposed expression for (8) to compute the square root in one stage is

2𝑋 = 212π‘˜π‘šπ΄πΏπ‘–,𝑗(π‘š), (11)

where

𝐴𝐿1,1(π‘š) = 2βˆ’1π‘š+ (2βˆ’5 + 2βˆ’6)π‘šοΏ½9𝑀𝑆𝐡𝑠 + (2βˆ’1 + 2βˆ’2 +

2βˆ’3 + 2βˆ’4 + 2βˆ’6 + 2βˆ’9 + 2βˆ’15 + 2βˆ’16),

for π‘˜π‘™ = 0 and 0 ≀ π‘š < 0.4375

𝐴𝐿2,2(π‘š) = 2βˆ’2π‘š+ (2βˆ’3 + 2βˆ’7)π‘š9𝑀𝑆𝐡𝑠 + (2βˆ’5 + 2βˆ’9 +

2βˆ’11), for π‘˜π‘™ = 0 and 0.4375 ≀ π‘š < 1.0

𝐴𝐿1,3(π‘š) = 2βˆ’1π‘š+ (2βˆ’3 + 2βˆ’6)π‘š9𝑀𝑆𝐡𝑠 + (2βˆ’2 + 2βˆ’3 +

2βˆ’5 + 2βˆ’7 + 2βˆ’9), for π‘˜π‘™ = 1 and 0 ≀ π‘š < 0.4375

𝐴𝐿2,4(π‘š) = 2βˆ’1π‘š + (2βˆ’5 + 2βˆ’7)π‘š9𝑀𝑆𝐡𝑠 + (2βˆ’2 + 2βˆ’3 +

2βˆ’4 + 2βˆ’6 + 2βˆ’7 + 2βˆ’10), for π‘˜π‘™ = 1 and 0.4375 ≀ π‘š < 1.0

where π‘š9𝑀𝑆𝐡𝑠 is the 9 MSBs of π‘š and π‘šοΏ½9𝑀𝑆𝐡𝑠 = 1βˆ’π‘š9𝑀𝑆𝐡𝑠 βˆ’ 2βˆ’9 is the 1’s complement of π‘š9𝑀𝑆𝐡𝑠.

IV. PERFORMANCE ANALYSIS A. Accuracy

We have simulated the 3-stage method and the proposed single-stage method in Matlab. For the 3-stage method, we simulated all the combinations of Mitchell-based logarithmic algorithms [6-9], [11-12] with two antilogarithmic algorithms [7], [14]. 64-bit double-precision is used to emulate the real values of βˆšπ‘. For integer input with wordlength greater than 16 bits, the combination of the logarithm conversion in [6] and the antilogarithm conversion in [14] gives the lowest maximum percentage error of 0.45%. The proposed single-stage method obtains the maximum percentage error of 0.34% (equivalent to 8-bit output precision), achieving an improvement of 24.44%. Further improvement is possible by increasing the numbers of regions.

B. Hardware Complexity The proposed method is implemented in an architecture

similar to Fig. 2 of [9] and Fig. 3(a) of [14]. For an n-bit input,

12

8

4

0 0 0.2 0.4 0.6 0.8 1

i = 1 i = 2 m

(a)

0 0.5 1

j = 1 j = 2 j = 3 j = 4

-2 -4 -6 -8

Γ—10-3

Xfract

(b)

Figure 2. Error curves: (a) 0.5πœ€πΏπ΄ and (b) πœ€π΄πΏ

Γ—10-3

Figure 1. Relationships between logarithmic and antilogarithmic conversions: (a) 2π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘(90Β° anticlockwise rotated); (b) 0.5 + 0.5π‘™π‘œπ‘”2(1 + π‘š); (c) 0.5π‘™π‘œπ‘”2(1 + π‘š)

1

0.5

0

i = 1

m 0.5 1

m

X fra

ct

i = 2

j = 1

j = 2

j = 3

j = 4

1 1.5 2

2Xfract

0.5+0.5log2(1 + m) 0.5+0.5(aim + bi)

0.5log2(1 + m) 0.5(aim + bi)

(a)

(b)

(c)

P

1 + 𝑒𝑗 Γ— π‘‹π‘“π‘Ÿπ‘Žπ‘π‘‘ + 𝑓𝑗

Page 4: [IEEE 2012 IEEE International Symposium on Circuits and Systems - ISCAS 2012 - Seoul, Korea (South) (2012.05.20-2012.05.23)] 2012 IEEE International Symposium on Circuits and Systems

an 𝑛-bit LOD, an 𝑛-wordΓ—βŒˆlog2π‘›βŒ‰-bit LUT and an (𝑛 βˆ’ 1)-bit logarithmic shifter are used to obtain π‘˜ and π‘š of (11). Subsequently, the 15 MSBs of π‘š ((𝑛 βˆ’ 1) bits) are input to the error correction circuit implementing 𝐴𝐿𝑖,𝑗 as shown in Fig. 3, where a 2-stage carry save adder (CSA) tree in the dotted polygon with each circle representing a 1-bit operand is used to accumulate the operands before summed up by a 16-bit carry propagation adder. A β€˜1’ bit is appended to the left of the output of the error correction circuit before it is input to an 𝑛-bit logarithmic shifter which implements 2

12π‘˜π‘š in (11).

Table I gives the area and speed comparisons between the proposed circuit in Fig. 3 and the architecture proposed in [4] for realizing the modified Dijkstra algorithm [3]. The basic building blocks of the architecture in [4] consist of mainly two n-bit adders, a comparator and two shifters. We use the unit-gate model in [15], in which a 2-input monotonic gate, such as a NAND gate, has one unit of area and one unit of delay, and a monotonic gate is conservatively assumed to consist of six transistors (T) in a classical CMOS process. A 2π‘ŠπΌ Γ— π‘Šπ‘œ ROM is estimated to have an area of (2βŒˆπ‘ŠπΌ/2βŒ‰(βŒˆπ‘ŠπΌ/2βŒ‰+ 1) +2π‘ŠπΌπ‘Šπ‘œ + 2βŒŠπ‘ŠπΌ/2βŒ‹π‘Šπ‘œ(βŒŠπ‘ŠπΌ/2βŒ‹ + 2) +π‘Šπ‘œ(2βŒŠπ‘ŠπΌ/2βŒ‹ + 2))T and a delay of (1 + βŒˆπ‘™π‘œπ‘”2π‘ŠπΌβŒ‰+ βŒˆπ‘ŠπΌ/2βŒ‰) unit [16]. Table I shows the estimated area and delay of the basic building blocks of the architecture in Fig. 15 of [4], the 3-stage design and the proposed single-stage design. The proposed single-stage design is smaller by 1.6% and faster by 15.0% even it is compared with only the basic building block in [4]. For 16-bit integer input, [4] requires either at least 8Γ—4320T area (pipelined) or 8Γ—140 units latency (looped) to achieve 8-bit output precision. In other words, the proposed single-stage method is about 9 times faster or 8 times smaller when compared to [4]. Note that iterative methods such as [4] are well known for area efficient at the expense of large latency. When compared to the 3-stage design, the proposed single-stage design reduces the area cost by 30.1% and computation delay by 40.8% as shown in Table I.

V. CONCLUSION A novel single-stage circuit based on Mitchell’s

logarithmic algorithms for computing integer square root was developed. The circuit is compact in size and fast in

computation speed. Based on the proposed approach, circuits for computing other arithmetic functions such as multiplication, division, exponential, etc., could be developed in the future.

REFERENCES [1] A. G. M. Strollo, D. De Caro, and N. Petra, β€œElementary functions

hardware implementation using constrained piecewise-polynomial approximations,” IEEE Trans. Comp., vol. 60, pp. 418-432, 2011.

[2] F. de Dinechin, and A. Tisserand, β€œMultipartite Table Methods,” IEEE Trans. Comp., vol. 54, pp. 319-330, 2005

[3] M. T. Tommiska, β€œ Area-efficient implementation of a fast square root algorithm,” in Proc. 3rd IEEE Int. Caracas Conf. Devices, Circuits Syst., pp. S18/1-S18/4, 2000.

[4] V. B. Alluri, J. R. Heath, and M. Lhamon, β€œA New Multichannel, Coherent Amplitude Modulated, Time-Division Multiplexed, Software-Defined Radio Receiver Architecture, and Field-Programmable-Gate-Array Technology Implementation, ” IEEE Trans. Signal Processing, vol 58, pp. 5369-5384, 2010.

[5] J. N. Mitchell, β€œComputer multiplication and division using binary logarithm,” IRE Trans. Comp., vol. EC-11, pp. 512-517, 1962.

[6] M. Combet, H. V. Zonneveld, and L. Verbeek, β€œComputation of the base two logarithm of binary numbers,” IEEE Trans. Electronic Comp., vol. EC-14, pp. 863-867, 1965.

[7] E. L. Hall, D. D. Lynch, and S. J. Dwyer, "Generation of products and quotients using approximate binary logarithms for digital filtering applications," IEEE Trans. Computers, vol. C-19, no. 2, pp. 97-105, 1970.

[8] S. L. SanGregory, C. Brothers, D. Gallagher, and R. Siferd, "A fast, low-power logarithm approximation with CMOS VLSI implementation," in Proc. 42nd Midwest Symp. Circuits and Systems, vol. 1, pp. 388-391, 1999.

[9] K. H. Abed, and R. E. Siferd, β€œCMOS VLSI implementation of a low-power logarithmic converter,” IEEE Trans. Comp., vol. 52, pp. 1421-1433, 2003.

[10] H. Kim, B. –G. Nam, J. –H. Sohn, J. –H. Woo, and H. –J. Yoo, β€œA 231-MHz, 2.18-mW 32-bit logarithmic arithmetic unit for fixed-point 3-D graphics system,” IEEE J. Solid-State Circuits, vol. 41, no. 11, pp. 2373-2381, 2006.

[11] Z. Li, J. An, M. Yang, and J. Yang, "FPGA design and implementation of an improved 32-bit binary logarithm converter," in Proc. 4th Int. Conf. Wireless Communications, Networking and Mobile Computing (WiCOM '08), pp. 1-4, 2008.

[12] T. -B. Juang, and S. -H. Chen, "A lower error and ROM-free logarithmic converter for digital signal processing applications," IEEE Trans. Circuits and Systems-II: Express Briefs, vol. 56, no. 12, pp. 931-935, 2009.

[13] D. De Caro, N. Petra, and A. G. M. Strollo, β€œEfficient logarithmic converters for digital signal processing applications,” IEEE Trans. Circuits and Systems-II: Express Briefs, vol. 58, no. 10, pp. 667 – 671, 2011.

[14] K. H. Abed, and R. E. Siferd, β€œVLSI implementation of a low-power antilogarithmic converter,” IEEE Trans. Comp., vol. 52, pp. 1221-1228, 2003.

[15] H. T. Vergos, C. Efstathiou, and D. Nikolos, "Diminished-one modulo 2n+1 adder design," IEEE Trans. Comp. vol. 51, pp. 1389-1399, 2002.

[16] Z. D. Ulman, and M. Czyzak, β€œHighly parallel, fast scaling of numbers in nonredundant residue arithmetic,” IEEE Trans. Signal Processing, vol. 46, pp. 487-496, 1998.

r3,4

r3,4

Figure 3. Error correction circuit implementing (11)

Adder Tree Legend: = a bit in the word A, where i = {0, 1, 2, …}

r2,4

r1 m1-m9

9 0 {m1-m9, 0}

10 S

r2

{m1-m15}

15 U

{0, m1-m14}

15 15 r2, 3

m1-m9 r1 9

00 {m1-m9, 00}

11 T

11 10

R

m1 r3,4

r2 r2 r3 r4 r4 r2, 3

r2,4

r3,4 r2,4

r2,4 r3,4

kl

r2,4 r3,4

r2,4 r3,4 r1 r1

r2, 3

m2 m3 m4 r2,4 r2,4

10 11

t10 t9 t8 t7 t6 t5 t4 t4 t2 t1 t0

s9 s8 s7 s6 s5 s4 s3 s2 s1 s0

r1 r2 r2 r2,3 r2,3 r2,3 r3,4 r4 r2 r1 r1

u14 u13 u12 u11 u10 u9 u8 u7 u6 u5 u4 u3 u2 u1 u0

ai

TABLE I. AREA AND SPEED COMPARISONS (𝑛 = 16)

Area (T) Delay (unit) Design in

Fig. 15 of [4]

Basic building block > 4320 > 140 Pipelined architecture > 8*4320 = 34560 > 140 Looped architecture > 4320 > 8*140 = 1120

3-stage design: [6] + [14] 6086 201 Proposed single stage design 4252 119


Recommended