20
Hindawi Publishing Corporation VLSI Design Volume 2010, Article ID 794891, 19 pages doi:10.1155/2010/794891 Review Article CORDIC Architectures: A Survey B. Lakshmi and A. S. Dhar Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology, Kharagpur, West Bengal 721302, India Correspondence should be addressed to B. Lakshmi, [email protected] Received 6 October 2009; Accepted 10 January 2010 Academic Editor: Kiyoung Choi Copyright © 2010 B. Lakshmi and A. S. Dhar. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. In the last decade, CORDIC algorithm has drawn wide attention from academia and industry for various applications such as DSP, biomedical signal processing, software defined radio, neural networks, and MIMO systems to mention just a few. It is an iterative algorithm, requiring simple shift and addition operations, for hardware realization of basic elementary functions. Since CORDIC is used as a building block in various single chip solutions, the critical aspects to be considered are high speed, low power, and low area, for achieving reasonable overall performance. In this paper, we first classify the CORDIC algorithm based on the number system and discuss its importance in the implementation of CORDIC algorithm. Then, we present systematic and comprehensivetaxonomyofrotationalCORDICalgorithms,whicharesubsequentlydiscussedindepth.Specialattentionhasbeen devoted to the higher radix and flat techniques proposed in the literature for reducing the latency. Finally, detailed comparison of various algorithms is presented, which can provide a first-order information to designers looking for either further improvement of performance or selection of rotational CORDIC for a specific application. 1. Introduction The current research in the design of high speed VLSI architectures for real-time digital signal processing (DSP) algorithms has been directed by the advances in the VLSI technology, which have provided the designers with signif- icant impetus for porting algorithm into architecture. Many of the algorithms used in DSP and matrix arithmetic require elementary functions such as trigonometric, inverse trigono- metric, logarithm, exponential, multiplication, and division functions. The commonly used software solutions for the digital implementation of these functions are table lookup method and polynomial expansions, requiring number of multiplication and additions/subtractions. However, digit- by-digit methods exist for the evaluation of these elementary functions, which compute faster than software solutions. Some of the digit-by-digit methods for the computation of the above mentioned elementary functions were described by Henry Briggs in 1624 in “Arithmetica Logarithmica” [12]. These are iterative pseudo division and pseudo multiplication processes, which resemble repeated-addition multiplication and repeated-subtraction division. In 1959, Volder has proposed a special purpose digital computing unit known as COordinate Rotation DIgital Computer (CORDIC),whilebuildinga real time navigationalcomputer for use in an aircraft [34]. This algorithm was initially developed for trigonometric functions which were expressed in terms of basic plane rotations. The conventional method of implementation of 2D vector rotation shown in Figure 1 using Givens rotation transform is represented by the equations out  = in  cos θ  y in  sin θ ,  y out  = in  sin θ  +  y in  cos θ , (1) where (in , y in ) and (out , y out ) are the initial and final coordinates of the vector, respectively. The hardware real- ization of these equations require four multiplications, two additions/subtractions and accessing the table stored in memory for trigonometric coe cients. The CORDIC algorithm computes 2D rotation using iterative equations employing shift and add operations. The versatility of CORDIC is enhanced by developing algorithms on the same basis to convert between binary to binary coded decimal

Cordic Architecture Survey

Embed Size (px)

DESCRIPTION

cordic

Citation preview

  • Hindawi Publishing CorporationVLSI DesignVolume 2010, Article ID 794891, 19 pagesdoi:10.1155/2010/794891

    Review Article

    CORDIC Architectures: A Survey

    B. Lakshmi and A. S. Dhar

    Department of Electronics and Electrical Communication Engineering, Indian Institute of Technology,Kharagpur, West Bengal 721302, India

    Correspondence should be addressed to B. Lakshmi, [email protected]

    Received 6 October 2009; Accepted 10 January 2010

    Academic Editor: Kiyoung Choi

    Copyright 2010 B. Lakshmi and A. S. Dhar. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

    In the last decade, CORDIC algorithm has drawn wide attention from academia and industry for various applications such asDSP, biomedical signal processing, software defined radio, neural networks, and MIMO systems to mention just a few. It is aniterative algorithm, requiring simple shift and addition operations, for hardware realization of basic elementary functions. SinceCORDIC is used as a building block in various single chip solutions, the critical aspects to be considered are high speed, lowpower, and low area, for achieving reasonable overall performance. In this paper, we first classify the CORDIC algorithm basedon the number system and discuss its importance in the implementation of CORDIC algorithm. Then, we present systematic andcomprehensive taxonomy of rotational CORDIC algorithms, which are subsequently discussed in depth. Special attention has beendevoted to the higher radix and flat techniques proposed in the literature for reducing the latency. Finally, detailed comparison ofvarious algorithms is presented, which can provide a first-order information to designers looking for either further improvementof performance or selection of rotational CORDIC for a specific application.

    1. Introduction

    The current research in the design of high speed VLSIarchitectures for real-time digital signal processing (DSP)algorithms has been directed by the advances in the VLSItechnology, which have provided the designers with signif-icant impetus for porting algorithm into architecture. Manyof the algorithms used in DSP and matrix arithmetic requireelementary functions such as trigonometric, inverse trigono-metric, logarithm, exponential, multiplication, and divisionfunctions. The commonly used software solutions for thedigital implementation of these functions are table lookupmethod and polynomial expansions, requiring number ofmultiplication and additions/subtractions. However, digit-by-digit methods exist for the evaluation of these elementaryfunctions, which compute faster than software solutions.

    Some of the digit-by-digit methods for the computationof the above mentioned elementary functions were describedby Henry Briggs in 1624 in Arithmetica Logarithmica[1, 2]. These are iterative pseudo division and pseudomultiplication processes, which resemble repeated-additionmultiplication and repeated-subtraction division. In 1959,

    Volder has proposed a special purpose digital computingunit known as COordinate Rotation DIgital Computer(CORDIC), while building a real time navigational computerfor use in an aircraft [3, 4]. This algorithm was initiallydeveloped for trigonometric functions which were expressedin terms of basic plane rotations.

    The conventional method of implementation of 2Dvector rotation shown in Figure 1 using Givens rotationtransform is represented by the equations

    xout = xin cos yin sin ,yout = xin sin + yin cos ,

    (1)

    where (xin, yin) and (xout, yout) are the initial and finalcoordinates of the vector, respectively. The hardware real-ization of these equations require four multiplications,two additions/subtractions and accessing the table storedin memory for trigonometric coecients. The CORDICalgorithm computes 2D rotation using iterative equationsemploying shift and add operations. The versatility ofCORDIC is enhanced by developing algorithms on the samebasis to convert between binary to binary coded decimal

  • 2 VLSI Design

    (BCD) number representation by Daggett in 1959 [5]. Theseiterative methods were described using decimal radix forthe design of powerful small machines by Meggitt in 1962[6]. Subsequently, Walther in 1971 [7, 8] has proposed aunified algorithm to compute rotation in circular, linear,and hyperbolic coordinate systems using the same CORDICalgorithm, embedding coordinate systems as a parameter.

    During the last 50 years of the CORDIC algorithm awide variety of applications have emerged. The CORDICalgorithm has received increased attention after an unifiedapproach is proposed for its implementation [7]. Thereafter,CORDIC based computing has been the choice for scientificcalculator applications and HP-2152A co-processor, HP-9100 desktop calculator, HP-35 calculator are a few suchdevices based on the CORDIC algorithm [1, 8]. TheCORDIC arithmetic processor chip is designed and imple-mented to perform various functions possible in rotationand vectoring mode of circular, linear, and hyperboliccoordinate systems [9]. Since then, CORDIC technique hasbeen used in many applications [10], such as single chipCORDIC processor for DSP applications [1115], lineartransformations [1621], digital filters [17], [2224], andmatrix based signal processing algorithms [25, 26]. Morerecently, the advances in the VLSI technology and the adventof EDA tools have extended the application of CORDICalgorithm to the field of biomedical signal processing [27],neural networks [28], software defined radio [29], andMIMO systems [30] to mention a few.

    Although CORDIC may not be the fastest technique toperform these operations, it is attractive due to its potentialfor ecient and low cost implementation of a large class ofapplications. Several modifications have been proposed inthe literature for the CORDIC algorithm during the last twodecades to provide high performance and low cost hardwaresolutions for real time computation of a two dimensionalvector rotation and transcendental functions.

    A new type of arithmetic operation called fast rotationsor orthonormal -rotations over a set of fixed angles isproposed [31]. These orthonormal -rotations are basedon the idea of CORDIC and share the property thatperforming the rotation requires a minimal number of shift-add operations. These fast rotations methods form a viablelow cost alternative to the CORDIC arithmetic for certainapplications such as FIR filter banks for image processing,the generation of spherical sample rays in 3D graphics, andthe computation of eigenvalue decomposition and singularvalue decomposition.

    We have carried out the critical study of dierentarchitectures proposed in the literature for 2D rotationalCORDIC in circular coordinate system, to initiate the workfor further latency reduction or throughput improvement.In this paper, we will review the architectures proposedfor rotational CORDIC. Specifically, we focus on redundantunfolded architectures, employing techniques suitable toincrease throughput and reduce latency.

    The rest of the paper is organized as follows. InSection 2, the basics of redundant arithmetic are presented.In Section 3, we present a review of generalized CORDICalgorithm, radix-2 and radix-4 CORDIC algorithms. In

    O x

    y

    Min

    (xout, yout)

    (xin, yin)

    Figure 1: Two dimensional vector rotation.

    Section 4, general architectures being employed in literaturefor the implementation of the CORDIC algorithm arediscussed. In Section 5, the complete taxonomy of rotationalCORDIC algorithms is presented. Section 6 presents thelow latency nonredundant CORDIC algorithm. Sections 79 provide dierent redundant CORDIC algorithms alongwith the architectures being proposed in the literature for therotational CORDIC, followed by the comparison of dierentmethods in Section 10. Finally, conclusions are presented inSection 11.

    2. Redundant Arithmetic [32, 33]

    A nonredundant radix- number system has the set{0, 1, . . . , 1} and all numbers can be uniquely represented.To avoid carry propagation delay in addition, redundantbinary number system is employed. The two common redun-dant number systems employed in CORDIC arithmeticare the signed-digit (SD) [3437] and the carry-save (CS)[38] number systems. In a SD number system for radix, the numbers are represented with digit set {, +1, . . . ,1, 0, +1, . . . ,}, where ( 1) and (1 (1)). For symmetric digit set, = , and each digit s of SDnumber system is represented as (s+, s) by (p,n) encodingsuch that (s+ s = s). In the radix-2 SD number system,numbers are represented with digits {1, 0, 1}. In the carry-save number system, numbers are represented with digit set{0, 1, 2}. It may be observed that, in both SD and CS numbersystems each number can be represented in multiple ways.The redundancy in SD and CS number representation limitsthe carry propagation from each stage to its immediate moresignificant bit position only. In both the SD/CS adders, allsum bits are generated with two full adder delay independentof the word length. Hence, the application of redundantarithmetic can accelerate the additions/subtractions due tocarry-free or limited carry-propagation.

    3. CORDIC Algorithm

    The CORDIC algorithm involves rotation of a vector v onthe XY-plane in circular, linear and hyperbolic coordinatesystems depending on the function to be evaluated. Tra-jectories for the vector vi for successive CORDIC iterationsare shown in Figure 2. This is an iterative convergence

  • VLSI Design 3

    algorithm that performs a rotation iteratively using a seriesof specific incremental rotation angles selected so that eachiteration is performed by shift and add operation. Thenorm of a vector in these coordinate systems is defined

    asx2 + my2, where m {1, 0,1} represents a circular,

    linear or hyperbolic coordinate system respectively. Thenorm preserving rotation trajectory is a circle defined by x2 +y2 = 1 in the circular coordinate system. Similarly, the normpreserving rotation trajectory in the hyperbolic and linearcoordinate systems is defined by the function x2 y2 = 1 andx = 1, respectively. The CORDIC method can be employedin two dierent modes, namely, the rotation mode and thevectoring mode. The rotation mode is used to perform thegeneral rotation by a given angle . The vectoring modecomputes unknown angle of a vector by performing a finitenumber of microrotations.

    3.1. Generalized CORDIC Algorithm. The generalized equa-tions of the CORDIC algorithm for an iteration can bewritten as [7]

    xi+1 = xi miyiSm,i ,yi+1 = ixiSm,i + yi,zi+1 = zi im,i,

    (2)

    where i represents either clockwise or counter clockwisedirection of rotation, represents the radix of the numbersystem, m steers the choice of circular (m = 1), linear(m = 0) or hyperbolic (m = 1) coordinate systems, Sm,iis the nondecreasing integer shift sequence, and m,i is theelementary rotation angle. The latter directly depends on Sm,ithrough the relation

    m,i = 1m

    tan1(

    mSm,i). (3)

    The shift sequence Sm,i depends on the coordinate system andthe radix of number system. Sm,i aects the convergence ofthe algorithm and n aects the accuracy of the final result.A detailed discussion on these is presented later. The valueof i depends on the radix of the number system and isdetermined by the following equation assuming that vectoris either in the first or in the fourth quadrant:

    i =

    sign(zi), for rotation mode,

    sign(yi), for vectoring mode,

    (4)

    where z and y are the steering variables in rotation andvectoring mode respectively. The required microrotations arenot perfect and increase the length of the vector. In order tomaintain a constant vector length, the obtained results haveto be scaled by the scale factor

    K =

    i

    ki,

    ki =

    1 + m2i 2Sm,i ,

    (5)

    where ki denotes the elementary scaling factor of the ithiteration, and K is the resultant scaling factor after niterations. The computation of scale factor and its compen-sation increases the computational overhead and hardwaredepending on the number system employed in the CORDICarithmetic.

    With the appropriate initial values of x, y, and z, bothrotation and vectoring modes can be used to computecommonly used elementary functions [39] given in Table 1.

    3.2. CORDIC Algorithm for Circular Coordinate System. Wepresent in this section the detailed description of 2D planerotation in circular coordinate system, since this is usedin many applications. The CORDIC algorithm calculatestrigonometric functions, rotation of a vector and angle ofa vector by realizing two dimensional vector rotation incircular coordinate systems. Figure 3 shows the rotation ofa vector with length Min by a sequence of microrotationsthrough the elementary angles i. Equation (2) represents theiterative rotation by an angle i in circular coordinate systemfor m = 1 and is given by

    xi+1 = xi i yii,

    yi+1 = ixii + yi,zi+1 = zi ii .

    (6)

    The values of i are chosen such that tan (i) = i andthe multiplication of tangent term is reduced to simpleshift operation. It may observed that the norm of vectorin (i + 1)th iteration is extended compared to that inith rotation, that is Mi+1 = Mi

    1 + tan2. The increase

    in magnitude of the vector in every iteration depends onthe radix of the number system and number of iterationsand is represented by the scale factor K . The direction ofiterative rotation is determined using zi or yi depending onrotation mode or vectoring mode respectively. The numberof microrotations to be performed in both the modesdepends on the desired computing accuracy and can beconstant for a particular computer of finite word length. Thenumber of microrotations in turn decides the number ofelementary angles. The iterative equations of the CORDICalgorithm for radix-2 and radix-4 number systems will bepresented in the following sections.

    3.2.1. Rotation Mode. In rotation mode, the input angle will be decomposed using a finite number of elementaryangles [3]

    = 00 + 11 + + n1n1, (7)

    where n indicates the number of microrotations, i is theelementary angle for ith iteration and i is the directionof ith microrotation. In rotation mode, z0 is the angleaccumulator initialized with the input rotation angle. Thedirection of vector in every iteration must be determinedto reduce the magnitude of the residual angle in the angleaccumulator. Therefore, the direction of rotation in any

  • 4 VLSI Design

    Table 1: Realization of some functions using CORDIC Algorithm.

    m Mode Initialization Output

    1 (Circular) Rotation

    x0 = xin xn = Km (xin cos yin sin )

    y0 = yin yn = Km (yin cos + xin sin )z0 = zn = 0

    x0 = 1/Km xn = cos

    y0 = 0 yn = sin

    z0 = zn = 0

    x0 = 1 xn =

    1 + a2

    y0 = a yn = sin

    z0 = /2 zn = 0

    1 (Circular) Vectoring

    x0 = xin xn = Km sign(x0) (x2in + y2in)1/2

    y0 = yin yn = 0z0 = 0 zn = tan1(yin/xin)

    0 (Linear) Rotation

    x0 = xin xn = xiny0 = yin yn = yin + xin z

    z0 = z zn = 0

    0 (Linear) Vectoring

    x0 = xin xn = xiny0 = yin yn = 0

    z0 = z zn = z + yin/xin

    1 (Hyperbolic) Rotationx0 = xin xn = Km (xin cosh + yin sinh )

    y0 = yin yn = Km (yin cosh + xin sinh )

    z0 = zn = 0

    x0 = 1/Km xn = cosh

    y0 = 0, z0 = zn = 0, yn = sinh

    x0 = a xn = ae

    y0 = a, z0 = zn = 0, yn = ae

    1 (Hyperbolic) Vectoring x0 = xin xn = Km sign(x0) (x2in y2in)1/2

    y0 = yin yn = 0, zn = + tanh1(yin/xin)

    x0 = a xn =a2 1

    y0 = 0 yn = 0, zn = coth1a

    x0 = a + 1 xn = 2a

    y0 = a 1 yn = 0, zn = 0.5 ln(a)

    x0 = a + b xn = 2ab

    y0 = a b yn = 0, zn = 0.5 ln(a/b)

  • VLSI Design 5

    O x

    y

    v (i) =[x (i)y (i)

    ]

    x2 + y2 = 1

    v1v3

    v4

    v2

    v0

    (a) Circular system

    Ox

    y

    v (i) =[x (i)y (i)

    ]

    v1

    v3

    v2

    v0

    x = 1

    (b) Linear system

    O x

    y

    v (i) =[x (i)y (i)

    ]

    x2 y2 = 1

    v1

    v3

    v2

    v0

    y = x

    y = x

    (c) Hyperbolic system

    Figure 2: Rotation in various coordinate systems.

    O x

    y

    Min

    (xout, yout)

    (xin, yin)

    (a)

    =

    i=0 i

    ROT() =

    i=0 ROT(i)

    (b)

    y

    O xXi+1 Xi

    Yi+1

    Yi

    Mi (1

    +ta

    n2 i)

    Mi tani

    Mii

    Q

    (c)

    Figure 3: CORDIC algorithm based 2D vector rotation.

    iteration is determined using the sign of the residual angleobtained in the previous iteration. The coordinates of avector obtained after n microrotations are

    xn = K(xin cos yin sin

    ),

    yn = K(xin sin + yin cos

    ),

    zn 0.(8)

    3.2.2. Vectoring Mode. In vectoring mode, the unknownangle of a vector is determined by performing a finite numberof microrotations satisfying the relation [3]

    = 00 + 11 + + n1n1. (9)The vectoring mode rotates the input vector through apredetermined set of n elementary angles so as to reduce they coordinate of the final vector to zero as closely as possible.Therefore, the direction of rotation in every iteration mustbe determined based on the sign of residual y coordinateobtained in the previous iteration. The coordinates obtainedin vectoring mode after n iterations are given by

    xn = Kx2in + y

    2in,

    yn 0,

    zn = tan1(yinxin

    ).

    (10)

    3.2.3. Radix-2 CORDIC Algorithm. The iteration equationsof the radix-2 CORDIC algorithm [7] in rotation mode ofcircular coordinate system at the (i + 1)th step are obtainedby using = 2 in (6) and are given by

    xi+1 = xi i2i yi,yi+1 = i2ixi + yi,zi+1 = zi ii,

    (11)

    where i = tan1(2i) and

    i =1, for zi < 0,1, otherwise.

    (12)

    In order to maintain a constant vector length, the obtainedresults have to be scaled by the scale factor K given by

    K =n1

    i=0

    1 + 22i. (13)

    For radix-2 CORDIC, K 1.65. The major drawback of theconventional CORDIC algorithm is its relatively high latencyand low throughput due to the sequential nature of theiteration process with carry propagate addition and variableshifting in every iteration. To overcome these drawbacks,pipelined implementations are proposed [40, 41]. However,

  • 6 VLSI Design

    the carry propagate addition remained a bottleneck forfurther throughput improvement. Two major methodologieshave been employed in the literature to increase the speedof CORDIC implementation. One reduces the delay ofeach iteration by adopting redundant arithmetic to radix-2 CORDIC [42] to eliminate carry propagate addition. Theother technique involves reducing the number of iterationsby increasing the radix employed for the implementation ofCORDIC algorithm [43].

    The redundant radix-2 CORDIC [42] is proposed byemploying redundant arithmetic. The direction of rotationsi, are selected from the set {1, 0, 1} in contrast to {1, 1}employed in the conventional CORDIC. These i values arecomputed by evaluating a few most significant digits of zi,since the determination of sign of a redundant number takeslong time. This redundant CORDIC algorithm performs norotation extension for i = 0 and aects the value of scalingfactor K , thus making it data-dependent. Therefore, K hasto be calculated for each microrotation. This calculation andcorrection increases the computation time and hardware.

    3.2.4. Redundant Radix-4 CORDIC Algorithm. As men-tioned above, the speed of CORDIC algorithm implementa-tion can be improved by reducing the number of iterations.The iteration equations for the radix-4 CORDIC algorithmin rotation mode derived at the (i + 1)th step by using = 4in (6) and are given by

    xi+1 = xi i4i yi,

    yi+1 = i4ixi + yi,

    wi+1 = wi tan1(i4i

    ),

    (14)

    where i {2,1, 0, 1, 2}. The final x and y coordinatesare scaled by

    K =

    i0ki =

    i0

    (1 + 2i 4

    2i)1/2

    . (15)

    Here, the scale factor K depends on the values of i andhence, has to be computed in every iteration. The range ofK is (1, 2.52) for radix-4 CORDIC. In this CORDIC, thedirection of rotation is computed based on the estimatedvalue of wi [43]. The w path involves the computationof estimated wi and evaluation of selection function todetermine i resulting in increase of the iteration delaycompared to that of radix-2. However, the number ofiterations required for radix-2 CORDIC can be halved byemploying the radix-4 CORDIC algorithm.

    The Scale factor computation and compensation,CORDIC algorithm convergence and accuracy aspects arepresented in following sections.

    3.2.5. Scale Factor Computation. The CORDIC rotation stepschange the length of the vector in every iteration resulting inthe distortion of the norm of the vector as shown in Figure 3and is given by (5). In nonredundant radix-2 CORDIC, K isconstant since = 1. However, K is no longer constant for

    nonredundant radix higher than 2, and redundant numbersystem. For radix-2, the scale factor needs to be computed forn/2 iterations as ki =

    1 + 22i becomes unity for i > n/2+1.

    In redundant radix-4 CORDIC [43], scale factor (15) is notconstant. In addition, it is sucient to compute K for n/4iterations as ki =

    1 + 42i becomes unity thereafter.

    3.2.6. Scale Factor Compensation. The scale factor compen-sation technique involves scaling of the final coordinates(xn, yn) with 1/K . The most direct method for scalingoperation is the multiplication of (xn, yn) by 1/K usingthe CORDIC module in linear mode [7]. This can realizedusing the CORDIC module in linear mode [7]. However,this method requires n shift and add operations which arecomparable to the computational eort of the CORDICalgorithm itself. Since K1 is constant for radix-2, thecomputational overhead can be reduced by using CSDrecoded multiplier. On an average, the number of nonzerodigits can be reduced to n/3 using CSD representation[32] and hence, the eort for multiplication using CSDrecoded multiplier is approximately one third that requiredusing conventional multiplier. Further, scaling can also beimplemented using a Wallace tree by fully parallelizingmultiplication and is preferred for applications aiming forlow latency at the expense of more silicon area [44].

    Scaling may be done by extending the sequence ofCORDIC iterations [9, 16, 17] to avoid additional hardwarerequired in the direct method. A comparison of several scalefactor compensation techniques proposed in the literaturealong with two additional methods, additive and multi-plicative decomposition approaches, for radix-2 CORDIC ispresented in [44]. It is observed from the presented resultsthat additive technique oers a low latency solution andmultiplicative technique oers an area economical solutionfor applications of CORDIC employing array and pipelinedarchitectures. An algorithm is proposed [45] to performsscale factor compensation in parallel with the CORDICrotation using nonredundant and redundant arithmetic,thereby, eliminating the final multiplication [3] or additionalscaling iterations [9, 16, 17].

    3.2.7. Convergence. The CORDIC algorithm involves therotation of a vector to reduce the z or y coordinate of thefinal vector as closely as possible to zero for rotation orvectoring mode respectively. The maximum value of rotationangle by which the vector can be rotated depends on the shiftsequence [7]. The expected results of the CORDIC algorithmcan be obtained if the z or y coordinate is driven sucientlyclose to zero. In addition, it can be guaranteed to drive z or yto zero, if the initial values of a vector (xin, yin, zin) or (xin, yin)lies within the permissible range. These ranges define thedomain of convergence of the CORDIC algorithm.

    For n-bit precision, the given rotation angle can bedecomposed as

    =n1

    i=0ii + , (16)

  • VLSI Design 7

    where is an angle approximation error such that || < n1and is negligible in practical computation [7]. This angleapproximation error in rotation and vectoring mode can becomputed as

    (rotation) = tan1(zn),

    (vectoring

    ) = tan1(ynxn

    ).

    (17)

    The magnitude of elementary angle for the given shiftsequence may be predetermined using

    i = tan1Sm,i , (18)

    where is the radix of the number system. The directionof rotation i must be selected to drive z or y towardszero for rotation or vectoring respectively. The range of idepends on the radix and digit set being used for the numbersystem. Since the number of iterations and elementary anglesto be traversed by the vector during these iterations arepredetermined, the range of for which CORDIC algorithmcan be used, called domain of convergence, is given by [7]

    || =n1

    i=0i + n1. (19)

    The convergence range of CORDIC algorithm can be definedfor rotation mode as

    zin n1

    i=0i + n1 (20)

    and for vectoring mode as

    tan1(yinxin

    )

    n1

    i=0i + n1. (21)

    The expected final results cannot be obtained, if the giveninitial values xin, yin and zin do not satisfy these convergencevalues. The range of convergence of the CORDIC algorithmcan be extended from /2 to using preprocessingtechniques [7, 27, 46].

    3.3. Accuracy. The accuracy of the CORDIC algorithm isaected by two primary sources of error, namely, angleapproximation and rounding error. The error bounds forthese two sources of error are derived by performing thedetailed numerical analysis of the CORDIC algorithm [47].The approximation error and the rounding error derivedare combined to yield the overall quantization error in theCORDIC computation. The overall quantization error can beassured to be within the range by considering an additionallog2n guard bits in the implementation of the CORDICalgorithm [7].

    3.3.1. Angle Approximation Error. Theoretically, the rotationangle is decomposed into infinite number of elementaryangles as shown in Figure 3. For practical implementation,

    CORDICarchitecture

    Folded Unfolded

    Bit serial Word serial Pipelined Parallel

    Figure 4: Taxonomy of CORDIC architectures.

    a finite number of microrotations n are considered. Hence,the input rotation angle can only be approximated resultingin an angle approximation error

    < n1, (22)

    where n1 is the residual angle after n microrotations.Hence, the accuracy of the output of the nth iteration isprincipally limited by the magnitude of the last rotationangle.

    3.3.2. Rounding Error. The second type of error calledrounding error is due to the truncation of CORDIC internalvariables by the finite length of storage elements. In additionscale factor compensation also contributes to this error. In abinary code, the truncation of intermediate results after everyiteration introduces maximum rounding error of log2n bits.To achieve a final accuracy of 1 bit in n bits, an additionallog2n guard bits must be considered in implementation ofthis algorithm [7].

    4. CORDIC Architectures

    In this section, a few architectures for mapping the CORDICalgorithm into hardware are presented. In general, thearchitectures can be broadly classified as folded and unfoldedas shown in Figure 4, based upon the realization of the threeiterative equations (6). Folded architectures are obtained byduplicating each of the dierence equations of the CORDICalgorithm into hardware and time multiplexing all theiterations into a single functional unit. Folding providesa means for trading area for time in signal processingarchitectures. The folded architectures can be categorizedinto bit-serial and word-serial architectures depending onwhether the functional unit implements the logic for one bitor one word of each iteration of the CORDIC algorithm.

    The CORDIC algorithm has traditionally been imple-mented using bit serial architecture with all iterationsexecuted in the same hardware [3]. This slows down thecomputational device and hence, is not suitable for highspeed implementation. The word serial architecture [7, 48]is an iterative CORDIC architecture obtained by realizingthe iteration equations (6). In this architecture, the shiftersare modified in each iteration to cause the desired shiftfor the iteration. The appropriate elementary angles, i areaccessed from a lookup table. The most dominating speedfactors during the iterations of word serial architecture are

  • 8 VLSI Design

    carry/borrow propagate addition/subtraction and variableshifting operations, rendering the conventional CORDIC[7] implementation slow for high speed applications. Thesedrawbacks were overcome by unfolding the iteration process[41, 48], so that each of the processing elements alwaysperform the same iteration as shown in Figure 5. The mainadvantage of the unfolded pipelined architecture comparedto folded architecture is high throughput due to the hard-wired shifts rather than time and area consuming barrelshifters and elimination of ROM. It may be noted that thepipelined architecture oers throughput improvement by afactor of n for n-bit precision at the expense of increasing thehardware by a factor less than n.

    5. CORDIC Taxonomy

    The implementation of CORDIC algorithm has evolvedover the years to suit varying requirements of applicationsfrom conventional nonredundant to redundant nature.The unfolded implementation with redundant arithmeticinitiated the eorts to address high latency in conventionalCORDIC. Subsequently, several modifications have beenproposed for redundant CORDIC algorithm to achievereduction in iteration delay, latency, area and power. Theevolution of the unfolded rotational CORDIC algorithmsis shown in Figure 6. As this taxonomy is fairly rich, theremainder of the review presents taxonomy in top-downapproach.

    CORDIC is broadly classified as nonredundant CORDICand redundant CORDIC based on the number systembeing employed. The major drawback of the conventionalCORDIC algorithm [3, 7] was low throughput and highlatency due to the carry propagate adder used for theimplementation of iterative equations. This contradicted thesimplicity and novelty of the CORDIC algorithm attractingthe attention of several researchers to device methods toincrease the speed of execution. The obvious solution is toreduce the time for each iteration or the number of iterationsor both. The redundant arithmetic has been employedto reduce the time for each iteration of the conventionalCORDIC. We have analyzed and presented in the followingSections, features of dierent pipelined and nonpipelinedunfolded implementations of the rotational CORDIC.

    6. Low Latency NonredundantRadix-2 CORDIC [49]

    A significant improvement for the conventional rotationalCORDIC algorithm in circular coordinate system is pro-posed [50], employing linear approximation to the rotationwhen the remaining angle is small. This remaining angle ischosen such that a first order Taylor series approximationof sin r and cos r , calling r the remaining angle, may beemployed as sin r r and cos r 1. The architecture forthe implementation of this algorithm using nonredundantarithmetic is presented in [49]. The iteration equations ofthis algorithm for the first n/2 + 1 microrotations are sameas those for the conventional CORDIC algorithm (11). The

    i values for the first n/3 iterations are determined itera-tively using the sign of angle accumulator zi. The rotationdirections from iteration n/3 + 1 onwards can be generatedin parallel, since the conventional circular arc tangent radixvalues approach the radix-2 coecients progressively forincreasing values of CORDIC iteration index as evident fromthe expression

    limk

    tan(

    2k)

    2k= 1. (23)

    For the range of iterations (n/3 + 1) i (n/2 + 1), alli values are determined from the recoded representationof remaining angle z(n/3+1). These i values are used toobtain z(n/2+1) from z(n/3+1). For i > (n/2 + 1), the CORDICmicrorotations are replaced by a single rotation using theremaining angle z(n/2+1). Thus, (11) is modified as

    x f = x(n/2+2) = k(n/2+1)(x(n/2+1) r y(n/2+1)

    ),

    y f = y(n/2+2) = k(n/2+1)(rx(n/2+1) + y(n/2+1)

    ),

    (24)

    where r = z(n/2+1), k(n/2+1) is the scale factor in the (n/2+1)thiteration and (x f , y f ) are the scaled final coordinates.

    Scale Factor. The low latency nonredundant radix-2CORDIC algorithm achieves constant scale factor sincei {1, 1} and performs the scale factor compensationconcurrently with the computation of x and y coordinates,using two multipliers in parallel [49]. This is in contrast totwo series multiplications required in the algorithm [50].

    7. Constant Scale Factor RedundantRadix-2 CORDIC

    Redundant radix-2 CORDIC methods can be classified asvariable and constant scale factor methods based on thedependence of scale factor on the input angle. In redundantradix-2 CORDIC [42], i {1, 0, 1} and hence scale factorK is data-dependent. Therefore, K has to be calculated foreach microrotation. This calculation and correction increasesthe computation time and hardware. Several redundantCORDIC algorithms with constant scale factor are availablein the literature [5153] to address data dependency ofthe scale factor as shown in Figure 7. In these methods,the iterative rotations of a point around the origin on theXY-plane are considered (see Figure 1). The direction ofeach rotation depends on the sign of steering variable zi,which represents the remaining angle of rotation. Sincethe computation of the sign of redundant number requiresmore time, estimated value of zi (zi) is used to determinethe direction of rotation. The estimated value is computedbased on the value of the three most significant digits ofzi. Constant scale factor is achieved by restricting i tothe set {1, 1}, thus facilitating a faster implementation.The constant scale factor methods can be classified basedon the arithmetic employed as redundant radix-2 CORDICwith signed digit arithmetic and carry save arithmetic (seeFigure 7).

  • VLSI Design 9

    X0 Y0 Z0 0

    0 +/

    Wired shift (1) Wired Shift (1)

    +/ +/

    +/ +/ +/

    +/ +/ +/

    MUXsign (Y)

    MUX

    MUX

    Wired shift (i) Wired shift (i)

    0 0

    1 1

    sign (Y)

    sign (Y)

    1 1 1

    i i i

    i

    sign (Z)

    sign (Z)

    sign (Z)Xi+1

    X1

    X2

    Xi

    Y1

    Y2

    Yi

    Yi+1

    2

    i+1

    Z1

    Z2

    Zi

    Zi+1

    . . . . . . . . .

    . . . . . . . . .

    Figure 5: Unfolded pipelined CORDIC architecture.

    Scale Factor. The scale factor need not be computed for theimplementation of all the constant scale factor techniquesdiscussed in this section. In these methods, no specificscale factor compensation technique is considered. It maybe noted that a specific compensation technique can beconsidered depending on the application.

    7.1. Constant Scale Factor Redundant CORDIC Using SDArithmetic. The redundant radix-2 CORDIC using SDarithmetic can be further classified based on the tech-nique employed to achieve constant scale factor (seeFigure 7). These methods are implemented using thebasic CORDIC iteration recurrences (11) with necessarytransformations.

    7.1.1. Double Rotation Method [51]. The double rotationmethod performs two rotation-extensions for each elemen-tary angle during the first n/2 iterations for n bit precisionto achieve constant scale factor independent of the operand.One rotation extension is performed for every elementaryangle for iterations greater than n/2. A negative rotationis performed by two negative subrotations, and a positiverotation by two positive subrotations. A nonrotation isperformed by one negative and one positive subrotation.Hence, 50% additional iterations are required compared tothe redundant CORDIC [42].

    7.1.2. Correcting Rotation Method [51]. This is anothermethod proposed to achieve constant scale factor for thecomputation of sine and cosine functions. This methodavoids rotation corresponding to i = 0 and performs onerotation extension in every iteration depending on the zi.Further, extra rotation extensions are performed at fixedintervals for correcting the error introduced by avoidingi = 0 and to assure convergence. If b fractional bits areused to estimate zi, the interval between correcting iterationsshould be less than or equal to (b 2) [54]. This methodalso requires 50% additional iterations, if three or four mostsignificant digits are used for sign estimation. The increase inlatency of rotational CORDIC due to these double rotationand correcting iteration methods is reduced using branchingalgorithm [52].

    7.1.3. Branching Method [52]. This method implementsCORDIC algorithm using SD arithmetic, restricting thedirection of rotations i to 1, without the need for extrarotations. This requires two modules in parallel to performtwo conventional CORDIC iterations, such that, the correctresult is retained at the end of each iteration. Two modulesperform the rotation in the same direction if the sign ofcorresponding zi can be determined. Otherwise, branchingis performed by making one CORDIC module (z+) performrotation with i = +1 and another module (z) perform

  • 10 VLSI Design

    CO

    RD

    ICim

    plem

    enta

    tion

    Hig

    h-s

    peed

    bit-

    leve

    lpip

    elin

    ed(1

    992,

    1996

    )

    Red

    un

    dan

    tC

    OR

    DIC

    Non

    redu

    nda

    nt

    CO

    RD

    IC

    Rad

    ix2

    (195

    9)lo

    wla

    ten

    cyra

    dix2

    (198

    9,20

    08)

    Rad

    ix2

    Rad

    ix2-

    4(1

    996)

    Rad

    ix4

    Var

    iabl

    esc

    ale

    fact

    or

    Car

    rysa

    ve(1

    997)

    Sign

    eddi

    git

    (199

    3)

    Con

    stan

    tsc

    ale

    fact

    orV

    aria

    ble

    scal

    efa

    ctor

    (198

    7)

    Sign

    eddi

    git

    arit

    hm

    etic

    Car

    ry-S

    ave

    arit

    hm

    etic

    ipr

    edic

    tion

    Dou

    ble

    rota

    tion

    (199

    1)

    Cor

    rect

    ing

    rota

    tion

    (199

    1)

    Bra

    nch

    ing

    (199

    3)D

    CO

    RD

    IC(1

    996)

    Dou

    ble

    step

    (199

    8)Lo

    w-l

    aten

    cy(1

    992)

    Low

    -lat

    ency

    (199

    2)P

    CO

    RD

    IC(2

    002)

    Hyb

    rid

    CO

    RD

    IC(1

    997)

    Flat

    (200

    2),P

    ara

    (200

    4)Se

    mifl

    at(2

    006)

    Figure 6: Taxonomy of CORDIC algorithms.

  • VLSI Design 11

    Constantscale factor

    Signed digitarithmetic

    Carry-Savearithmetic

    High-speedbit-level pipeilined

    (1992, 1996)

    Low-latency(1992)

    Double Step(1998)

    DCORDIC(1996)

    Branching(1993)

    CorrectingRotation(1991)

    Double rotation(1991)

    Figure 7: Taxonomy of constant scale factor redundant radix-2 CORDIC methods.

    rotation with i = 1. The direction of rotation in the nextsubsequent rotation is decided by the sign of that zi modulewhose value is small. In every iteration i, angle accumulator(z+ or z) computes the remaining angle (z+i or z

    i ) to

    determine the direction of rotation for the next iteration. Thedirection of rotation is determined by examining window ofthree digits of z+i or z

    i .

    The disadvantage of branching method is the necessityof performing two conventional CORDIC iterations inparallel which requires almost two fold eort in terms ofimplementation complexity. In addition, one of the moduleswill not be utilized when branching does not take place.However, this method oers faster implementation thandouble and correcting rotation methods [51], since, it doesnot require additional iterations to achieve constant scalefactor.

    7.1.4. Double Step Branching Method [53]. The performanceof branching algorithm is enhanced by the double stepbranching method to improve utilization of hardware. Thismethod involves determining two distinct i values ineach step with some additional hardware compared to thebranching method, where the two modules do dierentcomputations only when branching takes place. Doublestep branching method determines the two direction ofrotations by examining the six most significant digits to do adouble step. These six digits are divided into two subgroupsof three digits each, and each subgroup is handled inparallel, to generate the required i using zeroing modules (zpath). Although double stepping method introduces a smallhardware overhead compared to the branching method, it isbetter than the latter since it increases the utilization of x/yrotator modules.

    7.2. Constant Scale Factor Redundant CORDIC Using CSArithmetic. It is worth discussing here one more classifi-cation related to constant scale factor redundant radix-2CORDIC (see Figure 7). The implementation of redundantCORDIC with constant scale factor using signed arithmeticresults in an increase in the chip area [5153] and latency[51] by at least 50% compared to redundant radix-2CORDIC [42]. Low latency CORDIC algorithm [55] anddierential CORDIC algorithm [56, 57] with constant scale

    factor using CS arithmetic have been proposed to reduce thisoverhead, the details of which are discussed below.

    7.2.1. Low Latency Redundant CORDIC [55]. This algorithmis proposed to reduce the latency of redundant CORDIC[51] by subdividing the n iterations into dierent groups andusing dierent techniques for each of these groups. For all theiterations, if i = 1, conventional iteration equations (11)are used. This method avoids i = 0 for iterations between0 i (n 3)/4 and employs correcting rotation method[51]. For iterations (n 3)/4 < i (n + 1)/2, i = 0 isconsidered as a valid choice. Since for this group of iterationski =

    1 + 22i = 1 + 22i1 holds within n-bit precision,

    vector is not rotated for i = 0. However, the length ofthe vector is increased by the scale factor for that iteration,as the final coordinates are scaled assuming constant scalefactor. For the iterations i > (n + 1)/2, no correcting factor isrequired as the scale factor becomes unity.

    7.2.2. DCORDIC [56]. In the sign estimation methods [5153], half of the computational eort in the x/y/z datapaths of rotational CORDIC is required to allow for thecorrection of possible errors, as the sign estimation is notentirely perfect. This problem is reduced by high speedbit-level pipelining technique with CS arithmetic proposedin [57]. This algorithm involves the transformation of theconventional CORDIC iteration equations (11) into partiallyfixed iteration equations, given by

    zi+1 = zi

    i,

    xi+1 = xi sign(zi)2i yi ,

    yi+1 = sign(zi)2ixi + yi.(25)

    It is clear from these expressions that the computation ofx and y requires the actual sign of zi, while the angleaccumulator requires only the absolute value of zi. Theactual sign of zi (i) can be determined by taking intoaccount the initial sign of z0 and providing informationabout sign changes during the absolute value computation ofzi. Similarly, all i values are computed recursively. Later thistechnique is implemented with SD arithmetic and proposedas Dierential CORDIC (DCORDIC) algorithm [56]. Since

  • 12 VLSI Design

    RedundantCORDIC

    Radix 2(1987)

    Radix 2-4(1996)

    Radix 4(1993, 1997)

    Figure 8: Classification of CORDIC algorithms based on the radix.

    the sign calculation of steering variable (zi) during absolutevalue computation takes long time, most significant digitfirst absolute value technique is employed. This techniquereplaces the word level sign dependence by a bit leveldependence, reducing the overall computation time. Thebit level pipelined architecture is proposed to implementthese transformed iteration sequences, thus allowing highoperational speed.

    8. Higher Radix Redundant CORDIC

    As mentioned earlier, throughput and latency are importantperformance attributes in CORDIC based systems. Thevarious radix-2 CORDIC algorithms presented so far maybe used to reduce the iteration delay, thereby improvingthe throughput, with constant scale factor. Higher radixCORDIC algorithms using SD arithmetic [54, 58] andCS arithmetic [43, 59] are proposed to address latencyreduction. This is possible, since higher radix representationreduces the number of iterations. The classification of redun-dant CORDIC algorithms proposed in the literature basedon the radix of the number system is shown in Figure 8. Theapplication of radix-4 rotations in the CORDIC algorithmwas initially proposed in [54] to accelerate the radix-2algorithm.

    Scale factor need not be computed for the constantscale factor algorithms to be discussed in this section.Since no specific scale factor compensation technique isconsidered for these methods, a compensation technique canbe considered depending on the application.

    8.1. Pipelined Radix-4 CORDIC [58]. The generalizedCORDIC algorithm for any radix in three coordinate systemsand implementation of the same in rotation mode ofcircular coordinate system using radix-4 pipelined CORDICprocessor is presented in [58]. This algorithm performs twosuccessive radix-2 microrotations with the same microrota-tion angle using the iteration equations

    xi+1 = xi (i,1 + i,2

    )4i yi i,1i,242ixi,

    yi+1 =(i,1 + i,2

    )4ixi + yi i,1i,242i yi,

    zi+1 = zi (i,1 + i,2

    )i,

    (26)

    where i,1 and i,2 are two redundant radix-2 coecientsto decompose radix-4 coecient i {2,1, 0, +1, +2}satisfying the relation (i = i,1 + i,2). The value of i isselected as 0 = 21 and i = 4i for 1 i n 1. The

    selection function for i is determined using the five mostsignificant digits of z-coordinate, ensuring the convergenceof this algorithm. This algorithm is designed using SDarithmetic and requires two adders/subtractors for each stageof x/y data path in contrast to one adder/subtractor requiredin radix-2 CORDIC [42], for i < n/4. However, the numberof additions required are reduced during the last n/4 stages.

    Scale Factor Computation. The scale factor K in radix-4CORDIC algorithm is variable, since i takes values from thedigit set {2,1, 0, +1, +2}. K is computed in each iterationusing the combinational circuit by realizing the expression

    K =n/21

    i=0ki =

    n/21

    i=0

    (1 +

    i,142i

    )1/2(1 +

    i,242i

    )1/2.

    (27)

    8.2. Redundant Radix 2-4 CORDIC [59]. The number ofrotations in a redundant radix-2 CORDIC rotation unit isreduced by about 25% by expressing the direction of rota-tions in radix-2 and radix-4 [54]. This algorithm employsdierent modified CORDIC algorithms using CS arithmeticfor dierent subsets of iterations. For the iterations 1 i (n/2 + 1),redundant radix-4 CORDIC algorithm is employed, thus,halving the number of iterations. An unified architecture isproposed for the implementation of this algorithm to operatein rotation/vectoring mode of circular and hyperbolic coor-dinate systems.

    Scale Factor Computation. This algorithm achieves constantscale factor, since the rotation corresponding to = 0 isavoided for i n/2 + 1. Fori > n/2 + 1 scale factor neednot be computed as ki =

    1 + 42i 1.

    8.3. Radix-4 CORDIC [43]. A redundant radix-4 CORDICalgorithm is proposed using CS arithmetic, to reduce thelatency compared to redundant radix-2 CORDIC [42].This algorithm (14) computes i values using two dierenttechniques. For the microrotations in the range 0 i < (n/6),i is determined sequentially using angle accumulator. Forthe microrotations in the range i (n/6), the i values arepredicted from the the remaining angle after the first n/6[60]. Thus, the complexity of the w path is n/6, compared ton in the other architectures [4253] presented in the previoussections. For the range 0 i < (n/6), microrotations arepipelined in two stages to increase the throughput. A 32-bitpipelined architecture is proposed for the implementation ofthe radix-4 CORDIC algorithm using CS arithmetic.

    Scale Factor Computation. The possible scale factors areprecomputed and stored in a ROM. The number of possiblescale factors for 2i {0, 1, 4} is 3n/4+1. The size of ROM andaccess time increases with n. Hence, the scale factors for someiterations are stored in ROM and these values are used to

  • VLSI Design 13

    ParallelCORDIC

    Low-latency(1992)

    Hybrid CORDIC

    Flat CORDIC(2002)

    Para CORDIC(2004)

    PCORDIC(2002)

    Semiflat CORDIC(2006)

    MixedPartitioned

    Figure 9: Taxonomy of direction prediction based CORDICalgorithms.

    compute the scale factor for remaining iterations with thecombinational logic. This is designed by realizing the firstfew terms of Taylor series expansion of scale factor. For thisredundant radix-4 implementation, the number of iterationsare reduced at the expense of adding hardware for computingthe scale factor.

    9. Parallel CORDIC Algorithms

    The CORDIC algorithms discussed so far have represented using a set of elementary angles i called arc tangent radix set[3]

    = 00 + 11 + + n1n1, (28)

    where i = tan1(2i) and i {1, 1}, satisfying theconvergence theorem [7]

    i n1

    j=i+1j < n1 (29)

    in contrast to the representation using a normal radix

    = 020 + 121 + + n12n+1. (30)

    The direction of rotation i for the ith iteration is determinedafter computing the (i1) iterations sequentially. It is evidentfrom this sequential dependence of the radix system that thespeed of CORDIC algorithm can be improved by avoidingthe sequential behavior in the computation of i values or x/ycoordinates. The various redundant CORDIC algorithmsproposed in the literature employing either one or both thesetechniques are shown in Figure 9 and are discussed in thefollowing sections.

    9.1. Low Latency Radix-2 CORDIC [55]. The low latencyparallel radix-2 CORDIC architecture presented for therotation mode [55] predicts is by eliminating sequentialdependency of the z path. In order to minimize theprediction error, directions are predicted for a group of

    iterations at a time rather than for all iterations together.This architecture does not allow rotation for index i = 0.Hence, the convergence range of this architecture is lessthan (/2, +/2). On the other hand, the requirementof redundant to binary conversions of intermediate resultsin the z path restricts the pipelined implementation ofthis architecture. In order to reduce the latency of thisparallelizing scheme further, termination algorithm andbooth encoding method have been proposed.

    9.2. P-CORDIC [61]. The sequential procedure in the com-putation of direction of rotations of the CORDIC algorithmis eliminated by the P-CORDIC algorithm, while main-taining a constant scale factor. This algorithm precomputesthe direction of microrotations before the actual CORDICrotation starts iteratively in the x/y path. This is obtainedby deriving a relation between the constructed binaryrepresentation of direction of rotations d, and rotation angle [40, 62] given by

    = 0.5 + 0.5c1 + sign()0 + , (31)

    where c1 = 2

    i=0(2i tan1(2i)), =n/3

    i=1(ii),0 = 1 tan1(1), and i = 2i tan1(2i). Here, iscomputed using the partial oset i and the correspondingdirection bit i for the first n/3 iterations, since the valueof i decreases by a factor of 8 beyond n/3 iterations. Thedirection of rotations for any input angle in binary form areobtained by realizing this expression taking a variable oset from ROM. The unfolded architecture proposed for theimplementation of this algorithm eliminates the z path andreduces the area of the implementation. This architectureachieves latency and hardware reduction over the radix-2unfolded parallel architecture [55].

    Scale Factor. The scale factor in the implementation of P-CORDIC algorithm remains constant, as i {1, 1} beinggenerated for the implementation of x/y path. The scalefactor compensation is implemented using constant factormultiplication technique as discussed in Section 3.2.6.

    9.3. Hybrid CORDIC Algorithm. For n-bit fixed pointCORDIC processor in circular coordinate system, nearly n/3iterations must be computed sequentially. This is true forboth generation of direction and rotation without aectingaccuracy [60]. The subsequent rotation directions for thelast 2n/3 iterations can be generated in parallel since theconventional circular ATR values approach the radix-2coecients progressively with increasing iteration index, thatis,

    limk+

    tan 2k

    2k= 1. (32)

    This behavior is exploited by introducing the hybridCORDIC algorithms to speed up the conventional CORDICrotator. This algorithm involves partitioning into H andL. The rotation by H are performed as in the conventionalCORDIC algorithm and the iterations related to L can be

  • 14 VLSI Design

    simplified as in linear coordinate system. This algorithm ledto the development of several parallel CORDIC algorithms[6365]. These can be categorized broadly as mixed-hybridCORDIC and partitioned-hybrid CORDIC algorithms. Inmixed-hybrid CORDIC algorithms [65], the input angle and initial coordinates (xin, yin) are used to compute therotations for the first n/3 iterations as in the conventionalCORDIC. The remaining angle after these first n/3 iterationis used for computing directions for the last 2n/3 iterations.The implementation is designed to keep the fast timingcharacteristics of redundant arithmetic in the x/y path ofthe CORDIC processing. In the partitioned-hybrid CORDIC[63, 64], the first n/3 direction of rotations are generatedusing the first n/3 bits of and last 2n/3 direction of rotationsare predicted using the 2n/3 least significant bits of .

    9.3.1. Flat CORDIC [63]. The flat CORDIC algorithm isproposed to eliminate iterative nature in the x/y path forreducing the total computation time. This algorithm trans-forms x/y recurrences (11) of the conventional CORDIC intoa parallelized version by successive substitution to expressthe final vectors in terms of the initial vectors, resulting in asingle equation for n-bit precision. The expressions for finalcoordinates of 16-bit sine/cosine generator are

    x16 =[1 {(122122 12321223

    232223 91029210)

    +(123421222324 + + 234522232425 + +346723242627

    )+ ECX

    }],

    y16 =[121 + 222 + + 16216

    (123212223 578252728)

    +(123452122232425 + +234562223242526

    )+ ECY

    ],

    (33)

    where ECX and ECY are the error compensation factorsin x16 and y16, respectively. xin and yin are initialized with1/K and 0 respectively. The 16 sign digits (1, 2, . . . , 15, 16)for 16-bit precision represents the polarity of 16 microrota-tions required to achieve the target angle. These equationsdemonstrate the complete parallelization of the conventionalCORDIC algorithm. This technique precomputes i whichtakes values from the set {1, 1} to achieve constant scalefactor. The is for the first n/3 iterations are precomputedemploying a technique, called Split Decomposition Algo-rithm (SDA), which limits the input angle range to (0,/4)[66]. The last 2n/3 number of is are predicted from theremaining angle of n/3 iterations. The internal word lengthof the architecture proposed for this technique is consideredas (n + log2n) for n-bit external accuracy [47]. It may benoted that the complete parallelization of x/y iterations leadto the exponential increase of terms to be flattened, aecting

    the circuit complexity. In addition, the implementation offlat CORDIC needs complex combinational hardware blockswith poor scalability.

    Scale Factor. The scale factor in the implementation of theflat CORDIC algorithm is maintained constant, since i {1, 1}. The scale factor compensation is implemented usinga multiplier designed with CS adder tree.

    9.3.2. Para-CORDIC [64]. The Para-CORDIC parallelizesthe generation of direction of rotations from the binaryvalue of the input angle by employing binary to bipolarrepresentation (BBR) and microrotation angle recoding(MAR) techniques. This algorithm computes x/y coordi-nates iteratively while eliminating iterative z path completely.The input angle is divided into the higher part H andlower part L. The twos complement binary representationof input angle is

    = (d0) +l1

    i=1di2i +

    n

    i=ldi2i, (34)

    where di {0, 1} and l = (n log23)/3. The (l 1) bitsof input angle are converted into BBR, and MAR techniqueis employed to determine the direction of rotations 1 tol1. Since tan12i /= 2i, this method performs additionalmicrorotations for every iteration depending on each posi-tional binary weight 2i for i = 1, 2, . . . , l 1. The remainingangle after the first (l1) rotations is added to L. The valuesof l to n+1 are obtained from BBR of the corrected L.This method eliminates ROM for storing the predetermineddirection of rotations. However, it requires additional x/ystages for the repetition of a certain microrotations and arrayof adders to compute the corrected L.

    9.3.3. Semi-Flat CORDIC [65]. The iterative nature in theimplementation of the conventional CORDIC algorithm ispartially eliminated by semi flat algorithm. This is designedfor the semi parallelization of the x/y/z recurrences, toimprove the speed of a rotational unfolded CORDIC withoutincreasing the area requirements. The internal precision istaken higher than the required external precision in order toreduce the quantization error encountered in the CORDICalgorithm as discussed in Section 3.2.6. For the first bitsof i, x/y recurrences are computed iteratively using thedouble rotation method [51] resulting in x1/y1. Then,xn1/yn1 can be expressed in terms of these x1/y1, ifall is are predicted. The is for (nint/3 ) bits (nint= internal precision) of input angle are precomputed andstored in ROM, which is addressed by (nint/3 ) bits ofinput angle. The remaining (2nint/3) number of is arepredicted from rotation angle [60]. It may be noted thatneither the description nor the reference is provided for splitdecomposition method employed to precompute (nint/3)number of is.

    The computation time and area of the chip are aectedby the choice of , which is clear from the simulationresults presented in [65]. It is observed from these simulation

  • VLSI Design 15

    results that the best trade-o is obtained with = 6 and = 8 for a 16-bit CORDIC (internal precision 22 bits)and 32-bit CORDIC (internal precision 39 bits) respectively.After iterations, all the terms of (xn/yn) were added usingthe Wallace tree, flattening the x/y path. However, thisarchitecture has poor scalability.

    Scale Factor. This algorithm achieves constant scale factor,since i takes value from the set {1, 1}.

    10. Comparison

    We have presented a latency estimate comparison ofunfolded architectures available in the literature for 2Drotational CORDIC in Table 2. Latency is defined as sumof the delays for the computation of redundant x/y coor-dinates, scale factor compensation and redundant to binaryconversion of final x/y coordinates. The design detail of scalefactor compensation and redundant to binary conversionstages is not made available in the literature for all thearchitectures as discussed in the previous sections (Sections69). Hence, we have compared all the CORDIC algorithmswith respect to the latency required for the rotation computa-tion, excluding the scale factor compensation and redundantto binary conversion stages. All the architectures presentedin this table are implemented using redundant arithmeticexcept the conventional CORDIC [3] and the Low latencynonredundant CORDIC [49].

    The nonpipelined and pipelined implementation of theconventional radix-2 CORDIC algorithm [3, 7] requiresn iterations to compute x/y coordinates iteratively. Theiteration delay depends on the fast carry propagate adder,which is the bottleneck to increase throughput and reducelatency.

    The application of redundant arithmetic [42] to theconventional CORDIC makes i to take values from theset {1, 0, 1} instead of the set {1, 1}. The i values arecomputed iteratively and the choice of i = 0 resulted in thevariability of the usually constant scale factor. The variablescale factor increases the area and delay for scale factorcomputation. The latency of this implementation is ntstage,where tstage is the iteration stage delay in terms of full adderdelay tFA.

    The double rotation and correcting rotation redundantCORDIC methods using SD arithmetic are proposed in[51], to reduce the cost of the scale factor computation.The nonpipelined and pipelined implementation of thesemethods require latency of 1.5ntstage to compute final x/ycoordinates iteratively. These methods achieve constantscale factor, increasing the latency by 50% compared to[42].

    Low latency CORDIC algorithm [55] reduces the latencyto ((9n 3)/8)tstage compared to that 1.5ntstage in [51]. Thisalgorithm computes iteratively the direction of rotations andx/y coordinates. In addition, a nonpipelined architecture isalso proposed in this paper using prediction technique. Thelatency of this architecture is (n + log3n 1)tstage.

    Branching algorithm using signed digit arithmetic isproposed to achieve constant scale factor. The latency of non-pipelined and pipelined implementation of this algorithmis ntstage. This algorithm achieves 50% latency improvementover [51] to compute final x/y coordinates iteratively.However, it requires double the hardware as two sets of x/y/zmodules are employed.

    The direction of rotations computed using the signestimation methods [51, 52, 55] may not be accurate,therefore, half of the computational eort is required forcorrection. DCORDIC algorithm is proposed to determinethe direction of rotations iteratively using the sign of steeringvariable. However, this method requires an initial latencyof ntFA before the CORDIC rotation starts, to obtain thefirst direction of rotation. The signs are obtained for theremaining iterations with one full adder delay using bit levelpipelined architecture with n stages. This implementationrequires latency of ntstage + (n + 1)tFA to compute the finalx/y coordinates iteratively. In addition, this method requires2.5n initial register rows for skewing of input data.

    All the methods presented so far reduce the latency bydecreasing the iteration delay using redundant arithmetic.Since the latency reduction can also be obtained by reduc-ing the number of iterations, the same has motivated toimplement radix-4 pipelined CORDIC processor [58], whichresults in latency of (3n/4 + 1)tstage.

    The mixed radix CORDIC algorithm [59] is proposedusing radix-2 and radix-4 rotations for designing a pipelinedprocessor to operate in rotation and vectoring modes ofcircular and hyperbolic coordinate systems. The latency ofthis pipelined architecture requires (3n/4 + 1) stages withthree dierent stage delays (tstage) as 31tNAND(1 i < n/4),34tNAND(n/4 i (n/2 + 1)) and 36tNAND(i > (n/2 + 1)).This architecture takes more stage delay as this is designedfor various modes of operation.

    The advantage of applying radix-4 rotations for alliteration stages is exploited in [43] with less number ofadders as compared to [58]. For the microrotations in therange 0 i < (n/6), the pipelined architecture proposedfor this algorithm implementation determines i valuessequentially using angle accumulator. For the microrotationsin the range i (n/6), the i values are determined fromthe remaining angle after n/6 iterations. The latency of thisarchitecture to compute the final x/y coordinates iterativelyis (2n/3 + 2)tstage.

    In [61], P-CORDIC algorithm is proposed to eliminate zpath completely, using a linear relation between the rotationangle and the corresponding direction of all microrotationsfor rotation mode. This algorithm computes the x/y coordi-nates iteratively. The latency of the nonpipelined architectureproposed to implement this algorithm for n-bit precision is(n/12 + log2n + 1.75 + 2n)tFA.

    The iterative nature in the x/y/z path is eliminatedat the cost of scalability by the flat CORDIC algorithm[63]. This algorithm transforms x/y recurrences (11) ofthe conventional CORDIC into a parallelized version, bysuccessive substitution to express the final vectors in termsof the initial vectors, resulting in a single equation for n-bit precision. The direction of rotations are precomputed

  • 16 VLSI Design

    Ta

    ble

    2:C

    ompa

    riso

    nof

    vari

    ous

    rota

    tion

    alC

    OR

    DIC

    arch

    itec

    ture

    s,(S

    D:s

    ign

    eddi

    git,

    CS:

    carr

    ysa

    ve).

    Met

    hod

    (yea

    r)R

    adix

    ,A

    rith

    met

    icLa

    ten

    cy(t

    FA)

    Iter

    ativ

    eIt

    erat

    ive

    Scal

    efa

    ctor

    Non

    pipe

    lined

    Pip

    elin

    edX/Y

    path

    Z-p

    ath

    K

    Non

    redu

    nda

    nt

    (195

    9)[3

    ]2

    2s

    com

    plim

    ent

    n2

    n2

    Con

    stan

    t

    Red

    un

    dan

    t(1

    987)

    [42]

    2C

    Snt s

    tage

    nt s

    tage

    Var

    iabl

    e

    Dou

    ble-

    rota

    tion

    /2

    SD1.

    5nt s

    tage

    1.5n

    t sta

    ge

    C

    onst

    ant

    Cor

    rect

    ing

    (199

    1)[5

    1]

    Low

    late

    ncy

    (199

    2)[5

    5]2

    CS

    (n+

    log 3n

    1)t s

    tage

    ((9n

    3)/8

    )tst

    age

    Con

    stan

    t

    Bra

    nch

    ing

    (199

    3)[5

    2]2

    SDnt s

    tage

    nt s

    tage

    Con

    stan

    t

    DC

    OR

    DIC

    (199

    6)[5

    6]2

    SD/C

    S

    (nt s

    tage

    +n

    +1)

    Con

    stan

    t

    Rad

    ix-4

    (199

    3)[5

    8]4

    SD

    (3n/4

    +1)t s

    tage

    Var

    iabl

    e

    Rad

    ix2-

    4(1

    996)

    [59]

    24

    CS

    (3n/4

    +1)t s

    tage

    Con

    stan

    t

    Rad

    ix-4

    (199

    7)[4

    3]4

    CS

    (2n/3

    +1)t s

    tage

    n/6

    Var

    iabl

    e

    PC

    OR

    DIC

    (200

    2)[6

    1]2

    SD(1.7n

    +1.

    25+

    log 2n

    )

    Con

    stan

    t

    Flat

    CO

    RD

    IC(2

    002)

    [63]

    2SD

    34fo

    r16

    -bit

    ,50

    for

    32-b

    it

    com

    bin

    atio

    nal

    C

    onst

    ant

    Para

    -CO

    RD

    IC(2

    004)

    [64]

    2C

    S(2

    (s(n

    )+n/2l+

    2)+

    lo

    g 1.5n

    +2

    )

    Con

    stan

    t

    Sem

    i-fl

    at(2

    006)

    [65]

    2SD

    33fo

    r16

    -bit

    /

    com

    bin

    atio

    nal

    C

    onst

    ant

    Non

    redu

    nda

    nt

    low

    -lat

    ency

    (200

    8)[4

    9]2

    2s

    com

    plim

    ent

    (n/2

    +2)t a

    dder

    +t m

    ult

    iplie

    r(n/2

    +1)/m

    ult

    iplie

    rn/3

    Con

    stan

    t

  • VLSI Design 17

    before initiating the computation of x/y coordinates. Thefinal x/y coordinates are computed using combinationalblocks with the latency of 34tFA/16-bit and 50tFA/32-bit. Theexpressions for x and y variables need to be derived andcombinational building blocks have to be redesigned withchange in precision.

    In [64], Para-CORDIC algorithm is proposed to pre-compute the direction of rotations without using ROM,while eliminating iterative z path completely. This methoduses additional x/y stages for the repetition of a certainmicrorotations to predict the direction of rotations incontrast to ROM employed in [61, 63, 65]. The latency of thisPara-CORDIC is ((2(s(n) + n/2 l + 2) + log1.5n + 2))tFA,where l = (n log23)/3 and s(n) represents the total numberof microrotations required in MAR recoding of (l1) bits ofthe input angle. The values of s(n) for 16/32/64-bit precisionare 5, 18, 52 respectively.

    The semiflat technique is proposed in [65], to partiallyeliminate the iterative nature in x/y/z paths for the (n ) iterations ( = 6 for a 16-bit CORDIC and = 8for a 32-bit CORDIC, respectively). The latency of thenonpipelined implementation of this algorithm is 33tFA/16-bit and 49tFA/32-bit, respectively. It is observed that thisarchitecture is combinational after iterations and has poorscalability.

    In [49], the x/y coordinates are computed iteratively forthe (n/2 + 1) iterations using (n/2 + 1) number of fast adders.These values are used to compute the final x/y coordinatesusing two multipliers in parallel and one adder resulting inthe latency of ((n/2 + 2)tadder + tmultiplier). The i values for thefirst (n/3 + 1) iterations are determined iteratively using thesign of angle accumulator zi. For the range (n/3 + 1) < i (n/2 + 1), the rotation directions are generated in parallel.

    11. Conclusions

    In this paper, we have surveyed the algorithms for unfoldedimplementation of 2D rotational CORDIC algorithms.Special attention has been devoted to the systematic andcomprehensive classification of solutions proposed in theliterature. In addition to the pipelined implementation ofnonredundant radix-2 CORDIC algorithm that has receivedwide attention in the past, we have discussed the impor-tance of redundant and higher radix algorithms. We havealso stressed the importance of prediction algorithms toprecompute the directions of rotations and parallelization ofx/y path. It is worth noting that the considered algorithmsshould not be implemented as alternatives over the others,rather they should be integrated depending on the designconstraints of a specific application.

    We can draw final conclusions about the dierentalgorithms to achieve ecient implementation of applica-tion specific rotational CORDIC algorithm. As far as theapplication of redundant arithmetic to the pipelined imple-mentation of the conventional radix-2 CORDIC algorithmis concerned, area is doubled with reduction in the adderdelay of each stage from (log2n)tFA to 2tFA. Similarly, thehardware and iteration delay of redundant radix-2 CORDIC

    can be reduced by employing prediction technique forthe precomputation of direction of rotations. Further, thelatency reduction of this can be achieved by integrating theprediction technique with the redundant radix-4 arithmetictrading the area for variable scale factor computation.Another important observation about the solutions pro-posed with fully parallelization of x/y path is that it aectsthe modularity and regularity of the architecture leading toa poor scalable implementation. Finally, we conclude thatthe solution which can allow the design of scalable archi-tecture, employing prediction and x/y path parallelizationtechniques to redundant CORDIC algorithm can achieveboth latency reduction and throughput improvement.

    References

    [1] D. S. Cochran, Algorithms and accuracy in the HP-35,Hewlett-Packard Journal, vol. 23, no. 10, 1972.

    [2] J.-M. Muller, Elementary Functions: Algorithms and Implemen-tation, Birkhauser, Boston, Mass, USA, 2004.

    [3] J. E. Volder, The CORDIC trigonometric computing tech-nique, IRE Transactions on Electronic Computers, vol. 8, no.3, pp. 330334, 1959.

    [4] J. E. Volder, The birth of CORDIC, Journal of VLSI SignalProcessing, vol. 25, no. 2, pp. 101105, 2000.

    [5] D. H. Daggett, Decimal-binary conversions in CORDIC,IRE Transactions on Electronic Computers, vol. 8, pp. 335339,1959.

    [6] J. E. Meggitt, Pseudo division and pseudo multiplicationprocesses, IBM Journal, vol. 6, no. 2, pp. 210226, 1962.

    [7] J. S. Walther, A unified algorithm for elementary functions,in Proceedings of the AFIPS Spring Joint Computer Conference,pp. 379385, May 1971.

    [8] J. S. Walther, The story of Unified CORDIC, Journal of VLSISignal Processing, vol. 25, no. 2, pp. 107112, 2000.

    [9] G. L. Haviland and A. A. Tuszynski, A CORDIC arithmeticprocessor chip, IEEE Journal of Solid-State Circuits, vol. 15,no. 1, pp. 415, 1980.

    [10] Y. H. Hu, CORDIC-based VLSI architectures for digital signalprocessing, IEEE Signal Processing Magazine, vol. 9, no. 3, pp.1635, 1992.

    [11] A. A. J. de Lange, A. J. van der Hoeven, E. F. Deprettere,and J. Bu, Optimal floating-point pipeline CMOS CORDICprocessor, in Proceedings of the IEEE International Symposiumon Circuits and Systems (ISCAS 88), vol. 3, pp. 20432047,June 1988.

    [12] A. A. J. de Lange, A. J. van der Hoeven, E. F. Deprettere,and P. Dewilde, An application specific IC for digital signalprocessing: the floating point pipeline CORDIC processor, inProceedings of the European Conference on ASIC Design (ASIC90), pp. 6267, May 1990.

    [13] D. E. Metafas and C. E. Goutis, A DSP processor with apowerful set of elementary arithmetic operations based oncordic and CCM algorithms, Microprocessing and Micropro-gramming, vol. 30, no. 15, pp. 5157, 1990.

    [14] D. Timmermann, H. Hahn, B. J. Hosticka, and G. Schmidt,A programmable CORDIC chip for digital signal processingapplications, IEEE Journal of Solid-State Circuits, vol. 26, no.9, pp. 13171321, 1991.

    [15] A. A. J. de Lange and E. F. Deprettere, Design andimplementation of a floating-point quasi-systolic generalpurpose CORDIC rotator for high-rate parallel data and signal

  • 18 VLSI Design

    processing, in Proceedings of the 10th IEEE Symposium onComputer Arithmetic, pp. 272281, June 1991.

    [16] A. M. Despain, Fourier transform computers using CORDICiterations, IEEE Transactions on Computers, vol. 23, no. 10, pp.9931001, 1974.

    [17] H. M. Ahmed, J.-M. Delosme, and M. Morf, Highly concur-rent computing structures for matrix arithmetic and signalprocessing, Computer, vol. 15, no. 1, pp. 6582, 1982.

    [18] Y. H. Hu and S. Naganathan, A novel implementationof a chirp Z-transform using a CORDIC processor, IEEETransactions on Acoustics, Speech, and Signal Processing, vol. 38,no. 2, pp. 352354, 1990.

    [19] A. S. Dhar and S. Banerjee, An array architecture for fastcomputation of discrete Hartley transform, IEEE transactionson circuits and systems, vol. 38, no. 9, pp. 10951098, 1991.

    [20] K. Maharatna, A. S. Dhar, and S. Banerjee, A VLSI arrayarchitecture for realization of DFT, DHT, DCT and DST,Signal Processing, vol. 81, no. 9, pp. 18131822, 2001.

    [21] K. C. Ray and A. S. Dhar, CORDIC-based unified VLSIarchitecture for implementing window functions for realtime spectral analysis, IEE Proceedings: Circuits, Devices andSystems, vol. 153, no. 6, pp. 539544, 2006.

    [22] S. K. Rao and T. Kailath, Orthogonal digital filters for VLSIimplementation, IEEE transactions on circuits and systems,vol. 31, no. 11, pp. 933945, 1984.

    [23] P. P. Vaidyanathan, A unified approach to orthogonal digitalfilters and wave digital filters based on LBR two pair extrac-tion, IEEE transactions on circuits and systems, vol. 32, no. 7,pp. 673686, 1985.

    [24] Y. H. Hu and H. E. Liao, CALF: a CORDIC adaptive latticefilter, IEEE Transactions on Signal Processing, vol. 40, no. 4,pp. 990993, 1992.

    [25] J. R. Cavallaro and F. T. Luk, CORDIC arithmetic for an SVDprocessor, Journal of Parallel and Distributed Computing, vol.5, no. 3, pp. 271290, 1988.

    [26] J. A. Lee and T. Lang, SVD by constant-factor-redundant-CORDIC, in Proceedings of the 10th IEEE Symposium onComputer Arithmetic, pp. 264271, June 1991.

    [27] A. Banerjee, A. S. Dhar, and S. Banerjee, FPGA realizationof a CORDIC based FFT processor for biomedical signalprocessing, Microprocessors and Microsystems, vol. 25, no. 3,pp. 131142, 2001.

    [28] A. Meyer-Base, R. Watzel, U. Meyer-Base, and S. Foo, Aparallel CORDIC architecture dedicated to compute theGaussian potential function in neural networks, EngineeringApplications of Artificial Intelligence, vol. 16, no. 7-8, pp. 595605, 2003.

    [29] C. Y. Kang and E. E. Swartzlander Jr., Digit-pipelined directdigital frequency synthesis based on dierential CORDIC,IEEE Transactions on Circuits and Systems I, vol. 53, no. 5, pp.10351044, 2006.

    [30] H. Wang, P. Leray, and J. Palicot, Reconfigurable architecturefor MIMO systems based on CORDIC operators, ComptesRendus Physique, vol. 7, no. 7, pp. 735750, 2006.

    [31] G. J. Hekstra and E. F. A. Deprettere, Fast rotations: low-costarithmetic methods for orthonormal rotation, in Proceedingsof the 13th IEEE Symposium on Computer Arithmetic, pp. 116125, October 1997.

    [32] K. Hwang, Computer Arithmetic: Principles, Architecture andDesign, John Wiley & Sons, New York, NY, USA, 1979.

    [33] K. K. Parhi, VLSI Digital Signal Processing Systems: Designand Implementation, John Wiley & Sons, New York, NY, USA,1999.

    [34] A. Avizienis, Signed-digit number representation for fastparallel arithmetic, IRE Transactions on Electronic Computers,vol. 10, pp. 389400, 1961.

    [35] D. E. Atkins, Introduction to the role of redundancy incomputer arithmetic, IEEE Computer Magazine, vol. 8, no. 6,pp. 7477, 1975.

    [36] B. Parhami, Carry-free addition of recorded binary signed-digit numbers, IEEE Transactions on Computers, vol. 37, no.11, pp. 14701476, 1988.

    [37] B. Parhami, Generalized signed-digit number systems: aunifying framework for redundant number representations,IEEE Transactions on Computers, vol. 39, no. 1, pp. 8998,1990.

    [38] T. G. Noll, Carry-save arithmetic for high-speed digitalsignal processing, in Proceedings of the IEEE InternationalSymposium on Circuits and Systems, vol. 2, pp. 982986, May1990.

    [39] R. Andraka, A survey of CORDIC algorithms for FPGAbased computers, in Proceedings of the 6th ACM/SIGDAInternational Symposium on Field Programmable Gate Arrays(FPGA 98), pp. 191200, February 1998.

    [40] P. W. Baker, Suggestion for a fast binary sine/cosine genera-tor, IEEE Transactions on Computers, vol. 25, no. 11, pp. 11341136, 1976.

    [41] Y. H. Hu, Pipelined CORDIC architecture for the imple-mentation of rotational based algorithm, in Proceedings ofthe International Symposium on VLSI Technology, Systems andApplications, p. 259, May 1985.

    [42] M. D. Ercegovac and T. Lang, Fast cosine/sine implementa-tion using on-line CORIC, in Proceedings of the 21st AsilomarConference on Signals, Systems, and Computers, 1987.

    [43] E. Antelo, J. Villalba, J. D. Bruguera, and E. L. Zapata, Highperformance rotation architectures based on the Radix-4CORDIC algorithm, IEEE Transactions on Computers, vol. 46,no. 8, pp. 855870, 1997.

    [44] D. Timmermann, H. Hahn, B. J. Hosticka, and B. Rix, Anew addition scheme and fast scaling factor compensationmethods for CORDIC algorithms, The VLSI Journal onIntegration, vol. 11, no. 1, pp. 85100, 1991.

    [45] J. Villalba, J. A. Hidalgo, E. L. Zapata, E. Antelo, and J. D.Bruguera, CORDIC architectures with parallel compensationof the scale factor, in Proceedings of the International Con-ference on Application Specific Array Processors, pp. 258269,Strasbourg, France, July 1995.

    [46] X. Hu, R. G. Harber, and S. C. Bass, Expanding the range ofconvergence of the CORDIC algorithm, IEEE Transactions onComputers, vol. 40, no. 1, pp. 1321, 1991.

    [47] Y. H. Hu, The quantization eects of the CORDIC algo-rithm, IEEE Transactions on Signal Processing, vol. 40, no. 4,pp. 834844, 1992.

    [48] M. D. Erecegovac and T. Lang, Digital Arithmetic, Elsevier,Amsterdam, The Netherlands, 2004.

    [49] E. Antelo, J. Villalba, and E. L. Zapata, A low-latencypipelined 2D and 3D CORDIC processors, IEEE Transactionson Computers, vol. 57, no. 3, pp. 404417, 2008.

    [50] H. M. Ahmed, Ecient elementary function generation withmultipliers, in Proceedings of the 9th Symposium on ComputerArithmetic, pp. 5259, September 1989.

    [51] N. Takagi, T. Asada, and S. Yajima, Redundant CORDICmethods with a constant scale factor for sine and cosinecomputation, IEEE Transactions on Computers, vol. 40, no. 9,pp. 989995, 1991.

  • VLSI Design 19

    [52] J. Duprat and J.-M. Muller, The CORDIC algorithm: newresults for fast VLSI implementation, IEEE Transactions onComputers, vol. 42, no. 2, pp. 168178, 1993.

    [53] D. S. Phatak, Double step branching CORDIC: a new algo-rithm for fast sine and cosine generation, IEEE Transactionson Computers, vol. 47, no. 5, pp. 587602, 1998.

    [54] J. A. Lee and T. Lang, Constant-factor redundant CORDICfor angle calculation and rotation, IEEE Transactions onComputers, vol. 41, no. 8, pp. 10161025, 1992.

    [55] D. Timmermann, H. Hahn, and B. J. Hosticka, Low latencytime CORDIC algorithms, IEEE Transactions on Computers,vol. 41, no. 8, pp. 10101015, 1992.

    [56] H. Dawid and H. Meyr, The dierential CORDIC algorithm:constant scale factor redundant implementation withoutcorrecting iterations, IEEE Transactions on Computers, vol. 45,no. 3, pp. 307318, 1996.

    [57] H. Dawid and H. Meyr, High speed bit-level pipelinedarchitectures for redundant CORDIC implementation, inProceedings of the International Conference on Application, pp.358372, 1992.

    [58] J. D. Bruguera, E. Antelo, and E. L. Zapata, Design of apipelined Radix 4 CORDIC processor, Parallel Computing,vol. 19, no. 7, pp. 729744, 1993.

    [59] E. Antelo, J. D. Bruguera, and E. L. Zapata, Unified mixedRadix 24 redundant CORDIC processor, IEEE Transactionson Computers, vol. 45, no. 9, pp. 10681073, 1996.

    [60] S. Wang, V. Piuri, and E. E. Swartzlander Jr., Hybrid CORDICalgorithms, IEEE Transactions on Computers, vol. 46, no. 11,pp. 12021207, 1997.

    [61] M. Kuhlmann and K. K. Parhi, P-CORDIC: a precomputa-tion based rotation CORDIC algorithm, EURASIP Journal onApplied Signal Processing, vol. 2002, no. 9, pp. 936943, 2002.

    [62] M. Kuhlmann and K. K. Parhi, A high-speed CORDIC algo-rithm and architecture for DSP applications, in Proceedings ofthe IEEE Workshop on Signal Processing Systems (SiPS 99), pp.732741, October 1999.

    [63] B. Gisuthan and T. Srikanthan, Pipelining flat CORDICbased trigonometric function generators, MicroelectronicsJournal, vol. 33, no. 1-2, pp. 7789, 2002.

    [64] T.-B. Juang, S.-F. Hsiao, and M.-Y. Tsai, Para-CORDIC:parallel CORDIC rotation algorithm, IEEE Transactions onCircuits and Systems I, vol. 51, no. 8, pp. 15151524, 2004.

    [65] H. S. Kebbati, J. Ph. Blonde, and F. Braun, A new semi-flatarchitecture for high speed and reduced area CORDIC chip,Microelectronics Journal, vol. 37, no. 2, pp. 181187, 2006.

    [66] T. Srikanthan and B. Gisuthan, A novel technique foreliminating iterative based computation of polarity of micro-rotations in CORDIC based sine-cosine generators, Micropro-cessors and Microsystems, vol. 26, no. 5, pp. 243252, 2002.

  • Submit your manuscripts athttp://www.hindawi.com

    VLSI Design

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    International Journal of

    RotatingMachinery

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Hindawi Publishing Corporation http://www.hindawi.com

    Journal of

    EngineeringVolume 2014

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Shock and Vibration

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Mechanical Engineering

    Advances in

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Civil EngineeringAdvances in

    Acoustics and VibrationAdvances in

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Electrical and Computer Engineering

    Journal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Distributed Sensor Networks

    International Journal of

    The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

    SensorsJournal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Active and Passive Electronic Components

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Chemical EngineeringInternational Journal of

    Control Scienceand Engineering

    Journal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Antennas andPropagation

    International Journal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Navigation and Observation

    International Journal of

    Advances inOptoElectronics

    Hindawi Publishing Corporation http://www.hindawi.com

    Volume 2014

    RoboticsJournal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014