15
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009 1893 50 Years of CORDIC: Algorithms, Architectures, and Applications Pramod K. Meher, Senior Member, IEEE, Javier Valls, Member, IEEE, Tso-Bing Juang, Member, IEEE, K. Sridharan, Senior Member, IEEE, and Koushik Maharatna, Member, IEEE Abstract—Year 2009 marks the completion of 50 years of the invention of CORDIC (COordinate Rotation DIgital Computer) by Jack E. Volder. The beauty of CORDIC lies in the fact that by simple shift-add operations, it can perform several computing tasks such as the calculation of trigonometric, hyperbolic and logarithmic functions, real and complex multiplications, division, square-root, solution of linear systems, eigenvalue estimation, singular value decomposition, QR factorization and many others. As a consequence, CORDIC has been utilized for applications in diverse areas such as signal and image processing, communication systems, robotics and 3-D graphics apart from general scientific and technical computation. In this article, we present a brief overview of the key developments in the CORDIC algorithms and architectures along with their potential and upcoming applica- tions. Index Terms—Arithmetic circuits, CORDIC, CORDIC algo- rithms, digital signal processing chip, VLSI. I. INTRODUCTION C OORDINATE Rotation DIgital Computer is abbreviated as CORDIC. The key concept of CORDIC arithmetic is based on the simple and ancient principles of two-dimensional geometry. But the iterative formulation of a computational algo- rithm for its implementation was first described in 1959 by Jack E. Volder [1], [2] for the computation of trigonometric func- tions, multiplication and division. This year therefore marks the completion of 50 years of the CORDIC algorithm. Not only a wide variety of applications of CORDIC have emerged in the last 50 years, but also a lot of progress has been made in the area of algorithm design and development of architectures for high-performance and low-cost hardware solutions of those Manuscript received August 22, 2008; revised November 26, 2008 and April 10, 2009. First published June 19, 2009; current version published September 02, 2009. This paper was recommended by Associate Editor V. Öwall. P. K. Meher is with the Department of Communication Systems, Institute for Infocomm Research, Singapore 138632 (e-mail: [email protected]). J. Valls is with Instituto de Telecomunicaciones y Aplicaciones Multimedia, Universidad Politécnica de Valencia, 46730 Grao de Gandia, Spain (e-mail: [email protected]). T.-B. Juang is with the Department of Computer Science and Information Engineering, National Pingtung Institute of Commerce, Pingtung City, Taiwan 900 (e-mail: [email protected]). K. Sridharan is with the Department of Electrical Engineering, Indian Insti- tute of Technology Madras, Chennai 600036, India (e-mail: [email protected]. in). K. Maharatna is with the School of Electronics and Computer Sci- ence, University of Southampton, Southampton, SO17 1BJ, U.K. (e-mail: [email protected]). Digital Object Identifier 10.1109/TCSI.2009.2025803 applications. CORDIC-based computing received increased at- tention in 1971, when John Walther [3], [4] showed that, by varying a few simple parameters, it could be used as a single algorithm for unified implementation of a wide range of ele- mentary transcendental functions involving logarithms, expo- nentials, and square roots along with those suggested by Volder [1]. During the same time, Cochran [5] benchmarked various al- gorithms, and showed that CORDIC technique is a better choice for scientific calculator applications. The popularity of CORDIC was very much enhanced there- after primarily due to its potential for efficient and low-cost implementation of a large class of applications which include: the generation of trigonometric, logarithmic and transcendental elementary functions; complex number multiplication, eigen- value computation, matrix inversion, solution of linear systems and singular value decomposition (SVD) for signal processing, image processing, and general scientific computation. Some other popular and upcoming applications are: 1) direct frequency synthesis, digital modulation and coding for speech/music synthesis and communication; 2) direct and inverse kinematics computation for robot ma- nipulation; 3) planar and three-dimensional vector rotation for graphics and animation. Although CORDIC may not be the fastest technique to per- form these operations, it is attractive due to the simplicity of its hardware implementation, since the same iterative algorithm could be used for all these applications using the basic shift-add operations of the form . Keeping the requirements and constraints of different ap- plication environments in view, the development of CORDIC algorithm and architecture has taken place for achieving high throughput rate and reduction of hardware-complexity as well as the latency of implementation. Some of the typical ap- proaches for reduced-complexity implementation are focussed on minimization of the complexity of scaling operation and the complexity of barrel-shifter in the CORDIC engine. Latency of implementation is an inherent drawback of the conventional CORDIC algorithm. Angle recoding schemes, mixed-grain rotation and higher radix CORDIC have been developed for reduced latency realization. Parallel and pipelined CORDIC have been suggested for high-throughput computation. The objective of this article is not to present a detailed survey of the developments of algorithms, architectures and applications of CORDIC, which would require a few doctoral and masters level dissertations. Rather we aim at providing the key develop- ments in algorithms and architectures along with an overview 1549-8328/$26.00 © 2009 IEEE Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

Embed Size (px)

Citation preview

Page 1: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009 1893

50 Years of CORDIC: Algorithms, Architectures,and Applications

Pramod K. Meher, Senior Member, IEEE, Javier Valls, Member, IEEE, Tso-Bing Juang, Member, IEEE,K. Sridharan, Senior Member, IEEE, and Koushik Maharatna, Member, IEEE

Abstract—Year 2009 marks the completion of 50 years of theinvention of CORDIC (COordinate Rotation DIgital Computer)by Jack E. Volder. The beauty of CORDIC lies in the fact thatby simple shift-add operations, it can perform several computingtasks such as the calculation of trigonometric, hyperbolic andlogarithmic functions, real and complex multiplications, division,square-root, solution of linear systems, eigenvalue estimation,singular value decomposition, QR factorization and many others.As a consequence, CORDIC has been utilized for applications indiverse areas such as signal and image processing, communicationsystems, robotics and 3-D graphics apart from general scientificand technical computation. In this article, we present a briefoverview of the key developments in the CORDIC algorithms andarchitectures along with their potential and upcoming applica-tions.

Index Terms—Arithmetic circuits, CORDIC, CORDIC algo-rithms, digital signal processing chip, VLSI.

I. INTRODUCTION

C OORDINATE Rotation DIgital Computer is abbreviatedas CORDIC. The key concept of CORDIC arithmetic is

based on the simple and ancient principles of two-dimensionalgeometry. But the iterative formulation of a computational algo-rithm for its implementation was first described in 1959 by JackE. Volder [1], [2] for the computation of trigonometric func-tions, multiplication and division. This year therefore marks thecompletion of 50 years of the CORDIC algorithm. Not onlya wide variety of applications of CORDIC have emerged inthe last 50 years, but also a lot of progress has been made inthe area of algorithm design and development of architecturesfor high-performance and low-cost hardware solutions of those

Manuscript received August 22, 2008; revised November 26, 2008 and April10, 2009. First published June 19, 2009; current version published September02, 2009. This paper was recommended by Associate Editor V. Öwall.

P. K. Meher is with the Department of Communication Systems, Institute forInfocomm Research, Singapore 138632 (e-mail: [email protected]).

J. Valls is with Instituto de Telecomunicaciones y Aplicaciones Multimedia,Universidad Politécnica de Valencia, 46730 Grao de Gandia, Spain (e-mail:[email protected]).

T.-B. Juang is with the Department of Computer Science and InformationEngineering, National Pingtung Institute of Commerce, Pingtung City, Taiwan900 (e-mail: [email protected]).

K. Sridharan is with the Department of Electrical Engineering, Indian Insti-tute of Technology Madras, Chennai 600036, India (e-mail: [email protected]).

K. Maharatna is with the School of Electronics and Computer Sci-ence, University of Southampton, Southampton, SO17 1BJ, U.K. (e-mail:[email protected]).

Digital Object Identifier 10.1109/TCSI.2009.2025803

applications. CORDIC-based computing received increased at-tention in 1971, when John Walther [3], [4] showed that, byvarying a few simple parameters, it could be used as a singlealgorithm for unified implementation of a wide range of ele-mentary transcendental functions involving logarithms, expo-nentials, and square roots along with those suggested by Volder[1]. During the same time, Cochran [5] benchmarked various al-gorithms, and showed that CORDIC technique is a better choicefor scientific calculator applications.

The popularity of CORDIC was very much enhanced there-after primarily due to its potential for efficient and low-costimplementation of a large class of applications which include:the generation of trigonometric, logarithmic and transcendentalelementary functions; complex number multiplication, eigen-value computation, matrix inversion, solution of linear systemsand singular value decomposition (SVD) for signal processing,image processing, and general scientific computation. Someother popular and upcoming applications are:

1) direct frequency synthesis, digital modulation and codingfor speech/music synthesis and communication;

2) direct and inverse kinematics computation for robot ma-nipulation;

3) planar and three-dimensional vector rotation for graphicsand animation.

Although CORDIC may not be the fastest technique to per-form these operations, it is attractive due to the simplicity ofits hardware implementation, since the same iterative algorithmcould be used for all these applications using the basic shift-addoperations of the form .

Keeping the requirements and constraints of different ap-plication environments in view, the development of CORDICalgorithm and architecture has taken place for achieving highthroughput rate and reduction of hardware-complexity as wellas the latency of implementation. Some of the typical ap-proaches for reduced-complexity implementation are focussedon minimization of the complexity of scaling operation and thecomplexity of barrel-shifter in the CORDIC engine. Latencyof implementation is an inherent drawback of the conventionalCORDIC algorithm. Angle recoding schemes, mixed-grainrotation and higher radix CORDIC have been developed forreduced latency realization. Parallel and pipelined CORDIChave been suggested for high-throughput computation. Theobjective of this article is not to present a detailed survey ofthe developments of algorithms, architectures and applicationsof CORDIC, which would require a few doctoral and masterslevel dissertations. Rather we aim at providing the key develop-ments in algorithms and architectures along with an overview

1549-8328/$26.00 © 2009 IEEE

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

Page 2: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

1894 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009

of the major application areas and upcoming applications. Weshall however discuss here the basic principles of CORDICoperations for the benefit of general readers.

The remainder of this paper is organized as follows. InSection II, we discuss the principles of CORDIC operation,covering the elementary ideas from coordinate transformationto rotation mode and vectoring mode operations followedby design of the basic CORDIC cell and multidimensionalCORDIC. The key developments in CORDIC algorithms andarchitectures are discussed in Section III, which covers the al-gorithms and architectures pertaining to higher-radix CORDIC,angle recording, coarse-fine hybrid micro rotations, redundantnumber representation, differential CORDIC, and pipelineimplementation. In Section IV, we discuss the scaling andaccuracy aspects including the scaling techniques, scaling-freeCORDIC, quantization and area-delay-accuracy trade-off. Theapplications of CORDIC to scientific computations, signal pro-cessing, communications, robotics and graphics are discussedbriefly in Section V. The conclusion along with future researchdirections are discussed in Section VI.

II. BASIC CORDIC TECHNIQUES

In this Section, we discuss the basic principle underlying theCORDIC-based computation, and present its iterative algorithmfor different operating modes and planar coordinate systems. Atthe end of this section, we discuss the extension of two-dimen-sional rotation to multidimensional formulation.

A. The CORDIC Algorithm

As shown in Fig. 1, the rotation of a two-dimensional vectorthrough an angle , to obtain a rotated vector

could be performed by the matrix product ,where is the rotation matrix:

(1)

By factoring out the cosine term in (1), the rotation matrixcan be rewritten as

(2)

and can be interpreted as a product of a scale-factorwith a pseudorotation matrix ,

given by

(3)

The pseudorotation operation rotates the vector by an angleand changes its magnitude by a factor , to produce

a pseudo-rotated vector .To achieve simplicity of hardware realization of the rotation,

the key ideas used in CORDIC arithmetic are to (i) decomposethe rotations into a sequence of elementary rotations throughpredefined angles that could be implemented with minimumhardware cost; and (ii) to avoid scaling, that might involve arith-metic operation, such as square-root and division. The secondidea is based on the fact the scale-factor contains only the magni-tude information but no information about the angle of rotation.

Fig. 1. Rotation of vector on a two-dimensional plane.

1) Iterative Decomposition of Angle of Rotation: TheCORDIC algorithm performs the rotation iteratively bybreaking down the angle of rotation into a set of small pre-de-fined angles1, , so that couldbe implemented in hardware by shifting through bit locations.Instead of performing the rotation directly through an angle ,CORDIC performs it by a certain number of microrotationsthrough angle , where

and (4)

that satisfies the CORDIC convergence theorem [3]:. But,

the decomposition according to (4) could be used only for(called the “convergence range”)

since . Therefore, the angular decom-position of (4) is applicable for angles in the first and fourthquadrants. To obtain on-the-fly decomposition of angles intothe discrete base , one may otherwise use the nonrestoringdecomposition [6]

and (5)

with if and otherwise, where therotation matrix for the th iteration corresponding to the selectedangle is given by

(6)

being the scale-factor, and the pseudoro-tation matrix

(7)

Note that the pseudo-rotation matrix for the th itera-tion alters the magnitude of the rotated vector by a scale-factor

during the th microrotation, which is in-dependent of the value of (direction of microrotation) used inthe angle decomposition.

1All angles are measured in radian unless otherwise stated.

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

Page 3: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

MEHER et al.: 50 YEARS OF CORDIC: ALGORITHMS, ARCHITECTURES AND APPLICATIONS 1895

Fig. 2. Hardware implementation of a CORDIC iteration.

2) Avoidance of Scaling: The other simplification performedby the Volder’s algorithm [1] is to remove the scale-factor

from (6). The removal of scaling from the itera-tive microrotations leads to a pseudo-rotated vectorinstead of the desired rotated vector , where thescale-factor is given by

(8)

Since the scale-factor of microrotations does not depend onthe direction of microrotations and decreases monotonically, thefinal scale-factor converges to . Therefore, in-stead of scaling during each microrotation, the magnitude offinal output could be scaled by . Therefore, the basic CORDICiterations are obtained by applying the pseudo-rotation of thevector to have, , together with the nonrestoringdecomposition of the selected angles , as follows:

(9)

CORDIC iterations of (9) could be used in two operating modes,namely the rotation mode (RM) and the vectoring mode (VM),which differ basically on how the directions of the microrota-tions are chosen. In the rotation mode, a vector is rotated byan angle to obtain a new vector . In this mode, the directionof each microrotation is determined by the sign of : if signof is positive, then otherwise . In the vec-toring mode, the vector is rotated towards the -axis so thatthe -component approaches zero. The sum of all angles of mi-crorotations (output angle ) is equal to the angle of rotation ofvector , while output corresponds to its magnitude. In thisoperating mode, the decision about the direction of the micro-rotation depends on the sign of : if it is positive thenotherwise . CORDIC iterations are easily implementedin both software and hardware. Fig. 2 shows the basic hardwarestage for a single CORDIC iteration. After each iteration thenumber of shifts is incremented by a pair of barrel-shifters. Tohave an -bit output precision, CORDIC iterations areneeded. Note that it could be implemented by a simple selec-tion operation in serial architectures like the one proposed inthe original work, or in fully parallel CORDIC architectures theshift operations could be hardwired, where no barrel-shifters areinvolved.

Finally, to overcome the problem of the limited convergencerange and, then to extend the CORDIC rotations to the complete

TABLE IGENERALIZED CORDIC ALGORITHM

range of , an extra iteration is required to be performed. Thisnew iteration is shown in (10) which is required as an initialrotation through .

where (10)

B. Generalization of the CORDIC Algorithm

In 1971, Walther found how CORDIC iterations could bemodified to compute hyperbolic functions [3] and reformulatedthe CORDIC algorithm in to a generalized and unified formwhich is suitable to perform rotations in circular, hyperbolic andlinear coordinate systems. The unified formulation includes anew variable , which is assigned different values for differentcoordinate systems. The generalized CORDIC is formulated asfollows:

(11)

where

for rotation modefor vectoring mode

For or , and or, the algorithm given by (11) works in circular,

linear or hyperbolic coordinate systems, respectively. Table Isummarizes the operations that can be performed in rotationand vectoring modes2 in each of these coordinate systems.The convergence range of linear and hyperbolic CORDIC areobtained, as in the case of circular coordinate, by the sum of all

given by . The hyperbolic CORDIC requiresto execute iterations for twice to ensure con-vergence. Consequently, these repetitions must be consideredwhile computing the scale-factor ,which converges to 0.8281.

2In the rotation mode, the components of a vector resulting due to rotation ofa vector through a given angle are derived, while in the vectoring mode the mag-nitude as well as the phase angle of a vector are estimated from the componentvalues. The rotation and vectoring modes are also known as the vector rotationmode and the angle accumulation mode, respectively.

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

Page 4: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

1896 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009

C. Multidimensional CORDIC

The CORDIC algorithm was extended to higher dimensionsusing simple Householder reflection [7]. The Householder re-flection matrix is defined as

(12)

where is an -dimensional vector and is theidentity matrix. The product reflects the -dimensionalvector with respect to the hyperplane with normal thatpasses through the origin. Basically, the Householder-basedCORDIC performs the vectoring operation of an -dimen-sional vector to one of the axes.

For the sake of clarity, we consider here the case of 3-D vectorprojected on to the -axis in the Euclidean

space. The rotation matrix for 3-D case, corresponding to the thiteration, , is given by the product of two simple House-holder reflections as

(13)

where , and with, and and

being the directions of microrotations.One can write the th rotation matrix in terms of the

pseudo-rotation matrix as , whereis the scale-factor and

is the pseudo-rotation matrix which could be expressed asfunction of the shifting and decision variables as

(14)

Therefore, the th iteration of 3-D Housholder CORDIC ro-tation results , and, the vector is projectedto -axis, such that after iterations gives the length of thevector scaled by with bit precision [8].

III. ADVANCED CORDIC ALGORITHMS AND ARCHITECTURES

CORDIC computation is inherently sequential due to twomain bottlenecks: 1) the micro-rotation for any iteration is per-formed on the intermediate vector computed by the previousiteration and 2) the th iteration could be started onlyafter the completion of the th iteration, since the value ofwhich is required to start the th iteration could be knownonly after the completion of the th iteration. To alleviate thesecond bottleneck some attempts have been made for evalua-tion of values corresponding to small micro-rotation angles[9], [10]. However, the CORDIC iterations could not still beperformed in parallel due to the first bottleneck. A partial par-allelization has been realized in [11] by combining a pair ofconventional CORDIC iterations into a single merged iterationwhich provides better area-delay efficiency. But the accuracyis slightly affected by such merging and cannot be extended toa higher number of conventional CORDIC iterations since the

induced error becomes unacceptable [11]. Parallel realizationof CORDIC iterations to handle the first bottleneck by directunfolding of micro-rotation is possible, but that would resultin increase in computational complexity and the advantage ofsimplicity of CORDIC algorithm gets degraded [12], [13]. Al-though no popular architectures are known to us for fully par-allel implementation of CORDIC, different forms of pipelinedimplementation of CORDIC have however been proposed forimproving the computational throughput [14].

Since the CORDIC algorithm exhibits linear-rate conver-gence, it requires iterations to have -bit precision ofthe output. Overall latency of the computation thus amounts toproduct of the word-length and the CORDIC iteration period.The speed of CORDIC operations is therefore constrainedeither by the precision requirement (iteration count) or theduration of the clock period. The duration of clock period onthe other hand mainly depends on the large carry propagationtime for the addition/subtraction during each micro-rotation.It is a straight-forward choice to use fast adders for reducingthe iteration period at the expense of large silicon area. Useof carry-save adder is a good option to reduce the iterationperiod and overall latency [15]. Timmermann and others havesuggested a method of truncation of CORDIC algorithm after

iterations (for -bit precision), where the last itera-tion performs a single rotation for implementing the remainingangle. It lowers the the latency time but involves one multi-plication or division, respectively, in the rotation or vectoringmode [9].

To handle latency bottlenecks, various techniques havebeen developed and reported in the literature. Most of thewell known algorithms could be grouped under, high-radixCORDIC, the angle-recoding method, hybrid micro-rotationscheme, redundant CORDIC and differential CORDIC whichwe discuss briefly in the following subsections.

A. Higher Radix CORDIC Algorithm

The radix-4 CORDIC algorithm [16] is given by

(15)

where and the elementary angles. The scale-factor for the th iteration

. In order to preserve the norm of thevector the output of micro-rotations is required to be scaled bya factor

(16)

To have -bit output precision, the radix-4 CORDIC algorithmrequires micro-rotations, which is half that of radix-2 al-gorithm. However, it requires more computation time for eachiteration and involves more hardware compared to the radix-2CORDIC to select the value of out of five different possi-bilities. Moreover, the scale-factor, given by (16), also varieswith the rotation angles since it depends on which could have

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

Page 5: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

MEHER et al.: 50 YEARS OF CORDIC: ALGORITHMS, ARCHITECTURES AND APPLICATIONS 1897

any of the five different values. Some techniques have there-fore been suggested for scale-factor compensation through iter-ative shift-add operations [16], [17]. A high-radix CORDIC al-gorithm in vectoring mode is also suggested in [18], which canbe used for reduced latency operation at the cost of larger sizetables for storing the elementary angles and pre-scaling factorsthan the radix-2 and radix-4 implementation.

B. Angle Recoding (AR) Methods

The purpose of angle recoding (AR) is to reduce the numberof CORDIC iterations by encoding the angle of rotation as alinear combination of a set of selected elementary angles ofmicro-rotations. AR methods are well-suited for many signalprocessing and image processing applications where the ro-tation angle is known a priori, such as when performing thediscrete orthogonal transforms like discrete Fourier transform(DFT), the discrete cosine transform (DCT), etc.

1) Elementary-Angle-Set Recoding: In the conventionalCORDIC, any given rotation angle is expressed as a linear com-bination of values of elementary angles that belong to the set

in order to obtain an -bit value as .However, in AR methods, this constraint is relaxed by addingzeros to the linear combination to obtain the desired angleusing relatively fewer terms of the formfor . The elementary-angle-set (EAS) usedby AR scheme is given by

. One of the simplest formof the angle recoding method based on the greedy algorithmproposed by Hu and Naganathan [19] tries to represent the re-maining angle using the closest elementary angle .The angle recoding algorithm of [19] is briefly stated in Table II.Using this recoding scheme the total number of iterations couldbe reduced by at least 50% keeping the same -bit accuracyunchanged. A similar method of angle recoding in vectoringmode called as the backward angle recoding is suggested in[20].

2) Extended Elementary-Angle-Set Recoding: Wu et al. [21]have suggested an AR scheme based on an extended elemen-tary-angle-set (EEAS), that provides a more flexible way of de-composing the target rotation angle. In the EEAS approach,the set of the elementary-angle set is extended furtherto

and . EEAS has betterrecoding efficiency in terms of the number of iterations andcan yield better error performance than the AR scheme basedon EAS. The pseudo-rotation for th micro-rotations based onEEAS scheme is given by

(17)

The pseudo-rotated vector , obtained after(the required number of micro-rotations) iterations, accordingto (17), needs to be scaled by a factor , where

to producethe rotated vector. For reducing the scaling approximation andfor a more flexible implementation of scaling, similar to the

TABLE IIANGLE RECODING ALGORITHM

Fig. 3. EEAS-based CORDIC architecture. BS represents the Barrel Shifter,and C denotes the control signals for the micro-rotations.

EEAS scheme for the micro-rotation phase, a method has alsobeen suggested in [21], as given below

(18)

where and . and.

The iterations for micro-rotation phase as well as the scalingphase could be implemented in the same architecture to reducethe hardware cost, as shown in Fig. 3.

3) Parallel Angle Recoding: The AR methods [19], [21]could be used to reduce the number of iterations by more than50%, when the angle of rotation is known in advance. However,for unknown rotation angles, their hardware implementation in-volves more cycle time than the conventional implementation,which results in a reduction in overall efficacy of the algorithm.To reduce the cycle time of CORDIC iterations in such cases,a parallel angle selection scheme is suggested in [22], whichcan be used in conjunction with the AR method, to gain theadvantages of the reduction in iteration count, without furtherincrease in the cycle time. The parallel AR scheme in [22] isbased on dynamic angle selection, where the elementary angles

can be tested in parallel and the direction for the micro-ro-tations can be determined quickly to minimize the iteration pe-riod. During each iteration, the residual angle , is passed to aset of adder-subtractor units that computefor each elementary angle in parallel and thedifferences for are then fed to a binary-tree likestructure to compare them against each other to find the smallest

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

Page 6: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

1898 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009

Fig. 4. Architecture for parallel angle recoding.

difference. The corresponding to the smallest differenceis used as the angle of micro-rotation. The architecture

for parallel angle recoding of [22] is shown in Fig. 4.The parallel AR reduces the overall latency at the cost of high

hardware-complexity of add/subtract-compare unit. For actualimplementation, it is required to find a space-time trade-off andlook at the relative performance in comparison with other ap-proaches as well. The AR schemes based on EAS and EEAShowever are useful for those cases where the angle of rotationis known in advance.

C. Hybrid or Coarse-Fine Rotation CORDIC

Based on the radix-2 decomposition, any rotation anglewith -bit precision could be expressed as a linear combina-tion of angles from the set , givenby , where , explicitly specifies whetherthere is need of a micro-rotation or not. But, radix-2 decom-position is not used in the conventional CORDIC because thatwould not lead to simplicity of hardware realization. Instead,arctangents of the corresponding values of radix-2 based set areused as the elementary-angle-set with a view to implement theCORDIC operations only by shift-add operations. The key ideaunderlying the coarse-fine angular decomposition is that for thefine values of , (i.e., when ),

could be replaced by in the radix-set for expan-sion of , since when is sufficiently large.

1) Coarse–Fine Angular Decomposition: In the coarse-fineangular decomposition, the elementary-angle-set containsthe arctangents of power-of-two for more-significant partwhile the less significant part contains the power-of-twovalues, such that the radix-set is given by ,where and

, and is assumedto be sufficiently large such that [10]. Forthe hybrid decomposition scheme, the rotation angle could bepartitioned into two terms expressed as

(19)

where and are said to be the coarse and fine subangles,respectively, given by

(20a)

(20b)

Fig. 5. Architecture for a Hybrid CORDIC algorithm [10].

A combination of coarse and fine micro-rotations are usedin hybrid CORDIC operations in two cascaded stages. Coarserotations are performed in stage-1 to have an intermediate vector

(21)

and fine rotations are performed on the output of stage-1 to ob-tain the rotated output

(22)

2) Implementation of Hybrid CORDIC: To derive the effi-ciency of hybrid CORDIC, the coarse and fine rotations are per-formed by separate circuits as shown in Fig. 5. The coarse ro-tation phase is performed by the CORDIC processor-I and thefine rotation phase is performed by CORDIC processor-II.

To have fast implementation, processor-I performs a pair ofROM look-up operations followed by addition to realize the ro-tation through angle . Since could be expressed as a linearcombination of angels of small enough magnitude , where

, the computation of fine rotation phase canbe realized by a sequence of shift-and-add operations. For im-plementation of the fine rotation phase, no computations are in-volved to decide the direction of micro-rotation, since the needof a micro-rotation is explicit in the radix-2 representation of

. The radix-2 representation could also be recoded to expresswhere as shown in [9]. Since the

direction of micro-rotations are explicit in such a representationof , it would be possible to implement the fine rotation phasein parallel for low-latency realization.

The hybrid decomposition could be used for reducing the la-tency by ROM-based realization of coarse operation. This canalso be used for reducing the hardware complexity of fine rota-tion phase since there is no need to find the direction of micro-rotation. Several options are however possible for the implemen-tation of these two stages. A form of hybrid CORDIC is sug-gested in [23] for very-high precision CORDIC rotation wherethe ROM size is reduced to nearly bits. The coarse rota-tions could be implemented as conventional CORDIC throughshift-add operations of micro-rotations if the latency is tolerable.

3) Shift-Add Implementation of Coarse Rotation: Usingthe symmetry properties of the sine and cosine functions indifferent quadrants, the rotation through any arbitrary anglecould be mapped from the full range to the first halfthe first quadrant . The coarse-fine partition could beapplied thereafter for reducing the number of micro-rotationsnecessary for fine rotations. To implement the course rotations

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

Page 7: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

MEHER et al.: 50 YEARS OF CORDIC: ALGORITHMS, ARCHITECTURES AND APPLICATIONS 1899

Fig. 6. Shift-add architecture for a Hybrid CORDIC algorithm.

through shift-add operations the coarse subangle is repre-sented in [24] and [25] in terms of elementary rotations of theform as

(23)

where is a correction term.Using (23) on (19), one can find ,

where

(24)

It is shown [25] that, based on the above decompositionsusing radix-2 representation, both coarse and fine rotationscould be implemented by a sequence of shift-and-add oper-ations in CORDIC iterations without ROM lookup table orthe real multiplication operation. One such implementation isshown in Fig. 6. Processor-I performs CORDIC operations likethat of conventional CORDIC for nearly the first one-third ofthe iterations and the residual angle as well as the intermediaterotated vector is passed to the processor-II. Processor-II canperform the fine rotation in one of the possible ways as in caseof the circuit of Fig. 5.

The coarse-fine rotation approach in some modified forms hasbeen applied for reduced-latency implementation of sine and co-sine generation [24]–[28], high-speed and high-precision rota-tion [24], [26], and conversion of rectangular to polar coordi-nates and vice versa [29], [30].

4) Parallel CORDIC Based on Coarse-Fine Decomposition:In [31], the authors have proposed two angle recoding tech-niques for parallel detection of direction of micro-rotations,namely the binary to bipolar recoding (BBR) and micro-rota-tion angle recoding (MAR) to be used for the coarse part of theinput angle . BBR is used to obtain the polarity of each bitin the radix-2 representation of to determine the rotationdirection. MAR is used to decompose each positional binaryweight into a linear combinationof arctangent terms. It is further shown in [32] that, the rotationdirection can be decided once the input angle is known toenable parallel computation of the micro-rotations. Althoughthe CORDIC rotation can be executed in parallel according to[32], the method for decomposition of each positional binaryweight produces many extra stages of micro-rotation, especiallywhen the bit-width of input angle increases. A more efficientrecoding scheme has been proposed in [33] for the reduction ofnumber of micro-rotations to be employed in parallel CORDICrotations.

D. Redundant-Number-Based CORDIC Implementation

Addition/subtraction operations are faster in the redundantnumber system, since unlike the binary system, it does notinvolve carry propagation. The use of redundant numbersystem is therefore another way to speed up the CORDICiterations. A CORDIC implementation based on the redundantnumber system called as redundant CORDIC was proposedby Ercegovac and Lang and applied to matrix triangulariza-tion and singular value decomposition [34]. Rotation moderedundant CORDIC has been found to result in fast imple-mentation of sinusoidal function generation, unitary matrixtransformation, angle calculation and rotation [34]–[38].Although redundant CORDIC can achieve a fast carry-freecomputation, the direction of the micro-rotation (the signfactor ) cannot be determined directly unlike the case of theconventional CORDIC, since the redundant number systemallows a choice along with the conventional choices1 and 1 such that . Therefore, it requires adifferent formulation for selection of , which is dif-ferent for binary signed-digit representation and carry–saveimplementation. In radix-2 signed-digit representation, as-suming— , it isshown that [6]

ififif

(25)

where is the value of truncated after the first fractionaldigit. Similarly for carry-save implementation, it is

ififif

(26)

It can be noted from (25) and (26), that in some of the iter-ations no rotations are performed, so that the scale-factor be-comes a variable which depends on the angle of rotation. Sincethe redundant CORDIC of [34] uses non-constant scale-factor,Takagi et al. [35] have proposed the double-rotation method andcorrecting-rotation method to keep the value of scale-factor con-stant. In double rotation method, in each iteration two micro-ro-tations are performed, such that when , one positive andone negative micro-rotations are performed, and whenor , respectively, two positive or two negative micro-rota-tions are performed. The scale-factor is retained constant in thiscase since the number of micro-rotations is fixed for any rota-tion angle but it doubles the iteration count. The correcting-rota-tion method examines the sign of constituted by some mostsignificant digits of , and if then is taken to be

and is taken to be otherwise. It is shown that theerror occurring in this algorithm could be corrected by repeti-tion of the iterations for , etc., where is thesize of . The branching CORDIC was proposed in [36] forfast on-line implementation for redundant CORDIC with a con-stant scale factor. The main drawback of this method, however,is its necessity of performing two conventional CORDIC itera-tions in parallel, which consumes more silicon area than clas-sical methods [39]. The work proposed in [34] has also beenextended to the vectoring mode [37], and correcting operations

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

Page 8: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

1900 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009

Fig. 7. Pipelined architecture for conventional CORDIC.

are included further to keep the scaling factor constant so as toeliminate the hardware for scaling.

E. Pipelined CORDIC Architecture

Since the CORDIC iterations are identical, it is very muchconvenient to map them into pipelined architectures. The mainemphasis in efficient pipelined implementation lies with theminimization of the critical path. The earliest pipelined archi-tecture that we find was suggested by Deprettere, Dewilde andUdo in 1984 [14]. Pipelined CORDIC circuits have been usedthereafter for high-throughput implementation of sinusoidalwave generation, fixed and adaptive filters, discrete orthogonaltransforms and other signal processing applications [40]–[44].A generic architecture of pipelined CORDIC circuit is shown inFig. 7. It consists of stages of CORDIC units where each of thepipelined stages consists of a basic CORDIC engine of the kindshown in Fig. 2. Since the number of shifts to be performed bythe shifters at different stages is fixed (shift-operation through-bit positions is performed at the th stage) in case of pipelined

CORDIC the shift operations could be hardwired with adders;and therefore shifters are eliminated in the pipelined implemen-tation. The critical-path of pipelined CORDIC thus amountsto the time required by the add/subtract operations in each ofthe stages. When three adders are used in each stage as shownin Fig. 7, the critical-path amounts to ,where and are the time required for addition,2:1 multiplexing and 2’s complement operation, respectively.For known and constant angle rotations the sign of micro-ro-tations could be predetermined, and the need of multiplexingcould be avoided for reducing the critical-path. The latency ofcomputation thus depends primarily on the time required foran addition. Since there is very little room for reducing thecritical path in the pipelined implementation of conventional

TABLE IIIDIFFERENTIAL CORDIC ALGORITHM

CORDIC, digit-on-line pipelined CORDIC circuits based onthe differential CORDIC (D-CORDIC) algorithm have beensuggested to achieve higher throughput and lower pipelinelatency.

F. Differential CORDIC Algorithm

D-CORDIC algorithm is equivalent to the usual CORDIC interms of accuracy as well as convergence, but it provides fasterand more efficient redundant number-based implementation ofboth rotation mode and vectoring mode CORDIC. It introducessome temporary variables corresponding to the CORDIC vari-ables and , that generically defined as

(27)

which implies that and. The signs of are, therefore, considered as being

differentially encoded signs of in the differential CORDICalgorithm [45]. The rotation and vectoring mode D-CORDICalgorithms are outlined in Table III.

D-CORDIC algorithm is suitable for efficient pipelinedimplementation which is utilized by Ercegovac and Lang [34]using on-line arithmetic based on redundant number system.Since the output data in the redundant on-line arithmetic can beavailable in the most-significant-digit-first (MSD-first) fashion,the successive iterations could be implemented by a set ofcascaded stages, where processing time between the successivestages is overlapped with a single-digit time-skew, that resultsin a significant reduction in overall latency of computation.Moreover, in some redundant number representations, theabsolute values and sign of the output are easily determined,e.g., in binary signed-digit (BSD) representation, the sign of anumber corresponds to the sign of the first nonzero MSD, andnegation of the number can be performed just by flipping signsof nonzero digits. A two-dimensional systolic D-CORDICarchitecture is derived in [46] where phase accumulation is per-formed for direct digital frequency synthesis in the digit-levelpipelining framework.

IV. SCALING, QUANTIZATION AND ACCURACY ISSUES

As discussed in Section II-A, scaling is a necessary opera-tion associated with the implementation of CORDIC algorithm.Scaling in CORDIC could be of two types: 1) constant factorscaling and 2) variable factor scaling. In case of variable factorscaling the scale-factor changes with the rotation angle. It arisesmainly because some of the iterations of conventional CORDICare ignored (and that varies with the angle of rotation), as in

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

Page 9: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

MEHER et al.: 50 YEARS OF CORDIC: ALGORITHMS, ARCHITECTURES AND APPLICATIONS 1901

the case of higher-radix CORDIC and most of the optimizedCORDIC algorithms. The techniques for scaling compensationfor each such algorithms have been studied extensively for min-imizing the scaling overhead. In case of conventional CORDIC,as given by (8), after sufficiently large number of iterations, thescale-factor converges to 1.6467605, which leads to con-stant factor scaling since the scale factor remains the same forall the angle of rotations. Constant factor scaling could be ef-ficiently implemented in a dedicated scaling unit designed bycanonical signed digit (CSD)-based technique [47] and commonsub-expression elimination (CSE) approach [48], [49]. Whenthe sum of the output of more than one independent CORDICoperations are to be evaluated, one can perform only one scalingof the output sum [50] in the case of constant factor scaling. Inthe following subsections, we briefly discuss some interestingdevelopments on implementation of on-line scaling and real-ization of scaling-free CORDIC. Besides, we outline here thesources of error that may arise in a CORDIC design and theirimpact on implementation.

A. Implementation of Mixed-Scaling-Rotation

Dewilde et al. [51] have suggested the on-line scaling whereshift-add operations for scaling and micro-rotations are in-terleaved in the same circuit. This approach has been used in[52] and improved further in [53]. In the mixed-scaling-rota-tion (MSR) approach, pioneered by Wu et al. [54]–[56], themicro-rotation and scaling phases are merged into a unifiedvector rotational model to minimize the overhead of the scalingoperation [54]–[56]. The MSR-CORDIC can be applied toDSP applications, in which the rotation angles are usuallyknown a priori, e.g., the twiddle factor in fast Fourier transform(FFT) and kernel components in other sinusoidal transforms.It is shown in [55] that the MSR technique can significantlyreduce the total iteration count so as to improve the speedperformance and enhance the signal-to-quantization-noiseratio (SQNR) performance by controlling the internal dy-namic range. The MSR-CORDIC scheme has been appliedto a variable-length FFT processor design [29], and found toresult in significant hardware reduction in the implementationof twiddle-factor multiplications. Although, the interleavedscaling and MSR-CORDIC provide hardware reduction, theyalso lead to the reduction of throughput. For high-throughputimplementation, one should implement the micro-rotations andscaling in two separate pipelined stages.

B. Low-Complexity Scaling

When the elementary angles pertaining to a rotation are “suf-ficiently small”, defined by , and the rota-tions are only in one direction, the CORDIC rotation is given bythe representation [57]

(28)

and , (considering clockwise micro-rota-tions only), where and are the components of the vectorafter the th micro-rotation, is the input wordlength and

. The formulation of (28) performs the

“actual” rotation where the norm of the vector is preserved atevery micro-rotation.

However, the problem with this formulation is that the overallrange of angles for which it can be used is very small, because,for 16-bit wordlength, the largest such angle is

, which obviously is quite small compared to the entirecoordinate space. To overcome this problem, argument reduc-tion is performed through “domain folding” [58] by mappingthe target rotation-angles into the range . Besides, theelementary rotations are carried out in an adaptive manner toenhance the rate of convergence so as to force the approxima-tion error of final angle below a specified limit [59]. But, thedomain-folding in some cases, involves a rotation throughwhich demands a scaling by a factor of . Besides, the targetrange is still much larger than the range of convergenceof the scaling-free realization. The formulation of (28), there-fore, could be effectively used when a rotation throughis not required and angles of rotations could be folded to therange . Generalized algorithms, and their corre-sponding architectures to perform the scale-factor compensa-tion in parallel with the CORDIC iterations, for both rotationand vectoring modes are proposed in [60], where the compen-sation overhead is reduced to a couple of iterations. It is shownin [61] that since the scale-factor is known in advance, one canperform the minimal recoding of the bits of scaling-factor, andimplement the multiplication thereafter by a Wallace tree. It isa good solution of low-latency scaling particularly for pipelinedCORDIC architectures.

C. Quantization and Numerical Accuracy

Errors in CORDIC are mainly of two types: 1) the angle ap-proximation error which originates from quantization of rota-tion angle represented by a linear combination of finite numbersof elementary angles and 2) the finite wordlength of the datapathresulting in the rounding/truncation of output that increases cu-mulatively through the successive iterations of micro-rotations.A third source of error that also comes into the picture resultsfrom the scaling of pseudo-rotated outputs. The scaling error is,however, also due to the use of finite wordlength in the scalingcircuitry and is predominantly a rounding/truncation error. Adetailed discussion on rounding error due to fixed and floatingpoint implementations is available in [62]. In his earlier work,Walther [3] concluded that the errors in the CORDIC outputare bounded, and extra bits are required in the datap-aths to take care of the errors. Hu [62] has provided more pre-cise error bounds due to the angle approximation error for dif-ferent CORDIC modes for fixed point as well as floating-pointimplementations. The error bound resulting for fixed point rep-resentation of arctangents is further analyzed by Kota and Cav-allaro [63] and its impact on practical implementation has beendiscussed.

D. Area-Delay-Accuracy Trade-Off

Area, accuracy and latency of CORDIC algorithm dependmainly on the iteration count and its implementation. Toachieve -bit accuracy, if fixed-point arithmetic is applied, thewordlength of and data-path is andfor the computation of the angle , it is [45],

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

Page 10: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

1902 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009

TABLE IVCOMPUTATIONS USING CORDIC ALGORITHM IN DIFFERENT CONFIGURATIONS

[63]. The hardware requirement therefore increases accord-ingly with the desired accuracy. Floating-point implementationnaturally gives higher accuracy than its fixed-point counter-part, but at the cost of more complex hardware. To minimizethe angle approximation error, the smallest elementary angle

needs to be as small as possible [62]. This consequentlydemands more number of right-shifts and more hardware forthe barrel-shifters and adders. Besides, to have better angleapproximation, more number of iterations are required whichincreases the latency. The additional accuracy resulting fromfloating-point implementation or better angle approximationmay not, however, be necessary in many applications. Thus,there is a need for trade-off between hardware-cost, latencyand numerical accuracy subject to a particular application.Therefore, the designer has to check how much numericalaccuracy is needed along with area and speed constraints forthe particular application; and can accordingly decide on fixedor floating-point implementation and should set the wordlengthand optimal number of iterations.

V. APPLICATIONS OF CORDIC

CORDIC technique is basically applied for rotation of avector in circular, hyperbolic or linear coordinate systems,which in turn could also be used for generation of sinusoidalwaveform, multiplication and division operations, and evalua-tion of angle of rotation, trigonometric functions, logarithms,exponentials and squareroot [6], [64], [65]. Table IV showssome elementary functions and operations that can be directlyimplemented by CORDIC. The table also indicates whetherthe coordinate system is circular (CC), linear (LC), or hyper-bolic (HC), and whether the CORDIC operates in rotationmode (RM) or vectoring mode (VM), the initialization of theCORDIC and the necessary pre- or postprocessing step toperform the operation. The scale factors are, however, obviatedin Table IV for simplicity of presentation. In this Section, wediscuss how CORDIC is used for some basic matrix problemslike QR decomposition and singular-value decomposition.Moreover, we make a brief presentation on the applications of

CORDIC to signal and image processing, digital communica-tion, robotics and 3-D graphics.

A. Matrix Computation

1) QR Decomposition: QR decomposition of a matrix canbe performed through Givens rotation [66] that selectively in-troduces zeros into the matrix. Givens rotation is an orthogonaltransformation of the form

(29)

where and. The QR decomposition requires two types

of iterative operations to obtain an upper-triangular matrixusing orthogonal transformations. Those are: (i) to calculatethe Givens rotation angle, and (ii) to apply the calculated angleof rotation to the rest of the rows. Circular coordinate CORDICis a good choice to implement both these Givens rotations,where the first operation is performed by a VM CORDICand the second one is performed by an RM CORDIC. TheCORDIC-based QR decomposition can be implemented inVLSI with suitable area-time trade-off using a systolic trian-gular array, a linear array or a single CORDIC processor that isreconfigurable for rotation and vectoring modes of operations.A detail explanation of these architectures are available in [64],[67].

2) Singular Value Decomposition and Eigenvalue Estima-tion: Singular value decomposition of a matrix is given by

where and are orthogonal matrices andis a diagonal matrix of singular values. For CORDIC-based

implementation of SVD, it is decomposed into 2 2 SVD prob-lems, and solved iteratively. To solve each 2 2 SVD problem,two-sided Givens rotation is applied to each of the 2 2 ma-trices to nullify the off-diagonal elements, as described in thefollowing:

(30)

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

Page 11: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

MEHER et al.: 50 YEARS OF CORDIC: ALGORITHMS, ARCHITECTURES AND APPLICATIONS 1903

where is a 2 2 input matrix to be decomposed; and andare, respectively, the left and right rotation angles, calculated

from the elements of using the following two relations:

for

(31)

CORDIC-based architectures for SVD using this methodwere developed by Cavallaro and Luk [68]. A simplified designof array processor for the particular case ( i.e., )was developed further by Delosme [69] for the symmetricEigenvalue problem. In a relatively recent paper [70], Liu et al.have proposed an application-specific instruction set processor(ASIP) for the real-time implementation of QR decompositionand SVD where circular coordinate CORDIC is used for effi-cient implementation of both these functions.

B. Signal Processing and Image Processing Applications

CORDIC techniques have a wide range of DSP applicationsincluding fixed/adaptive filtering [8], and the computation ofdiscrete sinusoidal transforms such as the DFT [50], [52], [71],[72], discrete Hartley transform (DHT) [53], [73], [74], dis-crete cosine transform (DCT) [75]–[78], discrete sine transform(DST) [76]–[78] and chirp -transform (CZT) [79]. The DFT,DHT, and DCT [80] of an -point input sequence for

, in general, are given by

for (32)

where the transform kernel matrix is defined as

forfor DHTfor DCT

The input sequence for the DFT is, in general, complex andthe computation of (32) can be partitioned into blocks ofform: ,which is in the same form as the output of RM-CORDIC,for . In case of DHT similarly the computationcan also be transformed into a computations of the form

to be im-plemented efficiently by RM-CORDIC units. These featuresof DFT and DHT are used to design parallel and pipelinedarchitectures for the computation of these two transforms [50],[52], [53], [71]–[74]. It is shown that [76], [77] by simpleinput-output modification, one can transform the DCT andDST kernels into the DHT form to compute then by rotationmode CORDIC. Similarly in [79], CZT is represented by aDFT-like kernel by simple pre-processing and post-processingoperations, and implemented through CORDIC rotations. TheCORDIC technique has also been used in many image pro-cessing operations like spatial domain image enhancement forcontrast stretching, logarithmic transformation and power-lawtransformation, image rotation, and Hough transform for linedetection [81], [82]. CORDIC implementation of some of theseapplications are discussed in [83], [84]. Several other signalprocessing applications are discussed in detail in [64], whichwe do not intend to repeat here.

Fig. 8. CORDIC-based direct digital synthesizer. � � � � � � �.

Fig. 9. A generic scheme to use RM CORDIC for digital modulation. � and� are, respectively, the in-phase and quadrature signals to be modulated. � �� ���� ������� �� � � � ���� ������� and � � � � � �� ��.

C. Applications to Communication

CORDIC algorithm can be used for efficient implementa-tion of various functional modules in a digital communicationsystem [85]. Most applications of CORDIC in communicationsuse the circular coordinate system in one or both CORDIC op-erating modes. The RM-CORDIC is mainly used to generatemixed signals, while the VM-CORDIC is mainly used to esti-mate phase and frequency parameters. We briefly outline heresome of the important communication applications.

1) Direct Digital Synthesis: Direct digital synthesis is theprocess of generating sinusoidal waveforms directly in the dig-ital domain. A direct digital synthesizer (DDS) (as shown inFig. 8) consists of a phase accumulator and a phase-to-wave-form converter [86], [87]. The phase-generation circuit incre-ments the phase according to , where is the normal-ized carrier frequency in every cycle and feeds the phase infor-mation to the phase-to-waveform converter. The phase-to-wave-form converter could be realized by an RM-CORDIC [88], [89],as shown in Fig. 8. The cosine and sine waveforms are obtained,respectively, by the CORDIC outputs and .

2) Analog and Digital Modulation: A generic scheme touse CORDIC in RM for digital modulation is shown in Fig. 9,where the phase-generation unit of Fig. 8 is changed to gen-erate the phase according to , for

and being the normalized carrier and the modulatingfrequencies, respectively, and is the phase of modulatingcomponent. By suitable selection of the parameters and

and the CORDIC inputs and , the generic scheme ofFig. 9 it could be used for digital realization of analog ampli-tude modulation (AM), phase modulation (PM), and frequencymodulation (FM), as well as the digital modulations, e.g., am-plitude shift keying (ASK), phase-shift keying (PSK), and fre-quency-shift keying (FSK) modulators. It could also be usedfor the up/down converters for quadrature-amplitude modula-tors (QAM) and full mixers for complex signals or phase andfrequency corrector circuits for synchronization [85].

3) Other Communication Applications: By operating theCORDIC in vectoring mode, one can compute the magnitudeand the angle of an input vector. The magnitude computationcan be used for envelope-detection in an AM receiver or todetect FSK signal if it is placed after mark or space filters [90].The angle computation in VM CORDIC, on the other hand, can

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

Page 12: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

1904 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009

be used to detect FM and FSK signals and to estimate phaseand frequency parameters [91]. A single VM-CORDIC can beused to perform these computations for the implementation ofa slicer for a high-order constellation like the 32-APSK used inDVB-S2.

CORDIC circuits operating in both modes are also requiredin digital receivers for the synchronization stage to perform aphase or frequency estimation followed by a correction stage.This can be done by using two different CORDIC units, to meetthe high speed requirement in Costas loop for phase recoveryin a QAM modulation [92], [93]. On the other hand the burst-based communication system that needs a preamble for syn-chronization purposes, e.g., in case of IEEE 802.11a WLAN-OFDM receivers, can use a single CORDIC unit configurablefor both operating modes since the estimation and correctionare not performed simultaneously [94], [95]. Apart from these,the CORDIC-based QR decomposition has been used in multi-input-multi-output (MIMO) systems to implement V-BLASTdetectors [96]–[98], and to implement a recursive-least-square(RLS) adaptive antenna beamformer [67], [99], [100].

D. Applications of CORDIC to Robotics and Graphics

Two of the key problems where CORDIC provides area andpower-efficient solutions are: 1) direct kinematics and 2) inversekinematics of serial robot manipulators. How CORDIC is ap-plied in these applications is discussed below.

1) Direct Kinematics Solution (DKS) for Serial Robot Ma-nipulators: A robot manipulator consists of a sequence of links,connected typically by either revolute or prismatic joints. For an

-degrees-of-freedom manipulator, there are joint-link pairswith link 0 being the supporting base and the last link is attachedwith a tool. The joints and links are numbered outwardly fromthe base. The coordinates of the points on the th link repre-sented by change successively for

due to successive rotations and translations of the links. Thetranslation operations are realized by simple additions of coor-dinate values while the new coordinates of any point due to ro-tation are computed by RM-CORDIC circuits.

2) Inverse Kinematics for Robot Manipulators: The inversekinematics problem involves determination of joint variables fora desired position and orientation for the tool. The CORDIC ap-proach is valuable to find the inverse kinematic solution whena closed form solution is possible (when, in particular, the de-sired tool tip position is within the robot’s work envelope andwhen joint angle limits are not violated). The authors in [101]present a maximum pipelined CORDIC-based architecture forefficient computation of the inverse kinematics solution. It isalso shown [101], [102] that up to 25 CORDIC processors arerequired for the computation of the entire inverse kinematics so-lution for a six-link PUMA-type robotic arm. Apart from imple-mentation of rotation operations, CORDIC is used in the eval-uation of trigonometric functions and square root expressionsinvolved in the inverse kinematics problems [103].

3) CORDIC for Other Robotics Applications: CORDIC hasalso been applied to robot control [104], [105], where CORDICcircuits serve as the functional units of a programmable CPUco-processor. Another application of CORDIC is for kinematicsof redundant manipulators [106]. It is shown in [106] that thecase of inverse kinematics can be implemented efficiently in

parallel by computing pseudo-inverse through singular valuedecomposition. Collision detection is another area whereCORDIC has been applied to robotics [107]. A CORDIC-basedhighly parallel solution for collision detection between a robotmanipulator and multiple obstacles in the workspace is sug-gested in [107]. The collision detection problem is formulatedas one that involves a number of coordinate transformations.CORDIC-based processing elements are used to efficientlyperform the coordinate transformations by shift-add operations.

4) CORDIC for 3-D Graphics: The processing in graphicssuch as 3-D vector rotation, lighting and vector interpolation arecomputation-intensive and are geometric in nature. CORDICarchitecture is therefore a natural candidate for cost-effectiveimplementation of these geometric computations in graphics.A systematic formulation to represent 3D computer graphicsoperations in terms of CORDIC-type primitives is provided in[108]. An efficient stream processor based on CORDIC-typemodules to implement the graphic operations is also suggestedin [108]. 3-D vector interpolation is also an important functionin graphics which is required for good-quality shading [109] forgraphic rendering. It is shown that the variable-precision capa-bility of CORDIC engine could be utilized to realize a power-aware implementation of the 3-D vector interpolator [110].

VI. CONCLUSION

The beauty of CORDIC is its potential for unified solutionfor a large set of computational tasks involving the evaluationof trigonometric and transcendental functions, calculation ofmultiplication, division, square-root and logarithm, solution oflinear systems, QR-decomposition, and SVD, etc. Moreover,CORDIC is implemented by a simple hardware through re-peated shift-add operations. These features of CORDIC hasmade it an attractive choice for a wide variety of applications.In the last fifty years, several algorithms and architectureshave been developed to speed up the CORDIC by reducingits iteration counts and through its pipelined implementation.Moreover, its applications in several diverse areas includingsignal processing, image processing, communication, roboticsand graphics apart from general scientific and technical compu-tations have been explored. Latency of computation, however,continues to be the major drawback of the CORDIC algorithm,since we do not have efficient algorithms for its parallel im-plementation. But, CORDIC on the other hand is inherentlysuitable for pipelined designs, due to its iterative behavior, andsmall cycle time compared with the conventional arithmetic.For high-throughput applications, efficient pipelined-archi-tectures with multiple-CORDIC units could be developed totake the advantage of pipelineability of CORDIC, because thedigital hardware is getting cheaper along with the progressivedevice-scaling. Research on fast implementation of shift-ac-cumulation operation, exploration of new number systemsfor CORDIC, optimization of CORDIC for constant rotationhave scope for further reduction of its latency. Another wayto use CORDIC efficiently, is to transform the computationalalgorithm into independent segments, and to implement theindividual segments by different CORDIC processors. Withenhancement of its throughput and reduction of latency, it isexpected that CORDIC would be useful for many high-speedand real-time applications. The area-delay-accuracy trade-off

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

Page 13: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

MEHER et al.: 50 YEARS OF CORDIC: ALGORITHMS, ARCHITECTURES AND APPLICATIONS 1905

for different advanced algorithms may be investigated in detailand compared with in future work.

REFERENCES

[1] J. E. Volder, “The CORDIC trigonometric computing technique,” IRETrans. Electron. Computers, vol. EC-8, pp. 330–334, Sept. 1959.

[2] J. E. Volder, “The birth of CORDIC,” J. VLSI Signal Process., vol. 25,pp. 101–105, 2000.

[3] J. S. Walther, “A unified algorithm for elementary functions,” inProc. 38th Spring Joint Computer Conf., Atlantic City, NJ, 1971, pp.379–385.

[4] J. S. Walther, “The story of unified CORDIC,” J. VLSI Signal Process.,vol. 25, no. 2, pp. 107–112, June 2000.

[5] D. S. Cochran, “Algorithms and accuracy in the HP-35,” Hewlett-Packard J., pp. 1–11, Jun. 1972.

[6] J.-M. Muller, Elementary Functions: Algorithms and Implementa-tion. Boston, MA: Birkhauser Boston, 2006.

[7] S.-F. Hsiao and J.-M. Delosme, “Householder CORDIC algorithms,”IEEE Trans. Computers, vol. 44, no. 8, pp. 990–1001, Aug. 1995.

[8] E. Antelo, J. Villalba, and E. L. Zapata, “A low-latency pipelined 2Dand 3D CORDIC processors,” IEEE Trans. Computers, vol. 57, no. 3,pp. 404–417, Mar. 2008.

[9] D. Timmermann, H. Hahn, and B. J. Hosticka, “Low latency timeCORDIC algorithms,” IEEE Trans. Computers, vol. 41, no. 8, pp.1010–1015, Aug. 1992.

[10] S. Wang, V. Piuri, and J. E. E. Swartzlander, “Hybrid CORDIC algo-rithms,” IEEE Trans. Computers, vol. 46, no. 11, pp. 1202–1207, Nov.1997.

[11] S. Wang and E. E. Swartzlander, Jr., “Merged CORDIC algorithm,”in IEEE Int. Symp. on Circuits Syst. (ISCAS’95), 1995, vol. 3, pp.1988–1991.

[12] B. Gisuthan and T. Srikanthan, “Pipelining flat CORDIC basedtrigonometric function generators,” Microelectron. J., vol. 33, pp.77–89, 2002.

[13] S. Suchitra, S. Sukthankar, T. Srikanthan, and C. T. Clarke, “Elimina-tion of sign precomputation in flat CORDIC,” in IEEE Int. Symp. onCircuits Syst., ISCAS’05, May 2005, vol. 4, pp. 3319–3322.

[14] E. Deprettere, P. Dewilde, and R. Udo, “Pipelined CORDIC architec-tures for fast VLSI filtering and array processing,” in IEEE Int. Conf.on Acoust., Speech, Signal Process., ICASSP’84, Mar. 1984, vol. 9, pp.250–253.

[15] H. Kunemund, S. Soldner, S. Wohlleben, and T. Noll, “CORDIC pro-cessor with carry save architecture,” in Proc. ESSCIRC 90, Sept. 1990,pp. 193–196.

[16] E. Antelo, J. Villalba, J. D. Bruguera, and E. L. Zapatai, “High perfor-mance rotation architectures based on the radix-4 CORDIC algorithm,”IEEE Trans. Computers, vol. 46, no. 8, pp. 855–870, Aug. 1997.

[17] P. R. Rao and I. Chakrabarti, “High-performance compensation tech-nique for the radix-4 CORDIC algorithm,” Proc. IEE Comput. and Dig-ital Techn., vol. 149, no. 5, pp. 219–228, Sep. 2002.

[18] E. Antelo, T. Lang, and J. D. Bruguera, “Very-high radix circularCORDIC: Vectoring and unified rotation/vectoring,” IEEE Trans.Computers, vol. 49, no. 7, pp. 727–739, July 2000.

[19] Y. H. Hu and S. Naganathan, “An angle recoding method for CORDICalgorithm implementation,” IEEE Trans. Comput., vol. 42, no. 1, pp.99–102, Jan. 1993.

[20] Y. H. Hu and H. H. M. Chern, “A novel implementation of CORDIC al-gorithm using backward angle recoding (BAR),” IEEE Trans. Comput.,vol. 45, no. 12, pp. 1370–1378, Dec. 1996.

[21] C.-S. Wu, A.-Y. Wu, and C.-H. Lin, “A high-performance/low-latencyvector rotational CORDIC architecture based on extended elementaryangle set and trellis-based searching schemes,” IEEE Trans. CircuitsSyst. II: Anal. Digital Signal Process., vol. 50, no. 9, pp. 589–601, Sep.2003.

[22] T. K. Rodrigues and E. E. Swartzlander, “Adaptive CORDIC:Using parallel angle recoding to accelerate CORDIC rotations,” in40th Asilomar Conf. on Signals, Syst. and Computers, ACSSC’06,Oct.–Nov. 2006, pp. 323–327.

[23] M. Kuhlmann and K. K. Parhi, “P-CORDIC: A precomputation basedrotation CORDIC algorithm,” EURASIP J. Appl. Signal Process., vol.2002, no. 9, pp. 936–943, 2002.

[24] D. Fu and A. N. Willson, Jr., “A high-speed processor for digital sine/cosine generation and angle rotation,” in Conf. Rec. 32nd AsilomarConf. on Signals, Syst. & Computers, Nov. 1998, vol. 1, pp. 177–181.

[25] C.-Y. Chen and W.-C. Liu, “Architecture for CORDIC algorithm real-ization without ROM lookup tables,” in Proc. 2003 Int. Symp. on Cir-cuits Syst., ISCAS’03, May 2003, vol. 4, pp. 544–547.

[26] D. Fu and A. N. Willson, Jr., “A two-stage angle-rotation architectureand its error analysis for efficient digital mixer implementation,” IEEETrans.Circuits Syst. I: Reg. Papers, vol. 53, no. 3, pp. 604–614, Mar.2006.

[27] S. Ravichandran and V. Asari, “Implementation of unidirectionalCORDIC algorithm using precomputed rotation bits,” in 45th MidwestSymp. on Circuits Syst., 2002. MWSCAS 2002, Aug. 2002, vol. 3, pp.453–456.

[28] C.-Y. Chen and C.-Y. Lin, “High-resolution architecture for CORDICalgorithm realization,” in Proc. Int. Conf. on Commun., Circuits Syst.,ICCCS’06, June 2006, vol. 1, pp. 579–582.

[29] D. D. Hwang, D. Fu, and A. N. Willson, Jr., “A 400-MHz processorfor the conversion of rectangular to polar coordinates in 0.25-���-mCMOS,” IEEE J. Solid-State Circuits, vol. 38, no. 10, pp. 1771–1775,Oct. 2003.

[30] S.-W. Lee, K.-S. Kwon, and I.-C. Park, “Pipelined cartesian-to-polarcoordinate conversion based on SRT division,” IEEE Trans. CircuitsSyst. II: Express Briefs, vol. 54, no. 8, pp. 680–684, Aug. 2007.

[31] S.-F. Hsiao, Y.-H. Hu, and T.-B. Juang, “A memory-efficient andhigh-speed sine/cosine generator based on parallel CORDIC rota-tions,” IEEE Signal Process. Lett., vol. 11, no. 2, 2004.

[32] T.-B. Juang, S.-F. Hsiao, and M.-Y. Tsai, “Para-CORDIC: ParallelCORDIC rotation algorithm,” IEEE Trans. Circuits Syst. I: Regular Pa-pers, vol. 51, no. 8, 2004.

[33] T.-B. Juang, “Area/delay efficient recoding methods for parallelCORDIC rotations,” in IEEE Asia Pacific Conf. on Circuits Syst.,APCCAS’06, Dec. 2006, pp. 1539–1542.

[34] M. D. Ercegovac and T. Lang, “Redundant and on-line CORDIC:Application to matrix triangularization and SVD,” IEEE Trans. Com-puters, vol. 39, no. 6, pp. 725–740, June 1990.

[35] N. Takagi, T. Asada, and S. Yajima, “Redundant CORDIC methodswith a constant scale factor for sine and cosine computation,” IEEETrans. Computers, vol. 40, no. 9, pp. 989–995, Sept. 1991.

[36] J. Duprat and J.-M. Muller, “The CORDIC algorithm: New results forfast VLSI implementation,” IEEE Trans. Computers, vol. 42, no. 2, pp.168–178, Feb. 1993.

[37] J.-A. Lee and T. Lang, “Constant-factor redundant CORDIC for anglecalculation and rotation,” IEEE Trans. Computers, vol. 41, no. 8, pp.1016–1025, Aug. 1992.

[38] N. D. Hemkumar and J. R. Cavallaro, “Redundant and on-lineCORDIC for unitary transformations,” IEEE Trans. Computers, vol.43, no. 8, pp. 941–954, Aug. 1994.

[39] J. Valls, M. Kuhlmann, and K. K. Parhi, “Evaluation of CORDIC algo-rithms for FPGA design,” J. VLSI Signal Process. Syst., vol. 32, no. 3,pp. 207–222, Nov. 2002.

[40] D. E. Metafas and C. E. Goutis, “A floating point pipeline CORDICprocessor with extended operation set,” in IEEE Int. Symp. on CircuitsSyst., ISCAS’91, June 1991, vol. 5, pp. 3066–3069.

[41] Z. Feng and P. Kornerup, “High speed DCT/IDCT using a pipelinedCORDIC algorithm,” in 12th Symp. on Computer Arithmetic, July1995, pp. 180–187.

[42] M. Jun, K. K. Parhi, G. J. Hekstra, and E. F. Deprettere, “Efficientimplementations of pipelined CORDIC based IIR digital filters usingfast orthonormal�-rotations,” IEEE Trans. Signal Process., vol. 48, no.9, 2000.

[43] M. Chakraborty, A. S. Dhar, and M. H. Lee, “A trigonometric formu-lation of the LMS algorithm for realization on pipelined CORDIC,”IEEE Trans. Circuits Syst. II, Express Briefs, vol. 52, no. 9, 2005.

[44] E. I. Garcia, R. Cumplido, and M. Arias, “Pipelined CORDIC designon FPGA for a digital sine and cosine waves generator,” in Int. Conf.on Electr. Electron. Eng., ICEEE’06, Sept. 2006, pp. 1–4.

[45] H. Dawid and H. Meyr, “The differential CORDIC algorithm: Constantscale factor redundant implementation without correcting iterations,”IEEE Trans. Computers, vol. 45, no. 3, pp. 307–318, Mar. 1996.

[46] C. Y. Kang and E. E. Swartzlander, Jr., “Digit-pipelined direct digitalfrequency synthesis based on differential CORDIC,” IEEE Trans. Cir-cuits Syst. I, Reg. Papers, vol. 53, no. 5, pp. 1035–1044, May 2006.

[47] R. I. Hartley, “Subexpression sharing in filters using canonic signeddigit multipliers,” IEEE Trans. Circuits Syst. II: Analog Digital SignalProcess., vol. 43, no. 10, pp. 677–688, Oct. 1996.

[48] O. Gustafsson, A. G. Dempster, K. Johansson, M. D. Macleod, and L.Wanhammar, “Simplified design of constant coefficient multipliers,” J.Circuits, Syst., Signal Process., vol. 25, no. 2, pp. 225–251, Apr. 2006.

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

Page 14: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

1906 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 56, NO. 9, SEPTEMBER 2009

[49] G. Gilbert, D. Al-Khalili, and C. Rozon, “Optimized distributedprocessing of scaling factor in CORDIC,” in 3rd Int. IEEE-NEWCASConf., June 2005, pp. 35–38.

[50] A. M. Despain, “Fourier transform computers using CORDIC itera-tions,” IEEE Trans. Computers, vol. 23, no. C-10, pp. 993–1001, Oct.1974.

[51] P. Dewilde, E. F. Deprettere, and R. Nouta, “Parallel and pipelinedVLSI implementation of signal processing algorithms,” in VLSI andModern Signal Processing, S. Y. Kung, H. J. Whitehouse, and T.Kailath, Eds. Englewood Cliffs, NJ: Prentice-Hall, 1995.

[52] K. J. Jones, “High-throughput, reduced hardware systolic solution toprime factor discrete Fourier transform algorithm,” Proc.IEE Com-puters and Digital Techn., vol. 137, no. 3, pp. 191–196, May 1990.

[53] P. K. Meher, J. K. Satapathy, and G. Panda, “Efficient systolic solutionfor a new prime factor discrete Hartley transform algorithm,” Proc. IEECircuits, Devices & Syst., vol. 140, no. 2, pp. 135–139, Apr. 1993.

[54] Z.-X. Lin and A.-Y. Wu, “Mixed-scaling-rotation CORDIC(MSR-CORDIC) algorithm and architecture for scaling-free high-per-formance rotational operations,” in IEEE Int. Conf. on Acoust., Speech,Signal Process., ICASSP’03, Apr. 2003, vol. 2, pp. 653–656.

[55] C.-H. Lin and A.-Y. Wu, “Mixed-scaling-rotation CORDIC (MSR-CORDIC) algorithm and architecture for high-performance vector ro-tational DSP applications,” IEEE Trans. Circuits Syst. I, Reg. Papers,vol. 52, no. 11, pp. 2385–2396, Nov. 2005.

[56] C.-L. Yu, T.-H. Yu, and A.-Y. Wu, “On the fixed-point properties ofmixed-scaling-rotation CORDIC algorithm,” in IEEE Workshop onSignal Process. Syst., Oct. 2007, pp. 430–435.

[57] A. S. Dhar and S. Banerjee, “An array architecture for fast computationof discrete hartley transform,” IEEE Trans. Circuits Syst., vol. 38, no.9, pp. 1095–1098, Sep. 1991.

[58] K. Maharatna, A. Troya, S. Banerjee, and E. Grass, “New virtuallyscaling free adaptive CORDIC rotator,” Proc. IEE Computers and Dig-ital Techn., vol. 151, no. 6, pp. 448–456, Nov. 2004.

[59] K. Maharatna, S. Banerjee, E. Grass, M. Krstic, and A. Troya, “Modi-fied virtually scaling free adaptive CORDIC rotator algorithm and ar-chitecture,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 11,pp. 1463–1474, Nov. 2005.

[60] J. Villalba, T. Lang, and E. Zapata, “Parallel compensation of scalefactor for the CORDIC algorithm,” J. VLSI Signal Process., vol. 19,no. 3, pp. 227–241, Aug. 1998.

[61] D. Timmermann, H. Hahn, B. J. Hosticka, and B. Rix, “A new additionscheme and fast scaling factor compensation methods for CORDIC al-gorithms,” Integration, the VLSI J., vol. 11, no. 1, pp. 85–100, Mar.1991.

[62] Y. H. Hu, “The quantization effects of the CORDIC algorithm,” IEEETrans. Signal Process., vol. 40, no. 4, pp. 834–844, Apr. 1992.

[63] K. Kota and J. R. Cavallaro, “Numerical accuracy and hardwaretradeoffs for CORDIC arithmetic for special-purpose processors,”IEEE Trans. Computers, vol. 42, no. 7, pp. 769–779, July 1993.

[64] Y. H. Hu, “CORDIC-based VLSI architectures for digital signal pro-cessing,” IEEE Signal Process. Mag., vol. 9, no. 3, pp. 16–35, July1992.

[65] F. Angarita, A. Perez-Pascual, T. Sansaloni, and J. Vails, “EfficientFPGA implementation of cordic algorithm for circular and linear co-ordinates,” in Int. Conf. on Field Programmable Logic and Appl., Aug.2005, pp. 535–538.

[66] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed. Bal-timore, MD: Johns Hopkins Univ. Press, 1996.

[67] G. Lightbody, R. Woods, and R. Walke, “Design of a parameterizablesilicon intellectual property core for QR-based RLS filtering,” IEEETrans. Very Large Scale Integr. (VLSI) Syst., vol. 11, no. 4, pp. 659–678,Aug. 2003.

[68] J. R. Cavallaro and F. T. Luk, “CORDIC arithmetic for a SVD pro-cessor,” J. Parallel and Distributed Computing, vol. 5, pp. 271–290,1988.

[69] J. M. Delosme, “A processor for two-dimensional symmetric eigen-value and singular value arrays,” in IEEE 21th Asilomar Conf. on Cir-cuits, Syst., and Computers, Nov. 1987, pp. 217–221.

[70] Z. Liu, K. Dickson, and J. V. McCanny, “Application-specific instruc-tion set processor for SoC implementation of modern signal processingalgorithms,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 4,pp. 755–765, Apr. 2005.

[71] K. J. Jones, “2D systolic solution to discrete Fourier transform,” Proc.IEE Computers and Digital Techn., vol. 136, no. 3, pp. 211–216, May1989.

[72] T.-Y. Sung, “Memory-efficient and high-speed split-radix FFT/IFFTprocessor based on pipelined CORDIC rotations,” Proc. IEE Vision,Image Signal Process., vol. 153, no. 4, pp. 405–410, Aug. 2006.

[73] L.-W. Chang and S.-W. Lee, “Systolic arrays for the discreteHartley transform,” IEEE Trans. Signal Process., vol. 39, no. 11, pp.2411–2418, Nov. 1991.

[74] P. K. Meher and G. Panda, “Novel recursive algorithm and highly com-pact semisystolic architecture for high throughput computation of 2-DDHT,” Electron. Lett., vol. 29, no. 10, pp. 883–885, May 1993.

[75] W.-H. Chen, C. H. Smith, and S. C. Fralick, “A fast computationalalgorithm for the discrete cosine transform,” IEEE Trans. Commun.,vol. 25, no. 9, pp. 1004–1009, Sep. 1977.

[76] B. Das and S. Banerjee, “Unified CORDIC-based chip to realise DFT/DHT/DCT/DST,” Proc. IEE Computers and Digital Techniques, vol.149, no. 4, pp. 121–127, July 2002.

[77] J.-H. Hsiao, L.-G. Ghen, T.-D. Chiueh, and C.-T. Chen, “Highthroughput CORDIC-based systolic array design for the discretecosine transform,” IEEE Trans. Circuits Syst. Video Technol., vol. 5,no. 3, pp. 218–225, June 1995.

[78] D. C. Kar and V. V. B. Rao, “A CORDIC-based unified systolic archi-tecture for sliding window applications of discrete transforms,” IEEETrans. Signal Process., vol. 44, no. 2, pp. 441–444, Feb. 1996.

[79] Y. H. Hu and S. Naganathan, “A novel implementation of chirpZ-transform using a CORDIC processor,” IEEE Trans. Acoust.,Speech, Signal Process., vol. 38, no. 2, pp. 352–354, Feb. 1990.

[80] J. G. Proakis and D. G. Manolakis, Digital Signal Processing: Prin-ciples, Algorithms and Applications. Upper Saddle River, NJ: Pren-tice-Hall, 1996.

[81] R. C. Gonzalez, Digital Image Processing, 3rd ed. Upper SaddleRiver, N.J.: Prentice Hall, 2008.

[82] N. Guil, J. Villalba, and E. L. Zapata, “A fast Hough transform forsegment detection,” IEEE Trans. Image Process., vol. 4, no. 11, pp.1541–1548, Nov. 1995.

[83] S. M. Bhandakar and H. Yu, “VLSI implementation of real-timeimage rotation,” in Int. Conf. on Image Process., Sept. 1996, vol. 2,pp. 1015–1018.

[84] S. Sathyanarayana, S. R. Kumar, and S. Thambipillai, “UnifiedCORDIC based processor for image processing,” in 15th Int. Conf. onDigital Signal Process., July 2007, pp. 343–346.

[85] J. Valls, T. Sansaloni, A. Perez-Pascual, V. Torres, and V. Almenar,“The use of CORDIC in software defined radios: A tutorial,” IEEECommun. Mag., vol. 44, no. 9, 2006.

[86] L. Cordesses, “Direct digital synthesis: A tool for periodic wave gen-eration (part 1),” IEEE Signal Process. Mag., vol. 21, no. 4, 2004.

[87] J. Vankka, Digital Synthesizers and Transmitters for Software Radio.Dordrecht, Netherlands: Springer, 2005.

[88] J. Vankka, “Methods of mapping from phase to sine amplitude in di-rect digital synthesis,” in 50th IEEE International Frequency ControlSymposium, Jun. 1996, pp. 942–950.

[89] F. Cardells-Tormo and J. Valls-Coquillat, “Optimisation of direct dig-ital frequency synthesisers based on CORDIC,” Electron. Lett., vol. 37,no. 21, 2001.

[90] M. E. Frerking, Digital Signal Processing in Communication Sys-tems. New York: Van Nostrand Reinhold, 1994.

[91] H. Meyr, M. Moeneclaey, and S. A. Fechtel, Digital CommunicationReceivers: Synchronization, Channel Estimation, and Signal Pro-cessing. New York: Wiley, 1998.

[92] F. Cardells, J. Valls, V. Almenar, and V. Torres, “Efficient FPGA-basedQPSK demodulation loops: Application to the DVB standard,” Lec-tures Notes on Computer Sci., vol. 2438, pp. 102–111, 2002.

[93] C. Dick, F. Harris, and M. Rice, “FPGA implementation of carrier syn-chronization for QAM receivers,” J. VLSI Signal Process., vol. 36, pp.57–71, 2004.

[94] M. J. Canet, F. Vicedo, V. Almenar, and J. Valls, “FPGA implementa-tion of an IF transceiver for OFDM-based WLAN,” in IEEE Workshopon Signal Process. Syst., SIPS’04, 2004, pp. 227–232.

[95] A. Troya, K. Maharatna, M. Krstic, E. Grass, U. Jagdhold, and R.Kraemer, “Low-power VLSI implementation of the inner receiver forOFDM-based WLAN systems,” IEEE Trans. Circuits Syst. I, Reg. Pa-pers, vol. 55, no. 2, pp. 672–686, Mar. 2008.

[96] Z. Guo and P. Nilsson, “A VLSI architecture of the square root algo-rithm for V-BLAST detection,” J. VLSI Signal Process. Syst., vol. 44,no. 3, pp. 219–230, Sept. 2006.

[97] Z. Khan, T. Arslan, J. S. Thompson, and A. T. Erdogan, “Analysis andimplementation of multiple-input, multiple-output VBLAST receiverfrom area and power efficiency perspective,” IEEE Trans. Very LargeScale Integr. (VLSI) Syst., vol. 14, no. 11, pp. 1281–1286, Nov. 2006.

[98] F. Sobhanmanesh and S. Nooshabadi, “Parametric minimum hardwareQR-factoriser architecture for V-BLAST detection,” Proc. IEE Cir-cuits, Devices and Syst., vol. 153, no. 5, pp. 433–441, Oct. 2006.

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.

Page 15: 50 Years of CORDIC: Algorithms, Architectures, and ...eprints.soton.ac.uk/267873/1/tcas1_cordic_review.pdf · 50 Years of CORDIC: Algorithms, Architectures, and Applications ... completion

MEHER et al.: 50 YEARS OF CORDIC: ALGORITHMS, ARCHITECTURES AND APPLICATIONS 1907

[99] C. M. Rader, “VLSI systolic arrays for adaptive nulling [radar],” IEEESignal Process. Mag., vol. 13, no. 4, pp. 29–49, July 1996.

[100] R. Hamill, J. V. McCanny, and R. L. Walke, “Online CORDIC algo-rithm and VLSI architecture for implementing QR-array processors,”IEEE Trans. Signal Process., vol. 48, no. 2, pp. 592–598, Feb. 2000.

[101] C. Lee and P. Chang, “A maximum pipelined CORDIC architecture forinverse kinematic position computation,” IEEE Trans. Robot. Autom.,vol. RA-3, no. 5, pp. 445–458, 1987.

[102] R. Harber, J. Li, X. Hu, and S. Bass, “The application of bit-serialCORDIC computational units to the design of inverse kinematics pro-cessors,” in Proc. IEEE Int. Conf. on Robot. and Autom., 1988, pp.1152–1157.

[103] C. Krieger and B. Hosticka, “Inverse kinematics computations withmodified CORDIC iterations,” Proc. IEE Computers and DigitalTechn., vol. 143, pp. 87–92, Jan. 1996.

[104] Y. Wang and S. Butner, “A new architecture for robot control,” in Proc.IEEE Int. Conf. on Robot. Autom., 1987, pp. 664–670.

[105] Y. Wang and S. Butner, “RIPS: A platform for experimental real-timesensory-based robot control,” IEEE Trans. Syst., Man, Cybern., vol. 19,pp. 853–860, 1989.

[106] I. Walker and J. Cavallaro, “Parallel VLSI architectures for real-timekinematics of redundant robots,” in Proc. IEEE Int. Conf. on Robot.Autom., 1993, pp. 870–877.

[107] M. Kameyama, T. Amada, and T. Higuchi, “Highly parallel collisiondetection processor for intelligent robots,” IEEE J. Solid-State Circuits,vol. 27, pp. 300–306, 1992.

[108] T. Lang and E. Antelo, “High-throughput CORDIC-based geometryoperations for 3D computer graphics,” IEEE Trans. Computers, vol.54, no. 3, pp. 347–361, Mar. 2005.

[109] B. Phong, “Illumination for computer generated pictures,” Commun.ACM, pp. 311–317, June 1975.

[110] J. Euh, J. Chittamuru, and W. Burleson, “CORDIC vector interpolatorfor power-aware 3D computer graphics,” in IEEE Workshop on SignalProcess. Syst., SIPS’02, Oct. 2002, pp. 240–245.

Pramod Kumar Meher (SM’03) received the M.Sc.and M.Phil. degrees in physics, in 1978 and 1981, re-spectively, and the Ph.D. degree in science, in 1996,all from Sambalpur University, Sambalpur, India.

Currently, he is a Senior Scientist with the Institutefor Infocomm Research, Singapore. Prior to this as-signment he was a visiting faculty with the School ofComputer Engineering, Nanyang Technological Uni-versity, Singapore. From 1997 to 2002, he was a Pro-fessor of Computer Application with Utkal Univer-sity, Bhubaneswar, India, and from 1993 to 1997, he

was a Reader in electronics with Berhampur University, Berhampur, India. Hisresearch interest mainly includes the design and optimization of dedicated andreconfigurable architectures for computation-intensive algorithms, arithmeticcircuits, and algorithm-architecture co-design for digital signal processing andcommunication.

Dr. Meher is a Fellow of IET (UK) and of IETE (India). He is currentlyserving as Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND

SYSTEMS—II: EXPRESS BRIEFS, IEEE TRANSACTIONS ON VERY LARGE SCALE

INTEGRATION (VLSI) SYSTEMS, and The Journal of Circuits, Systems, andSignal Processing. He was the recipient of Samanta Chandrasekhar Award forexcellence in research in engineering and technology for the year 1999.

Javier Valls (M’02) received the telecommunicationengineering degree from the Universidad Politec-nica de Cataluna, Spain, and the Ph.D. degree intelecommunication engineering from the Univer-sidad Politecnica de Valencia, Spain, in 1993 and1999, respectively.

He is with the Department of Electronics at Uni-versidad Politecnica de Valencia, Valencia, Spain,since 1993, where he currently is an AssociateProfessor. His current research interests include thedesign of FPGA-based systems, computer arith-

metic, VLSI signal processing, and digital communications.

Tso-Bing Juang (M’99) received the Ph.D. degreein computer science and engineering from NationalSun Yat-sen University, Kaoshiung, Taiwan, in 2004.Since 2006, he has been with the Department of Com-puter Science and Information Engineering, NationalPingtung Institute of Commerce, Taiwan, and is cur-rently an Assistant Professor. His research interestsinclude computer arithmetic and VLSI systems.

Dr. Juang received the Best Thesis Award fromTaiwan Xerox’s foundation in 1995. Currently heserves as the reviewers of IEEE TRANSACTIONS

ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, IEEE TRANSACTIONS

ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, IEEE TRANSACTIONS ON

COMPUTERS, and IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION

(VLSI) SYSTEMS.

K. Sridharan (S’84-M’96-SM’01) received thePh.D. degree from Rensselaer Polytechnic Institute,Troy, NY, in 1995.

He was an Assistant Professor at the IndianInstitute of Technology (IIT) Guwahati, India, from1996 to 2001. Since June 2001, he has been withIIT Madras, Chennai, India, where he is presently anAssociate Professor. He was a visiting staff memberat NTU, Singapore, in 2000–2001 and 2006–2008.He has supervised three Ph.D. candidates, and holdsone joint patent. He has published more than 60

papers in various international journals and conferences.Dr. Sridharan received the Computer Engineering Division Prize for a paper

published in the Journal of I.E. (India) in 2002. He was also a joint recipient ofthe IEEE Vincent Bendix Award.

Koushik Maharatna (M’02) received the M.Sc. de-gree in electronic science from Calcutta University,in 1995 and the Ph.D. degree from Jadavpur Univer-sity, Calcutta, India, in 2002.

From 2000 to 2003, he was a Research Scientistwith IHP, Frankfurt (Oder), Germany, where, hewas mainly involved in the design of a single-chipmodem for the IEEE 802.11a standard. He is withthe Electronics Systems and Devices Group at theSchool of Electronics and Computer Science ofthe University of Southampton, U.K., as a Senior

Lecturer since 2006. His research interests include development of VLSIarchitectures for DSP and communication applications, computer arithmetic,low-power digital design, analog signal processing, and CNN.

Dr. Maharatna has served as session chair, reviewer and review committeemanagement member in several IEEE conferences and journals. He is currentlya member of the Engineering and Physical Research Council (EPSRC) collegein the U.K. He is also a member of VLSI System Application (VSA) TechnicalCommittee of IEEE Circuits and Systems Society.

Authorized licensed use limited to: UNIVERSITY OF SOUTHAMPTON. Downloaded on September 15, 2009 at 13:19 from IEEE Xplore. Restrictions apply.