Upload
phungque
View
248
Download
0
Embed Size (px)
Citation preview
Research ArticleHigh Level Synthesis FPGA Implementation ofthe Jacobi Algorithm to Solve the Eigen Problem
Ignacio Bravo1 Ceacutesar Vaacutezquez1 Alfredo Gardel1 Joseacute L Laacutezaro1 and Esther Palomar2
1Electronics Department University of Alcala Alcala de Henares 28805 Madrid Spain2School of Computing Telecommunications and Networks Birmingham City University Millennium Point Birmingham B4 7XG UK
Correspondence should be addressed to Ignacio Bravo ibravodepecauahes
Received 2 December 2014 Revised 19 January 2015 Accepted 19 January 2015
Academic Editor Jose R C Piqueira
Copyright copy 2015 Ignacio Bravo et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
We present a hardware implementation of the Jacobi algorithm to compute the eigenvalue decomposition (EVD)The computationof eigenvalues and eigenvectors has many applications where real time processing is required and thus hardware implementationsare often mandatory Some of these implementations have been carried out with field programmable gate array (FPGA) devicesusing low level register transfer level (RTL) languages In the present study we used the Xilinx Vivado HLS tool to develop a highlevel synthesis (HLS) design and evaluated different hardware architectures After analyzing the design for different input matrixsizes and various hardware configurations we compared it with the results of other studies reported in the literature concludingthat although resource usage may be higher when HLS tools are used the design performance is equal to or better than low levelhardware designs
1 Introduction
Eigenvalue calculation is a problem that arises in many fieldsof science and engineering such as computer vision [1]business and finance [2] power electronics [3] and wirelesssensor networks [4] In these applications eigenvalues arenormally used in a decision process so real time performanceis often required
Since eigenvalue computation is usually based on opera-tions such as matrix multiplication andmatrix inversion anda large amount of data must be handled it is computationallyexpensive and can represent a potential bottleneck for the restof the algorithm
The algorithms involved in eigenvalue and eigenvectorcomputations are generally iterative and present strong datadependencies between iterations however many similaroperations can be carried out in parallel in the same iterationsuch as multiply and accumulate in matrix products
Although this is not important when working withgeneral purpose processors or more specialized digital signalprocessors (DSPs) some devices leverage this feature byallowing the execution of repetitive operations in parallel
For example graphic processing units (GPUs) are usefulwhen a large number of multiply and accumulate operationsare involved since they are designed to work in graphicsprocessing which is based on vector and matrix operations
For similar reasons field programmable gate arrays(FPGAs) display a remarkable capacity to carry out repetitiveoperations in parallel
There are many applications where matrices are real andsymmetric with sizes not exceeding 20 times 20 [5 6] In thissituation EVD computation is easier and there are manyalgorithms that take advantage of this
Some of the applications where the symmetric eigenvalueproblem arises include the principal component analysis(PCA) statistical technique used for dimension reductionin data analysis [1 7] and the multiple signal classification(MUSIC) algorithm [8] which uses a covariance matrix(which is real and symmetric by definition) In these studiesFPGA devices were employed to solve the problem using aspecialized hardware module to compute the EVD
As mentioned previously most of the methods used tosolve the eigen problem are iterative Depending on howthe operations are performed on the initial matrix and
Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2015 Article ID 870569 11 pageshttpdxdoiorg1011552015870569
2 Mathematical Problems in Engineering
the results obtained a distinction can be made betweenpurely iterative methods (Jacobi power iterations) [9] andalgorithms that perform some kind of transformation of theinitial matrix (Lanczos-bisection householder-QR) [10 11]A distinction can also be made between algorithms thatoutput all the eigenvalues (Jacobi QR) and those that onlycompute extremal eigenvalues or the one closest to a givenvalue (power iterations Lanczos)
In the studies cited above the main method employedfor eigenvalue and eigenvector computation was the Jacobialgorithm since its characteristics render it highly suitablefor a parallel implementation The Jacobi algorithm can beimplemented in such a way that there are almost no datadependencies between operations in the same iteration and itis possible to use systolic architectures as has been proposedby Brent et al [12] In addition all the operations involvedcan be carried out by using the CORDIC algorithm [13] ashas been shown by Cavallaro and Luk in [14] and this canbe implemented simply with add and shift operations whichonly use a small amount of hardware resources compared tomultiplication and division operations Another advantage ofthe Jacobi method is that its round-off error is very stable [9]
However there are other methods for computingextremal eigenvalues that can take advantage of thecomputing power provided by FPGAs such as the QRalgorithm or the Lanczos [15] method since some of theiroperations can be carried out in parallel In this study we alsopresent an FPGA floating point implementation of the QRalgorithm In terms of a practical FPGA implementation it isa challenging task to make efficient use of the availableresources while at the same time meeting real timerequirements This is mainly because the analysis ofalgorithm data dependencies and their parallelization isextremely difficult
Furthermore FPGA implementations are usually carriedout in register transfer level (RTL) languages where there is avery low level of abstraction and it is therefore necessary tomanage resources and timing carefully This also implies thatwhen some architectural decisions are made it is difficult togo back without recoding most of the system Neverthelessthis kind of system can be very efficient in terms of the totalamount of resources used
When an algorithm with relevant mathematical work-load such as the ones mentioned before has to be imple-mented in an FPGA a great part of the design processconsist in the partition of the system in small units toachieve the required grade of concurrency Design time canbe shortened by using a high level synthesis (HLS) tool Thistool attempts to raise the abstraction level allowing algorithmspecification by means of a high level programming languagesuch as C or C++ Thanks to this the designer can focuson the algorithm itself while the tool makes the software tohardware conversion guided by some user directives whichrange from resource usage to concurrency grade Some ofthe tools that have become popular in the recent years areImpulse C [16] Catapult C by Calypto [17] and Vivado HLSby Xilinx [18]
Hardware implementations of the Jacobi algorithm arevery popular because its characteristics permit different
architectural designs depending on the required perfor-mance In the literature there are systolic implementationswhere speed is themost important requirement [19 20] serialdesigns where low resource consumption is selected [1 21]and a meet in the middle approach (semiparallel) where bothcriteria are balanced [8] In this study the Jacobi algorithmwas implemented using the Xilinx Vivado HLS tool toexplore different hardware implementations and comparethem to select the most efficient one in terms of executiontime and FPGA resources used
In this study the Jacobi algorithmwas implemented usingthe Xilinx Vivado HLS tool to explore different hardwareimplementations and compare them to select the mostefficient one in terms of execution time and FPGA resourcesused
To gain a better understanding of the results obtained wealso carried out a floating point implementation of the QRalgorithm using Vivado HLS The results obtained were thencompared to similar systems described in the literature
The rest of the paper is organized as follows In Section 2the mathematical background behind the Jacobi and QRalgorithms is presented together with some details relevantto implementation The proposed implementations and itsparameters are discussed in Sections 3 and 4 respectivelyFinally the system performance is compared with otherdesign approaches in Section 5 and we present our conclu-sions in Section 6
2 Mathematical Background
The eigenvalue problem [22] is one of the main questions innumerical algebra and due to themany applications in whichit is useful it has attracted much research attention in recentyears [23ndash25] The formulation of the problem is simple (1)but it can be approached from very different angles
119860V = 120582V (1)where V and 120582 are an eigenvalue and an eigenvector respec-tively of a matrix 119860 isin R119899times119899 Although there is an analyticalsolution it involves the calculation of polynomial roots andtherefore it is only suitable for low order (119899) matrices (119899 lt 4)When the problem has a higher order numerical algorithmsmust be used In this section the mathematical backgroundof Jacobi and QR algorithms is presented when they areused to compute the eigenvalues from real and symmetricmatrices
21 The Jacobi Algorithm The Jacobi algorithm [26] wasfirst proposed in 1846 but did not become popular until itscomputational implementation was explored The singularvalue decomposition (SVD) of a matrix 119860 isin R119899times119899 is givenby
119860 = 119880119863119881119879
(2)where 119880 and 119881 are orthogonal matrices and 119863 is a diagonalmatrix storing the singular values of119860 on its diagonal For thecase where 119860 is real and symmetric we can rewrite (2) as
119860 = 119881119863119881119879
(3)
Mathematical Problems in Engineering 3
The idea behind the Jacobi method is to perform aseries of similarity transformations (ie before and aftermultiplication with an orthogonal matrix) of 119860 to renderit more diagonal in each iteration This can be seen as aniterative expression
119860(119896+1)
= 119881(119896)
119860(119896)
119881(119896)119879
(4)
where 119896 represents the current iteration and 119896+1 the next oneThe sequence of transformation matrices is chosen in such away that the doublemultiplication renders the updated119860(119896+1)
a little more diagonal than its predecessor 119860(119896) A popularchoice for 119881(119896) is the Givens rotation matrix (5) Consider
119866 (119894 119895 120572) =
[[[[[[[[[[[[
[
1 sdot sdot sdot 0 sdot sdot sdot 0 sdot sdot sdot 0
d
0 sdot sdot sdot 119862120572sdot sdot sdot 119878120572sdot sdot sdot 0
d
0 sdot sdot sdot minus119878120572sdot sdot sdot 119862
120572sdot sdot sdot 0
d
0 sdot sdot sdot 0 sdot sdot sdot 0 sdot sdot sdot 1
]]]]]]]]]]]]
]
(5)
where 119862120572and 119878
120572are the sine and cosine of 120572 respectively
Parallelismof the algorithm resides in this choice If the119881119860119881119879product is studied it can easily be seen that only four elementsof 119860(119896) are modified so multiple similarity transformationscan be carried at the same time
Eigenvectors of 119860 are obtained by accumulating therotation matrices used in eigenvalue computation as shownin the following expression
119881 =
ℎ
prod
119896=1
119881(119896)
where 119881(1) = 119868 (6)
To take full advantage of this Brent andLuk [19] proposeda systolic architecture in which the initial matrix 119860
119899times119899is
mapped onto a set of 2times2matrices such as the one shown inexpression (7) This leaves a subset of 2 times 2 diagonalizationproblems which can easily be solved in hardware makinguse of theCORDIC (COordinate RotationDIgital Computer)algorithm [27] as shown in [14]
119860119897119898= [1198862119897minus12119898minus1
1198862119897minus12119898
11988621198972119898minus1
11988621198972119898
] (7)
Since 1198992 diagonalization problems now have to besolved 1198992 rotation angles (120572
119897) are also required Given that
119860 is symmetric the angle is selected from the submatriceswhere 119897 = 119898 according to expression (8) so 119899 elements(where 119899 is the order of119860) can be eliminated in each iterationAs will now be shown angle calculation can also be carriedout by the CORDIC algorithm making an efficient use ofresources in the implementation
tan (2120572) =1198862119897minus12119898
11988621198972119898
minus 1198862119897minus12119898minus1
(8)
After each update of 119860(119896+1) it is necessary to bring newnonzero elements near the main diagonal so that they can beeliminated in the next iteration repeating the entire processuntil 119860(119896+1) = 119863 To this end several rows and columns areinterchanged as it will be explained in Section 3
As mentioned before the CORDIC algorithm [13 27]can be used to solve the double rotation (4) on everysubmatrix119860
119897119898and to calculate the rotation angles One great
advantage of CORDIC is that it can be expressed by meansof shift and add operations which are easy to implement inreconfigurable hardware It also facilitates working with fixedpoint codification where the word length is not fixed to 32 or64 bits as occurswith the floating point standard thusmakingmore efficient use of FPGA resources For example in thepresent study the entire system was fixed to a word lengthof 18 bits since this is the entry word length of Xilinx FPGAmultipliers
The fundamental idea behind CORDIC is to carry out aseries of rotations (called microrotations) on an input vectorto obtain another vector as outputThis is performed in threedifferent coordinate systems (linear circular and hyperbolic)and it is possible to obtain various basic functions suchas multiplication trigonometric operations logarithms andexponentials [28] in the components of the output vectorgiven some initial input vector conditions [27]Microrotationvalues are selected as shown in the following expression
tan (120572119894) = 120575119894 where 120575
119894= 2minus119894
(9)
The choice of 120575119894as an inverse power of 2 is what will allow us
to implement the algorithm using shifts since multiplicationor division for a power of 2 is equivalent to a shift to the leftor the right respectively in binary codification
In the circular coordinate system it is possible to obtainthe two operations required for the Jacobi algorithm Givensrotation and angle calculation (arctangent) Since CORDICperforms a series of rotations on a vector it can be seen as aniterative expression (11) Consider
V119894+1= [
1 minus1205901198942minus119894
1205901198942minus119894
1] V119894 V = [119909
119910] (10)
119911119894+1= 119911119894minus 120590119894120572119894 (11)
where V is a vector determined by its vertical and horizontalcomponents (119909 and 119910) and 119911 is a variable that keeps trackof the total rotated angle In each iteration the direction ofthe rotation 120590
119894is determined by the operation that is to be
performed In the particular case of the circular coordinatesystem the two algorithm operation modes are as follows
(i) Rotation mode the algorithm takes a vector V0=
(1199090 1199100) and an angle 119911
0and outputs a new vector
V = (119909 119910) which corresponds to V0rotated 119911 radians
To do so 120590119894is selected in such a way that 119911 approaches
zero in each iteration When 119911 cong 0 it can be assumedthat the starting vector has been fully rotated a 119911 angle
(ii) Translation mode algorithm input is again a vectorV0= (1199090 1199100) whose phase angV
0we want to compute
4 Mathematical Problems in Engineering
To do so the vector is rotated until its 119910 componentis zero selecting 120590
119894accordingly Since 119911 accumulates
microrotation angles when 119910 cong 0 the variable 119911will correspond to the initial phase of the vector Inaddition since 119910 = 0 119909 will store the absolute valueof V0
If this is expressed mathematically in rotation mode 120590119894=
sign(119911119894) the relation between the input and the output is as
follows
[119909
119910] = 119870
1[cos (119911
0) sin (119911
0)
minus sin (1199110) cos (119911
0)] [1199090
1199100
] (12)
On the other hand if the algorithm works in vectoringmode 120590
119894must be computed as 120590
119894= minus sign(119910
119894) and the relation
between the input and the output is now as shown in thefollowing expression
119909 = 1198701radic1199092
0+ 1199102
0
119911 = 1199110+ atan
1199100
1199090
(13)
As can be observed in expressions (12) and (13) theresult is scaled by a constant value of 119870
1 This is due to the
nonorthogonality of the microrotations which modifies theabsolute value of the vector To solve this results obtainedcan be scaled by 1119870
1 In addition 119870
1(14) is known in
advance and only varies with the number of microrotationsperformed Since 119870
1is fixed 1119870
1can be precomputed and
stored in a memory element to correct the output data
1198701=
119899minus1
prod
119894=0
radic1 + 1205902
119894sdot 2minus2119894 (14)
22 The QR Algorithm As indicated in Section 1 to showthe effectiveness of the Jacobi algorithm over other meth-ods when it is implemented in reconfigurable logic a QRalgorithm implementation was also carried out using VivadoHLS
In the QR algorithm another condition must be addedto our initial matrix which was real and symmetric now itmust also be a tridiagonal matrix like the one presented inthe following expression
119879 =
[[[
[
119886111988720 0
1198872119886211988730
0 119887311988631198874
0 0 11988741198864
]]]
]
(15)
To obtain the eigenvalues of119879 a reduction strategy is usedwhere the 119887
2 119887
119899entries are zeroed When for example
1198872= 0 the row and the column involved can be removed
where 1198861is an eigenvalue of 119879
To accomplish this in the QR algorithm a sequence ofmatrices similar to119879 (119879(1) 119879(119899)) is produced [29] To startwith the initial matrix 119879(119896) is factored as follows
119879(119896)
= 119876(119896)
119877(119896)
(16)
where 119876(119896) is an orthogonal matrix and 119877(119896) is an uppertriangular matrix In a similar way 119879(119896+1) is defined as
119879(119896+1)
= 119877(119896)
119876(119896)
(17)
As presented thus far the convergence of the method isslow To speed it up a shift can be performed on the originalmatrix for a 120590 quantity close to the eigenvalue we are lookingfor Thus we now have a new sequence of matrices such that
119879(119896)
minus 120590119868 = 119876(119896)
119877(119896)
119879(119896+1)
= 119877(119896)
119876(119896)
+ 120590119868
(18)
After each shift the eigenvalues of 119879 will be shifted by thesame 120590 amount and it is easy to recover the original valuesby accumulating the shifts after each iteration To obtainan approximated value of 120590 for each iteration a popularchoice is to calculate the eigenvalues of the matrix 119864 (19) andtake the one closest to 119886
119899 This is called the Wilkinson shift
[26] Finally 119876(119896) and 119877(119896) are chosen to be Givens rotationmatrices to eliminate the appropriate 119887
119894element
119864 = [119886119899minus1
119887119899
119887119899119886119899
] (19)
3 Implementation Details
As discussed in Section 1 the implementation tool used inthis study was the Xilinx Vivado HLS [18] which allowsthe use of a high level programming language (C C++ orSystem C) instead of a hardware description language such asVHDL or Verilog Although it presents some drawbacks thismeans of implementing the algorithms enabled us to test verydifferent hardware configurations of the systemwith very fewdesign modifications
Due to its iterative nature the Jacobi method can easilybe expressed as a series of loops which is the main elementused in Vivado HLS to implement the algorithms
This can be done via pragma directives or tcl scriptinggiving each part of the design different attributes that deter-mine the degree of concurrency or the desired resource con-straints Some important operations that can be performedon design loops include the following
(i) Pipelinewhichwill try to achieve parallelism betweenthe operations performed in the same iterationPipeline operation can also be specified throughoutthe entire design so that all internal loops are unrolledmaking parallelism between two system iterationspossible
(ii) Unroll which will implement each iteration as aseparate hardware instance
(iii) Dataflow which will try to achieve parallelismbetween different loops that are data dependentimplementing communication channels such as ping-pong memories
As mentioned before we can also apply resource con-straints to the designs limiting the number of instances
Mathematical Problems in Engineering 5
of a particular hardware element The elements that canbe limited range from operators (+lowast) to specific resourcessuch as DSP48E and BRAM memory blocks User definedelements such as a function called multiple times canalso be constrained and this is very important for designoptimization as will be shown later
Figure 1 shows a functional diagram of the Jacobi algo-rithm Since the C programming language was used eachtask was implemented as a for loop and labeled to makeoptimization easier
When the design starts its operation (a) the input matrix119860 isin R119899times119899 is buffered in a memory element 119878 In addition tocompute eigenvectors 119881(1) is loaded with an identity matrix(119868 isin R119899times119899) After the initial operations there are three tasks(a b and c) to be performed ℎ times in the external loopwhere ℎ corresponds to the total number of Jacobi iterations(elimination of the 119899 elements adjacent to the main diagonal)that the algorithmwill execute and is selected in suchway thateigenvector error is minimized
Thefirst task thatmust be done in each Jacobi iteration (b)is to calculate rotation angles (120572(119896)
119895) As previouslymentioned
this was achieved using the CORDIC algorithm in vectoringmode
Thenext tasks to be performed (c1 and c2) are eigenvalueand eigenvector rotations which consist of pre- and post-multiplication with the Givens rotation matrix Since thereare no data dependencies between these two tasks they areplaced in different loops and thus parallelization is possiblePre- or postmultiplication with a Givens rotation matrix canbe expressed as two vector rotations so it can be performedwith two rotation mode CORDIC algorithm executionsThisleaves us with six CORDIC executions four of them whichcorrespond to eigenvalue calculation and the other two toeigenvector computation
Thefinal operation to be performed before the next Jacobiiteration (d) is a series of row and column permutations withthe purpose of introducing new nonzero elements into thediagonal submatrices
This element rearrangement was implemented asdescribed by Brent and Luk [19] but with somemodificationsIn their study they proposed a systolic architecturewhere each submatrix corresponded to an independenthardware unit called a processor and all processors wereinterconnected so that data interchanges were possible
Since our design did not have a defined hardware archi-tecture at this point element interchangewas substitutedwiththe corresponding row and column permutations in order tofulfill the requirement of introducing new nonzero elementsinto the diagonal submatrices
Finally when all the Jacobi rotations are completedeigenvalues and eigenvectors are output as arrays which canbe implemented as BRAM ports or AXI buses so that thedesign can be integrated in different kinds of system
All tasks were implemented in the C programminglanguage using a fixed point data type following the XQNquantification scheme where a number is represented by 119883bits of integer part and 119873 bits of fractional part with a total
number of bits (word length) of WL = 119883 + 119873 + 1 for signedvalues and WL = 119883 + 119873 for unsigned values
Starting from the initial C prototype different hardwarearchitectures can be considered depending on the designrequirements Mainly resource usage (LUT Flip-Flops mul-tipliers memory elements) and timing performance (latency(119879119871) throughput or initiation interval (II) and maximum
clock frequency) are the parameters that we attempt tooptimize
(i) the latency (119879119871) which specifies how much time will
take until the system generate the first valid result(ii) the initiation interval (II) which corresponds to the
throughput or the number of clock cycles betweentwo valid sets of results after the latency period
As a result of the Vivado HLS optimization directivesthree architectures can be achieved depending on whichparameters are considered more important
(i) Serial architecture this hardware scheme pursuesminimal resource usage at the expense of a higherlatency To reach this optimization level a pipelinedirective is specified on the external loop of thedesign which performs the Jacobi iterations
(ii) Parallel architecture in this kind of system the maingoal is to achieve the best performance in termsof execution time (latency and throughput) at theexpense of resource usage To achieve this pipelineis applied to the entire design which means thatthe external loop will be unrolled and parallelismbetween two algorithm executions will be possible
(iii) Semiparallel architecture although timing is stillimportant in this scheme resource usage is alsoconsidered and attempts are made to minimize itwhile affecting execution time as little as possible Inour particular case this was achieved by limiting thenumber of CORDIC instances of the design in theparallel architecture
Although all three alternatives were analyzed the fullparallel architecture was found to use a massive amount ofinternal FPGA resources greatly exceeding the capacity ofthe device This was mainly due to an excessive amountof CORDIC units in the rotation mode which led to 55hardware instances of it in the case of an 8 times 8 inputmatrix Consequently we ruled out the possibility of parallelarchitecture
It is important to clarify the difference between semi-parallel and serial architecture In both cases we alwaysattempted to maximize concurrency between operationssuch as arithmetic operations andmemory accesses howeverwith parallel architecture it was also possible to achieveconcurrence between two full executions of the algorithm
The excessive amount of CORDIC units implemented bythe Vivado HLS synthesizer was taken as a starting pointfor optimization of the design Since Vivado HLS allowedus to limit the number of elements of a particular hardwareresource or entity the number of CORDIC instances in both
6 Mathematical Problems in Engineering
S A
V Intimesnk = 1
120572(k)j = 05tans(k)2jminus12j
s(k)2j2j minus s(k)2jminus12jminus1
Sk+1 Sk+1aux Vk+1 Vk+1aux
(k+1) = V(k)ij middot R(120572j)
i = 1 n2 j = 1 n2 j = i n2i = 1 n2
(k+1) = R(120572i)Tmiddot S(k)ij middot R(120572j)
k = h
(a)
(b)
(c2)
(d)
(c1)
Result rearrange
No
[ ]
Yes
A Rntimesnisin
Vaux 119894119895 Saux 119894119895
larrminus
larrminus
larrminus larrminus
rarr
rarrVlarrminus
1205821 120582n diag(S)
lceilrceil lceilrceil
larrminus
1205821 120582n
j = 1 n2
Figure 1 Jacobi algorithm functional diagram
semiparallel and serial architecture was limited to differentvalues allowing Vivado HLS to redesign the RTL implemen-tation and subsequently analyzing the design performanceobtained
First we analyze the case of the semiparallel architectureSince 8 times 8 is a common size in the literature for input matri-ces the design was optimized for this matrix size extendingthe conclusions reached to other matrix sizes The resultsobtained (resource consumption and timing performance)are shown in Figure 2
An analysis of timing performance indicated thatalthough latency did not vary very much the initiationinterval (II) was considerably reduced when the number ofrotation mode CORDIC instances was increased from oneto four slowly decreasing as more CORDIC modules wereadded
This result was mainly due to data dependencies betweenoperations performed in the Jacobi algorithm As previously
indicated four rotations must be performed for eigenvaluecalculation and the result of the first two is required toperform the last two whereas two independent rotationsmust be performed for eigenvector calculation
From this it can be concluded that four CORDICmodules will greatly increase the performance of the systemWhen timing performance is related to resource usage it isclear that it increased linearly with CORDIC module usageand the slope variation was due to the different control logicimplemented by Vivado with an odd and an even number ofCORDIC modules
The last conclusion that can be reached from Figure 2 isthat there was parallelism between two algorithm iterationssince 119879
119871gt II This means that the first valid result will be
obtained after 119879119871clock cycles but the next results will be
obtained more rapidly after every II clock cyclesThe results obtained for the serial architecture are shown
in Figure 3 This time only the latency is shown since there
Mathematical Problems in Engineering 7
1 2 3 4 5 6 7 8 90
20406080
100
Disc
rete
elem
ents
()
Resource usage
LUTFF
1 2 3 4 5 6 7 8 90
500
1000
1500
2000
Rotation CORDIC modules
Rotation CORDIC modules
Execution time (clock cycles)
IITL
nmiddotT
CLK
Figure 2 Semiparallel architecture implementation results of theJacobi algorithm for 8 times 8matrices
was no parallelism between two algorithm executions andthey only differed in one clock cycle that the system neededto start the next execution (II = 119879
119871+ 1)
As indicated earlier the execution time between two validresult sets is the same as the latency period (119879
119871) and in
the semiparallel design alternative it evolved similarly to theinitiation interval
For similar reasons there was a greater improvement inlatency in the first fourCORDICmodule additions which didnot decrease at all after sixThe increase in resource usage wasalso linear now with a constant slope although it used fewerresources
4 Assessment of Design Parameters
41 Optimal Number of Jacobi Iterations To evaluate thequality of the results obtained using the Jacobi method weconducted an experimental error analysis to determine howmany Jacobi iterations were required to obtain satisfactoryresults
The word length (WL) selected for the fixed point cod-ification scheme presented in Section 2 was 18 bits with afractional part of 15 bits (2Q15) since Xilinx FPGAmultiplierentry is fixed to this value and the purpose of using theCORDIC algorithm was to minimize resource usage Inaddition as will now be shown this codification providedsufficient precision to avoid overflow and growing roundingerrors
Simulating the system for multiple 8 times 8 real and sym-metric random matrices we calculated the eigenvalue and
1 2 3 4 5 6 7 8 90
1020304050
Disc
rete
elem
ents
Resource usage
1 2 3 4 5 6 7 8 91500
2000
2500
3000
Rotation CORDIC modules
Rotation CORDIC modules
Execution time
LUTFF
TL
nmiddotT
CLK
Figure 3 Serial architecture implementation results of the Jacobialgorithm for 8 times 8matrices
21 215 22 225 23 235 24 245 25 255 260
1020304050
Max
imum
erro
r (
)
Iterations
21 215 22 225 23 235 24 245 25 255 260
02040608
1
Qua
drat
ic m
ean
erro
r (
)
Iterations
Eigenvector component error (rarr
rarr
)1205821 120582m
Figure 4 Eigenvectors maximum and quadratic component error
eigenvector maximum (119890MAX) and quadratic (119890SQRT) errorcomparing FPGA Jacobi fixed point results with MATLABfloating point implementation (eig)
When the results were analyzed we found that the eigen-vector error stabilized later than the eigenvalue error (moreiterations were required) Consequently Figure 4 shows
8 Mathematical Problems in Engineering
the eigenvector component error calculated according to thefollowing expressions
119890 =
1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB
10038161003816100381610038161003816100381610038161003816119881MATLAB
1003816100381610038161003816
sdot 100
119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)
119890SQRT =1
1198962
radic
119896minus1
sum
119894=0
1198902
(20)
where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results
Our analysis of 8 times 8 matrices indicated that ℎ = 26
was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on
After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix
ℎ = 9 + 119899 log 119899 (21)
42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size
With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)
Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length
Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device
Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity
Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression
119879119871= 119879load + 119899
2
sdot ℎ (22)
8 9 10 11 12 13 14 15 1605
1
15
2Used resources
FF an
d LU
T (
)
8 9 10 11 12 13 14 15 160
50
100
150
FFLUT
Input matrix size (n)
Input matrix size (n)
Execution time with 100MHz
times104
Tex
e(120583
s)
fCLK =
Figure 5 Serial Jacobi design performance with matrix orderincrease
where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration
119879load =1198992
2+119899
2+ 4 (23)
A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879
119871) was still related to the
size of the inputmatrix as shown in the following expression
119879119871= 119879load + 119879119869 sdot ℎ (24)
where 119879119869= 119891(119899
2
) is the time required to perform a Jacobiiteration
Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs
Mathematical Problems in Engineering 9
Design comparison
11000 23000 34000 45000LUT
8500
17000
26000
34000
FF
400080001200016000
4000
8000
12000
16000
Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library
TL(nTCLK)
II(nT
CLK
)
Figure 6 Comparative between different design alternatives
5 Comparison with Other Proposals
The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention
Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation
The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)
The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode
The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources
Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one
Vivado HLS library
Clock frequency (MHz)
VHDL (Bravo et al
2008)QR
Parallel Jacobi
Serial Jacobi
0 50 100 150 200 250 300
Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8
The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks
The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit
As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations
Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required
Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically
Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources
As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed
On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz
10 Mathematical Problems in Engineering
On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level
6 Conclusions
In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature
The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired
In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms
Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs
As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high
Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)
The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view
We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results
Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)
References
[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008
[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014
[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014
[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014
[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005
[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000
[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013
[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008
[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988
[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000
[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990
[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983
[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959
[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988
[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013
Mathematical Problems in Engineering 11
[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts
designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value
and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985
[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014
[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009
[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014
[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014
[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014
[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014
[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013
[27] J S Walther A Unified Algorithm for Elementary Functions1972
[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010
[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
2 Mathematical Problems in Engineering
the results obtained a distinction can be made betweenpurely iterative methods (Jacobi power iterations) [9] andalgorithms that perform some kind of transformation of theinitial matrix (Lanczos-bisection householder-QR) [10 11]A distinction can also be made between algorithms thatoutput all the eigenvalues (Jacobi QR) and those that onlycompute extremal eigenvalues or the one closest to a givenvalue (power iterations Lanczos)
In the studies cited above the main method employedfor eigenvalue and eigenvector computation was the Jacobialgorithm since its characteristics render it highly suitablefor a parallel implementation The Jacobi algorithm can beimplemented in such a way that there are almost no datadependencies between operations in the same iteration and itis possible to use systolic architectures as has been proposedby Brent et al [12] In addition all the operations involvedcan be carried out by using the CORDIC algorithm [13] ashas been shown by Cavallaro and Luk in [14] and this canbe implemented simply with add and shift operations whichonly use a small amount of hardware resources compared tomultiplication and division operations Another advantage ofthe Jacobi method is that its round-off error is very stable [9]
However there are other methods for computingextremal eigenvalues that can take advantage of thecomputing power provided by FPGAs such as the QRalgorithm or the Lanczos [15] method since some of theiroperations can be carried out in parallel In this study we alsopresent an FPGA floating point implementation of the QRalgorithm In terms of a practical FPGA implementation it isa challenging task to make efficient use of the availableresources while at the same time meeting real timerequirements This is mainly because the analysis ofalgorithm data dependencies and their parallelization isextremely difficult
Furthermore FPGA implementations are usually carriedout in register transfer level (RTL) languages where there is avery low level of abstraction and it is therefore necessary tomanage resources and timing carefully This also implies thatwhen some architectural decisions are made it is difficult togo back without recoding most of the system Neverthelessthis kind of system can be very efficient in terms of the totalamount of resources used
When an algorithm with relevant mathematical work-load such as the ones mentioned before has to be imple-mented in an FPGA a great part of the design processconsist in the partition of the system in small units toachieve the required grade of concurrency Design time canbe shortened by using a high level synthesis (HLS) tool Thistool attempts to raise the abstraction level allowing algorithmspecification by means of a high level programming languagesuch as C or C++ Thanks to this the designer can focuson the algorithm itself while the tool makes the software tohardware conversion guided by some user directives whichrange from resource usage to concurrency grade Some ofthe tools that have become popular in the recent years areImpulse C [16] Catapult C by Calypto [17] and Vivado HLSby Xilinx [18]
Hardware implementations of the Jacobi algorithm arevery popular because its characteristics permit different
architectural designs depending on the required perfor-mance In the literature there are systolic implementationswhere speed is themost important requirement [19 20] serialdesigns where low resource consumption is selected [1 21]and a meet in the middle approach (semiparallel) where bothcriteria are balanced [8] In this study the Jacobi algorithmwas implemented using the Xilinx Vivado HLS tool toexplore different hardware implementations and comparethem to select the most efficient one in terms of executiontime and FPGA resources used
In this study the Jacobi algorithmwas implemented usingthe Xilinx Vivado HLS tool to explore different hardwareimplementations and compare them to select the mostefficient one in terms of execution time and FPGA resourcesused
To gain a better understanding of the results obtained wealso carried out a floating point implementation of the QRalgorithm using Vivado HLS The results obtained were thencompared to similar systems described in the literature
The rest of the paper is organized as follows In Section 2the mathematical background behind the Jacobi and QRalgorithms is presented together with some details relevantto implementation The proposed implementations and itsparameters are discussed in Sections 3 and 4 respectivelyFinally the system performance is compared with otherdesign approaches in Section 5 and we present our conclu-sions in Section 6
2 Mathematical Background
The eigenvalue problem [22] is one of the main questions innumerical algebra and due to themany applications in whichit is useful it has attracted much research attention in recentyears [23ndash25] The formulation of the problem is simple (1)but it can be approached from very different angles
119860V = 120582V (1)where V and 120582 are an eigenvalue and an eigenvector respec-tively of a matrix 119860 isin R119899times119899 Although there is an analyticalsolution it involves the calculation of polynomial roots andtherefore it is only suitable for low order (119899) matrices (119899 lt 4)When the problem has a higher order numerical algorithmsmust be used In this section the mathematical backgroundof Jacobi and QR algorithms is presented when they areused to compute the eigenvalues from real and symmetricmatrices
21 The Jacobi Algorithm The Jacobi algorithm [26] wasfirst proposed in 1846 but did not become popular until itscomputational implementation was explored The singularvalue decomposition (SVD) of a matrix 119860 isin R119899times119899 is givenby
119860 = 119880119863119881119879
(2)where 119880 and 119881 are orthogonal matrices and 119863 is a diagonalmatrix storing the singular values of119860 on its diagonal For thecase where 119860 is real and symmetric we can rewrite (2) as
119860 = 119881119863119881119879
(3)
Mathematical Problems in Engineering 3
The idea behind the Jacobi method is to perform aseries of similarity transformations (ie before and aftermultiplication with an orthogonal matrix) of 119860 to renderit more diagonal in each iteration This can be seen as aniterative expression
119860(119896+1)
= 119881(119896)
119860(119896)
119881(119896)119879
(4)
where 119896 represents the current iteration and 119896+1 the next oneThe sequence of transformation matrices is chosen in such away that the doublemultiplication renders the updated119860(119896+1)
a little more diagonal than its predecessor 119860(119896) A popularchoice for 119881(119896) is the Givens rotation matrix (5) Consider
119866 (119894 119895 120572) =
[[[[[[[[[[[[
[
1 sdot sdot sdot 0 sdot sdot sdot 0 sdot sdot sdot 0
d
0 sdot sdot sdot 119862120572sdot sdot sdot 119878120572sdot sdot sdot 0
d
0 sdot sdot sdot minus119878120572sdot sdot sdot 119862
120572sdot sdot sdot 0
d
0 sdot sdot sdot 0 sdot sdot sdot 0 sdot sdot sdot 1
]]]]]]]]]]]]
]
(5)
where 119862120572and 119878
120572are the sine and cosine of 120572 respectively
Parallelismof the algorithm resides in this choice If the119881119860119881119879product is studied it can easily be seen that only four elementsof 119860(119896) are modified so multiple similarity transformationscan be carried at the same time
Eigenvectors of 119860 are obtained by accumulating therotation matrices used in eigenvalue computation as shownin the following expression
119881 =
ℎ
prod
119896=1
119881(119896)
where 119881(1) = 119868 (6)
To take full advantage of this Brent andLuk [19] proposeda systolic architecture in which the initial matrix 119860
119899times119899is
mapped onto a set of 2times2matrices such as the one shown inexpression (7) This leaves a subset of 2 times 2 diagonalizationproblems which can easily be solved in hardware makinguse of theCORDIC (COordinate RotationDIgital Computer)algorithm [27] as shown in [14]
119860119897119898= [1198862119897minus12119898minus1
1198862119897minus12119898
11988621198972119898minus1
11988621198972119898
] (7)
Since 1198992 diagonalization problems now have to besolved 1198992 rotation angles (120572
119897) are also required Given that
119860 is symmetric the angle is selected from the submatriceswhere 119897 = 119898 according to expression (8) so 119899 elements(where 119899 is the order of119860) can be eliminated in each iterationAs will now be shown angle calculation can also be carriedout by the CORDIC algorithm making an efficient use ofresources in the implementation
tan (2120572) =1198862119897minus12119898
11988621198972119898
minus 1198862119897minus12119898minus1
(8)
After each update of 119860(119896+1) it is necessary to bring newnonzero elements near the main diagonal so that they can beeliminated in the next iteration repeating the entire processuntil 119860(119896+1) = 119863 To this end several rows and columns areinterchanged as it will be explained in Section 3
As mentioned before the CORDIC algorithm [13 27]can be used to solve the double rotation (4) on everysubmatrix119860
119897119898and to calculate the rotation angles One great
advantage of CORDIC is that it can be expressed by meansof shift and add operations which are easy to implement inreconfigurable hardware It also facilitates working with fixedpoint codification where the word length is not fixed to 32 or64 bits as occurswith the floating point standard thusmakingmore efficient use of FPGA resources For example in thepresent study the entire system was fixed to a word lengthof 18 bits since this is the entry word length of Xilinx FPGAmultipliers
The fundamental idea behind CORDIC is to carry out aseries of rotations (called microrotations) on an input vectorto obtain another vector as outputThis is performed in threedifferent coordinate systems (linear circular and hyperbolic)and it is possible to obtain various basic functions suchas multiplication trigonometric operations logarithms andexponentials [28] in the components of the output vectorgiven some initial input vector conditions [27]Microrotationvalues are selected as shown in the following expression
tan (120572119894) = 120575119894 where 120575
119894= 2minus119894
(9)
The choice of 120575119894as an inverse power of 2 is what will allow us
to implement the algorithm using shifts since multiplicationor division for a power of 2 is equivalent to a shift to the leftor the right respectively in binary codification
In the circular coordinate system it is possible to obtainthe two operations required for the Jacobi algorithm Givensrotation and angle calculation (arctangent) Since CORDICperforms a series of rotations on a vector it can be seen as aniterative expression (11) Consider
V119894+1= [
1 minus1205901198942minus119894
1205901198942minus119894
1] V119894 V = [119909
119910] (10)
119911119894+1= 119911119894minus 120590119894120572119894 (11)
where V is a vector determined by its vertical and horizontalcomponents (119909 and 119910) and 119911 is a variable that keeps trackof the total rotated angle In each iteration the direction ofthe rotation 120590
119894is determined by the operation that is to be
performed In the particular case of the circular coordinatesystem the two algorithm operation modes are as follows
(i) Rotation mode the algorithm takes a vector V0=
(1199090 1199100) and an angle 119911
0and outputs a new vector
V = (119909 119910) which corresponds to V0rotated 119911 radians
To do so 120590119894is selected in such a way that 119911 approaches
zero in each iteration When 119911 cong 0 it can be assumedthat the starting vector has been fully rotated a 119911 angle
(ii) Translation mode algorithm input is again a vectorV0= (1199090 1199100) whose phase angV
0we want to compute
4 Mathematical Problems in Engineering
To do so the vector is rotated until its 119910 componentis zero selecting 120590
119894accordingly Since 119911 accumulates
microrotation angles when 119910 cong 0 the variable 119911will correspond to the initial phase of the vector Inaddition since 119910 = 0 119909 will store the absolute valueof V0
If this is expressed mathematically in rotation mode 120590119894=
sign(119911119894) the relation between the input and the output is as
follows
[119909
119910] = 119870
1[cos (119911
0) sin (119911
0)
minus sin (1199110) cos (119911
0)] [1199090
1199100
] (12)
On the other hand if the algorithm works in vectoringmode 120590
119894must be computed as 120590
119894= minus sign(119910
119894) and the relation
between the input and the output is now as shown in thefollowing expression
119909 = 1198701radic1199092
0+ 1199102
0
119911 = 1199110+ atan
1199100
1199090
(13)
As can be observed in expressions (12) and (13) theresult is scaled by a constant value of 119870
1 This is due to the
nonorthogonality of the microrotations which modifies theabsolute value of the vector To solve this results obtainedcan be scaled by 1119870
1 In addition 119870
1(14) is known in
advance and only varies with the number of microrotationsperformed Since 119870
1is fixed 1119870
1can be precomputed and
stored in a memory element to correct the output data
1198701=
119899minus1
prod
119894=0
radic1 + 1205902
119894sdot 2minus2119894 (14)
22 The QR Algorithm As indicated in Section 1 to showthe effectiveness of the Jacobi algorithm over other meth-ods when it is implemented in reconfigurable logic a QRalgorithm implementation was also carried out using VivadoHLS
In the QR algorithm another condition must be addedto our initial matrix which was real and symmetric now itmust also be a tridiagonal matrix like the one presented inthe following expression
119879 =
[[[
[
119886111988720 0
1198872119886211988730
0 119887311988631198874
0 0 11988741198864
]]]
]
(15)
To obtain the eigenvalues of119879 a reduction strategy is usedwhere the 119887
2 119887
119899entries are zeroed When for example
1198872= 0 the row and the column involved can be removed
where 1198861is an eigenvalue of 119879
To accomplish this in the QR algorithm a sequence ofmatrices similar to119879 (119879(1) 119879(119899)) is produced [29] To startwith the initial matrix 119879(119896) is factored as follows
119879(119896)
= 119876(119896)
119877(119896)
(16)
where 119876(119896) is an orthogonal matrix and 119877(119896) is an uppertriangular matrix In a similar way 119879(119896+1) is defined as
119879(119896+1)
= 119877(119896)
119876(119896)
(17)
As presented thus far the convergence of the method isslow To speed it up a shift can be performed on the originalmatrix for a 120590 quantity close to the eigenvalue we are lookingfor Thus we now have a new sequence of matrices such that
119879(119896)
minus 120590119868 = 119876(119896)
119877(119896)
119879(119896+1)
= 119877(119896)
119876(119896)
+ 120590119868
(18)
After each shift the eigenvalues of 119879 will be shifted by thesame 120590 amount and it is easy to recover the original valuesby accumulating the shifts after each iteration To obtainan approximated value of 120590 for each iteration a popularchoice is to calculate the eigenvalues of the matrix 119864 (19) andtake the one closest to 119886
119899 This is called the Wilkinson shift
[26] Finally 119876(119896) and 119877(119896) are chosen to be Givens rotationmatrices to eliminate the appropriate 119887
119894element
119864 = [119886119899minus1
119887119899
119887119899119886119899
] (19)
3 Implementation Details
As discussed in Section 1 the implementation tool used inthis study was the Xilinx Vivado HLS [18] which allowsthe use of a high level programming language (C C++ orSystem C) instead of a hardware description language such asVHDL or Verilog Although it presents some drawbacks thismeans of implementing the algorithms enabled us to test verydifferent hardware configurations of the systemwith very fewdesign modifications
Due to its iterative nature the Jacobi method can easilybe expressed as a series of loops which is the main elementused in Vivado HLS to implement the algorithms
This can be done via pragma directives or tcl scriptinggiving each part of the design different attributes that deter-mine the degree of concurrency or the desired resource con-straints Some important operations that can be performedon design loops include the following
(i) Pipelinewhichwill try to achieve parallelism betweenthe operations performed in the same iterationPipeline operation can also be specified throughoutthe entire design so that all internal loops are unrolledmaking parallelism between two system iterationspossible
(ii) Unroll which will implement each iteration as aseparate hardware instance
(iii) Dataflow which will try to achieve parallelismbetween different loops that are data dependentimplementing communication channels such as ping-pong memories
As mentioned before we can also apply resource con-straints to the designs limiting the number of instances
Mathematical Problems in Engineering 5
of a particular hardware element The elements that canbe limited range from operators (+lowast) to specific resourcessuch as DSP48E and BRAM memory blocks User definedelements such as a function called multiple times canalso be constrained and this is very important for designoptimization as will be shown later
Figure 1 shows a functional diagram of the Jacobi algo-rithm Since the C programming language was used eachtask was implemented as a for loop and labeled to makeoptimization easier
When the design starts its operation (a) the input matrix119860 isin R119899times119899 is buffered in a memory element 119878 In addition tocompute eigenvectors 119881(1) is loaded with an identity matrix(119868 isin R119899times119899) After the initial operations there are three tasks(a b and c) to be performed ℎ times in the external loopwhere ℎ corresponds to the total number of Jacobi iterations(elimination of the 119899 elements adjacent to the main diagonal)that the algorithmwill execute and is selected in suchway thateigenvector error is minimized
Thefirst task thatmust be done in each Jacobi iteration (b)is to calculate rotation angles (120572(119896)
119895) As previouslymentioned
this was achieved using the CORDIC algorithm in vectoringmode
Thenext tasks to be performed (c1 and c2) are eigenvalueand eigenvector rotations which consist of pre- and post-multiplication with the Givens rotation matrix Since thereare no data dependencies between these two tasks they areplaced in different loops and thus parallelization is possiblePre- or postmultiplication with a Givens rotation matrix canbe expressed as two vector rotations so it can be performedwith two rotation mode CORDIC algorithm executionsThisleaves us with six CORDIC executions four of them whichcorrespond to eigenvalue calculation and the other two toeigenvector computation
Thefinal operation to be performed before the next Jacobiiteration (d) is a series of row and column permutations withthe purpose of introducing new nonzero elements into thediagonal submatrices
This element rearrangement was implemented asdescribed by Brent and Luk [19] but with somemodificationsIn their study they proposed a systolic architecturewhere each submatrix corresponded to an independenthardware unit called a processor and all processors wereinterconnected so that data interchanges were possible
Since our design did not have a defined hardware archi-tecture at this point element interchangewas substitutedwiththe corresponding row and column permutations in order tofulfill the requirement of introducing new nonzero elementsinto the diagonal submatrices
Finally when all the Jacobi rotations are completedeigenvalues and eigenvectors are output as arrays which canbe implemented as BRAM ports or AXI buses so that thedesign can be integrated in different kinds of system
All tasks were implemented in the C programminglanguage using a fixed point data type following the XQNquantification scheme where a number is represented by 119883bits of integer part and 119873 bits of fractional part with a total
number of bits (word length) of WL = 119883 + 119873 + 1 for signedvalues and WL = 119883 + 119873 for unsigned values
Starting from the initial C prototype different hardwarearchitectures can be considered depending on the designrequirements Mainly resource usage (LUT Flip-Flops mul-tipliers memory elements) and timing performance (latency(119879119871) throughput or initiation interval (II) and maximum
clock frequency) are the parameters that we attempt tooptimize
(i) the latency (119879119871) which specifies how much time will
take until the system generate the first valid result(ii) the initiation interval (II) which corresponds to the
throughput or the number of clock cycles betweentwo valid sets of results after the latency period
As a result of the Vivado HLS optimization directivesthree architectures can be achieved depending on whichparameters are considered more important
(i) Serial architecture this hardware scheme pursuesminimal resource usage at the expense of a higherlatency To reach this optimization level a pipelinedirective is specified on the external loop of thedesign which performs the Jacobi iterations
(ii) Parallel architecture in this kind of system the maingoal is to achieve the best performance in termsof execution time (latency and throughput) at theexpense of resource usage To achieve this pipelineis applied to the entire design which means thatthe external loop will be unrolled and parallelismbetween two algorithm executions will be possible
(iii) Semiparallel architecture although timing is stillimportant in this scheme resource usage is alsoconsidered and attempts are made to minimize itwhile affecting execution time as little as possible Inour particular case this was achieved by limiting thenumber of CORDIC instances of the design in theparallel architecture
Although all three alternatives were analyzed the fullparallel architecture was found to use a massive amount ofinternal FPGA resources greatly exceeding the capacity ofthe device This was mainly due to an excessive amountof CORDIC units in the rotation mode which led to 55hardware instances of it in the case of an 8 times 8 inputmatrix Consequently we ruled out the possibility of parallelarchitecture
It is important to clarify the difference between semi-parallel and serial architecture In both cases we alwaysattempted to maximize concurrency between operationssuch as arithmetic operations andmemory accesses howeverwith parallel architecture it was also possible to achieveconcurrence between two full executions of the algorithm
The excessive amount of CORDIC units implemented bythe Vivado HLS synthesizer was taken as a starting pointfor optimization of the design Since Vivado HLS allowedus to limit the number of elements of a particular hardwareresource or entity the number of CORDIC instances in both
6 Mathematical Problems in Engineering
S A
V Intimesnk = 1
120572(k)j = 05tans(k)2jminus12j
s(k)2j2j minus s(k)2jminus12jminus1
Sk+1 Sk+1aux Vk+1 Vk+1aux
(k+1) = V(k)ij middot R(120572j)
i = 1 n2 j = 1 n2 j = i n2i = 1 n2
(k+1) = R(120572i)Tmiddot S(k)ij middot R(120572j)
k = h
(a)
(b)
(c2)
(d)
(c1)
Result rearrange
No
[ ]
Yes
A Rntimesnisin
Vaux 119894119895 Saux 119894119895
larrminus
larrminus
larrminus larrminus
rarr
rarrVlarrminus
1205821 120582n diag(S)
lceilrceil lceilrceil
larrminus
1205821 120582n
j = 1 n2
Figure 1 Jacobi algorithm functional diagram
semiparallel and serial architecture was limited to differentvalues allowing Vivado HLS to redesign the RTL implemen-tation and subsequently analyzing the design performanceobtained
First we analyze the case of the semiparallel architectureSince 8 times 8 is a common size in the literature for input matri-ces the design was optimized for this matrix size extendingthe conclusions reached to other matrix sizes The resultsobtained (resource consumption and timing performance)are shown in Figure 2
An analysis of timing performance indicated thatalthough latency did not vary very much the initiationinterval (II) was considerably reduced when the number ofrotation mode CORDIC instances was increased from oneto four slowly decreasing as more CORDIC modules wereadded
This result was mainly due to data dependencies betweenoperations performed in the Jacobi algorithm As previously
indicated four rotations must be performed for eigenvaluecalculation and the result of the first two is required toperform the last two whereas two independent rotationsmust be performed for eigenvector calculation
From this it can be concluded that four CORDICmodules will greatly increase the performance of the systemWhen timing performance is related to resource usage it isclear that it increased linearly with CORDIC module usageand the slope variation was due to the different control logicimplemented by Vivado with an odd and an even number ofCORDIC modules
The last conclusion that can be reached from Figure 2 isthat there was parallelism between two algorithm iterationssince 119879
119871gt II This means that the first valid result will be
obtained after 119879119871clock cycles but the next results will be
obtained more rapidly after every II clock cyclesThe results obtained for the serial architecture are shown
in Figure 3 This time only the latency is shown since there
Mathematical Problems in Engineering 7
1 2 3 4 5 6 7 8 90
20406080
100
Disc
rete
elem
ents
()
Resource usage
LUTFF
1 2 3 4 5 6 7 8 90
500
1000
1500
2000
Rotation CORDIC modules
Rotation CORDIC modules
Execution time (clock cycles)
IITL
nmiddotT
CLK
Figure 2 Semiparallel architecture implementation results of theJacobi algorithm for 8 times 8matrices
was no parallelism between two algorithm executions andthey only differed in one clock cycle that the system neededto start the next execution (II = 119879
119871+ 1)
As indicated earlier the execution time between two validresult sets is the same as the latency period (119879
119871) and in
the semiparallel design alternative it evolved similarly to theinitiation interval
For similar reasons there was a greater improvement inlatency in the first fourCORDICmodule additions which didnot decrease at all after sixThe increase in resource usage wasalso linear now with a constant slope although it used fewerresources
4 Assessment of Design Parameters
41 Optimal Number of Jacobi Iterations To evaluate thequality of the results obtained using the Jacobi method weconducted an experimental error analysis to determine howmany Jacobi iterations were required to obtain satisfactoryresults
The word length (WL) selected for the fixed point cod-ification scheme presented in Section 2 was 18 bits with afractional part of 15 bits (2Q15) since Xilinx FPGAmultiplierentry is fixed to this value and the purpose of using theCORDIC algorithm was to minimize resource usage Inaddition as will now be shown this codification providedsufficient precision to avoid overflow and growing roundingerrors
Simulating the system for multiple 8 times 8 real and sym-metric random matrices we calculated the eigenvalue and
1 2 3 4 5 6 7 8 90
1020304050
Disc
rete
elem
ents
Resource usage
1 2 3 4 5 6 7 8 91500
2000
2500
3000
Rotation CORDIC modules
Rotation CORDIC modules
Execution time
LUTFF
TL
nmiddotT
CLK
Figure 3 Serial architecture implementation results of the Jacobialgorithm for 8 times 8matrices
21 215 22 225 23 235 24 245 25 255 260
1020304050
Max
imum
erro
r (
)
Iterations
21 215 22 225 23 235 24 245 25 255 260
02040608
1
Qua
drat
ic m
ean
erro
r (
)
Iterations
Eigenvector component error (rarr
rarr
)1205821 120582m
Figure 4 Eigenvectors maximum and quadratic component error
eigenvector maximum (119890MAX) and quadratic (119890SQRT) errorcomparing FPGA Jacobi fixed point results with MATLABfloating point implementation (eig)
When the results were analyzed we found that the eigen-vector error stabilized later than the eigenvalue error (moreiterations were required) Consequently Figure 4 shows
8 Mathematical Problems in Engineering
the eigenvector component error calculated according to thefollowing expressions
119890 =
1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB
10038161003816100381610038161003816100381610038161003816119881MATLAB
1003816100381610038161003816
sdot 100
119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)
119890SQRT =1
1198962
radic
119896minus1
sum
119894=0
1198902
(20)
where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results
Our analysis of 8 times 8 matrices indicated that ℎ = 26
was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on
After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix
ℎ = 9 + 119899 log 119899 (21)
42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size
With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)
Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length
Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device
Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity
Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression
119879119871= 119879load + 119899
2
sdot ℎ (22)
8 9 10 11 12 13 14 15 1605
1
15
2Used resources
FF an
d LU
T (
)
8 9 10 11 12 13 14 15 160
50
100
150
FFLUT
Input matrix size (n)
Input matrix size (n)
Execution time with 100MHz
times104
Tex
e(120583
s)
fCLK =
Figure 5 Serial Jacobi design performance with matrix orderincrease
where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration
119879load =1198992
2+119899
2+ 4 (23)
A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879
119871) was still related to the
size of the inputmatrix as shown in the following expression
119879119871= 119879load + 119879119869 sdot ℎ (24)
where 119879119869= 119891(119899
2
) is the time required to perform a Jacobiiteration
Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs
Mathematical Problems in Engineering 9
Design comparison
11000 23000 34000 45000LUT
8500
17000
26000
34000
FF
400080001200016000
4000
8000
12000
16000
Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library
TL(nTCLK)
II(nT
CLK
)
Figure 6 Comparative between different design alternatives
5 Comparison with Other Proposals
The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention
Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation
The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)
The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode
The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources
Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one
Vivado HLS library
Clock frequency (MHz)
VHDL (Bravo et al
2008)QR
Parallel Jacobi
Serial Jacobi
0 50 100 150 200 250 300
Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8
The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks
The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit
As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations
Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required
Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically
Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources
As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed
On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz
10 Mathematical Problems in Engineering
On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level
6 Conclusions
In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature
The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired
In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms
Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs
As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high
Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)
The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view
We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results
Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)
References
[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008
[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014
[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014
[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014
[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005
[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000
[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013
[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008
[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988
[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000
[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990
[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983
[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959
[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988
[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013
Mathematical Problems in Engineering 11
[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts
designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value
and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985
[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014
[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009
[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014
[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014
[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014
[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014
[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013
[27] J S Walther A Unified Algorithm for Elementary Functions1972
[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010
[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 3
The idea behind the Jacobi method is to perform aseries of similarity transformations (ie before and aftermultiplication with an orthogonal matrix) of 119860 to renderit more diagonal in each iteration This can be seen as aniterative expression
119860(119896+1)
= 119881(119896)
119860(119896)
119881(119896)119879
(4)
where 119896 represents the current iteration and 119896+1 the next oneThe sequence of transformation matrices is chosen in such away that the doublemultiplication renders the updated119860(119896+1)
a little more diagonal than its predecessor 119860(119896) A popularchoice for 119881(119896) is the Givens rotation matrix (5) Consider
119866 (119894 119895 120572) =
[[[[[[[[[[[[
[
1 sdot sdot sdot 0 sdot sdot sdot 0 sdot sdot sdot 0
d
0 sdot sdot sdot 119862120572sdot sdot sdot 119878120572sdot sdot sdot 0
d
0 sdot sdot sdot minus119878120572sdot sdot sdot 119862
120572sdot sdot sdot 0
d
0 sdot sdot sdot 0 sdot sdot sdot 0 sdot sdot sdot 1
]]]]]]]]]]]]
]
(5)
where 119862120572and 119878
120572are the sine and cosine of 120572 respectively
Parallelismof the algorithm resides in this choice If the119881119860119881119879product is studied it can easily be seen that only four elementsof 119860(119896) are modified so multiple similarity transformationscan be carried at the same time
Eigenvectors of 119860 are obtained by accumulating therotation matrices used in eigenvalue computation as shownin the following expression
119881 =
ℎ
prod
119896=1
119881(119896)
where 119881(1) = 119868 (6)
To take full advantage of this Brent andLuk [19] proposeda systolic architecture in which the initial matrix 119860
119899times119899is
mapped onto a set of 2times2matrices such as the one shown inexpression (7) This leaves a subset of 2 times 2 diagonalizationproblems which can easily be solved in hardware makinguse of theCORDIC (COordinate RotationDIgital Computer)algorithm [27] as shown in [14]
119860119897119898= [1198862119897minus12119898minus1
1198862119897minus12119898
11988621198972119898minus1
11988621198972119898
] (7)
Since 1198992 diagonalization problems now have to besolved 1198992 rotation angles (120572
119897) are also required Given that
119860 is symmetric the angle is selected from the submatriceswhere 119897 = 119898 according to expression (8) so 119899 elements(where 119899 is the order of119860) can be eliminated in each iterationAs will now be shown angle calculation can also be carriedout by the CORDIC algorithm making an efficient use ofresources in the implementation
tan (2120572) =1198862119897minus12119898
11988621198972119898
minus 1198862119897minus12119898minus1
(8)
After each update of 119860(119896+1) it is necessary to bring newnonzero elements near the main diagonal so that they can beeliminated in the next iteration repeating the entire processuntil 119860(119896+1) = 119863 To this end several rows and columns areinterchanged as it will be explained in Section 3
As mentioned before the CORDIC algorithm [13 27]can be used to solve the double rotation (4) on everysubmatrix119860
119897119898and to calculate the rotation angles One great
advantage of CORDIC is that it can be expressed by meansof shift and add operations which are easy to implement inreconfigurable hardware It also facilitates working with fixedpoint codification where the word length is not fixed to 32 or64 bits as occurswith the floating point standard thusmakingmore efficient use of FPGA resources For example in thepresent study the entire system was fixed to a word lengthof 18 bits since this is the entry word length of Xilinx FPGAmultipliers
The fundamental idea behind CORDIC is to carry out aseries of rotations (called microrotations) on an input vectorto obtain another vector as outputThis is performed in threedifferent coordinate systems (linear circular and hyperbolic)and it is possible to obtain various basic functions suchas multiplication trigonometric operations logarithms andexponentials [28] in the components of the output vectorgiven some initial input vector conditions [27]Microrotationvalues are selected as shown in the following expression
tan (120572119894) = 120575119894 where 120575
119894= 2minus119894
(9)
The choice of 120575119894as an inverse power of 2 is what will allow us
to implement the algorithm using shifts since multiplicationor division for a power of 2 is equivalent to a shift to the leftor the right respectively in binary codification
In the circular coordinate system it is possible to obtainthe two operations required for the Jacobi algorithm Givensrotation and angle calculation (arctangent) Since CORDICperforms a series of rotations on a vector it can be seen as aniterative expression (11) Consider
V119894+1= [
1 minus1205901198942minus119894
1205901198942minus119894
1] V119894 V = [119909
119910] (10)
119911119894+1= 119911119894minus 120590119894120572119894 (11)
where V is a vector determined by its vertical and horizontalcomponents (119909 and 119910) and 119911 is a variable that keeps trackof the total rotated angle In each iteration the direction ofthe rotation 120590
119894is determined by the operation that is to be
performed In the particular case of the circular coordinatesystem the two algorithm operation modes are as follows
(i) Rotation mode the algorithm takes a vector V0=
(1199090 1199100) and an angle 119911
0and outputs a new vector
V = (119909 119910) which corresponds to V0rotated 119911 radians
To do so 120590119894is selected in such a way that 119911 approaches
zero in each iteration When 119911 cong 0 it can be assumedthat the starting vector has been fully rotated a 119911 angle
(ii) Translation mode algorithm input is again a vectorV0= (1199090 1199100) whose phase angV
0we want to compute
4 Mathematical Problems in Engineering
To do so the vector is rotated until its 119910 componentis zero selecting 120590
119894accordingly Since 119911 accumulates
microrotation angles when 119910 cong 0 the variable 119911will correspond to the initial phase of the vector Inaddition since 119910 = 0 119909 will store the absolute valueof V0
If this is expressed mathematically in rotation mode 120590119894=
sign(119911119894) the relation between the input and the output is as
follows
[119909
119910] = 119870
1[cos (119911
0) sin (119911
0)
minus sin (1199110) cos (119911
0)] [1199090
1199100
] (12)
On the other hand if the algorithm works in vectoringmode 120590
119894must be computed as 120590
119894= minus sign(119910
119894) and the relation
between the input and the output is now as shown in thefollowing expression
119909 = 1198701radic1199092
0+ 1199102
0
119911 = 1199110+ atan
1199100
1199090
(13)
As can be observed in expressions (12) and (13) theresult is scaled by a constant value of 119870
1 This is due to the
nonorthogonality of the microrotations which modifies theabsolute value of the vector To solve this results obtainedcan be scaled by 1119870
1 In addition 119870
1(14) is known in
advance and only varies with the number of microrotationsperformed Since 119870
1is fixed 1119870
1can be precomputed and
stored in a memory element to correct the output data
1198701=
119899minus1
prod
119894=0
radic1 + 1205902
119894sdot 2minus2119894 (14)
22 The QR Algorithm As indicated in Section 1 to showthe effectiveness of the Jacobi algorithm over other meth-ods when it is implemented in reconfigurable logic a QRalgorithm implementation was also carried out using VivadoHLS
In the QR algorithm another condition must be addedto our initial matrix which was real and symmetric now itmust also be a tridiagonal matrix like the one presented inthe following expression
119879 =
[[[
[
119886111988720 0
1198872119886211988730
0 119887311988631198874
0 0 11988741198864
]]]
]
(15)
To obtain the eigenvalues of119879 a reduction strategy is usedwhere the 119887
2 119887
119899entries are zeroed When for example
1198872= 0 the row and the column involved can be removed
where 1198861is an eigenvalue of 119879
To accomplish this in the QR algorithm a sequence ofmatrices similar to119879 (119879(1) 119879(119899)) is produced [29] To startwith the initial matrix 119879(119896) is factored as follows
119879(119896)
= 119876(119896)
119877(119896)
(16)
where 119876(119896) is an orthogonal matrix and 119877(119896) is an uppertriangular matrix In a similar way 119879(119896+1) is defined as
119879(119896+1)
= 119877(119896)
119876(119896)
(17)
As presented thus far the convergence of the method isslow To speed it up a shift can be performed on the originalmatrix for a 120590 quantity close to the eigenvalue we are lookingfor Thus we now have a new sequence of matrices such that
119879(119896)
minus 120590119868 = 119876(119896)
119877(119896)
119879(119896+1)
= 119877(119896)
119876(119896)
+ 120590119868
(18)
After each shift the eigenvalues of 119879 will be shifted by thesame 120590 amount and it is easy to recover the original valuesby accumulating the shifts after each iteration To obtainan approximated value of 120590 for each iteration a popularchoice is to calculate the eigenvalues of the matrix 119864 (19) andtake the one closest to 119886
119899 This is called the Wilkinson shift
[26] Finally 119876(119896) and 119877(119896) are chosen to be Givens rotationmatrices to eliminate the appropriate 119887
119894element
119864 = [119886119899minus1
119887119899
119887119899119886119899
] (19)
3 Implementation Details
As discussed in Section 1 the implementation tool used inthis study was the Xilinx Vivado HLS [18] which allowsthe use of a high level programming language (C C++ orSystem C) instead of a hardware description language such asVHDL or Verilog Although it presents some drawbacks thismeans of implementing the algorithms enabled us to test verydifferent hardware configurations of the systemwith very fewdesign modifications
Due to its iterative nature the Jacobi method can easilybe expressed as a series of loops which is the main elementused in Vivado HLS to implement the algorithms
This can be done via pragma directives or tcl scriptinggiving each part of the design different attributes that deter-mine the degree of concurrency or the desired resource con-straints Some important operations that can be performedon design loops include the following
(i) Pipelinewhichwill try to achieve parallelism betweenthe operations performed in the same iterationPipeline operation can also be specified throughoutthe entire design so that all internal loops are unrolledmaking parallelism between two system iterationspossible
(ii) Unroll which will implement each iteration as aseparate hardware instance
(iii) Dataflow which will try to achieve parallelismbetween different loops that are data dependentimplementing communication channels such as ping-pong memories
As mentioned before we can also apply resource con-straints to the designs limiting the number of instances
Mathematical Problems in Engineering 5
of a particular hardware element The elements that canbe limited range from operators (+lowast) to specific resourcessuch as DSP48E and BRAM memory blocks User definedelements such as a function called multiple times canalso be constrained and this is very important for designoptimization as will be shown later
Figure 1 shows a functional diagram of the Jacobi algo-rithm Since the C programming language was used eachtask was implemented as a for loop and labeled to makeoptimization easier
When the design starts its operation (a) the input matrix119860 isin R119899times119899 is buffered in a memory element 119878 In addition tocompute eigenvectors 119881(1) is loaded with an identity matrix(119868 isin R119899times119899) After the initial operations there are three tasks(a b and c) to be performed ℎ times in the external loopwhere ℎ corresponds to the total number of Jacobi iterations(elimination of the 119899 elements adjacent to the main diagonal)that the algorithmwill execute and is selected in suchway thateigenvector error is minimized
Thefirst task thatmust be done in each Jacobi iteration (b)is to calculate rotation angles (120572(119896)
119895) As previouslymentioned
this was achieved using the CORDIC algorithm in vectoringmode
Thenext tasks to be performed (c1 and c2) are eigenvalueand eigenvector rotations which consist of pre- and post-multiplication with the Givens rotation matrix Since thereare no data dependencies between these two tasks they areplaced in different loops and thus parallelization is possiblePre- or postmultiplication with a Givens rotation matrix canbe expressed as two vector rotations so it can be performedwith two rotation mode CORDIC algorithm executionsThisleaves us with six CORDIC executions four of them whichcorrespond to eigenvalue calculation and the other two toeigenvector computation
Thefinal operation to be performed before the next Jacobiiteration (d) is a series of row and column permutations withthe purpose of introducing new nonzero elements into thediagonal submatrices
This element rearrangement was implemented asdescribed by Brent and Luk [19] but with somemodificationsIn their study they proposed a systolic architecturewhere each submatrix corresponded to an independenthardware unit called a processor and all processors wereinterconnected so that data interchanges were possible
Since our design did not have a defined hardware archi-tecture at this point element interchangewas substitutedwiththe corresponding row and column permutations in order tofulfill the requirement of introducing new nonzero elementsinto the diagonal submatrices
Finally when all the Jacobi rotations are completedeigenvalues and eigenvectors are output as arrays which canbe implemented as BRAM ports or AXI buses so that thedesign can be integrated in different kinds of system
All tasks were implemented in the C programminglanguage using a fixed point data type following the XQNquantification scheme where a number is represented by 119883bits of integer part and 119873 bits of fractional part with a total
number of bits (word length) of WL = 119883 + 119873 + 1 for signedvalues and WL = 119883 + 119873 for unsigned values
Starting from the initial C prototype different hardwarearchitectures can be considered depending on the designrequirements Mainly resource usage (LUT Flip-Flops mul-tipliers memory elements) and timing performance (latency(119879119871) throughput or initiation interval (II) and maximum
clock frequency) are the parameters that we attempt tooptimize
(i) the latency (119879119871) which specifies how much time will
take until the system generate the first valid result(ii) the initiation interval (II) which corresponds to the
throughput or the number of clock cycles betweentwo valid sets of results after the latency period
As a result of the Vivado HLS optimization directivesthree architectures can be achieved depending on whichparameters are considered more important
(i) Serial architecture this hardware scheme pursuesminimal resource usage at the expense of a higherlatency To reach this optimization level a pipelinedirective is specified on the external loop of thedesign which performs the Jacobi iterations
(ii) Parallel architecture in this kind of system the maingoal is to achieve the best performance in termsof execution time (latency and throughput) at theexpense of resource usage To achieve this pipelineis applied to the entire design which means thatthe external loop will be unrolled and parallelismbetween two algorithm executions will be possible
(iii) Semiparallel architecture although timing is stillimportant in this scheme resource usage is alsoconsidered and attempts are made to minimize itwhile affecting execution time as little as possible Inour particular case this was achieved by limiting thenumber of CORDIC instances of the design in theparallel architecture
Although all three alternatives were analyzed the fullparallel architecture was found to use a massive amount ofinternal FPGA resources greatly exceeding the capacity ofthe device This was mainly due to an excessive amountof CORDIC units in the rotation mode which led to 55hardware instances of it in the case of an 8 times 8 inputmatrix Consequently we ruled out the possibility of parallelarchitecture
It is important to clarify the difference between semi-parallel and serial architecture In both cases we alwaysattempted to maximize concurrency between operationssuch as arithmetic operations andmemory accesses howeverwith parallel architecture it was also possible to achieveconcurrence between two full executions of the algorithm
The excessive amount of CORDIC units implemented bythe Vivado HLS synthesizer was taken as a starting pointfor optimization of the design Since Vivado HLS allowedus to limit the number of elements of a particular hardwareresource or entity the number of CORDIC instances in both
6 Mathematical Problems in Engineering
S A
V Intimesnk = 1
120572(k)j = 05tans(k)2jminus12j
s(k)2j2j minus s(k)2jminus12jminus1
Sk+1 Sk+1aux Vk+1 Vk+1aux
(k+1) = V(k)ij middot R(120572j)
i = 1 n2 j = 1 n2 j = i n2i = 1 n2
(k+1) = R(120572i)Tmiddot S(k)ij middot R(120572j)
k = h
(a)
(b)
(c2)
(d)
(c1)
Result rearrange
No
[ ]
Yes
A Rntimesnisin
Vaux 119894119895 Saux 119894119895
larrminus
larrminus
larrminus larrminus
rarr
rarrVlarrminus
1205821 120582n diag(S)
lceilrceil lceilrceil
larrminus
1205821 120582n
j = 1 n2
Figure 1 Jacobi algorithm functional diagram
semiparallel and serial architecture was limited to differentvalues allowing Vivado HLS to redesign the RTL implemen-tation and subsequently analyzing the design performanceobtained
First we analyze the case of the semiparallel architectureSince 8 times 8 is a common size in the literature for input matri-ces the design was optimized for this matrix size extendingthe conclusions reached to other matrix sizes The resultsobtained (resource consumption and timing performance)are shown in Figure 2
An analysis of timing performance indicated thatalthough latency did not vary very much the initiationinterval (II) was considerably reduced when the number ofrotation mode CORDIC instances was increased from oneto four slowly decreasing as more CORDIC modules wereadded
This result was mainly due to data dependencies betweenoperations performed in the Jacobi algorithm As previously
indicated four rotations must be performed for eigenvaluecalculation and the result of the first two is required toperform the last two whereas two independent rotationsmust be performed for eigenvector calculation
From this it can be concluded that four CORDICmodules will greatly increase the performance of the systemWhen timing performance is related to resource usage it isclear that it increased linearly with CORDIC module usageand the slope variation was due to the different control logicimplemented by Vivado with an odd and an even number ofCORDIC modules
The last conclusion that can be reached from Figure 2 isthat there was parallelism between two algorithm iterationssince 119879
119871gt II This means that the first valid result will be
obtained after 119879119871clock cycles but the next results will be
obtained more rapidly after every II clock cyclesThe results obtained for the serial architecture are shown
in Figure 3 This time only the latency is shown since there
Mathematical Problems in Engineering 7
1 2 3 4 5 6 7 8 90
20406080
100
Disc
rete
elem
ents
()
Resource usage
LUTFF
1 2 3 4 5 6 7 8 90
500
1000
1500
2000
Rotation CORDIC modules
Rotation CORDIC modules
Execution time (clock cycles)
IITL
nmiddotT
CLK
Figure 2 Semiparallel architecture implementation results of theJacobi algorithm for 8 times 8matrices
was no parallelism between two algorithm executions andthey only differed in one clock cycle that the system neededto start the next execution (II = 119879
119871+ 1)
As indicated earlier the execution time between two validresult sets is the same as the latency period (119879
119871) and in
the semiparallel design alternative it evolved similarly to theinitiation interval
For similar reasons there was a greater improvement inlatency in the first fourCORDICmodule additions which didnot decrease at all after sixThe increase in resource usage wasalso linear now with a constant slope although it used fewerresources
4 Assessment of Design Parameters
41 Optimal Number of Jacobi Iterations To evaluate thequality of the results obtained using the Jacobi method weconducted an experimental error analysis to determine howmany Jacobi iterations were required to obtain satisfactoryresults
The word length (WL) selected for the fixed point cod-ification scheme presented in Section 2 was 18 bits with afractional part of 15 bits (2Q15) since Xilinx FPGAmultiplierentry is fixed to this value and the purpose of using theCORDIC algorithm was to minimize resource usage Inaddition as will now be shown this codification providedsufficient precision to avoid overflow and growing roundingerrors
Simulating the system for multiple 8 times 8 real and sym-metric random matrices we calculated the eigenvalue and
1 2 3 4 5 6 7 8 90
1020304050
Disc
rete
elem
ents
Resource usage
1 2 3 4 5 6 7 8 91500
2000
2500
3000
Rotation CORDIC modules
Rotation CORDIC modules
Execution time
LUTFF
TL
nmiddotT
CLK
Figure 3 Serial architecture implementation results of the Jacobialgorithm for 8 times 8matrices
21 215 22 225 23 235 24 245 25 255 260
1020304050
Max
imum
erro
r (
)
Iterations
21 215 22 225 23 235 24 245 25 255 260
02040608
1
Qua
drat
ic m
ean
erro
r (
)
Iterations
Eigenvector component error (rarr
rarr
)1205821 120582m
Figure 4 Eigenvectors maximum and quadratic component error
eigenvector maximum (119890MAX) and quadratic (119890SQRT) errorcomparing FPGA Jacobi fixed point results with MATLABfloating point implementation (eig)
When the results were analyzed we found that the eigen-vector error stabilized later than the eigenvalue error (moreiterations were required) Consequently Figure 4 shows
8 Mathematical Problems in Engineering
the eigenvector component error calculated according to thefollowing expressions
119890 =
1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB
10038161003816100381610038161003816100381610038161003816119881MATLAB
1003816100381610038161003816
sdot 100
119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)
119890SQRT =1
1198962
radic
119896minus1
sum
119894=0
1198902
(20)
where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results
Our analysis of 8 times 8 matrices indicated that ℎ = 26
was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on
After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix
ℎ = 9 + 119899 log 119899 (21)
42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size
With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)
Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length
Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device
Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity
Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression
119879119871= 119879load + 119899
2
sdot ℎ (22)
8 9 10 11 12 13 14 15 1605
1
15
2Used resources
FF an
d LU
T (
)
8 9 10 11 12 13 14 15 160
50
100
150
FFLUT
Input matrix size (n)
Input matrix size (n)
Execution time with 100MHz
times104
Tex
e(120583
s)
fCLK =
Figure 5 Serial Jacobi design performance with matrix orderincrease
where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration
119879load =1198992
2+119899
2+ 4 (23)
A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879
119871) was still related to the
size of the inputmatrix as shown in the following expression
119879119871= 119879load + 119879119869 sdot ℎ (24)
where 119879119869= 119891(119899
2
) is the time required to perform a Jacobiiteration
Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs
Mathematical Problems in Engineering 9
Design comparison
11000 23000 34000 45000LUT
8500
17000
26000
34000
FF
400080001200016000
4000
8000
12000
16000
Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library
TL(nTCLK)
II(nT
CLK
)
Figure 6 Comparative between different design alternatives
5 Comparison with Other Proposals
The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention
Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation
The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)
The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode
The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources
Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one
Vivado HLS library
Clock frequency (MHz)
VHDL (Bravo et al
2008)QR
Parallel Jacobi
Serial Jacobi
0 50 100 150 200 250 300
Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8
The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks
The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit
As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations
Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required
Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically
Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources
As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed
On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz
10 Mathematical Problems in Engineering
On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level
6 Conclusions
In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature
The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired
In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms
Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs
As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high
Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)
The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view
We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results
Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)
References
[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008
[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014
[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014
[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014
[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005
[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000
[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013
[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008
[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988
[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000
[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990
[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983
[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959
[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988
[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013
Mathematical Problems in Engineering 11
[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts
designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value
and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985
[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014
[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009
[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014
[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014
[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014
[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014
[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013
[27] J S Walther A Unified Algorithm for Elementary Functions1972
[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010
[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
4 Mathematical Problems in Engineering
To do so the vector is rotated until its 119910 componentis zero selecting 120590
119894accordingly Since 119911 accumulates
microrotation angles when 119910 cong 0 the variable 119911will correspond to the initial phase of the vector Inaddition since 119910 = 0 119909 will store the absolute valueof V0
If this is expressed mathematically in rotation mode 120590119894=
sign(119911119894) the relation between the input and the output is as
follows
[119909
119910] = 119870
1[cos (119911
0) sin (119911
0)
minus sin (1199110) cos (119911
0)] [1199090
1199100
] (12)
On the other hand if the algorithm works in vectoringmode 120590
119894must be computed as 120590
119894= minus sign(119910
119894) and the relation
between the input and the output is now as shown in thefollowing expression
119909 = 1198701radic1199092
0+ 1199102
0
119911 = 1199110+ atan
1199100
1199090
(13)
As can be observed in expressions (12) and (13) theresult is scaled by a constant value of 119870
1 This is due to the
nonorthogonality of the microrotations which modifies theabsolute value of the vector To solve this results obtainedcan be scaled by 1119870
1 In addition 119870
1(14) is known in
advance and only varies with the number of microrotationsperformed Since 119870
1is fixed 1119870
1can be precomputed and
stored in a memory element to correct the output data
1198701=
119899minus1
prod
119894=0
radic1 + 1205902
119894sdot 2minus2119894 (14)
22 The QR Algorithm As indicated in Section 1 to showthe effectiveness of the Jacobi algorithm over other meth-ods when it is implemented in reconfigurable logic a QRalgorithm implementation was also carried out using VivadoHLS
In the QR algorithm another condition must be addedto our initial matrix which was real and symmetric now itmust also be a tridiagonal matrix like the one presented inthe following expression
119879 =
[[[
[
119886111988720 0
1198872119886211988730
0 119887311988631198874
0 0 11988741198864
]]]
]
(15)
To obtain the eigenvalues of119879 a reduction strategy is usedwhere the 119887
2 119887
119899entries are zeroed When for example
1198872= 0 the row and the column involved can be removed
where 1198861is an eigenvalue of 119879
To accomplish this in the QR algorithm a sequence ofmatrices similar to119879 (119879(1) 119879(119899)) is produced [29] To startwith the initial matrix 119879(119896) is factored as follows
119879(119896)
= 119876(119896)
119877(119896)
(16)
where 119876(119896) is an orthogonal matrix and 119877(119896) is an uppertriangular matrix In a similar way 119879(119896+1) is defined as
119879(119896+1)
= 119877(119896)
119876(119896)
(17)
As presented thus far the convergence of the method isslow To speed it up a shift can be performed on the originalmatrix for a 120590 quantity close to the eigenvalue we are lookingfor Thus we now have a new sequence of matrices such that
119879(119896)
minus 120590119868 = 119876(119896)
119877(119896)
119879(119896+1)
= 119877(119896)
119876(119896)
+ 120590119868
(18)
After each shift the eigenvalues of 119879 will be shifted by thesame 120590 amount and it is easy to recover the original valuesby accumulating the shifts after each iteration To obtainan approximated value of 120590 for each iteration a popularchoice is to calculate the eigenvalues of the matrix 119864 (19) andtake the one closest to 119886
119899 This is called the Wilkinson shift
[26] Finally 119876(119896) and 119877(119896) are chosen to be Givens rotationmatrices to eliminate the appropriate 119887
119894element
119864 = [119886119899minus1
119887119899
119887119899119886119899
] (19)
3 Implementation Details
As discussed in Section 1 the implementation tool used inthis study was the Xilinx Vivado HLS [18] which allowsthe use of a high level programming language (C C++ orSystem C) instead of a hardware description language such asVHDL or Verilog Although it presents some drawbacks thismeans of implementing the algorithms enabled us to test verydifferent hardware configurations of the systemwith very fewdesign modifications
Due to its iterative nature the Jacobi method can easilybe expressed as a series of loops which is the main elementused in Vivado HLS to implement the algorithms
This can be done via pragma directives or tcl scriptinggiving each part of the design different attributes that deter-mine the degree of concurrency or the desired resource con-straints Some important operations that can be performedon design loops include the following
(i) Pipelinewhichwill try to achieve parallelism betweenthe operations performed in the same iterationPipeline operation can also be specified throughoutthe entire design so that all internal loops are unrolledmaking parallelism between two system iterationspossible
(ii) Unroll which will implement each iteration as aseparate hardware instance
(iii) Dataflow which will try to achieve parallelismbetween different loops that are data dependentimplementing communication channels such as ping-pong memories
As mentioned before we can also apply resource con-straints to the designs limiting the number of instances
Mathematical Problems in Engineering 5
of a particular hardware element The elements that canbe limited range from operators (+lowast) to specific resourcessuch as DSP48E and BRAM memory blocks User definedelements such as a function called multiple times canalso be constrained and this is very important for designoptimization as will be shown later
Figure 1 shows a functional diagram of the Jacobi algo-rithm Since the C programming language was used eachtask was implemented as a for loop and labeled to makeoptimization easier
When the design starts its operation (a) the input matrix119860 isin R119899times119899 is buffered in a memory element 119878 In addition tocompute eigenvectors 119881(1) is loaded with an identity matrix(119868 isin R119899times119899) After the initial operations there are three tasks(a b and c) to be performed ℎ times in the external loopwhere ℎ corresponds to the total number of Jacobi iterations(elimination of the 119899 elements adjacent to the main diagonal)that the algorithmwill execute and is selected in suchway thateigenvector error is minimized
Thefirst task thatmust be done in each Jacobi iteration (b)is to calculate rotation angles (120572(119896)
119895) As previouslymentioned
this was achieved using the CORDIC algorithm in vectoringmode
Thenext tasks to be performed (c1 and c2) are eigenvalueand eigenvector rotations which consist of pre- and post-multiplication with the Givens rotation matrix Since thereare no data dependencies between these two tasks they areplaced in different loops and thus parallelization is possiblePre- or postmultiplication with a Givens rotation matrix canbe expressed as two vector rotations so it can be performedwith two rotation mode CORDIC algorithm executionsThisleaves us with six CORDIC executions four of them whichcorrespond to eigenvalue calculation and the other two toeigenvector computation
Thefinal operation to be performed before the next Jacobiiteration (d) is a series of row and column permutations withthe purpose of introducing new nonzero elements into thediagonal submatrices
This element rearrangement was implemented asdescribed by Brent and Luk [19] but with somemodificationsIn their study they proposed a systolic architecturewhere each submatrix corresponded to an independenthardware unit called a processor and all processors wereinterconnected so that data interchanges were possible
Since our design did not have a defined hardware archi-tecture at this point element interchangewas substitutedwiththe corresponding row and column permutations in order tofulfill the requirement of introducing new nonzero elementsinto the diagonal submatrices
Finally when all the Jacobi rotations are completedeigenvalues and eigenvectors are output as arrays which canbe implemented as BRAM ports or AXI buses so that thedesign can be integrated in different kinds of system
All tasks were implemented in the C programminglanguage using a fixed point data type following the XQNquantification scheme where a number is represented by 119883bits of integer part and 119873 bits of fractional part with a total
number of bits (word length) of WL = 119883 + 119873 + 1 for signedvalues and WL = 119883 + 119873 for unsigned values
Starting from the initial C prototype different hardwarearchitectures can be considered depending on the designrequirements Mainly resource usage (LUT Flip-Flops mul-tipliers memory elements) and timing performance (latency(119879119871) throughput or initiation interval (II) and maximum
clock frequency) are the parameters that we attempt tooptimize
(i) the latency (119879119871) which specifies how much time will
take until the system generate the first valid result(ii) the initiation interval (II) which corresponds to the
throughput or the number of clock cycles betweentwo valid sets of results after the latency period
As a result of the Vivado HLS optimization directivesthree architectures can be achieved depending on whichparameters are considered more important
(i) Serial architecture this hardware scheme pursuesminimal resource usage at the expense of a higherlatency To reach this optimization level a pipelinedirective is specified on the external loop of thedesign which performs the Jacobi iterations
(ii) Parallel architecture in this kind of system the maingoal is to achieve the best performance in termsof execution time (latency and throughput) at theexpense of resource usage To achieve this pipelineis applied to the entire design which means thatthe external loop will be unrolled and parallelismbetween two algorithm executions will be possible
(iii) Semiparallel architecture although timing is stillimportant in this scheme resource usage is alsoconsidered and attempts are made to minimize itwhile affecting execution time as little as possible Inour particular case this was achieved by limiting thenumber of CORDIC instances of the design in theparallel architecture
Although all three alternatives were analyzed the fullparallel architecture was found to use a massive amount ofinternal FPGA resources greatly exceeding the capacity ofthe device This was mainly due to an excessive amountof CORDIC units in the rotation mode which led to 55hardware instances of it in the case of an 8 times 8 inputmatrix Consequently we ruled out the possibility of parallelarchitecture
It is important to clarify the difference between semi-parallel and serial architecture In both cases we alwaysattempted to maximize concurrency between operationssuch as arithmetic operations andmemory accesses howeverwith parallel architecture it was also possible to achieveconcurrence between two full executions of the algorithm
The excessive amount of CORDIC units implemented bythe Vivado HLS synthesizer was taken as a starting pointfor optimization of the design Since Vivado HLS allowedus to limit the number of elements of a particular hardwareresource or entity the number of CORDIC instances in both
6 Mathematical Problems in Engineering
S A
V Intimesnk = 1
120572(k)j = 05tans(k)2jminus12j
s(k)2j2j minus s(k)2jminus12jminus1
Sk+1 Sk+1aux Vk+1 Vk+1aux
(k+1) = V(k)ij middot R(120572j)
i = 1 n2 j = 1 n2 j = i n2i = 1 n2
(k+1) = R(120572i)Tmiddot S(k)ij middot R(120572j)
k = h
(a)
(b)
(c2)
(d)
(c1)
Result rearrange
No
[ ]
Yes
A Rntimesnisin
Vaux 119894119895 Saux 119894119895
larrminus
larrminus
larrminus larrminus
rarr
rarrVlarrminus
1205821 120582n diag(S)
lceilrceil lceilrceil
larrminus
1205821 120582n
j = 1 n2
Figure 1 Jacobi algorithm functional diagram
semiparallel and serial architecture was limited to differentvalues allowing Vivado HLS to redesign the RTL implemen-tation and subsequently analyzing the design performanceobtained
First we analyze the case of the semiparallel architectureSince 8 times 8 is a common size in the literature for input matri-ces the design was optimized for this matrix size extendingthe conclusions reached to other matrix sizes The resultsobtained (resource consumption and timing performance)are shown in Figure 2
An analysis of timing performance indicated thatalthough latency did not vary very much the initiationinterval (II) was considerably reduced when the number ofrotation mode CORDIC instances was increased from oneto four slowly decreasing as more CORDIC modules wereadded
This result was mainly due to data dependencies betweenoperations performed in the Jacobi algorithm As previously
indicated four rotations must be performed for eigenvaluecalculation and the result of the first two is required toperform the last two whereas two independent rotationsmust be performed for eigenvector calculation
From this it can be concluded that four CORDICmodules will greatly increase the performance of the systemWhen timing performance is related to resource usage it isclear that it increased linearly with CORDIC module usageand the slope variation was due to the different control logicimplemented by Vivado with an odd and an even number ofCORDIC modules
The last conclusion that can be reached from Figure 2 isthat there was parallelism between two algorithm iterationssince 119879
119871gt II This means that the first valid result will be
obtained after 119879119871clock cycles but the next results will be
obtained more rapidly after every II clock cyclesThe results obtained for the serial architecture are shown
in Figure 3 This time only the latency is shown since there
Mathematical Problems in Engineering 7
1 2 3 4 5 6 7 8 90
20406080
100
Disc
rete
elem
ents
()
Resource usage
LUTFF
1 2 3 4 5 6 7 8 90
500
1000
1500
2000
Rotation CORDIC modules
Rotation CORDIC modules
Execution time (clock cycles)
IITL
nmiddotT
CLK
Figure 2 Semiparallel architecture implementation results of theJacobi algorithm for 8 times 8matrices
was no parallelism between two algorithm executions andthey only differed in one clock cycle that the system neededto start the next execution (II = 119879
119871+ 1)
As indicated earlier the execution time between two validresult sets is the same as the latency period (119879
119871) and in
the semiparallel design alternative it evolved similarly to theinitiation interval
For similar reasons there was a greater improvement inlatency in the first fourCORDICmodule additions which didnot decrease at all after sixThe increase in resource usage wasalso linear now with a constant slope although it used fewerresources
4 Assessment of Design Parameters
41 Optimal Number of Jacobi Iterations To evaluate thequality of the results obtained using the Jacobi method weconducted an experimental error analysis to determine howmany Jacobi iterations were required to obtain satisfactoryresults
The word length (WL) selected for the fixed point cod-ification scheme presented in Section 2 was 18 bits with afractional part of 15 bits (2Q15) since Xilinx FPGAmultiplierentry is fixed to this value and the purpose of using theCORDIC algorithm was to minimize resource usage Inaddition as will now be shown this codification providedsufficient precision to avoid overflow and growing roundingerrors
Simulating the system for multiple 8 times 8 real and sym-metric random matrices we calculated the eigenvalue and
1 2 3 4 5 6 7 8 90
1020304050
Disc
rete
elem
ents
Resource usage
1 2 3 4 5 6 7 8 91500
2000
2500
3000
Rotation CORDIC modules
Rotation CORDIC modules
Execution time
LUTFF
TL
nmiddotT
CLK
Figure 3 Serial architecture implementation results of the Jacobialgorithm for 8 times 8matrices
21 215 22 225 23 235 24 245 25 255 260
1020304050
Max
imum
erro
r (
)
Iterations
21 215 22 225 23 235 24 245 25 255 260
02040608
1
Qua
drat
ic m
ean
erro
r (
)
Iterations
Eigenvector component error (rarr
rarr
)1205821 120582m
Figure 4 Eigenvectors maximum and quadratic component error
eigenvector maximum (119890MAX) and quadratic (119890SQRT) errorcomparing FPGA Jacobi fixed point results with MATLABfloating point implementation (eig)
When the results were analyzed we found that the eigen-vector error stabilized later than the eigenvalue error (moreiterations were required) Consequently Figure 4 shows
8 Mathematical Problems in Engineering
the eigenvector component error calculated according to thefollowing expressions
119890 =
1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB
10038161003816100381610038161003816100381610038161003816119881MATLAB
1003816100381610038161003816
sdot 100
119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)
119890SQRT =1
1198962
radic
119896minus1
sum
119894=0
1198902
(20)
where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results
Our analysis of 8 times 8 matrices indicated that ℎ = 26
was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on
After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix
ℎ = 9 + 119899 log 119899 (21)
42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size
With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)
Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length
Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device
Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity
Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression
119879119871= 119879load + 119899
2
sdot ℎ (22)
8 9 10 11 12 13 14 15 1605
1
15
2Used resources
FF an
d LU
T (
)
8 9 10 11 12 13 14 15 160
50
100
150
FFLUT
Input matrix size (n)
Input matrix size (n)
Execution time with 100MHz
times104
Tex
e(120583
s)
fCLK =
Figure 5 Serial Jacobi design performance with matrix orderincrease
where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration
119879load =1198992
2+119899
2+ 4 (23)
A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879
119871) was still related to the
size of the inputmatrix as shown in the following expression
119879119871= 119879load + 119879119869 sdot ℎ (24)
where 119879119869= 119891(119899
2
) is the time required to perform a Jacobiiteration
Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs
Mathematical Problems in Engineering 9
Design comparison
11000 23000 34000 45000LUT
8500
17000
26000
34000
FF
400080001200016000
4000
8000
12000
16000
Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library
TL(nTCLK)
II(nT
CLK
)
Figure 6 Comparative between different design alternatives
5 Comparison with Other Proposals
The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention
Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation
The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)
The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode
The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources
Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one
Vivado HLS library
Clock frequency (MHz)
VHDL (Bravo et al
2008)QR
Parallel Jacobi
Serial Jacobi
0 50 100 150 200 250 300
Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8
The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks
The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit
As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations
Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required
Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically
Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources
As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed
On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz
10 Mathematical Problems in Engineering
On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level
6 Conclusions
In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature
The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired
In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms
Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs
As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high
Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)
The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view
We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results
Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)
References
[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008
[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014
[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014
[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014
[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005
[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000
[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013
[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008
[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988
[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000
[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990
[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983
[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959
[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988
[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013
Mathematical Problems in Engineering 11
[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts
designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value
and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985
[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014
[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009
[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014
[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014
[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014
[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014
[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013
[27] J S Walther A Unified Algorithm for Elementary Functions1972
[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010
[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 5
of a particular hardware element The elements that canbe limited range from operators (+lowast) to specific resourcessuch as DSP48E and BRAM memory blocks User definedelements such as a function called multiple times canalso be constrained and this is very important for designoptimization as will be shown later
Figure 1 shows a functional diagram of the Jacobi algo-rithm Since the C programming language was used eachtask was implemented as a for loop and labeled to makeoptimization easier
When the design starts its operation (a) the input matrix119860 isin R119899times119899 is buffered in a memory element 119878 In addition tocompute eigenvectors 119881(1) is loaded with an identity matrix(119868 isin R119899times119899) After the initial operations there are three tasks(a b and c) to be performed ℎ times in the external loopwhere ℎ corresponds to the total number of Jacobi iterations(elimination of the 119899 elements adjacent to the main diagonal)that the algorithmwill execute and is selected in suchway thateigenvector error is minimized
Thefirst task thatmust be done in each Jacobi iteration (b)is to calculate rotation angles (120572(119896)
119895) As previouslymentioned
this was achieved using the CORDIC algorithm in vectoringmode
Thenext tasks to be performed (c1 and c2) are eigenvalueand eigenvector rotations which consist of pre- and post-multiplication with the Givens rotation matrix Since thereare no data dependencies between these two tasks they areplaced in different loops and thus parallelization is possiblePre- or postmultiplication with a Givens rotation matrix canbe expressed as two vector rotations so it can be performedwith two rotation mode CORDIC algorithm executionsThisleaves us with six CORDIC executions four of them whichcorrespond to eigenvalue calculation and the other two toeigenvector computation
Thefinal operation to be performed before the next Jacobiiteration (d) is a series of row and column permutations withthe purpose of introducing new nonzero elements into thediagonal submatrices
This element rearrangement was implemented asdescribed by Brent and Luk [19] but with somemodificationsIn their study they proposed a systolic architecturewhere each submatrix corresponded to an independenthardware unit called a processor and all processors wereinterconnected so that data interchanges were possible
Since our design did not have a defined hardware archi-tecture at this point element interchangewas substitutedwiththe corresponding row and column permutations in order tofulfill the requirement of introducing new nonzero elementsinto the diagonal submatrices
Finally when all the Jacobi rotations are completedeigenvalues and eigenvectors are output as arrays which canbe implemented as BRAM ports or AXI buses so that thedesign can be integrated in different kinds of system
All tasks were implemented in the C programminglanguage using a fixed point data type following the XQNquantification scheme where a number is represented by 119883bits of integer part and 119873 bits of fractional part with a total
number of bits (word length) of WL = 119883 + 119873 + 1 for signedvalues and WL = 119883 + 119873 for unsigned values
Starting from the initial C prototype different hardwarearchitectures can be considered depending on the designrequirements Mainly resource usage (LUT Flip-Flops mul-tipliers memory elements) and timing performance (latency(119879119871) throughput or initiation interval (II) and maximum
clock frequency) are the parameters that we attempt tooptimize
(i) the latency (119879119871) which specifies how much time will
take until the system generate the first valid result(ii) the initiation interval (II) which corresponds to the
throughput or the number of clock cycles betweentwo valid sets of results after the latency period
As a result of the Vivado HLS optimization directivesthree architectures can be achieved depending on whichparameters are considered more important
(i) Serial architecture this hardware scheme pursuesminimal resource usage at the expense of a higherlatency To reach this optimization level a pipelinedirective is specified on the external loop of thedesign which performs the Jacobi iterations
(ii) Parallel architecture in this kind of system the maingoal is to achieve the best performance in termsof execution time (latency and throughput) at theexpense of resource usage To achieve this pipelineis applied to the entire design which means thatthe external loop will be unrolled and parallelismbetween two algorithm executions will be possible
(iii) Semiparallel architecture although timing is stillimportant in this scheme resource usage is alsoconsidered and attempts are made to minimize itwhile affecting execution time as little as possible Inour particular case this was achieved by limiting thenumber of CORDIC instances of the design in theparallel architecture
Although all three alternatives were analyzed the fullparallel architecture was found to use a massive amount ofinternal FPGA resources greatly exceeding the capacity ofthe device This was mainly due to an excessive amountof CORDIC units in the rotation mode which led to 55hardware instances of it in the case of an 8 times 8 inputmatrix Consequently we ruled out the possibility of parallelarchitecture
It is important to clarify the difference between semi-parallel and serial architecture In both cases we alwaysattempted to maximize concurrency between operationssuch as arithmetic operations andmemory accesses howeverwith parallel architecture it was also possible to achieveconcurrence between two full executions of the algorithm
The excessive amount of CORDIC units implemented bythe Vivado HLS synthesizer was taken as a starting pointfor optimization of the design Since Vivado HLS allowedus to limit the number of elements of a particular hardwareresource or entity the number of CORDIC instances in both
6 Mathematical Problems in Engineering
S A
V Intimesnk = 1
120572(k)j = 05tans(k)2jminus12j
s(k)2j2j minus s(k)2jminus12jminus1
Sk+1 Sk+1aux Vk+1 Vk+1aux
(k+1) = V(k)ij middot R(120572j)
i = 1 n2 j = 1 n2 j = i n2i = 1 n2
(k+1) = R(120572i)Tmiddot S(k)ij middot R(120572j)
k = h
(a)
(b)
(c2)
(d)
(c1)
Result rearrange
No
[ ]
Yes
A Rntimesnisin
Vaux 119894119895 Saux 119894119895
larrminus
larrminus
larrminus larrminus
rarr
rarrVlarrminus
1205821 120582n diag(S)
lceilrceil lceilrceil
larrminus
1205821 120582n
j = 1 n2
Figure 1 Jacobi algorithm functional diagram
semiparallel and serial architecture was limited to differentvalues allowing Vivado HLS to redesign the RTL implemen-tation and subsequently analyzing the design performanceobtained
First we analyze the case of the semiparallel architectureSince 8 times 8 is a common size in the literature for input matri-ces the design was optimized for this matrix size extendingthe conclusions reached to other matrix sizes The resultsobtained (resource consumption and timing performance)are shown in Figure 2
An analysis of timing performance indicated thatalthough latency did not vary very much the initiationinterval (II) was considerably reduced when the number ofrotation mode CORDIC instances was increased from oneto four slowly decreasing as more CORDIC modules wereadded
This result was mainly due to data dependencies betweenoperations performed in the Jacobi algorithm As previously
indicated four rotations must be performed for eigenvaluecalculation and the result of the first two is required toperform the last two whereas two independent rotationsmust be performed for eigenvector calculation
From this it can be concluded that four CORDICmodules will greatly increase the performance of the systemWhen timing performance is related to resource usage it isclear that it increased linearly with CORDIC module usageand the slope variation was due to the different control logicimplemented by Vivado with an odd and an even number ofCORDIC modules
The last conclusion that can be reached from Figure 2 isthat there was parallelism between two algorithm iterationssince 119879
119871gt II This means that the first valid result will be
obtained after 119879119871clock cycles but the next results will be
obtained more rapidly after every II clock cyclesThe results obtained for the serial architecture are shown
in Figure 3 This time only the latency is shown since there
Mathematical Problems in Engineering 7
1 2 3 4 5 6 7 8 90
20406080
100
Disc
rete
elem
ents
()
Resource usage
LUTFF
1 2 3 4 5 6 7 8 90
500
1000
1500
2000
Rotation CORDIC modules
Rotation CORDIC modules
Execution time (clock cycles)
IITL
nmiddotT
CLK
Figure 2 Semiparallel architecture implementation results of theJacobi algorithm for 8 times 8matrices
was no parallelism between two algorithm executions andthey only differed in one clock cycle that the system neededto start the next execution (II = 119879
119871+ 1)
As indicated earlier the execution time between two validresult sets is the same as the latency period (119879
119871) and in
the semiparallel design alternative it evolved similarly to theinitiation interval
For similar reasons there was a greater improvement inlatency in the first fourCORDICmodule additions which didnot decrease at all after sixThe increase in resource usage wasalso linear now with a constant slope although it used fewerresources
4 Assessment of Design Parameters
41 Optimal Number of Jacobi Iterations To evaluate thequality of the results obtained using the Jacobi method weconducted an experimental error analysis to determine howmany Jacobi iterations were required to obtain satisfactoryresults
The word length (WL) selected for the fixed point cod-ification scheme presented in Section 2 was 18 bits with afractional part of 15 bits (2Q15) since Xilinx FPGAmultiplierentry is fixed to this value and the purpose of using theCORDIC algorithm was to minimize resource usage Inaddition as will now be shown this codification providedsufficient precision to avoid overflow and growing roundingerrors
Simulating the system for multiple 8 times 8 real and sym-metric random matrices we calculated the eigenvalue and
1 2 3 4 5 6 7 8 90
1020304050
Disc
rete
elem
ents
Resource usage
1 2 3 4 5 6 7 8 91500
2000
2500
3000
Rotation CORDIC modules
Rotation CORDIC modules
Execution time
LUTFF
TL
nmiddotT
CLK
Figure 3 Serial architecture implementation results of the Jacobialgorithm for 8 times 8matrices
21 215 22 225 23 235 24 245 25 255 260
1020304050
Max
imum
erro
r (
)
Iterations
21 215 22 225 23 235 24 245 25 255 260
02040608
1
Qua
drat
ic m
ean
erro
r (
)
Iterations
Eigenvector component error (rarr
rarr
)1205821 120582m
Figure 4 Eigenvectors maximum and quadratic component error
eigenvector maximum (119890MAX) and quadratic (119890SQRT) errorcomparing FPGA Jacobi fixed point results with MATLABfloating point implementation (eig)
When the results were analyzed we found that the eigen-vector error stabilized later than the eigenvalue error (moreiterations were required) Consequently Figure 4 shows
8 Mathematical Problems in Engineering
the eigenvector component error calculated according to thefollowing expressions
119890 =
1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB
10038161003816100381610038161003816100381610038161003816119881MATLAB
1003816100381610038161003816
sdot 100
119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)
119890SQRT =1
1198962
radic
119896minus1
sum
119894=0
1198902
(20)
where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results
Our analysis of 8 times 8 matrices indicated that ℎ = 26
was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on
After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix
ℎ = 9 + 119899 log 119899 (21)
42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size
With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)
Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length
Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device
Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity
Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression
119879119871= 119879load + 119899
2
sdot ℎ (22)
8 9 10 11 12 13 14 15 1605
1
15
2Used resources
FF an
d LU
T (
)
8 9 10 11 12 13 14 15 160
50
100
150
FFLUT
Input matrix size (n)
Input matrix size (n)
Execution time with 100MHz
times104
Tex
e(120583
s)
fCLK =
Figure 5 Serial Jacobi design performance with matrix orderincrease
where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration
119879load =1198992
2+119899
2+ 4 (23)
A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879
119871) was still related to the
size of the inputmatrix as shown in the following expression
119879119871= 119879load + 119879119869 sdot ℎ (24)
where 119879119869= 119891(119899
2
) is the time required to perform a Jacobiiteration
Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs
Mathematical Problems in Engineering 9
Design comparison
11000 23000 34000 45000LUT
8500
17000
26000
34000
FF
400080001200016000
4000
8000
12000
16000
Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library
TL(nTCLK)
II(nT
CLK
)
Figure 6 Comparative between different design alternatives
5 Comparison with Other Proposals
The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention
Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation
The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)
The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode
The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources
Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one
Vivado HLS library
Clock frequency (MHz)
VHDL (Bravo et al
2008)QR
Parallel Jacobi
Serial Jacobi
0 50 100 150 200 250 300
Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8
The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks
The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit
As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations
Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required
Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically
Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources
As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed
On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz
10 Mathematical Problems in Engineering
On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level
6 Conclusions
In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature
The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired
In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms
Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs
As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high
Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)
The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view
We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results
Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)
References
[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008
[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014
[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014
[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014
[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005
[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000
[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013
[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008
[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988
[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000
[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990
[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983
[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959
[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988
[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013
Mathematical Problems in Engineering 11
[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts
designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value
and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985
[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014
[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009
[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014
[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014
[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014
[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014
[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013
[27] J S Walther A Unified Algorithm for Elementary Functions1972
[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010
[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
6 Mathematical Problems in Engineering
S A
V Intimesnk = 1
120572(k)j = 05tans(k)2jminus12j
s(k)2j2j minus s(k)2jminus12jminus1
Sk+1 Sk+1aux Vk+1 Vk+1aux
(k+1) = V(k)ij middot R(120572j)
i = 1 n2 j = 1 n2 j = i n2i = 1 n2
(k+1) = R(120572i)Tmiddot S(k)ij middot R(120572j)
k = h
(a)
(b)
(c2)
(d)
(c1)
Result rearrange
No
[ ]
Yes
A Rntimesnisin
Vaux 119894119895 Saux 119894119895
larrminus
larrminus
larrminus larrminus
rarr
rarrVlarrminus
1205821 120582n diag(S)
lceilrceil lceilrceil
larrminus
1205821 120582n
j = 1 n2
Figure 1 Jacobi algorithm functional diagram
semiparallel and serial architecture was limited to differentvalues allowing Vivado HLS to redesign the RTL implemen-tation and subsequently analyzing the design performanceobtained
First we analyze the case of the semiparallel architectureSince 8 times 8 is a common size in the literature for input matri-ces the design was optimized for this matrix size extendingthe conclusions reached to other matrix sizes The resultsobtained (resource consumption and timing performance)are shown in Figure 2
An analysis of timing performance indicated thatalthough latency did not vary very much the initiationinterval (II) was considerably reduced when the number ofrotation mode CORDIC instances was increased from oneto four slowly decreasing as more CORDIC modules wereadded
This result was mainly due to data dependencies betweenoperations performed in the Jacobi algorithm As previously
indicated four rotations must be performed for eigenvaluecalculation and the result of the first two is required toperform the last two whereas two independent rotationsmust be performed for eigenvector calculation
From this it can be concluded that four CORDICmodules will greatly increase the performance of the systemWhen timing performance is related to resource usage it isclear that it increased linearly with CORDIC module usageand the slope variation was due to the different control logicimplemented by Vivado with an odd and an even number ofCORDIC modules
The last conclusion that can be reached from Figure 2 isthat there was parallelism between two algorithm iterationssince 119879
119871gt II This means that the first valid result will be
obtained after 119879119871clock cycles but the next results will be
obtained more rapidly after every II clock cyclesThe results obtained for the serial architecture are shown
in Figure 3 This time only the latency is shown since there
Mathematical Problems in Engineering 7
1 2 3 4 5 6 7 8 90
20406080
100
Disc
rete
elem
ents
()
Resource usage
LUTFF
1 2 3 4 5 6 7 8 90
500
1000
1500
2000
Rotation CORDIC modules
Rotation CORDIC modules
Execution time (clock cycles)
IITL
nmiddotT
CLK
Figure 2 Semiparallel architecture implementation results of theJacobi algorithm for 8 times 8matrices
was no parallelism between two algorithm executions andthey only differed in one clock cycle that the system neededto start the next execution (II = 119879
119871+ 1)
As indicated earlier the execution time between two validresult sets is the same as the latency period (119879
119871) and in
the semiparallel design alternative it evolved similarly to theinitiation interval
For similar reasons there was a greater improvement inlatency in the first fourCORDICmodule additions which didnot decrease at all after sixThe increase in resource usage wasalso linear now with a constant slope although it used fewerresources
4 Assessment of Design Parameters
41 Optimal Number of Jacobi Iterations To evaluate thequality of the results obtained using the Jacobi method weconducted an experimental error analysis to determine howmany Jacobi iterations were required to obtain satisfactoryresults
The word length (WL) selected for the fixed point cod-ification scheme presented in Section 2 was 18 bits with afractional part of 15 bits (2Q15) since Xilinx FPGAmultiplierentry is fixed to this value and the purpose of using theCORDIC algorithm was to minimize resource usage Inaddition as will now be shown this codification providedsufficient precision to avoid overflow and growing roundingerrors
Simulating the system for multiple 8 times 8 real and sym-metric random matrices we calculated the eigenvalue and
1 2 3 4 5 6 7 8 90
1020304050
Disc
rete
elem
ents
Resource usage
1 2 3 4 5 6 7 8 91500
2000
2500
3000
Rotation CORDIC modules
Rotation CORDIC modules
Execution time
LUTFF
TL
nmiddotT
CLK
Figure 3 Serial architecture implementation results of the Jacobialgorithm for 8 times 8matrices
21 215 22 225 23 235 24 245 25 255 260
1020304050
Max
imum
erro
r (
)
Iterations
21 215 22 225 23 235 24 245 25 255 260
02040608
1
Qua
drat
ic m
ean
erro
r (
)
Iterations
Eigenvector component error (rarr
rarr
)1205821 120582m
Figure 4 Eigenvectors maximum and quadratic component error
eigenvector maximum (119890MAX) and quadratic (119890SQRT) errorcomparing FPGA Jacobi fixed point results with MATLABfloating point implementation (eig)
When the results were analyzed we found that the eigen-vector error stabilized later than the eigenvalue error (moreiterations were required) Consequently Figure 4 shows
8 Mathematical Problems in Engineering
the eigenvector component error calculated according to thefollowing expressions
119890 =
1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB
10038161003816100381610038161003816100381610038161003816119881MATLAB
1003816100381610038161003816
sdot 100
119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)
119890SQRT =1
1198962
radic
119896minus1
sum
119894=0
1198902
(20)
where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results
Our analysis of 8 times 8 matrices indicated that ℎ = 26
was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on
After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix
ℎ = 9 + 119899 log 119899 (21)
42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size
With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)
Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length
Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device
Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity
Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression
119879119871= 119879load + 119899
2
sdot ℎ (22)
8 9 10 11 12 13 14 15 1605
1
15
2Used resources
FF an
d LU
T (
)
8 9 10 11 12 13 14 15 160
50
100
150
FFLUT
Input matrix size (n)
Input matrix size (n)
Execution time with 100MHz
times104
Tex
e(120583
s)
fCLK =
Figure 5 Serial Jacobi design performance with matrix orderincrease
where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration
119879load =1198992
2+119899
2+ 4 (23)
A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879
119871) was still related to the
size of the inputmatrix as shown in the following expression
119879119871= 119879load + 119879119869 sdot ℎ (24)
where 119879119869= 119891(119899
2
) is the time required to perform a Jacobiiteration
Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs
Mathematical Problems in Engineering 9
Design comparison
11000 23000 34000 45000LUT
8500
17000
26000
34000
FF
400080001200016000
4000
8000
12000
16000
Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library
TL(nTCLK)
II(nT
CLK
)
Figure 6 Comparative between different design alternatives
5 Comparison with Other Proposals
The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention
Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation
The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)
The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode
The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources
Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one
Vivado HLS library
Clock frequency (MHz)
VHDL (Bravo et al
2008)QR
Parallel Jacobi
Serial Jacobi
0 50 100 150 200 250 300
Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8
The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks
The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit
As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations
Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required
Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically
Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources
As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed
On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz
10 Mathematical Problems in Engineering
On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level
6 Conclusions
In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature
The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired
In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms
Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs
As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high
Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)
The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view
We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results
Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)
References
[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008
[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014
[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014
[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014
[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005
[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000
[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013
[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008
[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988
[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000
[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990
[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983
[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959
[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988
[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013
Mathematical Problems in Engineering 11
[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts
designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value
and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985
[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014
[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009
[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014
[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014
[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014
[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014
[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013
[27] J S Walther A Unified Algorithm for Elementary Functions1972
[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010
[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 7
1 2 3 4 5 6 7 8 90
20406080
100
Disc
rete
elem
ents
()
Resource usage
LUTFF
1 2 3 4 5 6 7 8 90
500
1000
1500
2000
Rotation CORDIC modules
Rotation CORDIC modules
Execution time (clock cycles)
IITL
nmiddotT
CLK
Figure 2 Semiparallel architecture implementation results of theJacobi algorithm for 8 times 8matrices
was no parallelism between two algorithm executions andthey only differed in one clock cycle that the system neededto start the next execution (II = 119879
119871+ 1)
As indicated earlier the execution time between two validresult sets is the same as the latency period (119879
119871) and in
the semiparallel design alternative it evolved similarly to theinitiation interval
For similar reasons there was a greater improvement inlatency in the first fourCORDICmodule additions which didnot decrease at all after sixThe increase in resource usage wasalso linear now with a constant slope although it used fewerresources
4 Assessment of Design Parameters
41 Optimal Number of Jacobi Iterations To evaluate thequality of the results obtained using the Jacobi method weconducted an experimental error analysis to determine howmany Jacobi iterations were required to obtain satisfactoryresults
The word length (WL) selected for the fixed point cod-ification scheme presented in Section 2 was 18 bits with afractional part of 15 bits (2Q15) since Xilinx FPGAmultiplierentry is fixed to this value and the purpose of using theCORDIC algorithm was to minimize resource usage Inaddition as will now be shown this codification providedsufficient precision to avoid overflow and growing roundingerrors
Simulating the system for multiple 8 times 8 real and sym-metric random matrices we calculated the eigenvalue and
1 2 3 4 5 6 7 8 90
1020304050
Disc
rete
elem
ents
Resource usage
1 2 3 4 5 6 7 8 91500
2000
2500
3000
Rotation CORDIC modules
Rotation CORDIC modules
Execution time
LUTFF
TL
nmiddotT
CLK
Figure 3 Serial architecture implementation results of the Jacobialgorithm for 8 times 8matrices
21 215 22 225 23 235 24 245 25 255 260
1020304050
Max
imum
erro
r (
)
Iterations
21 215 22 225 23 235 24 245 25 255 260
02040608
1
Qua
drat
ic m
ean
erro
r (
)
Iterations
Eigenvector component error (rarr
rarr
)1205821 120582m
Figure 4 Eigenvectors maximum and quadratic component error
eigenvector maximum (119890MAX) and quadratic (119890SQRT) errorcomparing FPGA Jacobi fixed point results with MATLABfloating point implementation (eig)
When the results were analyzed we found that the eigen-vector error stabilized later than the eigenvalue error (moreiterations were required) Consequently Figure 4 shows
8 Mathematical Problems in Engineering
the eigenvector component error calculated according to thefollowing expressions
119890 =
1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB
10038161003816100381610038161003816100381610038161003816119881MATLAB
1003816100381610038161003816
sdot 100
119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)
119890SQRT =1
1198962
radic
119896minus1
sum
119894=0
1198902
(20)
where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results
Our analysis of 8 times 8 matrices indicated that ℎ = 26
was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on
After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix
ℎ = 9 + 119899 log 119899 (21)
42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size
With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)
Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length
Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device
Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity
Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression
119879119871= 119879load + 119899
2
sdot ℎ (22)
8 9 10 11 12 13 14 15 1605
1
15
2Used resources
FF an
d LU
T (
)
8 9 10 11 12 13 14 15 160
50
100
150
FFLUT
Input matrix size (n)
Input matrix size (n)
Execution time with 100MHz
times104
Tex
e(120583
s)
fCLK =
Figure 5 Serial Jacobi design performance with matrix orderincrease
where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration
119879load =1198992
2+119899
2+ 4 (23)
A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879
119871) was still related to the
size of the inputmatrix as shown in the following expression
119879119871= 119879load + 119879119869 sdot ℎ (24)
where 119879119869= 119891(119899
2
) is the time required to perform a Jacobiiteration
Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs
Mathematical Problems in Engineering 9
Design comparison
11000 23000 34000 45000LUT
8500
17000
26000
34000
FF
400080001200016000
4000
8000
12000
16000
Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library
TL(nTCLK)
II(nT
CLK
)
Figure 6 Comparative between different design alternatives
5 Comparison with Other Proposals
The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention
Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation
The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)
The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode
The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources
Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one
Vivado HLS library
Clock frequency (MHz)
VHDL (Bravo et al
2008)QR
Parallel Jacobi
Serial Jacobi
0 50 100 150 200 250 300
Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8
The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks
The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit
As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations
Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required
Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically
Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources
As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed
On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz
10 Mathematical Problems in Engineering
On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level
6 Conclusions
In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature
The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired
In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms
Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs
As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high
Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)
The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view
We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results
Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)
References
[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008
[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014
[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014
[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014
[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005
[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000
[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013
[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008
[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988
[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000
[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990
[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983
[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959
[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988
[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013
Mathematical Problems in Engineering 11
[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts
designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value
and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985
[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014
[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009
[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014
[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014
[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014
[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014
[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013
[27] J S Walther A Unified Algorithm for Elementary Functions1972
[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010
[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
8 Mathematical Problems in Engineering
the eigenvector component error calculated according to thefollowing expressions
119890 =
1003816100381610038161003816119881HLS1003816100381610038161003816 minus1003816100381610038161003816119881MATLAB
10038161003816100381610038161003816100381610038161003816119881MATLAB
1003816100381610038161003816
sdot 100
119890MAX = max (10038161003816100381610038161198901003816100381610038161003816)
119890SQRT =1
1198962
radic
119896minus1
sum
119894=0
1198902
(20)
where 119896 represents the total number of samples The errorobtained was plotted as a function of the number of Jacobiiterations (ie the number of rotations performed for eachsubmatrix) and was used to determine the optimal numberof iterations (ℎ) that the algorithm must perform to obtainsufficiently accurate results
Our analysis of 8 times 8 matrices indicated that ℎ = 26
was the best number of iterations for both eigenvectors andeigenvalues in order to obtain a sufficiently accurate solutionsince the error was constant from this value on
After performing the same tests for other input matrixsizes it was concluded that in general expression (21) canbe used to determine the optimal number of iterations to beperformed for an 119899 times 119899matrix
ℎ = 9 + 119899 log 119899 (21)
42 SerialParallel Jacobi Analysis Since an expression wasavailable for the optimal number of Jacobi iterations (ℎ) wealso conducted a study of the design performance dependingon initial matrix size
With the parallel architecture we found that althoughthe Vivado HLS generated an RTL design for matrices withan order 119899 gt 10 the Vivado suite synthesizer was unableto implement it successfully However the serial architecturedesign performed well for higher order matrices (20 times 20)
Based on the parametric study performed in Section 3we decided that a good trade-off between execution time andresource usagewould be to use fourCORDICmodules whichwould perform WL minus 1 iterations in both parallel and serialarchitectures where WL was the data word length
Implementation results for the serial architecture areshown in Figure 5 for the Xilinx Artix 7 series (XC7A100T-1CSG324C) device
Turning first to resource usage it can be seen that thisdid not not increase as fast as execution time since whenmatrix size is increased with a fixed amount of CORDICunits only storage elements and control logic will be addedConsequently it becomes evident that in terms of resourcesused the design has an 119874(1198992) complexity
Regarding execution time (119879exe) the Vivado HLS enabledus to examine the sequence of operations performed by thedesign in each clock cycle An analysis of this informationindicated that the design latency which corresponds toexecution time in the serial architecture follows the followingexpression
119879119871= 119879load + 119899
2
sdot ℎ (22)
8 9 10 11 12 13 14 15 1605
1
15
2Used resources
FF an
d LU
T (
)
8 9 10 11 12 13 14 15 160
50
100
150
FFLUT
Input matrix size (n)
Input matrix size (n)
Execution time with 100MHz
times104
Tex
e(120583
s)
fCLK =
Figure 5 Serial Jacobi design performance with matrix orderincrease
where 119899 is the size of the processed matrix and ℎ is thenumber of Jacobi iterations performed following expression(21) Meanwhile 119879load (23) corresponds to the time requiredto read sufficient data from an external memory to start thefirst Jacobi iteration and 1198992 is the amount of clock cyclesrequired to perform a Jacobi iteration
119879load =1198992
2+119899
2+ 4 (23)
A similar study was performed for different numbersof rotation CORDIC units which revealed that the time toperform a Jacobi iteration varied However it remained afunction of 1198992 so system latency (119879
119871) was still related to the
size of the inputmatrix as shown in the following expression
119879119871= 119879load + 119879119869 sdot ℎ (24)
where 119879119869= 119891(119899
2
) is the time required to perform a Jacobiiteration
Regarding the maximum size (119899) of the input matrix weassumed that given the serial nature of the system resourceusage would be the limit however we found that for 119899 gt 20the Vivado HLS took a very long time to guess the controllogic and implementation eventually failed Despite thisthe design has been coded in such a way that input matrixsize and the codification scheme can be modified simply bychanging a few constants which is a vast improvement overthe classic RTL designs
Mathematical Problems in Engineering 9
Design comparison
11000 23000 34000 45000LUT
8500
17000
26000
34000
FF
400080001200016000
4000
8000
12000
16000
Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library
TL(nTCLK)
II(nT
CLK
)
Figure 6 Comparative between different design alternatives
5 Comparison with Other Proposals
The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention
Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation
The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)
The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode
The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources
Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one
Vivado HLS library
Clock frequency (MHz)
VHDL (Bravo et al
2008)QR
Parallel Jacobi
Serial Jacobi
0 50 100 150 200 250 300
Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8
The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks
The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit
As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations
Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required
Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically
Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources
As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed
On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz
10 Mathematical Problems in Engineering
On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level
6 Conclusions
In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature
The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired
In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms
Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs
As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high
Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)
The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view
We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results
Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)
References
[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008
[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014
[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014
[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014
[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005
[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000
[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013
[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008
[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988
[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000
[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990
[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983
[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959
[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988
[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013
Mathematical Problems in Engineering 11
[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts
designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value
and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985
[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014
[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009
[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014
[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014
[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014
[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014
[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013
[27] J S Walther A Unified Algorithm for Elementary Functions1972
[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010
[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 9
Design comparison
11000 23000 34000 45000LUT
8500
17000
26000
34000
FF
400080001200016000
4000
8000
12000
16000
Semiparallel JacobiSerial JacobiFloating point QRMUSIC (Xu et al 2008)VHDL (Bravo et al 2008)Vivado HLS library
TL(nTCLK)
II(nT
CLK
)
Figure 6 Comparative between different design alternatives
5 Comparison with Other Proposals
The results obtained were also compared with those reportedin other studies of 8 times 8 matrices since these have receivedmuch research attention
Figure 6 shows the implementation results for the twoarchitectures proposed here (serial and parallel) a Jacobiserial architecture carried out in VHDL [1] a Jacobi semi-parallel architecture from the multiple signal classification(MUSIC) algorithm [8] and the Vivado HLS linear algebralibrary svd implementation
The criteria selected to compare the different designswere as follows FPGA resource consumption in terms of thediscrete elements FF and LUT computation time (latency andinitiation interval) and maximum clock frequency (119891CLK)
The VHDL serial Jacobi algorithm features a fully cus-tomized system using two ad hoc CORDIC modules one ofwhich works in rotation and vectoring mode while the otheris optimized to work only in rotation mode
The Jacobi module used in theMUSIC algorithm featuresa semiparallel architecture based on the Brent and Lukproposal [12] and also uses the CORDIC algorithm for thecalculations However it does not implement the full systolicarray only making use of a subset of this to save resources
Turning first to the two Jacobi algorithm implementationsproposed here it is evident that the best performance interms of speed was achieved by the semiparallel architecturewhereas the serial architecture only used half of the resourcesrequired to implement the parallel one
Vivado HLS library
Clock frequency (MHz)
VHDL (Bravo et al
2008)QR
Parallel Jacobi
Serial Jacobi
0 50 100 150 200 250 300
Figure 7 Maximum clock frequency for all the designs when inputmatrix size is fixed to 8 times 8
The other Vivado HLS implementation considered wasthe svd function from the Vivado HLS library which canonly be implemented in single precision floating point for-mat Resource usage is not dissimilar to our Vivado HLSimplementations however it is slower and makes use of 58DSP48E blocks
The difference arises from the use of floating point coreswhich are implemented with DSP units while the fixedpoint CORDIC approach employed in the present study onlyrequired two DSP48E multipliers per unit
As previously mentioned a floating point QR algorithmwas implemented in the same FPGAdevice also usingVivadoHLS The main purpose of this was to determine whetherother algorithms presented greater computational efficiencythan the Jacobi method Although we used different codingsystems in the Jacobi and QR algorithms respectively ananalysis of both implementations revealed that QR was lesssuitable for a hardware implementation since the operationsperformed can vary between different iterations
Nevertheless we found that although the performanceof the algorithm in terms of execution time was poor it didnot use a large amount of FPGA resources and it thereforerepresents an alternative to consider when floating pointcodification is required
Looking next at the eigen solver implemented as part ofthe MUSIC algorithm it can be seen that it did not performvery well On paper it is similar to the semiparallel architec-ture presented here but the control logic was designed byhand This shows that HLS performs better than traditionalhardware design flows when control logic becomes toocomplicated since it is implemented automatically
Our final comparison was with the VHDL serial architec-ture in which only two CORDIC modules were used Sinceeverything in this design was customized for the applicationtiming performance was similar to HLS designs but it usedfar fewer resources
As regards the maximum clock frequency at which thedesigns can work Figure 7 shows a comparison of all thealternatives discussed
On the one hand we have VivadoHLS designs where thedifference between floating point and fixed point is evidencedby the resultsWhile floating point designs can easily go above200MHz fixed point designs stay around 120MHz
10 Mathematical Problems in Engineering
On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level
6 Conclusions
In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature
The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired
In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms
Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs
As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high
Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)
The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view
We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results
Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)
References
[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008
[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014
[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014
[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014
[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005
[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000
[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013
[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008
[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988
[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000
[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990
[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983
[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959
[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988
[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013
Mathematical Problems in Engineering 11
[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts
designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value
and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985
[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014
[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009
[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014
[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014
[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014
[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014
[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013
[27] J S Walther A Unified Algorithm for Elementary Functions1972
[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010
[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
10 Mathematical Problems in Engineering
On the other hand we find theVHDLdesign In this casethe performance of the design proposed in [1] was similar tothe floating point designs since it was hand-coded at a verylow optimization level
6 Conclusions
In this paper we have described a high level synthesisimplementation of the Jacobi algorithm for eigenvalue andeigenvector calculations and this has been compared tosimilar or related systems reported in the literature
The main purpose of using a HLS tool was to try toachieve a similar performance to that of a customRTL systemcoded in for example VHDL in a fraction of the design timerequired
In recent years the amount of resources available inFPGA devices has grown exponentially and thus rapidity isbecoming more important than efficiency when implement-ing computational algorithms
Furthermore HLS tools have evolved making it possibleto achieve good systems that are easy to integrate into largerdesigns as hardware coprocessors together with a CPU or asa processing block in regular RTL designs
As final paper conclusion we can say that the use of highlevel implementation tools reduce the design time In mostcases the designs are initially coded simulated and validatedwith high level languages where there exist standard librariesand reference code available Manual RTL coding is notusually an easy step because there is not a straight way totranslate the functional algorithm to RTL description Inaddition the required level of RTL expertise is very high
Another inconvenience for RTL-based designs appearswhen new changes have to be included in the algorithmTheRTL structure is not very flexible and any modification to thealgorithm requires a lot of time (simulation and recoding)Nevertheless RTL designs generate hardware reconfigurableapproaches with less resources and higher clock frequencythan high level hardware designs (see Figures 6 and 7)
The developed algorithm in this work was initially madewith a traditional RTL flow [1] As a result we can comparethe results of a complex algorithm such as Jacobi based onRTL orHLS flows According to Figures 6 and 7 the achievedresults for the HLS methodology compared with RTL-baseddesign are favorably from different points of view
We have shown that the design process varies from thetraditional RTL flow since it is possible to iterate on the samecode to obtain different hardware architectures and performparametric simulations of the hardware in order to obtain thebest results
Therefore the designer should balance what are the priorrequirements in the system clock frequencyresources ordesign time and choose the methodology (RTLHLS) thatbest fit for that particular case
Conflict of Interests
The authors declare that there is no conflict of interestsregarding the publication of this paper
Acknowledgment
This work has been supported by the Spanish Ministry ofEconomy and Competitiveness through the ALCOR Project(DPI2013-47347-C2-1-R)
References
[1] I Bravo M Mazo J L Lazaro P Jimenez A Gardel and MMarron ldquoNovel HW architecture based on FPGAs oriented tosolve the eigen problemrdquo IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems vol 16 no 12 pp 1722ndash1725 2008
[2] H Gao and R Ma ldquoOn a system modelling a population withtwo age groupsrdquoAbstract andApplied Analysis vol 2014 ArticleID 920348 5 pages 2014
[3] X Tang W Deng and Z Qi ldquoInvestigation of the dynamicstability of microgridrdquo IEEE Transactions on Power Systems vol29 no 2 pp 698ndash706 2014
[4] S Wu J Zhang Y Hou and X Bai ldquoConvergence of gossipalgorithms for consensus in wireless sensor networks withintermittent links andmobile nodesrdquoMathematical Problems inEngineering vol 2014 Article ID 836584 18 pages 2014
[5] J A Jimenez M Mazo J Urena et al ldquoUsing PCA in time-of-flight vectors for reflector recognition and 3-D localizationrdquoIEEE Transactions on Robotics vol 21 no 5 pp 909ndash924 2005
[6] W Haiqing S Zhihuan and L Ping ldquoImproved PCA withoptimized sensor locations for process monitoring and faultdiagnosisrdquo inProceedings of the 39th IEEEConfernce onDecisionand Control vol 5 pp 4353ndash4358 December 2000
[7] A A S Ali A Amira F Bensaali andM Benammar ldquoPCA IP-core for gas applications on the heterogenous zynq platformrdquoin Proceedings of the 25th International Conference onMicroelec-tronics (ICM rsquo13) pp 1ndash4 December 2013
[8] D Xu Z Liu X Qi Y Xu and Y Zeng ldquoA FPGA-basedimplementation of MUSIC for centrosymmetric circular arrayrdquoin Proceedings of the 9th International Conference on SignalProcessing (ICSP rsquo08) pp 490ndash493 October 2008
[9] J HWilkinsonTheAlgebraic Eigenvalue Problem Monographson Numerical Analysis Clarendon Press New York NY USA1988
[10] S Sundar and B K Bhagavan ldquoGeneralized eigenvalue prob-lems Lanczos algorithm with a recursive partitioning methodrdquoComputers ampMathematics with Applications vol 39 no 7-8 pp211ndash224 2000
[11] D P OrsquoLeary and P Whitman ldquoParallel QR factorization byhouseholder and modified Gram-Schmidt algorithmsrdquo ParallelComputing vol 16 no 1 pp 99ndash112 1990
[12] R P Brent F T Luk and C van Loan ldquoComputation ofthe generalized singular value decomposition using mesh-connected processorsrdquo Journal of VLSI and Computer Systemsvol 1 no 3 pp 242ndash270 1983
[13] J E Volder ldquoThe CORDIC trigonometric computing tech-niquerdquo IRE Transactions on Electronic Computers vol EC-8 no3 pp 330ndash334 1959
[14] J R Cavallaro and F T Luk ldquoCORDIC arithmetic for an SVDprocessorrdquo Journal of Parallel and Distributed Computing vol 5no 3 pp 271ndash290 1988
[15] G Lucius F le Roy D Aulagnier and S Azou ldquoAn algorithmfor extremal eigenvectors computation of hermitian matricesand its FPGA implementationrdquo in Proceedings of the IEEE56th International Midwest Symposium on Circuits and Systems(MWSCAS rsquo13) pp 1407ndash1410 August 2013
Mathematical Problems in Engineering 11
[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts
designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value
and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985
[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014
[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009
[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014
[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014
[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014
[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014
[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013
[27] J S Walther A Unified Algorithm for Elementary Functions1972
[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010
[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Mathematical Problems in Engineering 11
[16] Impulse C httpwwwimpulseacceleratedcomtoolshtml[17] Catapult c httpcalyptocomenproductscatapultoverview[18] Vivadohlsxilinx web page httpwwwxilinxcomproducts
designtoolsvivadointegrationesl-designhtm[19] R P Brent and F T Luk ldquoThe solution of singular-value
and symmetric eigenvalue problems on multiprocessor arraysrdquoSociety for Industrial and Applied Mathematics vol 6 no 1 pp69ndash84 1985
[20] C-C Sun J Gotze andG E Jan ldquoParallel Jacobi EVDmethodson integrated circuitsrdquoVLSIDesign vol 2014 Article ID 5961039 pages 2014
[21] Y Liu C-S Bouganis and P Y K Cheung ldquoHardware architec-tures for eigenvalue computation of real symmetric matricesrdquoIET Computers and Digital Techniques vol 3 no 1 pp 72ndash842009
[22] W Ford Numerical Linear Algebra with Applications UsingMATLAB Academic Press New York NY USA 1st edition2014
[23] S W Kang and S N Atluri ldquoApplication of nondimensionaldynamic influence function method for eigenmode analysisof two-dimensional acoustic cavitiesrdquo Advances in MechanicalEngineering vol 2014 Article ID 363570 9 pages 2014
[24] A Bernal R Miro D Ginestar and G Verdu ldquoResolution ofthe generalized eigenvalue problem in the neutron diffusionequation discretized by the finite volumemethodrdquoAbstract andApplied Analysis vol 2014 Article ID 913043 15 pages 2014
[25] W Gafsi F Najar S Choura and S El-Borgi ldquoConfinement ofvibrations in variable-geometry nonlinear flexible beamrdquo Shockand Vibration vol 2014 Article ID 687340 7 pages 2014
[26] G H Golub and C F Van Loan Matrix Computations JohnsHopkins Studies in the Mathematical Sciences Johns HopkinsUniversity Press Baltimore Md USA 4th edition 2013
[27] J S Walther A Unified Algorithm for Elementary Functions1972
[28] B Lakshmi and A S Dhar ldquoCORDIC architectures a surveyrdquoVLSI Design vol 2010 Article ID 794891 19 pages 2010
[29] J Faires and R L Burden Numerical Methods Brooks ColePublishing Company 2012
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of
Submit your manuscripts athttpwwwhindawicom
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical Problems in Engineering
Hindawi Publishing Corporationhttpwwwhindawicom
Differential EquationsInternational Journal of
Volume 2014
Applied MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
OptimizationJournal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Operations ResearchAdvances in
Journal of
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Function Spaces
Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014
International Journal of Mathematics and Mathematical Sciences
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Algebra
Discrete Dynamics in Nature and Society
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Decision SciencesAdvances in
Discrete MathematicsJournal of
Hindawi Publishing Corporationhttpwwwhindawicom
Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014
Stochastic AnalysisInternational Journal of