homepages.cae.wisc.eduhomepages.cae.wisc.edu/~ece734/project/s06/Shyam_Claus... · Web viewKalavai J. Raghunath, Keshab K. Parhi; Fixed and Floating Point Error Analysis of QRD-RLS

ECE 734 VLSI Array Structures for Digital

Signal Processing

Project Report

Title:

A Recursive Method for the Solution of the Linear

Least Squares Formulation – algorithm and

performance using PLX instruction set

Authors:

Claus Benjaminsen

Shyam Bharat

Table of Contents

Table of Contents.......................................................................................................2

Introduction...................................................................................................................3

Algorithm.......................................................................................................................6

Theoretical Analysis.................................................................................................7

Numerical Analysis in Matlab...........................................................................13

Implementation Using the PLX instruction set.........................................16

Implementation of different dimensional multiplications using the

PLX instruction set..................................................................................................16

Problems with the PLX instruction set..........................................................19

Conclusion...................................................................................................................21

The PLX assembly code implementation....................................................22

2

References....................................................................................................................26

Introduction

The least squares method of parameter estimation seeks to minimize the squared

error between the observed data sequence and the assumed signal model, which is some

function of the parameter to be estimated. In particular, the linear least squares

formulation represents the case where the signal model is a linear function of the

parameter to be estimated. The salient feature of the least squares method is that no

probabilistic assumptions are made about the data – only a signal model is assumed. This

method is generally used in situations where a precise statistical characterization of the

data is unknown.

The linear least squares estimation (LLSE) method ultimately boils down to

solving a set of linear equations. As will be seen from the equations that follow, the

solution to the LLSE involves the computation of the inverse of a matrix. This matrix

happens to be the autocorrelation matrix of the assumed signal model matrix. Now, the

dimensions of this autocorrelation matrix are P x P, where ‘P’ is the number of samples

in the observation vector. This computation of the matrix inverse is straight-forward for

small values of ‘p’ but for large values of ‘p’, it becomes increasingly computationally

intensive. In digital signal processing (DSP) applications, the computation is generally

required to be done in real-time for a continuous succession of samples. In other words,

for DSP applications one matrix inversion is required for every new input vector. Since it

is very time consuming to calculate the inverse of a matrix, it is essential to find an

alternate way of solving the system of linear equations to yield the unknown parameters.

This problem is a fairly well-researched one and there exists an alternate method of

solving for the unknown parameters. This alternate method basically involves updating a

so-called weight vector in time, based on its value at the previous time instant. The

meaning of this statement will become clearer with the aid of equations presented below.

3

The function to be minimized with respect to the parameter at the present instant

is given by:

, 0 < λ < 1

where en(k) = d(k) – wH(n)u(k)

In matrix form, this function can be written as:

J(w(n)) = [d* - uHw(n)]HΛ[d – uHw(n)]

where Λ = diag{λn-1, λn-2, … , λ, 1}

The solution to this equation is given by:

w(n) = (u Λ uH)-1 u Λ d

or w(n) = Φ(n)-1ө(n)

where:

As can be observed from the above equations, the path to the solution involves the

computation of the inverse of a matrix, for every value of ‘n’. The method of recursive

least squares seeks to circumvent this computationally intensive step by instead

calculating w(n) from w(n-1), it’s value at the previous time instant.

4

Φ(n) and ө(n) can be rewritten in recursive form as follows:

Φ(n) = λΦ(n-1) + u(n)uH(n)

ө(n) = λө(n-1) + u(n)d*(n)

The matrix inversion lemma shown below is applied to Φ(n)

Therefore,

At this point, the following 2 new terms are defined:

P(n) = Φ-1(n) , where P(n) has dimensions M x M

, where k(n) is an M x 1 gain vector

It can be seen that: k(n) = P(n)u(n)

If we proceed with the math, we ultimately arrive at the following recursion formula for

w(n):

w(n) = w(n-1) + k(n)[d*(n) – uH(n)w(n-1)]

The last term (in square brackets) is denoted as α*(n), where α(n) = d(n) – wH(n-1)u(n) is

called the ‘innovation’ or ‘a priori estimation error’. This differs from e(n) which is the ‘a

posteriori estimation error’. To update w(n-1) to w(n), we need k(n) and α(n)

5

Algorithm

Based on the above formulation, shown below is an algorithm for solving the least

squares equation recursively:

Initialization: P(0) = δ-1I where δ is small and positive

w(0) = 0

For n = 1,2,3,….

x(n) = λ-1P(n-1)u(n)

k(n) = [1 + uH(n)x(n)]-1x(n)

α(n) = d(n) – wH(n-1)u(n)

w(n) = w(n-1) + k(n)α*(n)

P(n) = λ-1P(n-1) – k(n)xH(n)

6

Theoretical Analysis

The first view into the algorithm is looking into the sizes of the different quantities. When

doing that one finds that choosing the size of the input vector u and the target values d,

determines the sizes of all the other quantities in the algorithm. Therefore if u is (K x 1),

and d is (N x 1) the quantities in the algorithm have the following dimensions:

Quantity u d P x k w

Dimension (K x 1) (N x 1) (K x K) (K x 1) (K x 1) (K x N) (N x 1)

Considering the size of the project we will make the following simplification from the

general formulation of the algorithm. First we will assume only real quantities, which

imply that both the input values u and the output values d are real. Second we will limit d

to be a scalar, that is N = 1 and the dimensions of the quantities will then be

Quantity u d P X k W

Dimension (K x 1) (1 x 1) (K x K) (K x 1) (K x 1) (K x 1) (1 x 1)

Now the computations required in the algorithm in terms of vector and matrix algebra are

can be investigated.

The first line requires P multiplied by a constant (-1) (K2 multiplications), and a matrix

vector product (K2 multiplications and K(K-1) additions). This can be done faster if the

matrix vector product is carried out first and then the multiplication by a constant, but

since the same constant matrix product is required to calculate P(n), this should be done

first and stored, so it can be used in the calculation of P(n).

The second line requires a vector inner product (K multiplications and K-1 additions),

then a scalar addition by 1 (1 addition), a division (taking the inverse) of a scalar (1

division) and finally a constant multiplied by a vector (K multiplications).

7

In the third line there is a vector inner product (K multiplications) and a scalar subtraction

(1 subtraction).

The fourth line has a vector multiplied by a scalar (K multiplications) and a vector

addition (K additions).

Finally the fifth line has the same product of -1P(n-1) as the first line, so this doesn’t

need to be calculated again. Then it has a vector outer product (K2 multiplications) and

finally a matrix addition (K2 additions).

This gives a total number of

3K2 + 3K multiplications

K(K-1) + K-1 + 1 + K + K2 = 2K2 + K additions

1 division

1 subtraction per iteration.

If we don’t consider the division, a general purpose microprocessor with a separate

multiplier will need 5K2 + 4K + 1 operations to execute one iteration of the algorithm, not

counting the number of operations needed to move, load and store data. With the PLX

instruction set it is possible to carry out 2 multiplications and up to 4 additions in parallel,

which has the potential of reducing the number of operations per iteration to 2K2 + 1.75K

+ 1.

The algorithm can be drawn as a data flow graph (DFG), see Figure 1.

8

X

X +(.)T 1/x Xu(n)uT(n)

x(n)

+1

(.)T X

X- 1

k(n)

+

X

Z-1

P(n)P(n-1)

X +

+Z-1

w(n-1) w(n)(.)T

(n)

+

-

+d(n)

-

+

+

+

+

xT(n)

wT(n-1)

A

B

C D E

F G H I J

K L M

N O

Figure 1 The DFG of the RLS algorithm

The DFG has 6 loops.

1. A – B – A t1: tmsm+tmma

2. A – C – D – E – B – A t2: tmsm+tmvm+tvt+tvom+tmma

3. A – C – J – E – B – A t3: tmsm+tmvm+tvsm+tvom+tmma

4. A – C – G – H – I – J – E – B – A t4: tmsm+tmvm+tvim+tssa+tssd+tvsm+tvom+tmma

5. K – L – M – O – N – K t5: tvim+tssa+tvsm+tvva+tvt

6. O – O t6: tvva

Where the time required for each operation is defined as:

tmsm – Matrix – Matrix multiplication

tmma – Matrix addition

tmvm – Matrix – Vector multiplication

tvt – Vector transpose

tvom – Vector outer product

9

tvsm – Vector – scalar multiplication

tvim – Vector inner product

tssa – Scalar addition

tssd – Scalar division

tvv – Vector addition

All loops contain exactly one delay element and therefore the longest loop iteration time

will be the iteration bound. First it is noted that tvt = 0, as there in general is no need to do

a vector transpose in hardware, where column vectors and row vectors are not

distinguished only the computations they are involved in. Secondly a lot of the nodes are

the same in different loops and when comparing them it is found that only loop 4 and 5

can have the longest computation time. By subtracting the same computations in each of

the two loops one finds that if

tmsm+tmvm+tssd+tvom+tmma > tvva, which is a reasonable assumption.

Therefore loop 4 determines the iteration bound, which is

T = tmsm+tmvm+tvim+tssa+tssd+tvsm+tvom+tmma

The critical path is contained in loop 4 and is equal to the iteration bound T, therefore no

speed improvement can be achieved by retiming. Also since the iteration bound can be

directly achieved by using enough hardware with the implementation in the DFG above,

there is nothing to gain from unfolding either. Since our implementation focuses on an

implementation using the PLX instruction set, it is clear that to get anywhere close to the

iteration bound several PLX processors need to be interconnected, so the different loop

can be executed in parallel. To look into that a little further we will investigate the inter-

and intra-iteration dependence structure of the algorithm. This is done by drawing a

dependence graph (DG) of the variables in the algorithm as in Figure 2.

10

x(n-1)

(n-1)

w(n-1)P(n-1)

k(n-1)

x(n)

w(n)

(n)

P(n)

k(n)

Figure 2 The dependence graph of the RLS algorithm

From the figure it is seen that there is a lot of inter-iteration dependence. That is, not a lot

of calculations in a new iteration can be done, until all the calculations in the old iteration

are completed. If the algorithm is calculated in place, that is new values of the different

quantities are stored over old values, then we see that when P(n-1) is calculated only the

quantity x(n) can be calculated in the next iteration, before the calculation of w(n-1) is

completed. That k(n) can’t be calculated also comes from the fact that there is an output

dependence on k(n), because the calculation of w(n-1) depends on k(n-1), this needs to be

completed before the new value of k(n) can be stored on top of the old value. When w(n-

1) is calculated only (n) in the next iteration can be calculated before also P(n-1) is

calculated. If the algorithm is not calculated in place, that is old values are not

overwritten by new values, but instead stored into arrays, the output dependence on k is

removed. Therefore it is seen from the graph above that the left branch (x, k and P), can

be executed independently from the right branch. The right branch has an input

dependence on k and therefore depends on the left branch being calculated, before the

right branch can be completed. It is therefore a possibility to make an implementation

11

with two PLX processors, where only the value of k is communicated between the two

processors. Connecting this with the DFG, one processor will compute A, B, C, E, G, H, I

and J, and the other processor will compute K, L, M, O, ignoring the transpose

operations. This configuration will therefore reach the iteration bound, and no further

speed-up can be achieved by using extra PLX processors.

Looking into the intra-iteration dependence it is seen that k has an input dependence on x,

P depends on k and x, and w depends on k and . This gives the following possibilities of

execution order of each iteration.

1. x (n) k(n) P(n) (n) w(n)

2. x (n) k(n) (n) P(n) w(n)

3. x (n) k(n) (n) w(n) P(n)

4. x (n) (n) k(n) P(n) w(n)

5. x (n) (n) k(n) w(n) P(n)

6. (n) x(n) k(n) P(n) w(n)

7. (n) x(n) k(n) w(n) P(n)

There are even more possibilities if you mix the execution of iterations a possibility is

8. (n-1) w(n-1) P(n-1) x(n) k(n)

This just requires that x(1) and k(1) are calculated as initial values before the execution of

the loop is started.

This shows that there are several possible ways to order the calculations of the different

variables in the implementation, and hence these options can be explored to find the

ordering, which gives the must efficient implementation for instance by minimizing the

number of move operations.

12

Numerical Analysis in Matlab

We used the concept of ‘system identification’ to determine whether the results of the

RLS algorithm are accurate. Essentially, this boils down to the choice of the input signal

(u) and the observation vector (d). Initially, 4 weights are defined. The observation vector

is chosen to be the inner product of the 4 weights with the input signal (and random noise

is added to this product). With this a priori information, the RLS algorithm is coded in an

m-file and executed. Based on this simple experiment, it is seen from the figure below

that the weights tend to converge to their a priori defined values after just a few tens of

iterations. Of course, the convergence curve depends on the initial values ‘delta’ and

‘lambda’ (both of which are positive and less than 1). In general, higher values of

‘lambda’ tend to eliminate most of the ‘hunting’ of the weight estimates about their actual

values, while lower values of ‘lambda’ lead to a significant amount of fluctuation of the

weights about their true value. ‘delta’ affects the convergence pattern in a different

manner. Higher values of ‘delta’ lead to slower convergence, i.e. the algorithm takes a

larger number of iterations to attain the true value. Lower values of ‘delta’, on the other

hand, lead to faster convergence. Thus, higher values of ‘lambda’ and lower values of

‘delta’ are more suited for the RLS algorithm. The aim of such an analysis was to verify

the workability of the RLS algorithm using known inputs and outputs, so that the outputs

can be compared with those values. This was essential in order to implement the

algorithm using the PLX instruction set.

13

0 50 100 150 200 250-3

-2

-1

0

1

2

3

464 bit floating point; w1 = 3.2 w2 = 1.1 w3 = -2.1 w4 = 0.7

w1

w2

w3

w4

Figure 3 The convergence pattern of the weights used in the RLS algorithm

The graph in figure 3 shows the convergence of the weights, when the RLS algorithm is

executed in Matlab and as default double precision floating point values (64 bits) are used

in the calculations. PLX doesn’t support floating point, and to make use of the sub-word

parallelism inherent in the PLX architecture, we need to reduce the number of bit, so that

more than one number can be stored in each 64 bit PLX register. To get the full benefit of

the parallelism 4 numbers should be stored in each register, allowing each number to be

represented with 16 bits. In order to test the RLS algorithm with this reduced precision,

we use the fixed point package in Matlab to simulate the execution of the algorithm using

16 bit fixed point values. The option we now have to explore is how many bits should we

use for the integer part and how many for the fractional. Since the values in the algorithm

can be both positive and negative, we need to represent the values in two’s complement,

and hence one bit is used as the sign bit. This means that we can choose from 0 to 15

fractional bits, and we use the simulations to find the best choice.

14

In figure 4 the convergence patterns of the weights for 6 and 8 fractional bits are shown.

The patterns for 10 and 11 fractional bits are shown in figure 5.

0 50 100 150 200 250-10

-5

0

5

10

15

Number of Iterations

Val

ue o

f the

wei

ght

Fractional bits: 6; w1 = 3.2 w2 = 1.1 w3 = -2.1 w4 = 0.7

w1

w2

w3

w4

0 50 100 150 200 250

-4

-2

0

2

4

6

8


Val

ue o

f the

wei

ght


w1

w2

w3

w4

Figure 4 The convergence patterns for the weights using 16 bit fixed point values. In the graph on the left 6 fractional bits are used, and on the right 8 fractional bits are used.

0 50 100 150 200 250-4

-3

-2

-1

0

1

2

3

4


Val

ue o

f the

wei

ght


w1

w2

w3

w4

0 50 100 150 200 250

-20

-15

-10

-5

0

5

10

15

20


Val

ue o

f the

wei

ght


w1

w2

w3

w4

Figure 5 The convergence patterns for the weights using 16 bit fixed point values. In the graph on the left 10 fractional bits are used, and on the right 11 fractional bits are used.

From the figures it is clearly seen that as the number of fractional bits is increased, the

convergence becomes much more stable and the number of big fluctuations is decreased.

There is a limit though, because as the number of fractional bits is increased, the number

of integer bits is decreased, and hence at some point there will be too few integer bits to

represent the values accurately. When this happens the algorithm becomes unstable and

the weights will not converge as shown in figure 5 on the right. The simulations therefore

show that 10 fractional bits give the best performance with respect to convergence and

stability.

15

Implementation Using the PLX instruction setThe above Matlab simulations showed that 16 bits are enough to represent the values in

the RLS algorithm, while still maintaining a good performance. The number of fractional

bits should be 10, which actually doesn’t affect the implementation very much. It only

determines the format the input values, the constants and the output values are stored in.

The PLX assembly code for the RLS algorithm can be found at the end of this report.

Implementation of different dimensional multiplications using the PLX instruction set

The RLS algorithm involves the computation of multiplications and additions and even a

scalar division. Of these multiplications, there are some vector-scalar products, some

vector–vector products and some matrix-vector products. The results of such

multiplications could be vectors, scalars or matrices, depending on the combination and

orientation used. The PLX instruction set consists of thirty two 64-bit registers. Each of

these registers can thus store four sub-words each of 16-bit word length. Hence, for ease

of implementation, we have defined our vectors to be sized 4 x 1 and matrices to be sized

4 x 4. Also, each element of a vector, matrix or a scalar is defined to be 16 bits in size. As

a result, we define a convention wherein a vector is stored entirely in 1 register, with each

16-bit sub-word of the register carrying an element of the vector. Storing a matrix is also

similar – the first row of the matrix is stored like a vector (in 1 register), while the other 3

rows are stored in 3 other registers. A scalar is a little different – since only 16-bits are

required, scalars are stored in the least significant 16-bit sub-word of the appropriate

register. They can then be replicated in the other sub-words, depending on the

requirement of the computation. Described below are examples of certain common

multiplication operations, all of which appear in the RLS algorithm that is implemented

using PLX instructions.

16

Matrix-Vector product: Initially, each row of the matrix P (P is a 4x4 matrix) is stored

in a register. Therefore, each element has 16 bits allocated for it’s storage. Each row of P

is then multiplied with the column vector ‘u’, using the instructions ‘pmul.odd’ and

‘pmul.even’. This is done because, in the present form of sub-word parallelism available

with the ‘pmul.2’ instruction, only odd or evenly indexed sub-words can be multiplied

simultaneously and their result is stored as a 32-bit number. Subsequently though, we

limit the multiplication result to the lower 16-bits in order to maintain a standard limit on

the allocated space to each variable. Thus, wrap around arithmetic is used here. After the

requisite multiplications, the ‘mix.2.r’ instruction is used 4 times to concatenate the

appropriate results together. After these instructions are executed, we have 4 registers –

the content of each register represents four 16-bit words, each of which is the product of

an element of P and an element of ‘u’. Now, the content of each register needs to be

added up. This is not possible if the 16-bit data are in the same register. Hence, they are

moved to corresponding positions in different registers using the ‘check.4’ and

‘excheck.4’ instructions. Once this is accomplished, the final vector resulting from the

matrix-vector product is obtained using 3 successive ‘padd.2’ operations.

Vector-vector inner product: An inner product of a vector and a vector is essentially the

transpose (or Hermitian, which is the complex conjugate transpose) of one vector

multiplied by the other, results in a scalar. The instructions ‘pmul.odd’ and ‘pmul.even’

are used to generate 4 multiplication results. These 2 ‘result’ registers are then added

together using the ‘padd.2’ instruction. Essentially, using the same argument of retaining

the lower 16 bits of the multiplication result, all that remains now is to add the least and

second-most significant words of the result of the ‘padd.2’ instruction. This is done by

first using the ‘excheck.4’ instruction and then the ‘padd.2’ instruction again. The final

result is stored in the least significant word of the register.

Vector-vector outer product: A vector-vector outer product results in a matrix. For

example, a Kx1 vector multiplied by a 1xK vector results in a KxK matrix. Here, we have

to multiply a column vector with the Hermitian of another column vector. Since all our

data are real, the Hermitian merely translates into the transpose operation. The matrix

17

resulting from the outer product has each element as the product of an element of one

vector multiplied by an element of the other. For example, the first element of the first

column vector is multiplied with each element of the transposed vector to give the first

row of the result matrix. Similarly, the 2nd element of the first column vector is multiplied

with each element of the transposed vector to give the 2nd row of the result matrix. This

process continues till all rows of the result matrix are formed. In terms of PLX

implementation, the way this is done is by initially replicating the first element of the first

column vector into all 64 bits of a register using the ‘permset.2’ instruction. Then the

requisite multiplications are performed using the ‘pmul.odd’ and ‘pmul.even’

instructions. Now, the ‘mix.2.r’ instruction is used to arrange the elements of the first row

of the matrix as 16-bit sub-words of the register in which it is stored. This process is then

repeated for each of the other elements of the first column vector, to generate the

remaining rows of the result matrix.

Vector-scalar multiplication: This results in another vector. This operation is a subset

of the vector-vector outer product displayed above. The scalar is loaded into all 64 bits of

a register using the ‘permset.2’ instruction. The ‘pmul.odd’ and ‘pmul.even’ instructions

are then employed, followed by the ‘mix.2.r’ instruction, to obtain a register whose 64

bits are filled with four 16-bit values, each of which represent an element of the result

vector.

Scalar division: Division is not supported in the PLX instruction set and hence we have

to develop a division algorithm. This is done using multiple shift and subtract operations

performed in a loop. The denominator is first shifted 15 places to the left, where after the

loop is started. The denominator, if bigger than zero, is compared with the nominator and

if the nominator is biggest the two are subtracted. The result is stored as the new

nominator, 1 is shifted into the result register from the left and the denominator is shifted

1 place to the right. Then a new iteration is started with again comparing the denominator

with the nominator. If the denominator is biggest a 0 is shifted into the results register

from the right, the denominator is shifted one place to the right and a new iteration is

18

started. This is continued until the denominator has been shifted 31 times, after which the

division is done and the result obtained.

Problems with the PLX instruction setDuring our work on implementing our algorithm using the PLX instruction set, we came

across a problem with the compare instruction cmp. It turns out that the PLX simulator

has an error and hence in some cases, when the cmp instruction is used, it gives wrong

results. An example of this is shown in the figure 6.

Figure 6 A screen dump showing the execution of the PLX simulator. As highlighted the cmp.ge

instruction yields first a wrong result and then an instruction later it suddenly gives the right result.

The result of the first compare instruction is wrong, but when the exact same instruction

is executed an instruction later, it suddenly gives the right result. We reported the

problem and were sent another simulator, and we thought the problem was fixed, because

it didn’t give the same errors as the first simulator.

19

Figure 7 The second simulator has also problems with the compare instruction. It doesn’t give the

same errors as the first simulator, but another error as shown by the small oval in the lower left side.

Unfortunately it has another error also with the compare instruction. An example is

shown in figure 7, where the cmp.gt should turn out true, but returns false.

We have therefore not been able to test our implementation of our algorithm, and can not

verify if it works correctly or not.

20

Conclusion

The RLS algorithm is widely used in applications like adaptive beamforming, tracking

and other filtering applications. This project has investigated the algorithm with the goal

of implementing it using the PLX instruction set to take advantage of the sub-word

parallelism in the PLX architecture. A successful implementation in PLX will make it

possible to use the RLS algorithm in practical applications at a very low cost, because it

only needs a PLX processor and not development of dedicated hardware, which can be

very costly to manufacture.

The main performance criteria in connection with the implementation have been studied,

which include speed, stability and convergence. From the analysis performed it was

concluded that 16 bits are enough to store the variables in the algorithm and with 10 of

these as fractional bits, the algorithm is stable and good convergence is achieved.

The functionality of the implementation in PLX has not been verified, because errors

were found in PLX simulator and hence the results of compare instructions are unreliable.

Concluding on the speed of the implementation anyway, a count of instructions in the

assembly file shows that there are 78 instructions included in the main loop. This doesn’t

include the division subroutine, which takes a varying number of instructions depending

on the operands. As it is currently implemented using shift and subtract, it takes a really

long time to execute, and it would therefore be very beneficial to add a separate division

module to the PLX processor. In that case the 78 instructions are be a good estimate of

how fast this algorithm can be run on a PLX processor. With a processor speed of 100

MHz, this gives an iteration frequency of 1.28 kHz, which in many cases will be fast

enough for real-time applications.

21

The PLX assembly code implementation

// all Values are 16 bits 8 integer and 8 fractional

// constants#define lambda1 0x0100 //1.0 represents lambda inverse#define delta1 0xFA00 //250.0 represents delta inverse

#define u R1#define P1 R2#define P2 R3#define P3 R4#define P4 R5#define x R6#define k R7#define d R8#define alpha R9#define w R10

#define lambda R11#define count R27#define num R28#define denom R29#define divres R30

main proc// Initializationloadi.z.0 w,0x0000

loadi.z.3 P1,delta1loadi.z.2 P2,delta1loadi.z.1 P3,delta1loadi.z.0 P4,delta1

loadi.z.3 R12,lambda1permset.2 lambda,R12,3333

// Load data u and dLoop:

loadi.z.0 u,0x0010loadi.z.1 u,0x0020loadi.z.2 u,0x0030loadi.z.3 u,0x0040

loadi.z.0 d,0x0001

// Calculate lambda*Ppmul.even R12,lambda,P1pmul.odd R13,lambda,P1mix.2.r P1,R13,12

22

pmul.even R12,lambda,P2pmul.odd R13,lambda,P2mix.2.r P2,R13,12



// Calculate x = lambda*P*upmul.even R12,P1,upmul.odd R13,P1,u // Row#1 of p times u

pmul.even R14,P2,upmul.odd R15,P2,u // Row#2 of p times u



mix.2.r R12,R12,R14mix.2.r R13,R13,R15

mix.2.r R14,R16,R18mix.2.r R15,R17,R19

check.4 R16,R12,R14check.4 R17,R13,R15excheck.4 R18,R14,R12excheck.4 R19,R15,R13

// R16,R17,R18,R19 contain the sub-words of x, which are to be added

padd.2 R16,R16,R17padd.2 R17,R18,R19padd.2 x,R16,R17 // x = lambda*p*u

// Calculate the term to divided by i.e. (1 + u.x)

pmul.even R12,x,upmul.odd R13,x,upadd.2 R14,R12,R13excheck.4 R15,R14,R0paddincr.2 denom,R14,R15

// The rightmost 16 bits of 'denom' contain the scalar, // while the first 48 bits from the left are 'don't care'

call Division

// Calculate k = (divres) times (x)

23

permset.2 R12,divres,3333pmul.odd R13,R12,xpmul.even R14,R12,xmix.2.r k,R13,R14

// Calculate Hermitian of 'w' times 'u', which is a scalar (used in the // calculation of alpha)pmul.even R12,w,upmul.odd R13,w,upadd.2 R14,R12,R13excheck.4 R15,R14,R0padd.2 R15,R14,R15

// The rightmost 16 bits of 'R15' contain // the required scalar, while the first 48 bits from the left

are 'don't care'

// Calculate alphapsub.2 alpha,d,R15

// assuming that the scalar 'd' is stored // in the rightmost 16 bits of the register 'd' (R8)

// Calculate k times alphapermset.2 R12,alpha,0000

// here, we are replicating the 16-bit alpha into all 64 bits of R12, //assuming alpha to be in the rightmost 16 bits: if it is //in the leftmost 16 bits, 0000 should be

replaced with 3333pmul.odd R13,R12,kpmul.even R14,R12,kmix.2.r R15,R13,R14// 'k times alpha' is stored in R15

// Calculate w(n) = w(n-1) + 'k times alpha'padd.2 w,w,R15

// the 2nd argument w is w(n-1), which is added to R15,// and the result is stored in the first argument w, which is now w(n)

// Calculate k times 'Hermitian of x' .... 'Hermitian of x' is simply the // transpose of x, since x is

real-valuedpermset.2 R12,k,3333pmul.odd R13,R12,xpmul.even R14,R12,xmix.2.r R15,R13,R14

// R15 is the first row of 'k times Hermitian of x'

permset.2 R12,k,2222pmul.odd R13,R12,xpmul.even R14,R12,xmix.2.r R16,R13,R14

// R16 is the second row of 'k times Hermitian of x'

permset.2 R12,k,1111

24

pmul.odd R13,R12,xpmul.even R14,R12,xmix.2.r R17,R13,R14

// R17 is the third row of 'k times Hermitian of x'

permset.2 R12,k,0000pmul.odd R13,R12,xpmul.even R14,R12,xmix.2.r R18,R13,R14

// R18 is the fourth row of 'k times Hermitian of x'

// Calculating P and storing it in P1,P2,P3 and P4 (rows 1 to 4, respectively, of P)psub.2 P1,P1,R15psub.2 P2,P2,R16psub.2 P3,P3,R17psub.2 P4,P4,R18

jmp Loop

stop: trap 0FFFFh

Divisionloadi.z.0 num,0x0001 // numeratorloadi.z.0 count,0x001F // load counterloadi.z.0 R26,0x0001

slli denom,denom,15

compare:cmp.ge num,denom,P1,P2

P1 cmp.gt denom,R0,P1,P2P1 jmp sub

slli divres,divres,1shift: psub count,count,R26

cmp.eq count,R0,P3,P4P3 jmp stop

srli denom,denom,1jmp compare

sub: pshiftadd.1.l divres,divres,R26psub.2.s num,num,denomjmp shift

stop: ret R31

25

References

Simon Haykin; Adaptive Filter Theory, Prentice Hall 2002

Kalavai J. Raghunath, Keshab K. Parhi; Fixed and Floating Point Error Analysis of QRD-RLS and STAR-RLS Adaptive Filters, Acoustics, Speech and Signal Processing 1994, IC ASSP-94, Volume III, 19-22 April 1994, pages 111/81 – 111/84, vol. 3.

Minglu Jin; Partial updating RLS algorithm, Signal Processing, 2004. Proceedings. ICSP '04. 2004 7th International Conference on Volume 1, 31 Aug.-4 Sept. 2004 Page(s):392 - 395 vol.1

Liu, K.J.R.; An-Yeu Wu; Algorithms and architectures for split recursive least squares, VLSI Signal Processing, VII, 1994., [Workshop on] 26-28 Oct. 1994 Page(s):460 - 469

26

http://ieeexplore.ieee.org/xpl/RecentCon.jsp?punumber=4387



Documents

homepages.cae.wisc.eduhomepages.cae.wisc.edu/~ece734/project/s06/Shyam_Claus... · Web viewKalavai J. Raghunath, Keshab K. Parhi; Fixed and Floating Point Error Analysis of QRD-RLS