169
FO DGITAL SIGNAL PPOCESSIV, Approved or Tayoreae Distrbutin UnlmIte

FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

FO DGITAL SIGNAL PPOCESSIV,

Approved or Tayoreae

Distrbutin UnlmIte

Page 2: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

Flo

APONIIIR I~fUEFR LULSGA UPP~C~Sl 1F

VME m-W:

-o l_ --

,-

-1 -- _

J.-. L..,

Page 3: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

BestAvai~lable

copy

Page 4: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

~t~u~sv CL %1,V~A~i1, 409 w

A tEl W LTtIPI.?I. !,TRUC-TIIRE FOR DIGITAL. Sr'iJAI. YNLf~HOC;IN F~r I tic p4 vr~ oy'o mat e4fe d..

I~7 'Ir 010ft------. COaN T*1ATt 0 tR GA NT N

Dr.' Fr'md J./Tny Ior F)6?7-f'.,

I Pl[U0400M*4C 0OA6411ATION NAME AND ADDRESS10 PROGR ME'? A,~,~AEA 4 WORK UNIT NUW0 AIS

U~niversity of Cincinnati VDepartment of Electrical and Computer Engineeri --61102F /7Cincinnati, OH! 45221 4 23041A6

11. CONTROLLIN.J OFFVICE NAME ANO ADDRESS T-DAIL.

Air Force Office of Scientifjx.. a.o, h/NM 198Boiling AFB, DC 20332 ).Numkf

-T. MONIVORINC, AOENCY NAME 4 AORW0ffr-inf hforn Cntoilinlml flhllee) IS, SECURITY CLASS, (of this report)

UNCLASSIFIED

1T OEC I.. AS-IOPICATION/DO*NORtAO1.0scMko EOLE

16. DISTRIBUTION STATEMENT (of thoe Report)

APPROVED FOR PUBLIC RELEASE,_DISTRIBUTION UNLIMITED.

7,. DISTRIBUTION STATEMENT (oft he Abetract entered in nleek 20, It dillecent from Report)

1S. SUPPLEMENTARY NOTES

19, KCEY WORDS (Continue On reverae ald# it necoa ry arnd Identify by block nimber)

20 ASTRCT (ORIsideOf ecesaryendIdentlify by block number)

'The charge e4f-nFQGr /rant wstesu fanwrasomemory intensive digital arithmetic units based ofl modular algebra. Thenew class of arithmetic units, developed under this grant, operate at veryhigh speeds, admit VLSI and tit-slice realizations, and can be integrated intodigital signal processing systems,

DD A Nm 1473 901flm of I NOV 45 is 0UBOL 175

gEC-jRITY CL ASSIPICATie)N Ow Tmilt P ACE (1NOM~ flre Frrfered)

Page 5: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

FINAL REPORT

AFOSR Grant F49620-79-C-0066

Title: A New Multiplier Structure For Digital Signal Processing

Principal Investigator: Dr. Fred J. TaylorDepartment of Electrical andComputer Engineering

'Jniversity of CincinnatiCincinnati, Ohio 45221

Grant Monitor: Dr. J. BrainAFOSR/NMBolling Air Force Base, DCWashington, DC 20332

__ DTCAccessi on Fo TMI~S GRA&I

DTIC TAB ciS3E3190

Distribut 'on/

Availability Codes IFOCNOT1z orTrafflMI2TTAL TO PDC

Dib ! peas].This techniical report ha.s been reviewed and Isapproved for p~ublic release IMW AIR 190-12 (7b).

I A II Distribution is unlimited,11111A. D. BLOSI

i ~ ~ Te ehnioal InfUatoV Offioe'w

Page 6: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

TABLE OF CONTENTS

PagePART I Introduction P

Residue Arithmetic 3

RNS Capabilities 8

Large Moduli Au 11

Modulo p Adder 18

VLSI-RNS Multipliers 23

VLSI-RNS Multiplier Structure 23

RNS to Decimal Conversion 29

Overflow Tolerant RNS Multipliers 41

Error Analysis 43PART II System Design In the RNS 49

M-S Residue Architecture (MSR) 54

Finite Wordlength Effects 55

M-K RNS Filter Architecture 56

RNS Filter Design 62

MK Architecture 62

JKL-RNS Architecture 69

Distributed RNS Filter 69PART III Applications 81

PART IV Summary 81

References 82

Appendix A

Appendix B

Appendix C

iF

Page 7: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

INTRODUCTION

The charge of AFOSR Grant F49620-79-C-0066 was the study of a new class

of memory intensive digital arithmetic units based on modular algebra. The

newi class of arithmetic units, developed under this grant, operate at very

high speeds, admit VLSI and bit-slice realizations, and can be integrated

into digital signal processing systems.

Numerous authors have demonstrated the potential of residue arithmetic

for replizing high speed signal processing and computational systems. [1'6 ]

These methods are memory intensive in that the table lookup operations are

used to perform modular arithmetic. However, there is a possible flaw in

this contemporary residue arithmetic work and it is our dependence on high

speed memory. Admittedly, memory is becoming available with higher densities

and access speeds. However, they present a non-trivial power demand on the

system and are very expensive. For example, Intel's HMOS 1K x 4 memory, having

access times of 55, 70, and 80 ns cost on the order of $82, $76, and $62 per

copy. INMOS 16K (4k x 4) static RAM is available at 35 ns and Fairchild markets

a 1K ECL (high power dissipation) RAM at higher per unit costsEil. Our research

has determined that by using K x I HM0S memories, an equivalent 12 x 12 RNS

multiplier could be configured which has a pipelined throughput of 35 ns, but

it would require 9.9 watts of active pok~er and 1.65 watts standby. Furthermore,

the cost per moduli would exceed $1,000. Tf-erefore, high performance residue

based signal processing systems may carry a high Drice tag as well. It may

49 therefore be wise to rethink our dependence on memory intensive arithmetic.

Footnote [ij: This condition will be strongly influenced by the results ofDOD's VHSIC Program.

Page 8: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

2

It would seem advantageous to architecL future residue arithmetic

based systems on those technologies which will provide the highest performance

in terms o.:

1. speed

2. cost

3. power dissipation

4. packaging

5. availability

6. system compatability

parameters. The last two parameters are unfortunately often neglected in

exploratory research efforts. It would reflect poor engineering practice

to develop a technology dependent theory which is incompatable with it's

elcctronic environment. The technology which seems to yield the greatest

promise is the VLSI.J i High performance arithmetic unit5 are already

available in VLSI. For example, the TRW-VLSI carry-save 2's complement

multiplier line breaks down as follows:

TAB'[>L E I

. UNTIT SIZE PIfUS SPEED(ns) POWER (watts)

IMPY8HJ-l 8 x 8 40 45 1.5

MPY-12HJ 12 x 12 64 80 2.7

MPY-16HJ 16 x 16 64 100 4.0

MPY-2411J 24 x 24 64 200 5.0

By comparison, the 12 x 12 35 ns RNS multiplier is more than twice as fast

as the VLSI unit hut consumes more than 3.5 times the power. However, the

above VLSI multiplier units are designed to work in 2's complement and

Footnote [i: Small Scale Integration [SSI] = 50 gates, MSI 50-100 gatesVLSI : 4000 gates/chip

Page 9: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

S 3

therefore do not support residue arithmetic directly. Since these basic

fixed point 2's complement VLSI multipliers offer outstanding performance

tor the price, it is desirable to integrate them into a residue number

system (RNS) structure.

RESIDUE ARITHMETIC

Before a mcaningful dislog on residue arithmetic units and systems

can be established, the fundamental properties of this numbering system

should be reviewed.

Residue number system (RNS) is mature mathematical study. A serious

study of the RNS was offered by Gauss in the 19th century. In 1968 Szabo

and Tanaka published the book "Residue Arithmetic and Its Applications to

Computer Technology". [7 ] Due to the technological limitations of the period,

the book did not receive wide-spread ecognition and was soon out of print.

However, due to the recent availability of high density high performance

Read Only Memory (ROM) and Random Access Memory (RAM), the RNS is being re-

investigated for the application in digital filter design, implemlentation

of fast transforms, convolution, and optical computation.

Let P=(plP 2,...,pL) be a set of relatively prime integers, and let xL

be any integer in [0, M-l] (called dynamic range) where M = 11 p.. Theni =I l

by the Eucledian algorithm, there exists ki,xieI(integers), such that

i ~~ x:ii+Xi i=l,2,...,L 1

The quantity x is called the ith residue of x, and is usually denoted as

Jxi or xmod pi.pi

Page 10: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

.4

It is easy to show that x and M+x have the same residue representation.

Only if xe[O, M-I], can x be uniquely determined by the L-tuple (xlx 2,...,XL.

In this case denote R=(Xlx 2 ... A).

Another signed encoding scheme can also be used. In this case, the

dynamic range is [-(M-l)/2, (M-I)/2] with a negative number -lxi coded as

M-(xj. There is a trivial the isomorphism wnich maps [0, M-1] onto [-(M-l)/2,

(M-1)12]. This second coding scheme has the advantag,: t:,at sign detection

is not required during arithmetic operations and the sign of the result will

take care of itself providing that no overflow (out of dynamic range) had

occurred.

The following ire somde identities which will be used later. The proofs

are straight forward and will not be presented.

lx+yl, Ixitiyj 2.

IXyIp I IXplylp. 3.=1 4

I-xlp 1p-xp 4.

Let Zp be the set of integers x such that O<x<p (ie. residue class).

It is well known that Z is an abelian ring under addition and multiplicationp

modulo p. For any integer xeZp, the inverse of x is the integer ycZ suchp

that Ixylp = 1. It is also known that if x is relatively prime to p, then x

has an unique inverse, denoted x-l[p]. For example in Z6 ,5-1 [6]=5.

Arithmetic operations in RNS are defined in a straightforward manner.

Let x, ycZ, x, ye[O, M-1] and x=(xl,X2,...,xL), Y Then

z:koy:(zl,z2,..,zL) where zi:(xioyi ) vid pi, for i=l,2,...,L, and "o" denote

llz~l'g"!

Page 11: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

5

the operation x, + or -.

It is clear that the sub-operation within each modulus is independent

to each other. That is, no carry information is necessary between n;oduli.

The arithmetic is also exact and therefore free of round-off error. The z

is exact if O<z<M-l, however if xoy>M (overflow) then the answer will be

incorrect. Hence it is critical to know beforehand that the result will

not exceed the finite RNS dynamic range. Division in RNS is known to be

lifficult. Therefore RNS is considered to be best applied to system where

division is not the dominant operations.

Another RNS induced scheme is called the mixed-radix numbering system

(MRNS). Given the moduli set P=(plP2,..pL), any integer xe[O, M-1] can

be expressed uniquely as

X=RI + 2PI+R 3PIP 2 .+ '.. 4 XL(PIP 2 . PL-1 ) 5.

i-ilet ql=l and qi= Ii P., eq. (5) can be written as

j=l

x=Xl ql +x2 q2+x3q3+. "" +'qL 6.

or equivalently, x can be represented uniquely by the L-tuple x=<R1 , 2 .... XL>.

The Ri's are called the mixed radix digits with O<_<Pi-l. The mixed-radix

number system is a weighted number system. Therefore carries between digits

are necessary in arithmetic operations. A property of a weighted number system

is -hat magnitude comparison is trivial.

It is often necessary to compute the M.R. digits <R ,'2.

from given set residue digits (xl,x2 , .. XL). Here

(x 1 ,x 2

Page 12: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

l + ..... + Ll i 7.

Hence,

L-1'x JPl : x l =P1+x2Pl + .... + +L i1Ii Pi ixIpl l 8.

i=l 1

After subtracting Rl from both sides of eq. (7), one obtains

L-1x-x=x 2Pi ...... RL R P "

9L-I

1x-R P2IR2l .....1 =+XIL if Pil ={ P I

i TP2 "PIP2

Upon multiply both sides by pl-1[p 2] which exists by the reativeiy prime

property of p1 and P2 1 one obtains

S(x-R1)PI-2I [P2] 1 2= A Pl 1 [P 2 p2 2 0.

This process can be carried out successively until all Ri's areobtained. Actually, the iterate process can be realized in paralel formdue to the independence of residues. An algorithm found in Szabo and

Tanaka [7] can be used.

It was noted that division is difficult in RNS. However, in the casethat the divisor is a fixed constant c (where c is relatively prime toPi,i .. L), there is known to exist some simplification of the scaling

task. The scaling operation is formly defined as follows: GivenP=(PlP2 ,pL) and x-(xl,x ..... xL), what is the residue representation

Page 13: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

of I ? (where I I denotes the rounding to th-e closest integer operation.)

From the Eucledian Algorithm, namely

x - i W -ij 10.

c '

it follows that

c, c Ii - c__ 1

which is of integer value. The residue representation of this integer is

given in teri.m of a "scaling kernel" satisfying,

xi -:c [p l " -1I~i iII Jl i x Ii~ -~ i i ,Pi lx-lxIcIc [PillP 12-

Thus, if xi~c is known, then the residue representation of ,- can be

Tui XC C

obtained using one subtraction and one multiplication. Since c are relatively

prime, c- [pi ] exists w.r.t. pi. Usually Ixic will not be given and nave to

be found by a base extension algorithm.

The integer value of a residue representation can also be obtained

through the use of Chinese Remainder Theorem.

Given P = (pP 2 .... pL); where Pi, Pj are relatively pri;:e for i j,

the CRT states;

Theorem (Chinese Remainder Theorem)

L

1xIM I E m 13.

where

Page 14: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

8

M = f pj and jmim i [pi]lp i 13.Pi j=l

jfi

Proof: Since a residue number represents an integer uniquely in the dynamic

range [0, M-1]. It is enough to show that the right-hand side of eq. (13)

has residues (xl,x2, .... ,XL). Since

L -1 1 L

3 PjPl pip 3i= 3i:jmj Jxjmj'l I[pj]lpj i pj= Ixj I pj 2:Xj V :I,11...,IL

The claim follows.

*Notice that the left-hand side of eq. (13) is in the form of IXIM. That is,

the resulting integer will be unique if O<x<M.

There is yet another method which may be used to decode a residue tuple.

This method has been independently reported by Jenkin[8] and Julian.[9]

Starting from the residue representation (x1,x2 .... 'Y' <Xl'X2' . 9 >

is obtained through a M.R. conversion. Then eq. (5) will be used to recon-

struct x. This method is called M.R. reconstruction.

RNS CAPABILITIES

Interest in the RNS is due to its ability to perform high-speed arithmetic.

Speed is achieved through the use of a high degree of parallelism and an

absence of carry information requirements. Recall that the arithmetic composi-

tion of two integers, say x- (x x... ,x and Y (YI" given by x y (whereI'" L L

Page 15: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

9

o denotes addition, subtraction, or multiplication) satisfies xoy--(xl,,Yl,

.... ,xLoYL). It can be seen each residue digit, namely xioy i can be computed

independent of all others (ie: no carry information requirements). In

practice, the mapping of xi and yi into xioyi is accomplished using table

lookups where the table residue on randomly accessed read-only memory. Typical

high-speed memory modules, which are currently available, are:

TABLE 2

Device Type Technology Configuration Access-Speed

10149 ROM ECL 256x4 20 nsSN54S ROM TTL 1024x4 35 ns214711-1 RAM HMOS 4096x1 30 ns2125H-1 RAM HMOS 1024xi 20 ns12167 RAM 1IMOS 16384xi 45 nsIMS1400 RAM MOS 16384x4 30 ns

The product of two residues modulo pi, Pi< 2n can be precomputed and

stored in a 2mnxn-bit memory unit where m=2n. Using a large existing high-

speed memory (4Kxl at 30 ns), residues having up to six bit integer values

can be used (ex: P = {64,63,...}). Thus, fixed-point multipliers having a

dynamic range of [-M/2,M/2) can be architected which have execution rates

in the low nanoseconds.

The disadvantages of the residue number systems are manifold. Since

the RNS possess no most significant digit, decimal to residue conversion,

division, magnitude comparison, and arithmetic shift operations are cumber-

some dnd should be avoided. Register overflow, due to its finite dynamic

range, impose a severe constraint on the RNS operations. Unlike weighted

numbers (decimal, binary, etc.) where rounding or truncating least significant

digits can control overflow, such is not the case in the RNS. Since there

is an absence of least significant digits, the more general and inefficient

.. . .... .

Page 16: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

10

operation known as scaling must be used. Since scaling is a form of division,

its use should be discouraged. To gain insight into this problem, consider

the inner product of two 31-dimensional real vectors Y and y whose entries

are encoded as residue digits with respect to P = 32,31,29,27}. Without

scaling, the worst-case value of x and y would be limited to V-5 where

V = (M/2)/31 = 25056. Therefore, to insure that no worst case overflow

can occur, a 7.3-bit (ie: V-5 = 158 2 dynamic range limitation must

be imposed on x and y. With scaling,larger input ranges can be used at

the expense of statistical accuracy in the output space (analogous to

roundoff errors).

Due to the dynamic range limitation of RNS systems, one is generally

forced to accept one of the following two overflow prevention strategies.

1. Increase the dynamic range to a sufficiently large value

by adding more moduli to P, or

2. Make scaling a more efficient operation.

The first option represents a brute force attack to the problem. Such

an approach will increase to cost and complexity metrics of a filter. In

addition, the moduli set P must be tailored to unique filter. The other

approach appears to be the most popular at this time. Szabo and Tanaka,

and others, have concentrated on the scaling efficiency through the choice

of the three-tuple moduli set P : 2n- 2n 2n 1 } This moduli set has

the ability to efficiently scale a residue number by any one of the chosen

moduli. However, there is an intrinsic limitation plaguing this method and

it is its dynamic range. Using a large high-speed memory unit, say 4Kxl,

the input addressing space is limited to 212. This means that a moduli pi

Page 17: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

is technically limited to pi< 26 (ie: xi-yi<2 12). Therefore, the dynamic

range of any modular operation is given by M=(2nl)(2n)(2n+l)-23n=.18

In many applications, an 18-bit resolution is insufficient resolution.

LARGE MODULI AU

It is desirable to keep the previously discussed three moduli structure

for purposes of potential scaling needs. However, in order to overcome the

existing disadvantages of this system, that of dynamic range, a new archi-

tecture is called for. Since it is unrealistic to assume substantially larger

density high-speed memories will continue to become available, it is incum-

bent that more memory efficient residue arithmetic unit be designed. An

efficient algorithm, which is ideally suited for this application, is known

as the quarter-square multiplier.[lO-12]

= <P(s+)-4(s)> 14.<p

where @(s) = <s2>~ with s+ = (x+y)/2 and s =(x-y)/2.

The quarter-squared multiplier has been studied by J.M. Pollard (1976)

in a Galois field. Questions of hardware implementation were not considered

and, due to the Galois field limitation, only prime moduli could be considered.

H. Nussbaumer (1976) studied the quarter-square multiplier over real fields

for use in ROM intensive digital filters. Soderstrand and Fields (1977) made

brief reference to this multiplier for residue arithmetic but offered rio

satisfactory hardware realization. Our research has produced a practical

residue arithmetic quarter-squared modular multiplier in commercially available

hardware.

Page 18: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

12

A problem that would seem to plague the quarter-square multiplier is

the need to realize the division by two the sums and differences. In general,

the existence of an N-1 , such that <N-tIl>p=l, can only be guaranteed if N

is relatively prime to p. Since one of the chosen moduli is p=2n, multi-

plicative inverse of 2 cannot be guaranteed to exist. Therefore, the quarter

cannot be directly interpreted as the equation <<1/4> <(x+y) 2-(x-y)2 > >Pi i piP

The potential problem of dividing the sum of differences, found in equation 4,

by 2, will be explicitly and efficiently treated for the first time later in

this paper. For a 2m word memory unit, the direct product architecture (ie:

xy) would limit the maximal rnoduli to be bounded by 2n, n=m/2. In fact, this

claim can be extended to the case where p = 2n+1 through use of the following

modification. Observe that if xi = 0, then it automatically follows that

<xiYi>Pi = 0. Therefore, if xi =O.OAO ...0 (which is detectable condition in

that the (n+l)st bit and remaining n-bit block is zero (0+OAO0...0)) the out-

put register would be automatically cleared. Therefore, the lookup table need

not be accessed for this case. Instead, the all zero n-bit portion of the

table address, allocated to xi , can be used to represent xi=2n where xi

2nlA 00 ...0. Here, the table would be programmed to map yi into <2nyi>

using only a 2m word memory.

The memory requirements associated with the quarter-square multiplier

are substantially less than those of direct mechanizations. First, it should

be apparent that the integers s+ and s, found in equation 14, are bounded

from above by 2n+l. Therefore, only a (n+l)-bit table addressing space is

required to realize (s) versus the 2n-bit space needed for direct architectures.

It would appear however, that there is an exception to this rule. Since one

Page 19: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

13

of the moduli chosen is p 2n+,. Here the maximal value of s+(or s-) is 2n+l

which would technically require a (n+2)-bit address. However, by using the

protocol found in Figure 2, which is an adaptation of the network found in

Figure 1, the table size can be reduced to 2n+l words for all moduli. Here,

the overflow bit serves to differentiate s+=O from 2n+l

The quarter-squared architecture is abstracted in Figure 3. It uses a

2n+l word high-speed memory for modular arithmetic lookup operations. Using,

for example, the previously referenced 4K-30 ns device, moduli having an 11-bit

dynamic range (vs. 6-bit in the direct form) can be mechanized. This would

yield a three-moduli dynamic range on the order of 23(11)-8.6-I09. That is,

without an increase in memory size (and therefore access time), the dynamic

range of the quarter-squared is 233/218=215 times larger than that obtainable

through direct means! This large increase in dynamic range makes the RNS a

viable alternative to traditional filter design methods. Both improved pre-

cision and throughput (through the reduction or absence of traditional scaling

operations) can be achieved.

Several versions of the multiplier algorithm can be considered. They are

summarized in Figure 4. The first, called the sequential form, would have an

estimated throughput rate of 240 ns based on a 60 ns lookahead adder and memory

having an access time of 30 ns with a cycle time of 60 ns. The second archi-

tecture, called the parallel form, would run at a 180 ns rate. The parallel

architecture is preferred because its higher speed, simpler control. A 60 ns

pipelined execution rate can be purchased at a small hardware cost.

Upon closer investigation of the table lookup data base, a potential

nuisance can be found. It can be examplified by observing that if s- is,

Page 20: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

14

MSE3RAM or ROMvZ=Xy MOD P CLEARPz 2ri

R=2IF X L 0Yy IF Q 0

2=?' IF Y L 0

Fgureo 101110 21141 ATAl

Page 21: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

flnifO R~ +l

SELEC s h2

P'igire2. cmov Copr(ssi~l od p-

Page 22: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

c

CL

N. >

0.

+Lf) I )

Page 23: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

17

x AA

1 12

A 1 X-y CI A 2 C EUKTAIl2

- 6OS0f4A. l80ns.(S-

Fiur 6. A 0r6hi, -N' 24

A4

Page 24: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

18

18

odd and p'2, then (st)=<s2/4>32=x.25. Therefore, it may be required that

two additional fractional bits may need to be added to the table's output

wordiength. However, this is not the case as suggested by the following

theorem:

Theorem: Let Il vii denote the integer value of v. Then z<Ii (s+)ii- i-(s-)i >

That is, only the integer value of need be used and the fractional bits of

4(s±) can be ignored.

Proof: For x, y and k integers, one may define two rational numbers, namely

(x+y)/2-v+k/2; (x-y)/2=q+b/2 where k=O or 1. Then z=<<(x+y) 2/4> p-<(x-y) 2/4>p >p

<<v~kv+b2/4>p-<q+dv+k2/4> >p=<<v+kv>p+k2/4-q+k> 2/4>p=< (s+)- (s-)>pp p' p p-1 p

As a result, the parallel architecture is equivalent to that shown in

Figure 5. Furthermore, by deriving the above theorem over a rational field,

and showing that the results pertain to the integers, several classical pro-

blems are overcome:

1. The quarter-squared multiplier is not restricted to the

Galois fields suggested by Pollard.

2. The question of the existence of the multiplicative

inverse of 4 is now moot.

MODULO p ADDER

The Quarter-square multiplier requires a modulo p adder be used to com-

bine the two component parts of the solution (namely 4(s+) and 6(s-)). Modulo

p adders pose an interesting design problem. Unless a fast modulo p adder

can be fabricated, the overhead associated with addition will offset any gain

in throughput achieved through table lookups. For the moduli chosen, 2n-l, 2n,

and 2n+l, only the modulo 2n adder can be realized directly (n-bit adder with

Page 25: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

tr, 19

D 1 0 12

iI

V)~

LIJ

a- Q

c~

LiO

Page 26: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

20

ignored overflow). It would however, be desirable to use a n-bit adder to

realize the modulo 2 n-1 and 2+l adder as well. For the purpose of clarity,

let s be defined to be the sum of (s+) and f(s). The following observation

then follows:

TABLE 3

Dynamic Integer Modulo 214 Adder Modu1o p. Adder Example:U=3 iCase Range of S <s>2 1 OVF-BIT ) i '-s> s < s>

- - -- i

I s=O 0 0 2 l 10 0 0

2 1<s<2-2 s 0 2 '-1 s 4 4

3 s=2N- 1 s 0 2- I 0 7 0

4 S=2N 0L I 1 s- 2 +I 8 1

5 2N+I <s< 2N-4 s-2 N 1 2-1 s-2 N+1 10 3

6 s=O 0 0 2I 0 0 0

7 l<s<2 -I s 0 2 S 4 4

NN8 s=2 0 1 2 0 ,8 0

9 2 +I<s<2-2 s-2 N 1 2N s-2 10 2

10 s=O 0 0 2+1 0 0 0

11 1<s<2 -1 s 0 2 +1 s 4 4

12 s=2 0 1 2N 1 s 8 8

13 2 l+1..s< 2 N1I s- 2N 1 2 I s- 2N-1 10 1

14 s-2 N I I 0 0 2?%I is-21-1 16 7(special case)

Using n-bit AID gates to sense the zero condition of <s>2N, the overflow

bit OVF the sign bits of (s+) and *(s-) combinational logic can be defined

which will <s>2N into <s>pi It can be noted from the data found in Table 12*

Page 27: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

21

that the mapping requirements are:

1. for p = 2141 map s to s or s-2N+]=<<s>N+> -.5.

2. for p = 2N map s to s-2N<S>2N

3. for p = 2"+I, map s to s or s 2Nl=<s> N>

Mapping two is trivially satisfied with an n-bit adder. The other two

mappings require that s remains unchanged or it is decremented or incremented

by unity. There are several ways to approach this problem. Bioul, Davis,

and Quisquater have presented an unorthodox architecture for a modulo (2 l)

adder using two-input gates.L Modulo (2n+1) adders can also be realized

through the use of end-around-carries. However, compared to inodulo 2n addi-

tion, this approach would almost double the addition delay. This extended

delay problem can be overcome through added complexity (ie: tine multiplexing

two end-around-carry adders). Mapping one and three can be efficiently

realized in the manner suggested by the example found in Appendix A. The

functional operation of adding one (mapping 1) or subtracting one (mapping 3)

from the output of an n-bit adder is performed by a PLA. The PLA will provide

an overlay mask which accomplishes the required task. !he derivation and

utility of the mask can be understood in the context of the following example.

Example: Suppose s is an 11-bit word having a decimal value of slO = 92 or

s2=O000101lO0. If slo-I = 91 or (slo-l)2+ 00001011 lll; is desired, one notes

that only the 3-LSB's of s2 need be altered. In general, for n=12, only the

i = following i3 distinct birary masks are required to for.n (slo-1)2.

Page 28: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

22

MSB Pattern LSD Uotatiun

X X X X X X X X X XA 'A leave corresponding bit

X X X X X X X X X X X 0 location of s2 unchanged ](or 0)X X X X X X X X X X 0 1 - change corresponding bit

location of s to I (or 0)

0 1 1 1 1 1 1 1 1 1 1 1 Table II. MASK

Suppose the moduli p = 2n+l, n : 12, is to be implemented. By using two

commercially available 16x9 PLA's in parallel, the 12-bit output of an n-bit

adder (shown as <s> in Table I) and the four previously specified control2N

bits, can be converted to 13-bit mask. The mask would transform the output

of a high-speed n-bit adder to s or s-2nl, depending on the state of the 4

control bits. Based on a 25-ns 12-bit Schottky lookahead adder, a 20-ns

16x9 PLA, and 10-ns FET mask switches (in notation comments of Table II) a

65-ns modulo p adder, for p=2n-l, 2n , and 2n+l can be realized. The presence

of a 65-ns modulo p adder will now allow a 140-ns large moduli residue multi-

plier based on 35-ns 4Kxl HMOS memory units. (See Figure 5) For a moduli set

{2l2-1, 212, 2l2+1}, a fixed poirt multiplier, having an output dynamic range

of 236-212, can thus be fabricated having a word rate :f 7.143 M multiplications

per-second. This compares favorably with new 16xl6 VLSI multipliers. Using a

pipelined architecture, which requires the insertion of the storage registers

found in Figure 5, a ve.y impressive throughput figure of 28.5 M multiplica-

tions per second. It is important, and fortunate to realize that the Intel

A HMOS memory unit, used in this analysis, has a cycle time equal to the access

I. _ M

Page 29: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

23

time. If, as is often found in practice, a memory unit has a cycle time

approximately twice the access time, then pipeline delay would increase

from 35-ns to 70-ns.

VLSI RNS MULTIPLIERS

As previously noted, 16--bit three-moduli 35 ns pipelined multiplier is

more than three times as fast as the VLSI unit but consumes more than four

times the power and is significantly more complex. However, the above VLSI

multiplier units are designed to work in 2's complement are therefore do

not support residue arithmetic directly. In this paper, the algebraic

elegance and speed of the RNS is combined with the technological advantages

of VLSI to achieve high-performance modular multiplier.

Since the RNS is an exact numbering system, the nesting of modular

arithmetic operations can result in register overflow. Register overflow

occurs when the result of an arithmetic operation exceeds the admissible

dynamic range M. For a set of relatively prime moduli set P={pl,...,pL },

M=1Ipi,i=l,2,...,L. Overflow prevention in the RNS is accomplished through

the use of a relatively inefficient operation referred to as scaling. This

can be mechanized using the mixed-radix conversion algorithm or the Chinese

Remainder Theorem.[5 To insure that there will be some degree of efficiency

in the scaling operation, the moduli set must be carefully chosen.[6] A

particularly useful moduli set is P={ 2nl_, 2n, 2n+11. Based on this choice

of moduli, a VLSI based residue multiplier can be realized in commercially

available hardware.

VLSI-RNS MULTIPLIER STRUCTURE

This structure will be presented as three special cases.

Page 30: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

24

Moduli P=2n

For the purpose of discussion, consider p=2 n to be a moduli and x,

xeZp, to be the composite number

<X>p=X=2 mXHI+XLO; Xm=XHI or XLO, Qxrng m-1 16.

where m=n/2. Here xHl and xLO are m-bit positive integers and Zp is the

residue class of integers modulo p. For a y having the same format, it

follows that z=<xy>p is given by

z=<Xy>p=<2na+b+2m(c+d)>p 17.

where: m-)2,2n-.

b~xLOYLO; O<b<(2m-l) 2<2n_,

C=xHIYLO; O<c<(2m-1)2 <2n-

d=xLoYI1; O<d<(2m-l) 2<2n 1

v=c+d; O<V<2(2m-l)2

n mnnmUnder the hypothesis that p=2 n , and noting 2 M=2 /2, z computes to be

0

Z=<<2n a2n+<b>2n+2n c+d)/2m>2n>2n 19.

The last term in equation 19 may seem to pose a potential hardware realiza-

tion difficulty. However, this need not be the case in light of the follow-

ing interpretation. Suppose that the (n+l)-bit binary representation of the

?4

Page 31: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

25

positive integer v=(c+d) has the form xx.. .x (x=f or 1). Then V/2m can 6e

formed by simply defining the binary point to precede the mth LSB. That is,

V/2m = IV+.XV where IV is the integer part of V/2m and XV the fractional

part. Thus <2nv/2m>2n = <2n'V+.XV)>2n = <2n--' IV>2n+<2n(XV)> 2n. Computing

<2n(XV)>2n could promise to be an inefficient operation if conventional

digital methods are used. However, this need not be the case since XV is

known to be a m bit word where m is n/2. For example, if a 24-bit moduli is

desired (which represents a substantial improvement over the 5-bit moduli

typically found in the literature), then m=12, and a 4K x 1 high speed (35ns)

memory can be used to implement the mapping <2n(xv)>2n as a table lookup

operation. The partial product terms could then be combined by a moduli 2n

adder to form z=<xy> .

Moduli p=2 n-

Equation 16 can be rewritten in terms of the following set of relation-

shipsshipsi: 2n=(2n-l)+l

ii. 2m=2 n/2m=(2n-)/ 2m+/ 2m 20.

with

z=<((2n-l)a+a)+b+((2n-l)(V/2m)+V/2m)>p 21.

From the previous analysis, one notes that <a>p=a, <b>p=b and V/2mIV+.XV,

with

<(2nl)V/2m>p =<(2n-l)(IV+.XV)>p=<(2n-l).XV> . 22.

AUsing lookup operations and a 2m word memory unit the modular mapping can

again be implemented directly. The term V/2m is, as previously stated, is

Page 32: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

26

simply reassignment of the binary point of V. Again the partial product

terms would be recombined using a moduli p adder.

Moduli p=2n+l

This case requires special attention since it is not completely

analogous to the previous case considered. In particular, riot all the

residues in the residue class Z can be encoded into an n-bit word andP

represented as x=2mxHi +XLo. In other words, the admissible residue x=2n

does not conform to the accepted data format. However, x=2n is an easily

detected case since it is represented by x+l000... 0 (ie: test MSB for I

and AND with n-LSB's of O's). If x is detected to have a value of 2n

then only the following events are admissible

TABLE 4

x y z=<xY>p p=21+1 example, n=6

n <2n (2+1 )Y-Y 2n 1 = -Y>'2n+1 <64(y=5)> 65

I= _65-6=60

2n <22 n>2n+l

=<( 2 l 1 )2 -2( 2n+,)I > =1 <4096>65=1

These two possible events can be separately programncd without reducina

Page 33: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

throughput. That is, upon receipt of x (or y) = 2n, the output will be

immediately set to <-y>p (or 1).

An architecture capable of realizing the proposed large moduli multi-

plier in VLSI is suggested in Figure 5. This system is composed of four

commercially available VLSI multipliers, one custom VLSI Quad moduli p adder,

and memory units for table lookup use. More will be said on the structure

of the modulo p adder in the next section. For values of n=24 or 16-bits,

and based on commercial multiplier specifications, a three moduli multiplier

system can be built having a dynamic range on the order of 72 to 48-bits.

Furthermore, based on these parameters, a multiplier can be partitioned into

four 100 ns operations. This translates into a real-time throughput of 2.5 M

multipliers per second for a serial realization or 10 M multiplier per second

when pipelined (a most impressive 72 to 48-bit multiplication rate). The multi-

plier, suggested in Figure 5 performs the following ierations.

Page 34: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

28

TABLE 5

O peration p=2 n p=?"-I p=2nil Level Remarks

Si : <20 a -a 1 VLSI mul tipl ier

S2-x LOyLO>p- h ) b b 1 VLSI multiplier

T1 :x 1lYLO-" c c c 1 VLSI multiplier

T2d d I d 1 VLSI multiplier

U:<a+b>0- 1) <a+)>p b-a 2 mod p adder

V:c+d.. IV IV IV 2 adder-shift register

V': 0 +IV/2m -IV/ 2' 2 adder-shift register

XV: .XV .XV .XV 2 adder-shift register

WlI<U+V>p Wl W 3 mod p adder

W2:<p.XV> w2 w2 3 table lookup

Zk <w, +w2> p 2 2 2 4 mod p adder

Fxample: n=6, ren/2=3, p2 6 64 <xy>6 4 <558> 64 46

Let x = 1 = 2(8) 4.2 2:2 (I!I:LO) <xy>63=<558>63 =54

6 3y 31 = 3(8)-17 3:7 (11:1.0) xy-6. , 5, 3P

a=6, 1)=14, c::14, (1-6

Page 35: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

29

TABLE 6

Opera ti on p-64 p=63 p-65

S1:a -64(6)- 10 <63(6) '6:. -6 .65(G)-6-pz-6S2:b 1).14>.1 •14 .'14 'P= 4 .14 > = 14

T1:c 14 14 14

T2:d 6 6 6

U:u 14 20

V:v 20-001OOO0 20,0I1O100 20-,00101O0

VI: SETrO 20/8-l0010.100 -20/8--0010.100

.XV .100 .100 .10

Wl :w1 144,O001110 22.5-,001011O.1O0 5.5-000011.100

W2:w 2(looku5) 32,0100000 31.5.,0011111.1O0 32.5--0100000.1O0

Z:z 14+32=46 22.5+31.5=54 5.5+32.5=38

RNS TO DECIMAL CONVERSION

One of the major disadvantages of the residue numbering system (RNS)

is its inability to efficiently perform magnitude comparison. Magnitude

comparisons are critical to general purpose RNS operations since they are

generic to the management of dynamic range .(register) overflow and conditional

branching. Unlike weighted numbering systems, where overflow can be efficiently

handled by comparing data fields starting at the most significant-digit, RNS

Page 36: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

I..-,. ,30

" I I ,JI ,

- ' li1 'A

1A1

I Lb Vj

ty rIltI:T~ IIl I. '-

wJ I f .t I

,-A I ' 0

Si• 41°U clii

. .~r* (i(1' It I f

r .~

attI .4

n. C ',II

1 , J , ,,...,I "' I-' ,,, , ..

ItI

11; C1

t _ _ __ ,, , _ _ .. . . .<,Itl I"ANL - I_). I

J (.1 4

gal ri + C

(A I

ItI

.. ~ ., -,#A

-4 O~gcd t'i ~r~r -' '..~) esuy)I, 0 j tfl~ i -J

Page 37: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

31

[41procedures are complex and time consuming.[4 Various versions of theseRNS-to-deciial routines have been published which make use of modular tablelook-up operations and distributed (bit-slice) arithmetic.[13,14] However,the methods reported in the literature require a significant hardware invest-ment and consume a disaproportionate amount of run-time compared to otherRNS computational operations (viz: addition, subtraction, and multiplication).In the AFOSR program, a three new RNS-to-decimal has been developed which issignificantly more efficient than existing techniques.

With respect to the moduli set P = {p ,p2,P3 } there exists, for O<x<M,three unique mixed radix conversion (MRC) digits x MRC FX,2,3 such that

(x,2, 3)suhta

x x + + X 3P2P3 23.where x RNS (X,2,x 3 ) with

= X

S2 =<P2-I~p3]*<X3-x7> P3> P4P3

3 24.

More specifically, for the choice of moduli P={2n_, 2n,2n+j1 , it follows that

p2 -1[pl:I; p2 -1[p3 ]2; p-~[p1]=2n-l=2n; 25.

Upon substituting these multiplicative inverses into equation 2, one obtains

72 <2n*<x = + 26.

3-2 > n = -x3 )2n+, 2n+ < x -x >2n+l 6

%n

Page 38: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

32

= -<X -X >

T3 < n l < _2 2 n _ -< 3 x2 2 21+ t J 2'- 1

Functionally, it can be seen that

XI = fl(x2 ) ; V2:f2(xl'x 3); i3(xl'x 2,x3) 27.

The MRC algorithm can of course be realized vy using sequential methods. Here,

nexted modulo pi adders, and p 1[j] multipliers would be used to compute

xl,x2,yx. The three-tuple of mixed radix digits would be used to compute

i (equation 23) using these multiplications and additions. The disadvantage

of the direct appioach is execution speed due to the sequential complexity

of the algorithm. Throughput improvements and a reduction in complexity can

be achieved by using memory based table lookup operations to replace some

arithmetic. If high speed is to be achieved, high-speed menory units must be

employed. Such memories have a fairly restrictive input addressing space (5

to 12-bits). If -aapping f3 is to be realized, by presenting all three

12residues to a 4K-35ns RAM or ROM, then n=4, and M<2

Consider again the three moduli case where P={ 2n+1 ,2 ,s -1) which

specifies an RNS dynamic range M=plP 2P3. Based on a 4K-word high-speed

memory model, the previous medium moduli RNS-to-decimal converter was

practically limited to a size of six-bits per moduli (ie: M2 18). The

method presented in this research targets a 12-bit moduli for practical use

(ie: M2 36). The developed large moduli scheme can be easily motivated by

the data found in Figures 6a and 6b plus Table 7. The data found in these

figures and tables are based on the moduli set P=(5,4,31 and M=60. The first

three entries found in Table 7, namely x3,x2 and xi, are the residues digits

of x for x monotonically increasing over [0,59]. The fourth and fifth entries

namely Jl and J3, are hybrid parameters. Since 1p2-Pl1=1p 3-P21=l, the values

of Jl and J3 will increase by unity (in a moduli pi sense) for monotonically

Page 39: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

I 33

iii L----- CA

L rI

IIV-

___~ L

1 I Lii L j.

L~j LL

Cl CD

LiI [71 II I n

Li I HL~1I~j ils vC1i- 0g C:) LXI Tr

Page 40: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

34

____- AL3P~±JflN ) D

-:9 A. A:f 1

-14Fv~Fhrv Ii

rnNJ~ N

-- j-jm 1 T ~~jI j B~

It ___~Io- .J

U,

I~ 13

I I -N le

Page 41: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

35

- 7 T - .f~,,f

- % z C. - - -j -4C

* ..

fn W7 I I n ' n

I J I

0. o Io ! --

o 0 c . O 7 . 7?- C , oL -o-. .-A 0,

cli C -40-~ - -

t~I CI~ %. ,

Page 42: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

36

increasing values of x. The important observation is that J1 and J2 naturally

decompose into a system of cyclic patterns which shall be denoted Si2 and $32

over a subcover of M, say S2. More specifically,

S 2 = three sets of five subsets of four elements each and 0(S112 )=60

$32 = five sets of three subsets of four elements each and O(S 32)=60

S2 {kP21 10k<plP 3}

In general, for P={ 2n+l, 2n ,2n-l}{pP 2 ,P3}:

Sl2 = P3 sets of p1 subsets of P2 elements each and O(Sl 2)=M=.fpi

$32 = p, sets of P3 subsets of P2 elements each and O(S32)=M= Pi

Using more traditional algebraic terminology S1 2, $32 and S2 are ideals

in the ring of integers modulo M (ie: ZM). It is well known that in general

the mapping

xRNS (xl,X2,...,XL), xi=<x> 1 28.*1

is an onto homomorphism with kernel j'Ij. For Ii={kD i} (as is this case

here), Ij=O and Maher has shown the mapping to be isomorphic.[ 163 It should

also be apparent that due to the cyclic nature of J1 and J2, that any x

belonging to the subcover S2 has the block representation

x .~2 29.i x : {(plll+Jl)*P2; O<ll<P3; O<_Jl<pl,'o 29

or

= {(P313+J3)*P 2; O<13<pl; O<33<p3 E 30

2)

Page 43: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

37

where Il,13,Jl, and ,3 are integers. Equating equations 12 and 13, one

obtains

P3 13-pjIl = (JI-J3) 31.

which is of the form ax+by=c. Equation 14 possesses a very important pro-

perty which will now be derived.

Lemma 1:[ 17] If a, b, c are integers and at least one a, b is nonzero,

set d=gcd(a,b), then a solution to the Diophantine equation

ax + by = c 32.

exists for integer values of x and y if and only if d/c.

Lemma 2: If b is relatively prime to a, then the congruence by = c

mod a has an integral solution x. Any solutions x, and x2 are congruent

modulo a.

From these two lemnias, the following theorem can be stated:

Theorem: Given the Diophantine equation 14

Pll-P313 = (Jl-J3) = c (see Figure 3b) 33.

the solution two-tuple (11,13) is unique.

The proof is straightforwarded and is based on the fact that p1 and P3

are relatively prime, l[O,p3-l], and 13E[O,p.-l]. Therefore, by specifying

c, the block indices (11,13) can be uniquely determined. Observe that xS 2

can be derived from knowledge of the two-tuple (11,13). However, (11,13) is

uniquely determined by c=Jl-J3. Therefore, upon presenting a (n+l)-bit word

c to a (n+l)-bit high-speed RAM or ROM, the Drecomputed value of Sl=P2P11

or S3=p2P313 can be outputted. The corresponding value of TES2 can be

realized by adding to s, the integer P2Jl to Si or P2J3 to S3. Lastly, if

Page 44: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

38

xc[O,M), one only needs to add x2 to x. The decimal value x can therefore

be computed in the composite form

x = x2+P2Jl+llPlP 2 34.

where, due to uniqueness, the mixed radix digits ere (x2 ,Jl,nl).

In general, for P={ 2n+1 ,2n ,2n 1 }, the routine would proceed as follows:

1. Accept x RNS (xX 2,X3)

2. Form Jl:<x 2-x1 > and J2=<x3-X2>P2

3. Form J1-J3 = c

4. Map p(c)=pIP211=Sl or p3P2 13=S3

5. FORM _P2J'+SI or x=P 2J3+S3; -XS2

6. FORM x=+x2;

These steps are numerically examplified in Table 7 and diagrammed in Figure 7

for the {5,4,3} system.

Compared to conventional RNS-to-decimal conversion algorithms, the

derived algorithm possesses the following attributes:

1. no modulo M addition required as in the case of CRT or MRC methods

2. practical realization of very large moduli RNS systems

3. simple architecture and reduced complexity.

Additional refinements in the proposed method can also be obtained. First,

observe that c, the hybrid parameter which defines the argument of the mapping

4 (item 4) is a signed integer such that cc[-(2n-2),2n]. Technically, to do

the mapping (c), a (n+l)-bit high-speed RAM or ROM would be needed. This

Page 45: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

39

+

x 1 +

XX I 2 or

GENERAL ARCH17TCTURE

FIGURE 7

Page 46: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

40

suggests that the largest admissible moduli is 11-bits (using a 4K-memory

model). Furthermore, since cmax= 2n , it would appear as though the output

register for the signed-adder found at sten 3 would have (n+2)-bits if a

standard binary weighted code is to be used (eg: 2's complement). This pro-

blem can be overcome through the following modifications.

1. Using an (n+l)-bit (at least) sign-magnitude adder, the sum c=JI-J3

can be represented as a (n+l)-bit word having the format MSB:xx--.x.LSB.

The sum c can be partitioned into two sets V and Z given by:

xCV if MSB of x = "0i

xCZ if MSB of x = "1"

More specifically, V is a set of 2n_, elements determined by

V = {y I y = x for xc[0,2nl]}. Also, z is a set of 2n-l elements

determined by Z={zlz=x if xc[-(2n-2),-l], z=O if x=2n}. It can be

seen that the sets Y and Z are defined by the magnitude digits of

the signed magnitude value of c with the membership to Y and Z

determined by the MSB (sign-bit location) of c. The importance of

this partition is that two 2n word tables can be used to map c into

@(c). The device select line would be tied to the MSB of c as

suggested in Figure 8.

2. Another efficiency can be realized by using data packing. More

specifically, the term P2Jl+x 2 , for the considered choice of moduli,

can be rewritten to read 2 Jl+x 2. Since O<x 2<2n , and O<Jl< 2n, the

the term 2nJl+x 2 can be directly, and uniquely encoded into a (2n+l)-

bit register. This is suggested in Figure 8.

3. The proposed architecture, as in the direct realization of the mixediL

Page 47: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

41

mixed radix conversion, requires moduli p. for Pi= 2n_-I or 2n+l.

Several such modulo adders have bpen reported in the open literature.

A very efficient 40nsec modulo 2 +1 and 2n- adder, for n<12, has

been reported in reference 18.

OVERFLOW TOLERANT RNS MULTIPLIERS

In order to extend the dynamic range of the autoscale multiplier to a

more useful size (say 12 to 16 bits), based on a 4K memory model, data com-

pression will be required. A suitable compression algorithm, based on the

quarter-square algorithm has been reported in an earlier section. Further-

more, the theoretical foundation of a compression scheme has been motivated

in the previous section. Here, data compression will be studied in the con-

text of the popular three-moduli system P={ 2n+l, 2n,2n-l} such that M=P1 P2P3=23n-2n.

Any integer over [O,M) has the unique RNS representation x RNS (x Xx).

Consider now a subcover of the range [O,M) generated by all numbers x having

an RNS representation x RNS . Obviously is defined over a

subcover of [O,M), say S2 where S = {kP 2IOk<plP 3}. The utility of this

operation is that of data compression. More specifically, only 2n-bits of

data are needed to uniquely quantify i-(ie: x-%_(XI,x 3)) versus 3n-bits for

x ie: x'_(xlx 2,X3)). The digits of _ can conveniently be defined to be:

1l x2-"2

x2 = x2-x2=O 35.

3 x2x 3

As a point of interest, this is also an operation found in the residue to

Page 48: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

42

0 N 2N-'!

ILO L HI

n TRI- I +X2 __ 4 DATA STATEI BUS

4+3J SELECT

REFINED ARCHITECTUJRE

Figure 8

Page 49: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

t

43

mixed radix conversion algorithm used to determine the mixed radix coeffi-

cients of the weighted representation

x Xl+x2p2+A3).j-3 36.

ERROR ANALYSIS

The mean and error variance for the extended ranae autoscale multi-

plier is a function of the chosen moduli set. Even the simplest analysis

becomes burdened with nested sums and binomial coefficients. Instead, the

error statistics of the multiplier was studied using numerical simulation.

A general purpose FORTRAN program, written on a PDP 11/60 under RSX-II, is

reported in Figure 9. In Figure 10, the product of x=16 and yc[0,29], for

P={3,4,51 is reported. The parameter Zl is the autoscaled product over

S2' Z2 is the theoretical autoscaled product, with the last column exhibiting

the error. The test software operations in either a deterministic or statis-

tical mode. In either mode, the user specifies the choice of moduli (ie:

P={p1,P2,P3}) and the number of fractional bits used to define the table

lookup data. That is, the output wordlength is given by [l1g2M]+ number of

fractional bits. In the deterministic mode, all possible values of x and y

over [O,N) are tested. However, if N is large, long execution delays can

result. To overcome this problem, a statistical approach may be used where

the integer value of x and y is randomly chosen from a uniformly distributed

process over [O,N). The test is repeated M times and the statistics analyzed.

The software presents to the user both error mean and error standard deviation.

*For example, for P={7,8,9} and zero fractional bits of accuracy, the determin-

istics error mean and standard deviation was determined to be e=-.00011476 and

l ....

Page 50: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

ANALYSIS 'BAlICE CODE

%L '~44

-oi T~t C , ill

~T-1 11

~wfc 7* II.11

MITi ( 1 31

:fl'11T77 )1I ~pA ~I r. \O C.Atr,

'l T IIr r

-! M I T I I 6

INA r.II OF 9RC*nl- 1.IT

.1': ?IV IIT r2 12 I

1r.LCi2s -T

I S.Efl. V I !If;.0- 5TCS Is/2

n,.) TO iO

* r IYr E10 11.)T 3 3

rA I AlIJ '.1.1n

11TO 310

ml I I

J C1.v2 ITy .1/1 2T) CO I I

13') IT3.PtT (T2l)TI I a L ( , I ) I -TI . v( ( I JI)IV -*2T I _c 1 -T1 )I. L

*I'jL X'rT CO 1 Ill

'M TM ,20 %1g 2,',lI

I F !iT FIGURE92

Page 51: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

45

X Y Z Z ERROR

,:'. I) * oI ?Iof:I)o L . rH I~ ~ )I .* (fl.(

I,~* ' c' ' IIq ' ' * C" ": ": ":"' ~ * • : 41. n: C.I.,I

I • ., 1, ., .(f.Ii.. ,*.*,&,o - , _ ': .' .

.6 . , ,2' 1 3. (:'-,C c'c'U 11 7~332l I ~-.46. .

1/, 4. "' : r. "c'c,'i'c' ') ' 1 ' ' ":"/'.4 -r ,'::33,!. ,4

.1,4 .. . ...;.r I 'I,' C!'vdI I"., 1" r;'O . ;AOO..'47

-, :'6. "' rO0000':'- 4 . 6.67 - 1733::::

.. 1 ....... .... 1 • :7- ,./-,,7,'., o ,'-, "i . ! ':r'"IJ ,'C ' 4 . . .. ' . 1 " """'"'-'

I..00'00 4.. 000 -1.200000

1. ti:: ;8. 000000 6,.9 30 ':>""":":":' .

16~~~~ .1 -0.,.,.0 P.3,3 ,IA 6 6. ,

J6.' ,71..0;..'. 000000 :9'.066667r' ' ), .. ... 9333 r)') 0 *0

16 Is, 1..,00000 ,.000'1 -0.40001..: t9 10",.. ,000000 0 1333-1... 0.1/.6 20- 1:3' ". 00000:0 10.66666/:,4,':,7 - 2..:."'""": ,=,, ,.', :16,' 21 1 ..:,,000000 11].,2..0'000''0.. A - MMON''!.. . .

16 , 2 :74 1,..".000000 1, .. 80:'0000 .. .- 2"-20,: x:00.16, "2" I.,00000,:0 1 3: "' ........3 33 -,:.:.,=,: "I 1. .,6 :,716:, ..: 26. 1..5,000000 1t .8 666:, 6 :.:,7-"Ii.,,..:.,16 2 7 1,.000000 ' 14.400000 -0.6:00000I/: 2S:''' J 7. 000000 14.9 3 1 -2.0"6.....

I .. ' ::,: 7. 000000 1 A-;.. ... ,-,6.1 .'I..,,,,,.,.:*'" '

MULTIPLIER OPEIRATlONl

FIGURE 10

Page 52: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

46

e=-e00 379230. In the statistical mode, the results were e=-.00023643 andae=.00396498, which can be seen to be in close agreement. Table 8 andFigure 11 summarizes the results of several experintents. They are:

1. Deterministic for P={3,4,5}

2. Statistical for P={7,8,9}

3. Statistical for P={15,15,17)

4. Statistical for P={31,32,33)

for various choices of fractional-bit accuracy (denoted NN). The errorstandard deviation data has been interpreted in graphical form in Figure 6

and compared to usual theoretical model 'cien by ae2 Q2/12 or a e=Q/AT.Here Q is the quantization step size which, over S2, is given by Q=I/pIp 3.

The data is shown to be in close agreement with the theoretical model.

Lastly, it can be observed that the multipliers performance is more-or-less

invariant to the number of fractional bits used to generate the tables.

Page 53: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

47

rm f~'ri vo i 'i ':i'i i r D 'h r v r A T r Ni\ *. *rC 7 0 *~*'i ~ ::w r opr NN '

m:rcAm o, i ti 7;r nI*~rvF~'rfI4 *~*'u:~~:7~. OR A TN -0 0 0 09 0 .1= 6284 PMRi NII4a 4MRAI'N ANDIi DID.* KVyIATT'Or':N *-0.001'I7W~ 00116i4"4 FO:R. MN#.,Mr * 1 ii r, v r:r. ;Tp w r T ION **: 0 A6 . t0. *i' v17 4 FOrR 1'I'fvMF(AN~~c:'Ii.Y( AND M. UNWAION -0.00920::y''94 J , 0 1563645'4 F'I~v - J f

I'wOD"L Ti CHO;I CE r 1 t *r'; . 7. S.

ANT ii i: i&rr; ~WAV~rION -. 0 , 00700 0.00414101('4 FRm NN'"WAN~i ~s"':i AND OTD i.rTI -0.00 :'iWo 0.0008302~(: Ci FO '?~. .?R NN,'

r:.ju i:1 AND M. MWATO R?-0.063 j.0894 FO NN* j * 1

I 1L(.II AN AM DEsrV IION - 0'nQI 1, 0.0009C06M7~y: FRii NNNI 4 .t 'rr () p i ::n ,, i' im rO -0 0 0 I 4 0 0 0 8 0 r R N :

EXPERIMENTATION SUMMARY

TABLE 83

Page 54: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

3_ _ _ _ _ _ x

xx

.002 TH[EORY~

0

4 00013

-1IM=R15)x16 x17 '

0.000282M3lx 32 x K

.00021o 2 1

N U 1BER OF FRACTINAL BITSERROR ANALYSIS

FIGURE 11

Page 55: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

49

PART II SYSTEM DESIGN IN THE RNS

Under the AFOSR grant, new residu-e arithmetic units were conceptualized,

researched, and tested. Key breaktbrnughs were an eflici2nt RPIS to decimal

c.onverter and an autoscale multiplier. In this section. thec t1ui~ding block5_

will be put to use in designing high speed digital systems.

The utility of the RNS in digital filtering was forwarded by Jenkins and

Leon [1] through their work in non-recursive filtering (FIR: finite impulse

,esponse). In this case the problem of register overflow, in the RNS, was

overcome through the use of a P"' norm argument. Given a FIR, satisfyir.g

yn:aix-,i=l ,...,', with jx()k11, it follows that ix(n) 1,<2ai=v

over all i. In order to insure that dynamic range overflow will not occur,

the RNS dynamic range M=wp i would be chosen so that M>V. However, the design

of recursive filters (FIR: infinite im.pulse response) is significantly more

complex. Soderstrand [4] approaches t:, problem through base-extension mthods.

Other authors have used the Chinese Remainder Theorem (CRT) or mixed radix

conversion algorithm to control dynamic range. This has been strongly

criticized because of the -;erhead normally associated with these operation:;.

The PJS concepts, developed in -art I, overcome many of these objections.

_J

Page 56: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

50

The classic digital filter architecture, often referred to as the

Jackson-Kaiser-McDonald (JKM) filter, realizes a filter in terms of general

purpose multipliers, adders, and shift-registers. In the mid-1970's, several

new memory-intensive linear shift-invariant digitl filter architectures were

introduced. First, the distributed filter [Peled-Liu], or PL filter, was

introduced in 1974.[lO] Next, the Monkewich-Stunaart, or M-S, filter was

reported in 1975.[20] All three architectures are summarized in Figure 12.

Compared to conventional architectures, this class of memory inten.sive filters

offer the potential for high throughput. Execution speed is acheved by

replacing the relatively slow process of general digital multiplication with

table lookup scaling operations. Jenkins and Leon, in 1977, studied a memory

[I ]intensive filter architecture based on residue [modular] arithmetic. J By

exploiting the parallel nature of the residue numbering system, and using

table lookup operations to mechanize modular arithmetic, ultra-high speed

digital JKM filters were realized (see Figure 13). In most reported cases,

a residue arithmetic filter is organized into a decimal to residue encoder

stage, arithmetic-filter section, and residue to decimal converter stage.E4 ,6 ]

In this work, the fundamental structure of the residue arithmetic-filter sec-

tion will be developed and new results presented.

One of the principal limitations to the residue concept is its intolerance

of register overflow. This is a consequence of finite ring theory. Specifically,

for a set of relatively prime moduli, say P = {PP2" 'PL the residue

representation for a signal integer i, is given by i--(il,i2,...,iL), where

Ii mod pj if i>Oi j

(M-jij)mod pj if i<01

Page 57: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

51X(n) ______________

DIRECT if 0 2

T.. T + y(n)

0KM

x~n) COEFF. RAM/ROM

a. ,5.SRM .x ACCUM. y(n)

PL

LOKU /R ACCIJM.

S/R Ln

A R C H I T E T U R E S

Page 58: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

52

x (n) INU

DECIMIAL TO RN'S CONVERTER

RNS PATHS

RAM/ROm RAM/ROM RAM/ROM COEFFICIENTr

RAMs or ROMsRNS RNSRNALU ALU AUS

ACCUM ACCtUM ACCUM ACCUMULATORS

FIR RNS FILTER ARCHITECURE

FIGURE '13

Page 59: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

53

kand OCij<pj, M- pk, k=l,2,... ,L. The integer i will have a unique repre-

sentation if and only if -M/2<i<M/2.

In order to insure the satisfactory performance of a residue arithmetic

filter, dynamic range overflow cannot be tolerated. If for example, a shift-

invariant filter of the form y(k+l)=Zaiy(k-i)+Zbix(k-l) is considered, the

k00 bound on y(j) (ie: max (y(j)) for all j) must be less than M/2 otherwise

uniqueness can be guaranteed. As a result, the JKM residue arithmetic filter

suffered from a severe dynamic range restriction. For example, in order for

z=ax+by to be correctly represented in a residue system z must be bounded by M.

For O<a<A, O<b<B, O<x<X, O<y<Y, then AX+BY<M. If A,B,x,y are on the order of

r-bits of precision, then M must be on the order of 2r+l-bits in range. How-

ever, this is not the only constraint. If highspeed RAM or ROM is to be used

to perform the algebra (in a lookup sense), then for n=14, a table addressing

space of 29-bits would be required. Suppose, for the sake of discussion,

MQ3-bits and rm16 bits. Using a 16K highspeed memory unit (30-50 nsec) as

ria model,[ the maximum value of a moduli pi is 7= 128. In order to achieve

the 33-bit system dynamic range (ie: M), at least 5 (ie: [33/7]) moduli, on

the order of 7 bits each, must be used. This means that five parallel paths,

complete in memory and logic, must be configured and integrated into a complete

system.

Footnote 1: l6Kxl units: INTEL 2167: access line=4Ons, enable time=4Ons,cycle time=40ns, active power 500mW, standby power=75mW:

= INMOS 1400, access time=3Ons, enable time=35ns, cycle time30ns, active power=375ns, standby power=35ns.

ii I,

L I

Page 60: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

54

M-S RESIDUE ARCHITECTURE (MSR)

The algebraic operations found in a linear shift invariant filter are

data delays, additions, and scaling (in lieu of general multiplication).

Replacing general multipliers with residue scalars has been proposed by

several authors. [1 ,2 ,14 '18 '19] A residue multiplier would present a 2r-bit

two tuple (ai ,xi) to a ?r word memory unit and respond with the precomputed

value of (aix i) mod pi. Using the same 2r-word memory unit, a scalar would

accept a 2r-bit representation of xi. The table would respond with the

precomputed value of (,aix i) mod pi where ai is known apriori. For example,

using three 16K NMOS 30 nsec static RAMs and three moduli of the form

P={2n-l, 2 n, 2n+l , a scalar having a dynamic range on the order of 42-bits

can be realized. Using such scalars, the M-S filter architecture found in

Figure 12 can be realized.

Page 61: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

55

FINITE WORDLENGTH EFFECTS

Generally, a digital filter is a finite precision approximation to some

user defined discrete filter defined over a real coefficient field. The

errors, due to finite wordlength effects, are v- 1 documented. It is

generally assumed that the expected truncation error variance, per multi-

plier, is given by Q/2 and Q2/12 respectively (0 represents quantization

level). However, in a residue arithmetic filter must be defined over a ring

of integers. Real numbers cannot be tolerated. For example, suppose

a=3.251 and x=l0, then ax-32.51. Rounding this results would yield an

estimated product 33. In a residue sense, with respect to a moduli set

P=(3,4,5); M=60, x RNS -- (xlx 2 ,x3 )(l,2,0), one may make two sets of calcu-

lations, namely (i) and (ii).

i) a = 3.251 ii) [a] = 3

ax = 32.51 ax = 30

<axI> 3 = <3.251>3 = .251; [251] = 0 <ax 1>3

= <3>3 = 0

<ax 2>4 <6.502>4 = 2.502; [2.502] = 3 <ax 24 = <6>4 = 2

<ax 3>5 = <0>5 = 0; [0] = 0 <ax 3>5 = <0>5 = 0

The calculations found in column i use the decimal value of a in forming

product (ax) modulo pi. The resulting products are then rounded. The final

residue digits are (0,3,0) which is equivalent to a decimal value of 15. In

column ii, the integer value of a is used to form the product ax in the usual

residue arithmetic sense. The result is seen to produce the correct truncated

value of product, namely (0,2,0) RN-S -30. Therefore, since all filter para-

meters are to be integer value over [0,M], traditional finite wordlength

error modeling and analysis techniques apply.

Page 62: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

56

If a large dynamic range is required, in limited RNIS hardware, magnitude

scaling is required. A similar strategy is used in designing filters using

weighted fixed point arithmetic where rounding or trunication is used to

control the growth of dynamic range. In a RNS system, the problem is com-

pounded by the fact that the magnitude of a number must be known, if it is

to be scaled, and magnitude determination in the RNS is difficult. That is,

in order to support scaling in the residue number systemn, some sort of residue-

to-decimal conversion is required. Most existing residue scaling routines

makes use of base extension or mixed radix conversion schemes. ['] It has

been shown in reference [15] that in a realistic RNS system, a ten to twenty-

fold increase in computational overhead can be expected if scaling is present.

As a result, the overall throughput of an IIR-RNS filter would be compromised.

In order to achieve high data rates, over realistic dynamic ranges. in

limited hardware, a new low-overhead RNS arithmetic unit must be developed.

In the next section, such a unit will be presented.

M-K RNS FILTER ARCHITECTURE

In order to insure the uniqueness of a modular product of two numbers of

dynamic range V, the modular dynamic range must be bounded by V2 ie: V2>M%

This can be achieved through the use of a newly developed auto-scale policy.

The auto-scale arithmetic units will be shown to support memory intensive

filter architectures. Assumed that there is a practical memory wordsize

constraints. For high-speed (-,30 ns) applications, memory size is presently

limited to 4 to 16K words (ie: 12 to 14-bits).

For reasons that will become self-evident later in this section, the

n_, n, n+-,,, will b osdrdthree moduli system, given by P={pl=2 P2=2n, be considered

w m m m w m m n m m m n m m m m)

Page 63: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

57

Using such a moduli choice, signed integers Xc[[L-M/2], [M/2]] arc uniquely

represented by the three-tuple (x1 ,x2,x3) where x1 =x i mod p. In order to

scale x, using memory table lookup operations, the magnitude of x nust be

known. That is, the RNS three-tuple (x l x21 x3 ) must be simultaneously

presented to a memory module which is programmed to output (cx) nxd pj

where cx>[[L-M/2], LM/2]]. For high-speed application, the limiting 16K

214 memory units require IPi23n,24. That is, the design would be con-

strained to consider moduli on the order of 4-bits each. Also, the dynamic

range is limited to 14-bits. Referring to Figure 7, it can be observed that

an integer "xC[L-M/2], LM/2]], satisfying x=kP 2 , O<k<pl p 3-1, has the unique

RNS description

R ( (XsIX2)mod P1 (x2-x2)mod P2 ' (x3-x2)mod P3) 37.

= ((Xl-x2)mod Pi, 0, (x2-x2)mod p3)

- (x l'0' 3 )

Observe that x is defined over a subcover of [L-/2], LM/2]], and it can be

uniquely represented as the two-tuple (XI, X3). The two tuple approximation

2n 14of x, namely x, establisnes a memory size constraint given by 2 <2 or

n<7. Now 7-bit (vs. 4-bit) n;oduli are admissible) with the dynamic range

extended to 3n or 21-bits. The memory units can be programmed to output

[cx]mod pi where c is a user specified constant. Overflow prevention can

be achieved by introducing a scale factor K so that [cx/K]=[c'x] will not

exceed the permitted RNS dynamic range. For the application under study

(viz: IIR-RNS filtering), it shall be assumed that all system variables and

Page 64: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

58

constants belong to the integer range C-M/2, M/2) where M=pP2 =A 2 [2.

Therefore, the scale factor K needed to insure the absence of arithmetic

overflow, is K:M/2.

The error statistics associated with each auto-scaled multiplication

22 2can be shown to be bounded in mean by p2c/2M and in variance by a =p2 /36M

This has been experimentally verified: For example, for P = (15,16,17) and

(255,256,257), the error variance for the integer valued product y=cx, for

cc[O, M] given and xcFO,M] randomly choosen, is plotted in Figures 14 and

15. The error is defined t be e=(cx/M-[c:/M]) where i=kp,, kE(O,pP 13-1].

A M-S recursive RNS filter can be architected using the e.-o-scale

arithmetic unit. A dedicated auto-scale unit must be configured for each

unique filter coefficient. Each unit, in the three-moduli case, would

require three memory devices each. For example, a 9 coefficient 21-bit

resolution filter, based on 4Kxl RAM/ROM, would require

N = 9 3 6 162 (16Kxl) memory units

coef. moduli wordlength per moduli

A detailed description of an autoscaled arithmetic unit is found in Figure 16.

The modulo 2n-l and 2n+1 adders can be realized in the manner suggested by

Taylor 1 9] This architecture uses PLA's to augment a conventional n-bit

integer adder. Other realizations have been reported in the open literature. 1 3

For example, a modulo 2n-l adder can be realized in a simple straightforward

Footnote i: Unlike a 2's compliment system, where partial sum overflowcan be tolerated if the complete sum of products is bounded,each partial sum must be bounded to [-M/2, M/2) in the RNS.

)4

Page 65: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

ci

400

I.-1

If))

CN C

U-

9lfh 1- 62 is- el 'L- T Gz- LLCi9- 09 tIL- EtOQ-980 NJ 3JNtIWU8A L40UH

C-Y)

0010)

i i

cc

cc

C~CD4t7~

UU-

ID0

00N 3alU O9

Page 66: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

MODP2ADDER

+ C

I-

C-w > I I

[LJ

X 3 X

x RAMA>MEM ~oroMAPPIN CO

THRE MODUL EXENE RAMAUOAL ARCITETUR

ADE FIGUR RA6

Page 67: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

61

manner. Consider the term (xl-x2) and x2 Z2n. For(x1-x2} positive, (x1-x2) mod 2 -l=(xl-x2) but for (xl-x2) negative,(x1-x 2) mod 2"-=, - x.c where1x-x-md[ 2 1 Ix1-x2q2 denotes the complement of thebinary representation of Ix1-X2 1. For example, 5 mod 7

If a real-time, or pipelined architecture is required, then it'sdesired to design the modular adder which have identical propogation delay.Using the PLA-supported architecture, modulo 2n*l adders can be realized incommercially available hardware, for n<12, having a 30 nsec delay.

A second arithmetic operation found in the three moduli namely thecomputation of v=(2n-l A) mod 2n-l, where A={(xl-x2)mod 2nl) - f(x n+,

n 2 ~ 2-x3)mod2 4can be simply comcuted. It is directly verifiable that Ac[-2 ,2 -2]. Consider

\ 2A1 + 0 if A>0?

38-- 2A.1 - if A<08

where A0CZ2 and AEcZ For .>O,v can 5e computed usinq the following

scheme: n-l I

A : =V

Example A=6, n=3, (4.6) mod 7=3.

For A<O, v is given by [c denotes complement]n-l 1

phantom sign-bit

t Example: =-6, n=3, (-6-4) mod 7=4

Page 68: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

62

i 1 1 1I O -c-I I

or - - 0 =

As a result, v can be computed with negligable overhead and hardware.

thon

A ( 2i-' A)m)d2 I A(t i n.,ry) A )(i nary) A '( i_I 1.)

6 <24>7=3 i10 011 3

5 <20> 7= 6 101 110 6I

4 <16>7 =2 t00 010 2

3 <12>7=5 ol 101 5

2 <8>7=1 010 001 1

1 <4 > 7=4 00 1 100 4

0 <0 7=(0 00() 000 0

The discussed parmutation, from a modulo 2 n-1 adder to a buffer register,

can be realized by a hardware mapping. Here the LSB of the adder is

connected to the MB of the buffer. The other (n-l)-bits are mapped to the

buffer with shift of one bit location.

RNS FILTER DESIGN

MK Architecture

In a M-K architecture, eacn filter coefficient ci is realized with a

dedicated RNS table lookup unit. Based on the three moduli MRC algorithm

and a given ci, a MRC multiplier unit similar to the one detailed in Figure

_ would have to be co;ifigured. Here, for convenience, the multiplier 2

Page 69: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

63

is imbedded into a lookup table. Each unit would consist of nine 2nxn-bit

memory units. It must be stated that in o-der to uv. a 2nYn meriory in a

mOuuiu ( 2 n+l) operation, some form of data compressi: is requind. The

simplist compression routine would differenrtiate between the two external

number in Z , namely 0 and 2n Those two numbers have a (n+l)-bit (.e:2nl

common n-bit data bus plus 1-bit control line) representation of 00.. .00

and 10...00 respectively. By ANgn the n LSP's and sensing the MSB, the

two conditions can be easily identified. For x=O, it follows that [cx/M]=O

and the output registers would be zeroed without any memory (table lookup)

action. This means that one of the 2n memory addresses, namely x:O, is super-

fluous and can be assigned to x=2_ +l.

It follows directly from the MRC representation that

-j p Imod p + [x3ciMPP3]mod p.)mod p. 39.[-M1 mOd P= : 1- -L m d P + ii .--- _--Jm__P

That is, the outputs of modular tables (viz: [xlci/Mjmod pa,..., 1 x3 ciPlP3/M]

mod p.) must

From a design standpoint, it is desired to configure a system which has

minimum complexity. The 3=6 possible permutations of a three moduli set are

summarized in Table 9 in general and for the specific example x=l00 for

P=(7,8,9). A key feature of the general architecture are the propogation

delay paths dt (total delay), d, and d2 (see Figure 17). For sequential

operation, the MRC digits will be available for use after td seconds. The

j lengh od delay is due primarily to the nesting of four multipliers. In

addition, td is a function of the multiplication philosophy used (bit-serial,

Page 70: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

64

MRC

+ + X I o2)modpX d

RNS X 2"---' I

(X3-X2 )modP3 3 [2

P2 [P31 R2 P2 [Pl]

d

PIPLINED

d! d2

0)(O 2()o)dl -d- 3(01

2X3(2)

X22)R1 (2) R2 (2)

S d1

X1 (3) ooo

MRC DIAGRAM AND TIMMING

FIGURE 17

4

Page 71: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

65

lookup, general purpose, etc.). If high throughputs and low complexity is

desired, td should be minimized. One could ai:o consider a pipelined

architecture of depth two having an effective throughput rate of t2 second

per MRC. The design of an efficient pipelined MRC processor is promised on

the condition that t2 is small and tI and t2 are comparable. Referring to

the data summarized in Table 9, it would appear as though the first ordering

admits the best design. Therefore, this ordering will be used as the model

throughout this section. Based on this model

xI =x2 40.

x2 =(x2 -x 3 ) mod (2n+1 )

_X3=(2 n - {(Xl-X2)mod(2n-l )-(x2-x3) mod(2 n+, )})mod(2n-l).

The 2n+1 adders found in this architecture have been previously discussed.

In a M-K architecture, each filter coefficient c. is realized with a

I

dedicated RNS table lookup unit. Based on the three moduli MRC algrithm

and a given ci , a MRC multiplier-scalar can be configured as suggested in

Figure 18. Here, for convenience, the multiplie- 2 n-1 is imbedded onto a

lookup table. Each unit would consist of nine 2nxn-bit memory unit. It

must be stated that in order to use a 2nxn memory in a modulo (2n+1) operation,

some form of data compression is required. The simplist compression routine

would differentiate between the two external number in Zn. namely 0 and

2n. Those two numbers have a (n+l)-bit (ie: common n-bit data bus plus 1-bit

control line) representation of 00...00 and 10.. .00 respectively. By ANDing

and n LSB's and sensing the MSB, the two conditions can be easily identified.

Page 72: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

(XI- x2 )mod2~- cx module

x2

(X 2 -X)mod~+l _____________

x(IE FITE

MUSC FILTER ARHTCUE

FIGUREmod n_8 2

Page 73: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

67

For x=o, it follows that [cx/M]=O and the output registers would be zeroed

without any memory (table lookup) action. This means that one of the 2n

memory addresses, namely x=o, is superfluous and can be assigned to x=2n+l.

It follows directly from the MRC representation that

Iimod pj~d; iCrImod p.+ '2i'ltmod nil 3~l~mod p.)/mod pj

That is, the outputs of modular tables (viz: 'Xic /Mimod pj,..., x3ciPlP 3/Ml

mod pj) must be recombined, in an additive modulo pj sense. Again, a

sequential or pipelined architecture can be realized.

Example

Using an MS architecture, a 4th order Chebyshev filter was realized.

The response is reported in Figure 19. It has been experimentally determined

that it requires a 16-bit moduli three-tuple to achieve satisfactory

performance.

Page 74: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

68

r- MLa 0

cuLn

1030 0h9 odS-i 09109- a t-so 3on.rNotfw

FIGURE 19

t 9

Page 75: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

69

JKL-RNS Architecture

The disadvantage of the MRC-RNS-MK filter is the rapid growth of

hardware as a function of filter order. In order to reduce hardware com-

plexity associated with RNS-FIR filterinq, reusable (undedicated) overflow-

free RNS arithmetic units may be considered. Again, if overflow scaling is

imbedded the table lookup operations, the magnitude of RNS coded numbers

must be known. As previously noted, this has been the historical obstical

to fIR filtering in the RNS. The architecture which can achieve this goal

is detailed in Figure 18. The multiplier 2n-1 as previodsly noted, is a

zero-overhead operation. A timing diagram is offered in the referenced

figure. It is assumed that the modular addition delays are less than the

lookup table access times (say tM). The difficulty with this proposed

architecture are the delays associated with reprogramming the tables, for

each ci, from high-density low-speed (comparatively) main memory or mass

storage. As a result, the overall throughput of this architecture will be

unattractive.

Distributed RNS Filter

A powerfull linear constant coefficient filter policy is distributed

arithmetic (alias: bit-serial, bit-slice, or Peled and Liu filter). In

B-bit radix-2 binary weighted, the output of a discrete filter, sacisfyinj

n ny(n)= . aiY(n-i)+ Z bix(n-i) .

i=l i=O

is given

Page 76: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

7O

B-1 n n n ny(n) ( aiY(n-i;j)+ Z bix(r-i;j))2i-( 7 aiy(n-i.B)- E bix(n-i.8)2'

j=l i= i=O j- i J:O

with j denoting bit location. An equivalent statement for RNS systems can

be made in the RNS using the MRC. Here, a system variable would be given in

MRC form as

ZZ+Z2P +Z3 PIP2

There is a minor structural between a distributed filter using a MRC and

radix-2 format. For a three-moduli system, distributed partial products

must be recombined modulo plP 2 , and P3. Table requirements, for this

architecture, are correlated directly to n, the order of the filter.

Bit Slice

Example: A 2nd order discrete Chelyshev filter was designed in the

usual way. The infinite precision response used double pr in floating

point arithmetic. It can be noted, from the da.ta displayed in Figure 20

that three -8 bit moduli filter performs better than a 12-bit fixed point

filter and closely approximates 16-bit precision using a 4th order model,

8-bit moduli can aqain be seen to offer acceptable performance (see Figure

21).

75!I

Page 77: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

71

gc

12

-jj

U.. In

2J

ral !;d- evou ti.-qj es zZ5

SO Wflf N9111

FIUR 2.

Page 78: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

-

a,

U3L

Cl

so 9r4

.~ .u

zu Li

60 3QnlfNfkII4

FIGURE 20b

Page 79: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

731

10In1

o 3 C.-

4.1-Vi i - Sr-Ep

FIGRE20

Page 80: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

74

soJ

4n C)

vft

Uri #i U- i n- iBO a1(~N!t

FIGURE 20

Page 81: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

cu

U.U

90 C2

cu CD~

so 30111;ILJ

FIUE 0

Page 82: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

I r ;

c5

Ps Ln

s0

FIGURE 20f

Page 83: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

77

-4CI'S

Ccl

t-0

-4n

i Ln

C

120 '30nirfluiW

FIGURE 21a

Ia

Page 84: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

78

zuIi1-4

0n

IS Eq f~a o I

FJUE 1

Page 85: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

79

U)

ca

I~Ll

(u E

CU4v-fn 9

ILJ v 0=. = (a

FIGURE 21c

Page 86: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

80

-d

cu

u

Aui

., j

ca

U3 I-3--

oo O- 26,81- I6s- 56eg hSL .l-£ 1-

M- FIGUJRE 21d

I~- --------- _ _

Page 87: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

81

PART III APPLICATIONS

The RNS arithmetic, developed under this AFOSR grant, has been tested

in a MS, JKM, and distributed architecture. Several applications were

considered. The uniqueness of these applications, are on to themselves,

worthy of special treatment. Therefore, have been included in this

report in appendices. Appendix A treats the problem of realizing a

real-time Kalman filter. The development filter (submitted for publica-

tions) represents an original approach to this class of problem. In

Appendix B, a linear adaptive noise canceller is presented. Appendix C

contains other papers published, or under revicw. which contain an AFOSR

credit line.

PART IV SUMMARY

Under an AFOSR grant, RNS arithmetic, hardware, and architectures

have made major strides. Using a three moduli system, practical ultra-

high speed RNS units have been developed. The major problem of RNS-to-

decimal conversion plus magnitude scaling has been successfully treated.

In addition, new filter architectures were derived and analyzed.

The future of the RNS is considerably brighter as a result of this

study. In particular, the RNS techniques developed during this grant

period, will be further inhanced with the advent of VLSI.

Page 88: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

82

References

1. W.K. Jenkins and B.J. Leon, "The Use Of Residue Number System in theDesign of Finite Impulse Response Filters," IEEE C&S, CA5-24, April 1977.

2. G.A. Jullien, "Residue Number Scaling and Other Operations Using ROMArrays," 1EEE Trans. Computers, Vol C-27, No. 4, pp 325-337, April 1978.

3. C.H. Huang and F.J. Taylor, "Memory Compression Scheme For ModularArithmetic, to be published, IEEE ASSP.

4. M.A. Soderstrand, "A High Speed Low-Cost Recursive Filter Using ResidueArithmetic," Proc. IEEE, Vol. 65, No. 7, July 1977.

5. N.S. Szabo and R.I. Tanaka, Residue Arithmetic and Its Applications ToComputer Technology, McGraw-HT 1T, 1967.

6. W.J. Jenkins, "A Highly Efficient Residue-Combinatorial Architecture forDigital Filters," Proc. of the IEEE (Letter), Vol. 66, No. 6, June 1978.pp 700-702.

7. N.S. Szabo and R.I. Tanaka, Residue Arithmetic and Its Applications toComputer Technology, New York, McGraw-Hifl, 1967.

8. W.K. Jenkins, "Techniques for Residue-to-Analog Conversion For Residue-Encoded Digital Filters," IEEE Trans. on Circuits and Systems, Vol. CAS-25,No. 7, July 1978.

9. A. Baraniecka and G.A. Jullien, "On Decoding Techniques for ResidueNumber System Realizations of Digital Signal Processing Hardware,"IEEE Trans. on Circuits and Systems, Vol. CAS-25, No. 11, November 1978.

10. J.M. Pollard, "Implementation of Number-Theoretic Transforms," ElectronicsLetters, Vol. 12, No. 22, July 1976, pp 378-379.

11. H. Nussbaumer, "Digital Filters Using Read-Only Memories," ElectronicsLetters, Vol. 11, 1976, pp 294-295.

12. M.A. Soderstrand and E.L. Fields, "Multiplier for Residue-Number-ArithmeticDigital Filters," Electronic Letters, Vol. 13, No. 6, March 17, 1977.

13. G. Bioul, M. Davis, and J.J. Quisquarter," A Computation Scheme for anAdder Modulo (2n-l), '' Digital Process, Georgi Publisher, Switzerland,No. 1, (1975), pp 309-318.

IJ

-

Page 89: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

83

14. W.K. Jenkins, "A New Algorithm for Scaling in Residue Numbers Systemswith Applications to Recursive Digital Filtering," Proc. 1977, IEEEInt. Symp. on Circuits and Systems, Phoenix, AZ, pp. 56-59.15. H. Huang and F.J. Taylor, "High Speed DFT's Using Residue Numbers,"IEEE Int. Conf. on Acoustics, Speech & Signal Processing, Denver,Colorado, April, 1980, pp. 238-242.16. D.P. Maher, "Mathematical Background For Generalized, Partial, andincomplete Discrete Fourier Transforms, Proceedings of the 1980ASSP Conference, Denver, Colorado, April 1980.17. J.A. McClellan and C.M. Rader, Number Theory in Digital Signal Pro-s , Prentice Hall, 1979, Ch. I 1.18. F.J. Taylor, "Large VLSI Module Multipliers," IEEE Int. Symp. onCircuits and Systems, Houston, TX, April 1980.19. F.J. Taylor, "Large Moduli Multipliers," Proceedings of the ICASSP 80,Denver, Colorado, April 1980.20. A. Antonou, Digital Filter: Analysis and Design, McGraw-Hill, 1979.

:I

Page 90: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

APPENDIX A

THE PEALIZATION OF ADAPTIVE KALMIAN FILTERS

George Papadourakis

Fred J. Taylor

Department of Electrical and Computer Engineering

University of Cincinnati

Cincinnati., Ohio 45221

Portions of this work was supported under AFOSR Grantr/1 7047-0066

Keywords: Ada pti vu Ci Ite:rs , ro: - ti me compu tat i oil,

whi to stochastic process, Pfcj fo-r-13 iankcenshi p

al gorithm.

Page 91: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

A T .' A C T

An ndap t ive Ialman f i I to r i s cons Idrered f or ,h i ci thIe n

i inpu t noise c ovn ri nce nd n the output coviria-ncn are unk-

n o,.:,n .A nw..: procnlu rn is p rosen tnd for the i dnnt i r icI io

or the ton'notin cov.ar imcos . The prolosed i (InLiCi c-iLi on II1-

~or ithri us cs thc nu toco rre Ia t Ion i n fo riit ion con t.1I nn in

the innovation error SOCILunce to dletormine the rntio of the

unlmnov.n noise covarinncos rind to ohtain the op t ima n:in ,,11.ln

f II tor -.c in. The p roposel vn thoci cian '.oe .,s 1 i mr) 1 omen tc'J

0 rou'h the list, of h1Ii -h s pccl .-I I! i tra zru toco rrel t I on inl o r-

I !' s:Mh c' onarlntn -t, ' t d,.itat ra-ts.

The ~ ,n:u~ fo rhiulat ion of the i n i rin 1 wir i once

f i I t ..ri ;1, , C.,: 1o 1 1!-. sr, 1' 1 s c ot p I ater. a pr io ri '- ~n o%!.r d- c o f th1e

input )n(4 output noise covorimicos, spy P -,n! I". . In 205ost

prncticnI -pplications fl nnd R. an nithnr %ssuic to be iin!-

nouin or 4-pproxlrnvtacl. Several au thorrs have p r n -;t tn >

0schomesnr for the Identif Ication of tht, un!k-novrn covnrlanceso

with soria stccesq (2), 00i, (7) ind (11) %-.i th the usc or thne

lnnnvtlIoni senluorco I n, the idtnt I f Ient ion of tile un!.no-.-:n co- C

va~f1~,introiucec hyV :CU,

4-<

Von m I fI CIt t10 a : i F.16 rc i a A' 1 , -09 wpn

Page 92: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

I bIo t:ho fi rs t -11i v.- 1110c of Opr nu f1(rr" m L, tion 'unr.-

*tioni of thr ininovt ion s-iionc to dictr~rmine! tiir! rn t i o or

O ir', 'n rv.n rocv.; , *~iron ' - t !,oe p re' ;nr r ! is' i pnl i cii! I

ro,* O ns rysL,) , ilchi rfx'*! lt -. nhi t- i l .' r in t C cnns trnt

1, 'f I "h spe.cd rn 1i-time nl ,or I tri ror tiin coriptitrit ion of

t I r mutocorrn )t Io "minct ion !ii 'jr prennntnd by Phn ' ,r

a n d I I an k cn sh I p ( 112) 1 C i n Ti :- pnd o~l t h 17 ~or i t, hm can

f fuir t he r I i .ip r o v m th r o t n!i th. isq (-f ip" I r~ co :rs (I

i n-lIi no~, tb rrn.dnd, ,:fv) t tpd ) poi~r ~c~ sHifl1 eL )'i: c0ipu t'-

tion of th-i ititororrelation fu.nction, in rc,,-1 tie ~ic)

rf(n) f (n~k) "

n:i

if r(k~ is dm, I rti -f or k: on th c o rclnr o f /2~ thn

tir.I n- IF-Tr to coripti't IflFT'-(~)()m~: is co.:iptntion-

, IJIy or :I-I nl 1 A tcot~lo 1l1t2"+! roriplex ::i 1 Pies nrce

reau ,' Cin co~il.- in tIJmmu y t I 1m1~ r.It rr!,il mII tip IP1 10

P rteect inompu Lt1ttIton or thel Amtoy' rocuir Vein r am I muti 0-

Page 93: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

-

The, S version of'211 the atocorreltion al::.orithm sts

f i s:

P-1 K

i'~~1 ~)j=O i=i

For ::>>P tile "P 11,orith.- essnntially hal)vpes the numberof multiplications normally alssociated viith direct co~mpta-'tion. Thle multiplication count for !'>>P can also Im co. si.-crahi'ly ]-ass than the associnsted with FIFTI mechan ati ins.

The speed of the al-lorithm can be furthe.- i;mproverI 1b,I i -i n ti the tii.ie C nswJ1-!fj to ro-nP:to data invariant in-

dices (ie: (2Th) (2ki) (2ki2; . Ts can bePch iePved th r ou g h the tise of so cil led in-l ine code, I n a n

in-line code, the code is arrang-ed in a top-down fashion.fiere entry is made at the top and w..itLhout loopin-, run t~ocomp Ie t ion. The desired in-line code havin-g all data invar-

Lan acees c-n Se properly. comiputedcf ad properly se-'nded Lbrou::h the use of ALJTO.SIV! -iethodol ol-2 (1!;).

I Ano the r a? te rna t ive i s pos s i l1e th rou:.,h the use ofI ~~threaded codle. I t repli-ces a sta-nda*rd protgram with a series

of m~odt11leS ... ich a re thre,-:d to c thor e trea .s a rcou~te? aa -Aryi hc allI reruI's t nifbpit i on i:s.

~oun~- Thcz array 4;s selriotly L:ne ILrntiead ~

Page 94: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

'ore r Pmo v r, th r o v r r h a f;o c a tri i n- ~1 r n ri n a C owl-

nutt ion of the paz r t P r--. Comipa red to t' 1, n -lIi n e c o: n

t h threaded rode is twice ns fast and occunites less riorx'

-\not:ier ontion is avai 1ahlre andi is !.nov.ri : a knotted

Code. ho ts -w in h p t i ed in the thedinlici-tin- that thr,

procr:;- !.-A 1 1 move clot-n the thr- nd, "run-,rotind" i ns ie a

1-not for a wil then continune dox'n thr r~~ to t!-(- next

knot. T'-., *-:not represents a suloprol-rara in t:hir-h thI-r e ma .1

z ~xist nele:---,ntary loo;,in, . u.'sin, the knotte.-I cod-ll the ;mo'

requ~tirrrC"---rns rn-Iticed considerably (I';)- I t has 0.n sho,-n

in referecer( it; that a 12."- point time series can be auiLocor-

rlae, nu to a delay of 11 sarmples) iusin-, a pnp 11 -rin na-

co;Thputflr the folloi-wing ti-c intervals:

PPP 11/55 pnp 11/7.* pnp 11!r(0r* Proran Size

Conventional .9 G9 9.72 17. 35In- inn F, 2 75 4~!21.1 75

* ache, ::ahines

I nen-~gol i z in tha t the, co;.;u ta t ion of the :m ,oco rela t ion

function cun hi- pe r f o rme- a t ren l t i v speed~s .-we can

procned to thin anail-~sis of' the rle;._

-I

Page 95: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

Adi scret- tc~i ~nii ytw a orpcs 1 ~

+

F :nmnr s tz t n tr.:nc i tio~n riatri' (cons tant)

!z ~'IF~ veso s~ us n in, %:11 co n oiz-rt- P ut:fLt v c or

='r:! v,-ctor of ^auss i an esre-.wnt errors ( -!i t)

--I iS ~sucd t

I,~T

1=-L tin, or-PTA t 1-i con trol 1 !hIl c, -,nei-ovil

-1} coraicnti 6- 6 VTntn' t-t O .. -and

- Vp,1iic P5

1St _ _ .P (-) f'kit U

Page 96: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

cal in-lor:-rtion of -11% inPut noise covwiriznce CIl an- ou tpu tnoise COY~rance in cf stimate of thIe Stajte of tle stn

Sfn~ 0y(.) i~o ta i n c ( s criurmt Ia for k=1,?2, ~

S t,- 11, st n rrd KnlIuw-an filt r:

2. r-ror COvari~nnce Extrapolation:

+*'Ct -'ti jto

*U~p

4~. E-rror CovarianCe lin-ate:,

PEK(+)=tI-::Kx"i) MEK 3.7

___ - PTK(!)WKK(-2K'-' T . K)

_!c nt i t ion

IT

Page 97: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

y 9 . ~~ L c roc--r. I C*

r :. rnoz:to oro r~ r~ -.-o Za r. c c.~ ns *-j: !-.c

-K .: K*-::K-

orror_ T,:lo

in..; I C.'- .n:n io,.:*i ' I r--ct ru'st -t io P

Page 98: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

t -t S -Iia sel tt- -,)i rran be (eIndi crv

or t a rtio o f t ,i t ru sta t is t ics nfhQ)=~ ) I n thi s

caSe PE.(-) is directly proportional to PT(-) by a factorA

i~~ r:r

!'Y oxn-: the In f in it ion of 'T( - i n Cy 1)u

P'T(-) PPT+ ,.c

PTr(+) = ifiP() t.)

1e ro lownr obsorvat ions can be madle re-,ardin~ equa-

1. PT(- ;Ic (nns (.,)pl1ic it Iy jupon the output apriort

st ItistiCs Of the Sy'Ste-4 Model (viz.: I rv-'.n ,,I n)

arid the true noisn covaripncps and 1

2. PE(-) diipnnds upon apriori sta-tistics only._

Consilder the -case in toh~ andk- ~ I.I'~ -nIt can hf

sec ~ ~ ~ 6-, eqerl Iro _ut i 6ns. 13.~,{i) ni 1;7 hiut

cn, I~ 'J r ~ Is Ilta -~ T >T he 7 c~-t-

_,~ -lip - I____ Tsas

Page 99: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

~weser ies wri tn h 1. = r2 : jI rh tnn. SL~Ystao co~

var ianr as r c plotted for e~ach i tera-t ion vi th the, r)t io of

-1/21 distribuited loaihicyfromn .01 to1.

VicTh crossover po-int of the curves is thI-e or)timnl

ca seo in v.,hich we obtrin Od~ optimail !.,ilman -.ain.

2- -The theorn t-ical covn r inc 5- inc r63 s i n- f fe r the.

c-rossover roint.- it can he expl-inadby t-he !'%,rlcin

f iILer --ain wh-ich is- derash n each- i-t ra t ion.

Thereforn, the innovation error is ariid

~.It s-hould !)- nenti-onei that in a ranl sj,'stc!-rTC-

.- n not h)ec ex p ci ft 1 dt r:r mn ne.

A1t -each i tarn-t-fon in-- the- above-r0 mi.ulatio6n th av6 uto-

--i eht~ fuCto of th nnovp tiron prbtcso- :as -al-culcat-

ec!'i cir-ty Osrv tat tfh - s!pco- -- r- of thn !ttocorre-,

tiltiori- -Fn -. --i -fr -) -tr -L~ -:~t -'-iiC -fl~r~ ~r

ieg I a~ tj As pr-iosl f i v a- v, risrio

) tl4 -o fl- (fli

hen-~Fi~ t? i - - - __6_ r

Page 100: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

ISI

Tho~~ I'oit:i ;o pn "hoi~ othr c; r c~ t

:i On OF hinnov'r~t'O:1 PIO^CS.

In tha first ftI~t~n ih fr s airi~ 'I CThe tZ: E1

hl o~1c . 95c)i: 1 n t t - sa 2U) -n s , c ri-tt am' vort

''i I s of ll zirci chose;,n so "t~.(2 6~) 2.ftr c ach i tar--

ero h i ~ ly- "-'-. of ' ! c iit o

fo 0!,Is ca,-lLcu1tC4 as CS c' 0 7 of- s1 r~i It :hn

-a ~ prcs s ove r, a ctl ve citr ~ hs oos~l i

o r r a, e ,pra,.;:--. -tion is Uec-~n' _S S C-1 Vc-!

tiory alj-ori thm, tho val't.fc -o~ tt -:ici ' () c -r c Re

;J~ii~ I zV U1cn fla a 'do-cb Cu ri t -iro :;.l n Cno-~~

,~~~ricQ~ :::-1_1 i!t ~~~i-, t s. Jons- thei-tc a~n

Ti*: c- ri :;:r*-~

Page 101: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

r% (.' p tn pol C! fn An dptiv I y -'1;ri I tw can bc el-r % ye:!

.1hich, ~ce~~~upon t~- cUrren.- loccotion oilI min OYr t n!-

-c!!rvn, c~m ccnvcr:e to the opti;.m.1 Cnni t i I: o rr I

a.wwo al Ori Call~ c ,r. cii 1- I; vnt~.in ni -

c rop r0c 0s so r eILue to 5 -cs ii'ip] it'. 2en 111 0. 1 :ntI

ca i be 0'c r d uSing th'2 a? i1o ri thr.C to cZ1 -n~c h~f s

,.-.,o rI'las 0f 1 -c nutoorlto uci

Y!I-t o r cI- c11 tc n p~s fi nzr -I:s-; 1 dOn

a n4 : re:

Ha -''l *1 V.4

F : .- T

Mu~, -

-fa 15 11.% Iv .i t 1

r y

AWN-

Page 102: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

(c) i1ut~t terew t;The dots reprz25ent th 1 CttZ31,

%nue- 1f - 1.) -t cinch i teri-t ion.. The cciordi n tnn Of then--

f arc r. o v)rc!s e ntc d Th e c ur ve f i t t in, o f tes -)o in ts i s

.l r s n L2Jhn c ntin ti nIn m I1 5

:~e cna S- ' o 5 I conio 1- ir! oi, n-e r ~nn a o I 6 11

cor ~ ;li ii ~or _ re aJev i~n~o - *;~~ nd

tlh,.2 I r es tii.;mt-on is siho::n Io o-- Lthen,. 2e I, 1 CS ~r : Ood

Kit iniix- errors.

7 . sr:;:;_ VM. CO-IWZ I 0!!

a v ta r:n :i0 ol_;^ COVZXr ;!iOCC5 II Cl IN 1 ~ Ilter- i..i th thc*

ti~ ~ ~ :~-~1~n ~' t e tr sh o loih f:e Ino h

- rs n. d, ec n oc I

Cue ~I o vctio prcss 5: .ec~ii~ r rsi, I C rat io of

Li~unhn ao~r ~n c s th ir r-c tuel I.'zva s c an b e cn ICu It -

~ ' t~lof th-Tai'-ori Min is, possi!;.- Cr r.dteal

AA- ia tl ou hpt c-m- e a.chiVcd v.,Ith thed use of fst C.toL

cdr:r Z 1 oin I; 1-or! th- s. V: nuliae'r iCal- 0~~ ampj 41 usCratcs thez 1 r

r.-! I*- no *n. le ai ~ ier.I svr' h1

I! t - -inp_, 6

(T~~*~n~ r~ I ~ ~Z! Iir!~' C<i l-ino't

"T -T! *-Tt s -4 4- -

ij#:

Page 103: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

1 2~~XC I,,1;v c.t

2. ~ ~ C ~'I-pi to inpft noisc rin.-i o i! %,;- vr~'~ i

r if-, nnt t.cr cra rit ::~s of nlolc of a

notsn covr-.rtminc2 is v:ery cti-~

Page 104: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

r; rrc i on z a 'i fterinr thzor,, in Proc. .:

0" : :u ctin -.: or a

171

ft c, u pt.2 s ~ ,~ Ii T P'o rn ,t I '2i 7

r -~r~-.,j --in AC -: r, A -'ri cai, pp

f L L- ws

-, s as" s -;0 , : r- s L

a n --- r- anS ,i r aitjfl

:; . . Cr^o nd vol~. 1':r," :n- icfo

z ~ ofoi- itrst-i-t~~~~l~o vt:i

~' ;n:n'oi~rdr ~ie, EAl~

__f am

____ * ;ij onr..x;l - -

Page 105: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

in-tI

o r 2i r f-b 'fr:* nr- i n-: no s~

S ~tcs -fE E:7r~s ~'~2~ Con r, vol.

- A ~-2?, '~c~;~r V7!;,; 2 ;

~~ ~ ;~t:Lo: Ia ronr. 'o.'C-2 1,

mu .- time'r Ybr% In i ro-t~~~~ ; V. I-

Page 106: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

77.7

LC Iii:' 7r -ms. 1

.:~co1ri, 1,f 0 C~r f;- I --

-:S

-hm. * - f:~ v~ rr~ !~~~~~

mw

Page 107: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

w

C n i P- N C.< ' N' -. c r .iv i I CCNie c iF.c uR N . U-c c.( I I N tooNI,

11 11 If It it I it 11 11 if 1; 11 T1 i TI iti t i t I , it !I 1 I; it it It It 11 I f 11 11 11 It I t I f It- -4 ~ ('l IN Ol * -I. j 1 4 (1 0 ~ ri (I C-1( ~ il0 NcI "% ( I c ll (11 cl I5 (I rA A (I "I (1 0l 0* CI C

<1 1I H1 if 1I Il 11 i II I i 11 H H H 1 1 11 1 11 Ht 1 1 H if It IfI t it It R H H iH It H1 11 1 it II It II It

ttzzzzz-Zzzzzzzzz zzz4ZZ ZZzZZ ZZZ ZZ ZZZZ ZZz ZZ

00

0G) H LO)

LLII

*Y 4D~ tJ-

Page 108: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

it. r,~r i If' 0.1 C: C

* * C el .. N JI -C. U-1 -' *- 4 -. f- r-.. . . ..... ,- . . . q , -- 4 ;; v. ,,+-+!~~ .' .- .- . . . .. . . . . . . . . ... . . . . . . . . ._ _. o _ _ _

H If I I If II I t I I f it II ffI II ft II II H II I f I I I I "1 I t I II I t Ii I f it if II H II tI i i f I t I 1C -: C. : (-i C l C-1 Cl (I Cl Ct.: , I C I '| C -o l CI , .c C I C +:C C; C ' Cl ' C: C.I t C l .A Il, C ' C! CI cl

LC CC CC u CC CC C0C CC C CC C CC C C. C. CC CC :C CC CC C. 0C ZP C nC CC ,C CC C.CC C CC.- - ' c- (4 C¢" , t' - C C, ., 'O).++ .,O . , +..C -, .( ,: -< , +,e . '-, +) ',-" ."" C (I. fj b) lI',? , C1 l --- CA+ +-"

I--S Ii i t I t i i t i t Ii f I I I t It I t It ii If ii i t i t It Ii It ii It I ti Ii f It If it I| I I It I I t I f

CCZZZZZ ZZ ZZZ ?ZZ ZZZ ZZZZ ZZZZZZZZZ ZZZ ZZZZZZZZ

GS)

W •

,*,

&

SZ

1-7la. . I .,--R

0-0i)a.i i- I-+;---

& 0I G (JC+

1 m + - - <" 1 £

Page 109: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

C C.

.-.. .. .~ r~.. .CC vaC.

S i t U I If i 11 11 11 . It It 11 if ": 1 I I if i if ! I i i u if 11 .11 q n H It :1 11 1 : lup-1((1 ( 3 c- l* 3~ l C J Cl 171 iCS Ci -, j1 I (1(1( 1 Cl Cl c3 cl I: Cl C s Cl ( I

l - 'C -IC ~_~ l C. CIr P'. 4 ~_ _'a C - - - jr ~t~ f l( -(

0- -. C. --.. (C, C. if' !C

q( 1 q aI H i i t 11 11 11 .1 it it is HI I U 11 it i t 1 11 11 11 1- 11 -. it 11 -1 11 i t 11 it it N(izzz Zzzzzz z z zz z z zZzZZZZZZZ ZZ-ZZZ

* o

fix Z

tLi

i-: z * -

CUC

.(t .- ......

Page 110: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

LLI)

I

-1CL H C

PASS

U') Cl)

R-1 --.

Page 111: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

lti

z

J, z(I-

A-1Hat -

X T77

Page 112: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

F- (0

CLIC,)J

z Io

414

Page 113: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

APPENDIX B

ADAPTIVE NOISE CANCELLING

Adaptive Noise cancelling is a method of estimat-ing signals corrupted by additive noise or interference.This method uses a 'Primary' input containing the corruptedsignal and a 'Reference' input containing Noise which issimilar to the primary noise.. The reference input is adap-tively filtered and subtracted from the primary Input to ob-tain the signal estimate..

Adaptive filtering before subtraction allows thetreatment of inputs that are deterministic or stochastic,stationary or time variable. When the reference input Isfree of signal and when certain other conditions are met,the noise in the primary input can be eliminated withoutdistortion. It is further shown that in treating periodicinterference, the adaptive noise canceller acts as a notchfilter with narrow bandwidth and the capability of trackingthe exact frequency of interference.

Noise cancelling is a variation of optimal filter-Ing that is highly advantageous in many applications. Itmakes use of a reference input derived from one or more sen-sors located at points in the noise field where the signalis very weak or undetectable.. This input is filtered andsubtracted from the primary input containing both the signalandthe noise.. As a result, the primary noise attenuated orelliminated by cancellation.,

If done'improperly, subtracting noise from a re-ceived signal, would result in an increase in the outputnoise power. However when the filtering and subtraction arecontrolled by an appropriate adaptive process, noise reduc-tion If not complete noise elimination, can be acomplished.Adactive filtering may not be applicble In all of thefiltering situations. This adaptive filtering would not bepossible when, for example, the reference noise input Isunavailable.. In circumstances where the adaptive noise can-celling. is applicable, the levels of noise reduction areoften attainable, that would be difficult to achieve in di-rect filtering.

=

Page 114: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

PAGE 2

V)

la-F-

LiiF--

U))

I-j

-)En c0

C0C

LiJ (n c

w w

Page 115: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

PAGE 4

LnJ

CIQ w-M 71

cm)

D&

<L

cmI

- ..- c i

Page 116: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

PAGE 5

II

r-"

Page 117: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

In the noise cancelling systems, the system outputZ=S+nO-Y should be a best fit in the least squares sense tothe signal S. This Is acomnp-lished by feeding the systemoutput back to the adaptive filter and adjusting the filterthrough an LMS adaptive algorithm to minimise the total sys-tem output ter. Thus the system output serrves as theerror signal - r the adaptive process.

THE LMS ;ADAPTIVE FILTER:

The LMS adaptive filter Is the basic element ofthe adaptive noise cancelling systems. . The principal compo-nant of most adaptive systems is the adaptive linear com-biner shown in fig1.1.. The combiner weighs and sums a setof input signals to form the output signal. The input sig-nal vector X is- defined as:

~X~j

Xj--

Xnj

The input signal componants are assumed to appearsimultaneously on all input lines a discrete times indexedby the subscript "'. The componant X is a constant nor-mally set to 1 unless biasing is desired. The weightingcoefficients WOWI,W2... Wn are adjustible as shown in fig1.1. The weight vector is:

(' W0W RIWhere WO is the bias weight. The output Y isthe innerproduct of W and X

That Is: y(j)=xTW=WTX.j 3

Error e(j) is defined as the difference betweenthe desired response d(j) and the actual response Y(j). Inthe noise cancellin- systems , d(j) is the primary input it-self.

e(j) )d(j) -XT=d(j) 1TJ,

Page 118: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

PAGE 7

ui w.

ce-4

LiiiC34

LL)4

ceI

LLI

CL.

cic.

at>

Page 119: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

z c

'i

C,~

P44

IT id

Page 120: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

PAGE

The adaptive algorithm has to adjust the weightsof the adaptive linear combiner to minimise the least meansquare error. The adaptive linear combiner aong with thetapped delay line forms the adaptive filter shown in fig1.2. As before, the input signal vector is:

x. - J

~Xj-n+l1

The componants of this vector are delayed versionsof the input signal X . This filter permits the adjust-ment of gain and phase at many frequencies simultaneously.The total length of the delay line is determined by the re-ciprocal of the desired filter frequency resolution.

ADAPTIVE NOISE CANCELLER AS A NOTCH FILTER:

The notch filter is required in the situationwhere the primary input Is corrupted by an additive unde-sired sinusoidal interference. A notch filter can easilyreallsed by an adaptive noise canceller. The advantage isthat it offers easy control of bandwidth and the capabilityof tracking the exact frequency of interference.

Fig 2 shows the single frequency noise cancellerwith two adaptive weights. The primary input is assumed tobe any kind of signal-- periodic or transient or stochasticor the combination of these. The reference input assumed tobe a pure-sine wave C cos(wO+0 ). The primary and thereference inputs are digitised at 2*pi/T rad/sec sampling.The reference input is also phase shifted by 90 deg andagain digitised..

Fig 3 is the flow diagram. It shows the operationof the LMS algorithm. The weights are updated as shown Inthe diagram by,

wTl(j+I) wT1(j) k2ue(j)A(j)

wT2 (j+l) =wT2 (jY 2ue(j )B(j )

Page 121: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

PAGE j

where e(j) is the error

The reference Inputs are:

A(I)=C*Cos(w0 j Tg)

B(1)=C*sin(wo 1 T +9r)

The isolated impulse response from the error e(j)to the filter output is obtained with the feedback loop frompoit 'DI to Point '81 being assumed to be broken..

Let an impulse of amplitude 'at be applied at thePoint of error signal that is at Point 'C' at a discretetime j=k.

That is: at i=0, e(!)=a;

i~e. e(j) a d )

and Afj I for 1 0;

&00for 1 0; j

Therefore e(j) =a * Aj -k)-(3

The response at Point 'E' Is then:

e(j)*A(j) a *C cos(wO kT.~i ) for jk

=0 for =k

This Is the input Impulse scaled In amplitude bythe Instantaneous value of A(j) at j=k.

The signal flow path f rom- t-he- go Int tE to point

AmI

Page 122: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

PAGE i I

IF' is that of a digital integrator with transfer functionof 2u/(z-1) and the impulse response 2 *)p* U(I-1) where U(J)is the discrete unit step function.

The response at point 'F* is obtained by convolv-Ing 2u * U(J-i) with e(j)A(j).

i.e. A(j) = 2 a C cos(wC kT+O) where jek+l -(5)

This step function which was scaled at 'E' and de-layed at IF' Is now multiplied by A(j)to yield the responseat 'G' as

yl(j)-2,uaCcos(wOjT+ J) cos(wOkT+ 0) -(6)

where j k+1

the response at point OK' is also obtained. Thesignal flow path from point 'H' to point f'I would show animpulse response of 2p U(j-1) with UMC) being the step func-tion. The response at point 'V' is then

Bj=2ji a C sin(wO kT+ f) -- (7)

where j 2 k+1

This again is multiplied by B to obtain the response at 'K'as

y2(j) = a es~n(w0 jT+) sin(wO kt+ J) --(8)

where j k+1

= The combination of YI(j) and Y2(j) yields the response atthe filter output point 'DO as

2Y(J) = 2-a C cos w0T(j-k) -- (9)

2=2 aC uI(j-k-1)cosw0T(-k ) -(10)

This is a time Invariant Impulse response which isproportiona to the input Imiulse. The Linear transfer

-4 a Liner- trgnsfe

Page 123: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

PAGE IL.

function for the noise canceller can now be derived as fol-lows. If the time k is set to zero, the unit impulse res-ponse fromi the point 'C' to 'D' is

2YMj = 2y C U(j-1) cos (wO .iT) --(1D)

The transfer function of this path is

G(z)=2uC 2(z(z-cos(wO0T)

I (Z 2 -2zcos(w 0T)+1

=2 uc 2 0 zcos(w 0 T) - 1

Z 2_2zcos(w 0T) +1 3

This function can be expressed in terms of radian samplingfrequencyA 2 PIT as

2 zcos (21rVw f.) - 1G~) 2uC~f z-

Z _2zcos (2rwf )X +10

If the feed back point from 'D' to '8' is now closed, thetransfer function H(Z) is obtained from the formulae H

G/(1+ as z2 + 2z cos (21Fw.O., 1 -1H(z) =

(Z 2_2zcos (21Ww~l ) + 10

Equationl15 shows that the single frequency noise cancellerhas the properties of a notch filter at the reference fre-quency wO .This Is also shown experimentally..

APPLICATIONS:

There are a variety of practical applications suchlike the cancellation of Noise in speech. signals, cancella-

ij tion of antenna sidelbbe tnterference cancellation o-f sev-

ieral idso nefrence In Electrocardiography etc... Thesimulated experiment.-And their results witl nw !)e shown..These would Indicate the use of the adaptiVe noise cancellerIn varlous- environments.

Page 124: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

Ni

PAGE 1

Filter 1 is the hypothetical case where variousfixed frequencies are present and the undesired freqvency isto be cancelled. Consider an input of fixed signals at60c/s, 350c/s, 400c/s and 450c/s., If the 60 c/s signal isto be eiliminated, the program for Ftiter 1 is shown alongwith the output plots for the filter.

Filter 2 is another form of notch fflter. Thisremoves the 60 hz signal from a primary input of varyingfrequencies. Signal varying frequencies in the range of 300hz to 800 hz and 40 hz to 70 hz are generated as the primaryInput to the filter. As before the 60 hz will be removedadaptively.. The Filter 2 and it's output plots ae shown.

Adaptive filter is equally applicable to filterany type of varying signals and varying frequencies. ThisIs shown by Filter 3.. The primary Input has the signal Invarying freqs as before and the lower frequencies in therage of 40 hz to 70 hz are completely eliminaed. The res-ponse for Filter 3 is also shown

Li

Page 125: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

Nr~i.i I N I v'N

I'f' I - f'J I

til 10

n j I

!.)I rig r/,. t, o

.a..

IIf V Pi1Iil) IF/

Page 126: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

CC FILTER 1:TIIIS FILTER HIAS FIXED PRIMARY INPUT FREQUENCIESC AT 60HZ,3501Z,1;O00lZA1D 45011Z WHICH ARE ALLC SINUSOIDAL. .TIIE 60HZ IS BEING CONSIDERED AS THE NOISEC FREQUENCY TO ABE REMOVED ADAPTIVELY.C T14E OUTPUT IS A PLOT OF POWER SPECTRUM IN DB.,C THE FILTER 1/P AND 0/P PLOTS ARE EXACTLY IN SAME SCALEC

COMPLEX PI,P2,X2,X3,Xlt,X6,X7,X9COMPLEX Y1(257),S(257),RF1C257),RF2(257),Xl(257)T=o.P1=(0. .0.)T2=0.P2=(0.,.O. )X2 =(O.,0. )

X4=(0. ,.O.)X7=(0O...X9=(O.,O.).D09O 1=1025YDO(9 1=10,257

RF1(I )=(O.,0.).RF2(t )=(0.,O.).X1( I ) =(0. ".)

9 CONTINUEA=0.F =350.*0,DO 1 1=1,256

C --- PRIMARY INPUT --------IF(I.GE.7.5) F=4iO9.0.

A=15.0*SIN(T)B=3.0*SI N(T2)

P21 =CfP LX (AB, 0 .

S(I )=P1+P2

T2=T2+2 .0*3.14*60.O./256.0,1 CONTINUE

A=0.C ---- REFERENCE INPUT--

T=0.DO 2 1=1,256A=2.0*S[N(T)RF1( I )=CfMPLX(A,0.)T=T+2.0.*3. 11*60.O./256.0,

2 CONTINUET=0.

C ---- 90 DEG PHASE SHIFTED REFERENCE INPUT----AW.DO 3 I'-1,256A=2 . 0 *COS (T4)RF2( f)=Ct-MPLX (AO.)

Page 127: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

Filter 1:contdT-=T+2.0,- 3 . 14L*60 .0./256.0,3 CONTINUE

C ---THE FILTER--,-Y1(1)=(0.,O .DO 5 1=1,256X1(l )=S(I )-Y1(I )X2=X1( I )*RF I( I)X 2=X2*0.*1.25XG=X7X7=X2+X6X2=X7*RF1( I)

X3=X3*0.1i25X 9= X IXli=X3+XgJX3=XII*RF2( I)Y1(( I1)=X2+X3

5 CONT INUEIM8

CALL FPT(Xl, 114)CALL. FFT(Y1, 11-1)CALL FFT(S,lM)

c S(1 PRESENTLY HAS TIHE FFT OF 1/P S(1D0 50 1=1,256GG=REAL(XIil))**2+AlrtiAGx(XI i))**2GG=GGIZ 000.0.AI(M )=CMPLX(GG,0.0)GK=REAL.(YI( ) )**2+A;IMAG(YI( ) )**2G K =GK /1000.0,Y((1)=CIAPLX(GK,o.)PP=REAL (SC ) )**2+Alt4AG(S( I))**2PPPP/ 1000 .0S( I )=CM.PLXPP,O.)

50 CONTiNUECALL INITT(120)CALL PLQTS(IGUFT,5)

c CALL AX1Sr(0./A.',,I AXIS' ,-Gr io. ,0.0 O.4. NfC ~CALL AYIS(0.,,o.,.'Y AX I S1,6,u-. ,90 i J558 FORMAT(n)WIRITE (1,660)

660 FORMAT(' ','TO START THE PLOT, HIT RETURN 1)-~ READ (1,558) lEDX=O.CALL PLOT (0.0,0.0,-3)DO 303 (=1,128Y=REAL(s( I))X=X+6J C=2CALL FACTOR(O.01)CALL PLOT (X.Y,I'-)

303 CONT I Pt)E- C P. S. .0 F 0/P (G(I)Y) IS PLOTTED

Page 128: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

FilIter 1:contd

x=o.C CALL AXISC0.,o.,,'POwq SPEC',8§,.OlC CALL AXIS(0.,o.,t ,1,8.f,9 .. 1.)f0

CALL PLOT (0-0,0.O.,-3)DO 90 1=1,.123Y=REAL(Xlf I))IC=2X=X+6CALL FACTOR (0.01J)CALL PLOT (X,YIC)

90 CONTIUECALL ANMODECALL FJNITT(0,0)STOPEND

Page 129: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

1CIE

[LiL

ILI ~£~ -------

z Vd

o__o

oII

Page 130: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

C ****k*******k*-r1*******.*

CCc FILTER 2: THIS FILTER I,:-AS A PRIMARY INPUT OF VARYINGo FREQUEIICIES Itl THE RANGE OF 1-10%CIS TO 70C/S ANDC 300 C/S TO 8000/S. THE FREQUENCY CF THE GENERATED SIGNALC IS CONTINUOUSLY VARYING SINUSOIDALLY.C THE AMPLITUDE VARIATION OF THE SIGNAL IS ALSO SINUSOIDALC 6OC/S IS AiSSUMED TO BE THE NOISE FREQ TO BE ELLIMINATEDC THE OUT IS A.PLOT OF POWER SPECTRUM IN D8.C THE I/P AND 0/P PLOTS ARE IN TUE SAME SCALECCC

COMPLEX P1,P2,X2,X3.X4l,XGX7,X9COMPLEX Y1(257),S(257),RFI(257),RF2(257),Xl(257)T=0.Pl=(0.,0.)T2=0.P2=(0.,0. )X2=(0.,.0.)X3=(O.,O..XI = (0. '0.)

X7=(0. ,0.).

DO 9 1=1,257Y1( I)=(0.,0. ),RFI(I )=C0.,O.),flF2(I )=(0.,0.,X 1,(1 )=(...

9 CONTINUEA=O.U=0.GA=O.OU2=0..GA2=O.O

C --- PRIMARY INPUT (NOISE CURRUPTED)---DO 1 1=1,256U=U+4 .0GA=U/ 100F=IiOO.0+(SIN(GA)*100.0)U2=U2+'i.0GA2=U2/100 .0.F2=60.0+(SI N(GA)*10.0)A=5 .0 *SlI N( T)B=3.0*SINCT2)P1=CM-.PLX(A,O.)P2=Ct-PLX(1), 0.)SCI )=Pl+P?T=T+2.*.3. Th*F/ 1000.0.T2=T2+2. '1*3. 14*F2/256.0

1 CONTINUEC --- REFERENCE INPUT------

A=0.T-0.

Page 131: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

DO 2 1 =1,256f I er2cntA=2.0*SIN(T)flFl(l)=CtMPLX(A,o. )T=T+2.0*3. 11I*0.0!256.o,

2 COJT! IJUEC ---- PHASE SHIFTED REFERENCE INPUT ---

T=O.A=0.DO 3 1=1,256A=2.0*COS(T)RF 2( I ) =CMPLX (A,0.)T=T+2.0*3. 11*60.01256.0

3 CONTINUEC --- THlE FI LTER----

Y1(1)=(0.,0. ),DO 5 1=1,256X1( I)=S( I)-Y1( I)X2=Xl(lI)*RF1( I)X2=X2*0.125146=X 7X7=X2+X6X2=X7*RF1( I)X3=X1(I )*RF2(i )X3=X3*0 .125A'9=X4i

X4s=X3+X9X3=X14*RF2( 1)Yl1 1)=X2+X3

5 CONTINUEI ft4 = 8CALL FFT(X1,!M)CALL FFT(S,IM)

C SO9) PRESENTLY HAS THE FFT OF I/P S(1)DO 50 1=1,f[23GG=REAL(XI('I))**2+AIt.iAG(Xl(, ))**2PP=REAL( ) )**2+At .1AG(S(1I )**2

520 FORMAT( I, 'INPUT=',4X.E4.5.,OX,TOUTpV4XE14 5.)S(l )=CMPLX(PP,o.,)

X1(I )=CfPLX(GG,o.0o)50 CON4TINUE

CALL INITT(120)CALL PLOTS(IBUF,1,5)CALL PLOT (1.0.,i-O.,-3)WITE(1,660)READ(1,558) lED

C CALL AXIS(0.,,0.,-'lNPUT POW SPEC IN DI-09,010110)C CALL AXS(0.,.O.' 1,90.,.1.5 5 FORMAT(W~660 FORI4AT(' ','TO START TH4E PLOT, 141T RETURN '

.x=O.C CALL PLOT (1,0,1.0.,-3)

IC=2DO 303 1=1,128

Page 132: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

Filter 2:con.td

PP=REAL(S( I))PP=20.0*LOG1O( PP)Y=PPX =X+ 3CALL FACTOR(0.02)CALL PLOT (X,Y,IC)

303 CONTINUEC P.S- .OF 0/P (G(I)) IS PLOTTED

x=0.READ (1,'45u) IZK

1 5 8 FOIRhAT(I14)C CALL PLOT C1.0,1.0,-3)

IC=2c CALL AXIS(0..,0..,,'POVW SPEC IN DB',-14,9.,,0.,O.,1.)C CALL AXIS(O.,0.1. 1,1,7.,90.,O,1.)

CALL PLOT (1.0,1.0.,-3)DO 90 1=1,128GG=REAL(X1(I ))GG =20.0.* LOG 10(CGG )Y =GGX =X+3CALL FACTOR (0.02)CALL PLOT (XY',IC)

90 CONITINUECALL ANMODE'CALL FINITTCO,O)STOPEND

Page 133: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

-CL: -

CL (S)

IA- (

W

a.J a.

CL..

Page 134: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

Fter 2:contd

PP=4rAL(S(J ))PP='20.O*L0G10( PP)y=pPX=Xi-3CALL FACTOR(0.02)CALL PLOT (X,Y,IC)

303 COfiTINUEC P.S. OF 0/P WMJ) IS PLOTTED

READ C1")IZK1158 F 0 RAT(I It

C CALL PLOT (1A.,1.0o,-3)I C<

c CAL:_ AXIS(0.._0,.,_P~lw SPEC IN 3'149.,,0,)C CALL AXIS(0.,o._ '1,7.1,90~.1.l)

CALL PLOT (1.0,1.0-,-3)00 90 I11,28GG=R.EAL(X1(I )GG=2 . 0_*LOG10 (GG)

CALL FACTOR (0.02)CALL PLOT (XY,IC)

90 CONTINUECALL ANM.ODE-CALL FINiITT(0,0)STOPEND

Page 135: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

DO 2 11,256filter 24:contdA=2.O*S I t(T)ZF I(1I ):-CtPLX (A, 0.)

2 T=T+2.0*3. 1[*CiO.0/256.02CO IT Ir1U EC ---- PHASE SHIFTED REFERENCE INPUT ---

T=O.A=O.D0 3 1=1,256A=2.O*COS(T)

T=T4-2.0*3. 11*0.0/256.03 CONTINUEC --- TIIE FILTER----

00 5 1=1,256X1( I)=S( I)-Y1(I )X2=X1( I)*RFI(I)X2=X2* 0.1,251'G=X7X7=X2+XGX2=X7*RFI( I)X3=Xl( I)*RF2( I)X3=X3*0.125X 9= X11X4~=X3+X9X3=XI.*RF2( i)YI1I+1)=X2+X3

5 CONTINUEI IM= 8CALL FFT(X1, Im)CALL FFT(S,IA)

C S(I PRESENTLY HAS THE FFT OF I/P S(I)DO 50 1=1,123GC=REAL(XI( ) )**2+AII4P.G(X1( I))**2PP=REAL(S(I) )**2+AI..IAG(S(I) )**'2520 FORf-AT(' r-lINPITr,4X,E1I.5.X,10XhvUTpU,XE14.)SC! )=CtIPLX(PP,.)X1( I )-=CMPLX(GG,0.0)

50 CO4T I MU ECALL JNITT(120)CALL PLOTS(IBUF,i,S)CALL PLOT (1.0,1..,-3)WRil TE(1,660)REAO(1,.558) IEDC CALL AXlS(0.,_0.,.' g1PUT POW SPEC IN' B,2....,0,10)C -CALL AXIS(o., o.,' 1,7,g.o,

550 FORMtAT(II)660 - F11MATC' I 'TO START THE PLOT, HIT RETURN F)

__- C CALL PLOT (1. 0, 1,,.,-3)

-DO 303 1=1,128

Page 136: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

-A/

cI

C FILTER 2: THIS FILTER 11 -A$k A PRIMARY IIIPUT OF VARYINGC FREQUENCIES IN ThIE RAINGE OF I:OC/S TO 70CIS AIDC 300 C/S TO 800C/S. THlE FREQUENCY OF THlE GENERATED SIGNALC IS CONTINUOUSLY VARYING- SINUSOIDALLY.C THlE AMP~LITUDE VARIATION OF THE SIGNAL IS ALSO SINUSOIDALC 60C/S IS ASSUMED TO BE THIE NOISE FREQ TO BE ELLIMINATED)C THE OUT IS A PLOT OF POWER SPECTRUM IN OB.C TME I/P AND 0/P PLOTS ARE I N THlE SAMlE SCALrCC

COMPLEX P1,P2,X2,X3,XII,XG,X7,X9COUIPLEX Yl(257 ),S(257) RFI(257),RF2c25-),Xl(

2 57)

P1=(0.,O.)T2=().P2=(0.,0.).

X3=(O.,.Oj.-

X3=(0.,0.).

X7-(0.O.)

DO 9 1=1,257

RF2 (I )=(0.,..R( I )=(o. ,o0)

9 CONTINUEA=O.U=0.GA=O. 0.U2=0.0GA2=0.0

C ---- PRIMARY INPUT (NOISE CURRUPTED) -----DO 1 1=1,256U=U4.0GA=U/ 100F=I4 00.0+ (S I(GA)* 100.0N'U2=U2+4s.0.2A2=U2/100.0F2=G0.0+'SIN(GA)*10.0)A=5.0*SiII(T)

Pl=CiU.PLX(A,O.)P2=CMPLX(LI,0.)S(1 ):P1+P2T=T+2 .*.3* 14*F/3.000.0-T2=T2+2.0.*3.l4-*F2/25

6.01 CONTINUE

C -- REFERENCE INPUT--------A=o.

T~il .zj

Page 137: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

zz...........

orcoto

zL

rL

o o

Page 138: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

- Appendix C

1. A VLSI Residue Arithmetic multiplier, IEEE Transactions of Circuitsand Systems, W.K. Jenkins Editor, revised paper acceptea forpublication, approx. 10 pages.

2. Large Multiplier Multipliers, ICASSP 80 Proceedings Denver,Colorado, April 8-11, 4 pages.

3. Large Mioduli Multipliers, 1980 International Symiposium onCircuits and Systems, Houston, Texas, April 28-30, 3 pages.

4. A New Technique For WFTA Input/Output Reordering, InternatinnalJournal of Comput.er and Informnation Scierces, J. Tou editor,accept2d for Vol. 10, Number 1, approx. 15 pages.

5. .ihe Realization of Adaptive Kalmnan Filter, pending ACTA, M. HamzaEditor.

Page 139: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

PULiSLISED INi 1K PROCEEDINGS& Of TilE ICASSP 80

ONlVER, OLORAOAPRIL 9,10,11

t~nv~rityof ..od J. Taylor Oi 52

spart,-.ent of Electrical and Computer Engine~eringUnivrsit ofCincinnrati, cincinnati, ho 52

_______required, the residue structurg, alwais rrovidesIn th,- work tf~ tmesent a "ow table loku better perforit-anc with reard t ut~p.cation.

mhenallriuti~lcatonsmustbe ounedthe Z's;tcj~,~scteme3003 clss f aL l~l(ripcotrolement structure provides betzer perfornance.~.tltbliers cipable of iorkin,,, with exact Sinc- most nion-trivial filter and 'rainsformi'dulr) ulderig ss,.n..applications reqtui.e i high plurality of multiply

kleror, savings associated with the fler look-up and arid operations (almost always insuring themnultiplie", when compared to contemporary overflow of the limited integer 4ynamic rangesfletr'ods, are shown to Le on the order of-2/1' currently being implemented (<71 tyo.)) the,hi-re %=2n, neinpJt wor~flength. Th!'ougnp~.t is luture of residue based digital systeq mav apoears'ic-mi to-: be e.-!al -0 that oDotined ising VLSI1 limrited at first glanice. We s1hall, in tnis work,4nd classic architectures. present some new results which overcome this

contemporary deficien,.y and Hi fact -rake theputential of mrdular arithmetic systems even moreexc.e.ing.

:ntr~ucton:Residue ALU's.),.;tal signal processing is a study inder.going Tedsdatgso h eiu ~~brss~naccelerated growtn, acceptance, and appliczation. Tre didanld. Sinc the rel ses riobe siotn-4ith the possible exception of number theoretic ficant digit, decimal to residue conversion,tansfolms, digital slignal processing has b~en ivsomgtuec prsnad rth tcprincipally advanced through technological dvsomgiuecmaioadaiheiacnievements. These incluce the mricroprocessor, shift operations are cumbersome and snould below cost high performance nmer.ory, and the read avoided. Rigister overflow, due to its finite

onlymemry ROM. Te aailbilty f te ~dynamic range, impose a severe cnstraint on tnehias Lhailenged our traditional attitude towards, R.-S operations. Uinlike weighted numbers (deciffld,

perfrmig dgitl aithmtic Inparicuarbinary, etc.) where rounding or truncating leastthe rt f fied ointnulioliatin ha uner- significant digits can control overflow, such is

gone a partial rwetamorphosis through tire use of irt the case in the RNS. Since there is anROM based table look-ups. Since multiplication ibsence of least significant digits, the more

nas eena prncial seedcostandcompexiy general and inefficient operation known as scalinglimittin to dricital fildterng, adacment m~ust be used. SincL s:aling is a forn of division,4liintin are hiave bee ltrly , racevae ent its use should be discourageo. To 13in insight

in tis rea avebeenwarly rceied.into this problem, consider the inner product ofE:!SI;4 LOK-U ARTHMTICTECUIQES:two 31-dimensional real vectors x ar'd y whoseEXISINGLOOKUP RITHETI TECNIQES:entries are encoded as residue digits with respect

"'dch of the reported work on ROM based fixed to P=(32,31.29,27). Without scaling, the dynamicpoint multiplication has been i's support of range of x and y would be limited to V-3 wherelinear shift invariant digital filtering. Authors V=f.Ir/2)/3W25O56. Therefore, to insure that nosucn as Jenkirs and Leon, SolerStrand, etc., have worst c se overflow can occur, a 7.3-oit (ie: V. 5=studied the cost-speed metrics of digital filters 158,Q71) dynamic range limitation mrust be imposedusing the residue numbering system. The princioal on 7 and y. With scaling, larger input ranges caradvaintage of the rezidue numibering system ;s thrat be *jsed at the expense of statistical accuracy in.jrpport! fixed point multiplication and addition !the output space (analogous to rounooff er'rors).

wirchout need of preserving "car'ry information". Dt~e to the dynamic ran'ie of RNS systems, one isThus, parallel operations are idraissible. Inadaition, -nodular multipliers wyere shown to e generall1% forcea to accept one of the followingrealizable using table ;ook-,rp methods and RUM'1s. two overflow Prevention strategies.

J~enkins recently questioned winetnrer the perform- 1. Increase the dynamic range to a sufl'icientlyInce af tne res idue nuemrni nc System'r was due to large value by adding m~ore mnoduli to '. orthe intrinsic properties of the system or the use ~ '~ cln oeefcetoeainOf look-up nv lt ipl iers. It aas concluded that'it The first nir,~ni reprpsanfs i htri't- f-rc, Y-appears that irnen no roundinrg (scalinq) is to ttie problen. Such an approach wiill increase to

Page 140: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

cost a~nd cc rlexit! .-etr'c; j-.' a 1*nr:-. tni el uction o absenceaddition. tne noculi sot Z ;ust be tailoried to a -,f raufitionai 3catinl ocer-3ons) can beunique filter. Tne otnier , roacn app~ears to ce 4chieved. 7rinally. se-,eral versions of tre *the yiost pOODIar 4t t.iiS i. Szaibo ad I~rtA can he ccnside~ed Thav Airt saiarizedTanaka, and others, nave concentrated on the in figu-e 3.scaling efficiency through the Mc.~ie of thethree-tirnle mod-ili se 1, 2 n 1j 2 n.7n~~ po closer iniestigation of the table look-up

iroduli set has the ability to efficiently scale j~t bsaPotential nui;ance can be foond. It

a residue number by any one of the cnosen moduli. can b1e e-(am~plified y observing that if ;-9.Noweve' , there is an intrinsic limitation P=32, then !s-)z<9 14>3,20.23. Therefore, it

plaguing this method and it is its dynamic m~ay be required that t.~ocadditional fractional

range. Using a large high-speed miemory unit, bits -..ay need to be 3dded to the table's otutputsay I YxI the input addressing space ;s limited word length. 4owever, this is not the case as

.o 2 2. This means that a godull P. is suqgested by the following theorem:Tehiel iitdt .? (e ~~~21)riorem,: L,2t 'jv!1 deote the integer value of v.

Thierefore, the dynamic raige of ar~y zodular 21cperation is giver . __(~- - )(21Y1nl)i i 3 -M-z~[(s-s)~in iiany applications, an 18-bt, resolution is That is, only *ne integer valup of p need be usedlrsuf'iclent resolution. and the fractional bits of 6(s-) can be ignored.

:1ew Results Proof: Let (-<+y)/2 4-v+k/2; (X-y)/2 qb/2 where

Tw'o new mewsory efficient algorithms have been k=O or 1. Then zl<<(x+y) /4> 2

ae'-ived and is based on a no~el factorizition --!%_y)414> N :<:vkvb /1> -<Q+kvc i /4> >of a bilinear for-i. Over a real field it i; . p~i'4p q- P. pt( )-

which ~n modular form. becomes As A re sult, the rarallel architecture is equiva-

<x~=<(+) s-,2. lent to that shown in figure 4.

where t~s)=,s2> 3 with s"=1x+y)I2 and s5%(x-y)/2. X*odu)lo ± dder3tfistglnc tisalorth. hihshall be The 1-f1 multiplier requires a modulo p Adder be

rtefrste ln t i s a lma orhm dla muhiih used to combine the two component parts of therfre 4toa inra eoymdlrmli solution (namely (s+) and ~(). Modulo p

plier (; ), would appear to be counterproductive adders pose an interesting de-sign problem.with respect to a memory conservation metric. nesafs iouopadrcnb arctdThe memory requirements associated with thej

4 nesafs ouopadrcnb arctdwill be shown to be substantially less than the overheadassociated with addition will offset

those of direct mechanizations. First, it should any gain in throughput achieved through tables~ and ~look-ups. For the moaduli chosen, 2n~1, 2n, andbe apparent that the integer sans-foulqd 9n4l. only the modulo 2tn adder can be realizedin equation 2 is bounded from above by 2n" d.ecl (nbtAdrwt goe vrlw. I

Therfor, ony a(n~l-bi tab~ adresingwould however, be desirable to use a n-bit adderspace is required to realize (s-) versus the to realize the modula 2n-1 and 2n41 adder as well.2n-bit space needed for direct architectures.It would appear however, thiat there is an Usirg n-bit AND gates to sense the zero conditionexception to this rule. Since one of the mioduli Of <S> M, the overflow bit OVF, and the sign bitschosen is ,~2n.., Her- the -naxi-aal value of of tbi)and bs-, a combinational logics+(or s-) is 2n+1 which would techniczjly routine can be defined -which will convert sz4require a (n+2)-bit address. However, by using into <s> .It can be noted that the mappingthe 9rntco.l fo~rnd in fio3ure 1, the table size requiremBAts are:can be reduced to 2n+I Words for all moduli. 1.ff ='.,mp oso - 4I=4s

Wrthe ov~rlow bit serves to differentiate 1.'2p2~lmpst or s2+s S 14s-C fom .2. for r-2 . map s to s-.2 l<s>

The 14 system architecuie i.s abstracted in 2Figure 2. This uses 2n )~rd liigsi-soeed memory 3. o ~ 11, '.a o rs2 1-"s -lfor modular arithmetic look-hip operations. Using, 2'for exaimple. the previously referenced 4K-3O0ns S'jooose the !joduli pr 41. nzI2, is to be ivpedievice, mnoduli having An Il-bit dynaic rangn ricnted. iPy using two conimerci.:lly avjilable

Q~.6-bit in the direct for-;) can be mechanized. 150g PLA's in -arallel. the 12-bit )utcomt. of anThis wvould yild~ Whree-m~duli dynamic rantle on n-bit adder and the four control bits, can letha order of V L8,~. 6-1W1. That is, without converted to 13-bit mask. The mask would trans-an increaise in inenoFy size (and t ereforn access form the iutplit of a hinh-speed n-bit adler to

- - i~'), te dnami rage o th M~ s ~s or s-2n-1, depending on the state of the 4* times larger than that obtainable through direct .control bits. O3ased on a 25-ns 12-bit Schottky

Teans: This large increase in dynamic range look-ahead adder, a 20-ns 16x9 ?LA. and l0-nsn~akes the RIJS a viable alternative to traditional --E -ask ;witches i 6E-ns .cdjlo 3 adder, for

fter design methods. Botn improved precision 7~~1 n, and 2 10 can be rea-lized. The

]

Page 141: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

oresence of a 65-n- n'oduio 3 )dle- ,ill no'v aic-a l4C-ns Ilire 'moduli -esijtue -ulrt.o iier -toe -b.sed on 35-n 4Kxl U'OS -- or/ units. or -.moduli set Q2*-), 21'4', -, a fited poll't'qlti~jier, having in ,utput d'ynamic ran~ge 3. ,4 -21, can tus be faDricated havinq a .ord -

rate of 7.14311 multiplicatilons per second r-28.5: -mt ilcto- per second if a pipe- ., "' = 'I

lined architecture is used. -I 1Summary: : "The residue number systen offers the pozential - .for high speed parallel aritnnetic. This classof arithmetic has been denonstrated to be usefulin designing recursive al)jorithms, t-ansforms. ' '.-

and digital filters. Cne of tbe principallimitations to its u-e is its limited practical Figure 1

dynamic range. To overcome this problem,.large moduli multiplier, for the moduli set2n-l, 2n, 2n~l), was designed. Inis high-speed

large noduli systen was the product of the newM4' algorithm and new technologies (RAIl and PLA's).The oractical resioue multiolier is capabl ofspipoc,.ting a pipelined execution rate of 28.5:. 1-T--.--.'ultiojiers per second.

References

1. G.A. Jullien. "Residue Number Scaling andOther Operations Using ROM Arrays," IEE JTrans. Computers, Vol C-27, No. 4, " .. ,pp 325-337, April 1978. ' ,= :

2. C.d. Huang and F.J. Taylor, "Memory Figure 2Compression Scheme For Modular Arithmetic,to be published, IEEE ASSP.

3. ,!.A. Soderstrand, "A High Speed Low-CostRecursive Filter Using Residue Arithm.etic,"Proc. IZEE, Vol. 65, No. 7, July 1977.

4. U.S. Szabo and R.I. Tanaka, Residue Arithme-tic and Its Applications To Tomrp-uter -,-,." ..-..

Tecnnoloyo, ;cr4 11aw-Hil, 1967.

5. W.4. Jenkins and B.J. Leon, "The Use Of .,_t--..Residue lumber System in the Design of ..

Finite impulse Response Filters," IEEE: -7

CS:, CA5-24, April 1977. ,T. . I.rm , -4-14

6. W.K. Jenkins, "Techniques for Residue-to-Analog Conversion for Residue-EncodedDigital Filters," IEEE C&S, CA5-25,July 1978.

7. A. Peled and S. Liu, ",A Vew Hardware.Realization of Digital Filters," IEEE ASSP,

S 3. W4.J. Jenkins, 'A% ll~hly ,Efficient Qesid,je- -- "°Co-ibinatorial Architecture for Digit3%

Filters," Proc. of the IEEE (Letter), - I:-- I '. I . .. ,I Vol. 66, Nlo. 6, Jne 1973. pp 700-702.

This work was partialiy suopvorted under AFOSP Figure 3

Grant :o. AF49620-79-C-0066.

/70

"1t

Page 142: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

LW

TI

IJ I -- ":

U~vf )J

Page 143: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

mI

LARGE MODULI MULTIPLIERS

Fred J. Taylor

Department of Electrical and Computer EngineeringUniversity of Cincinnati, Cincinnati, Ohio 45221

ABSTRACT: required, the residue structure always providesIn this work we present a new table look-up better performance with regard to multiplication.storage scheme and a class of table looK-up When all multiplications must be rounded, the 2'smultipliers capable of working with exact complement structure provides better performance.(modular) numbering systems. Since most non-trivial filter and transform

applicat-ons require a high plurality of multiply

Memory savings associated with the new look-up and add operations (almost always insuring themultiplier, when compared to contemporary overflow of the limited integer dynamic rangesmethods, are shown to be on the order of 2/N currently being implemented (<216 typ.)) thewhere N=2n, n=input wordlength. Throughput is future of residue based digital system may appearshown to be equal to that obtained using VLSI limited at first glance. We shall, in this work,and classic architectures. present some new results which overcome this

contemporary deficiency and in fact make thepotential of molular arithmetic systems even moreexciting.

Introduction: Residue ALU's

Digital signal processing is a study undergoing The disadvantages of the residue number systnisaccelerated growth, acceptance, and application. are diad n e the re ses no signi-With the possible exception of number theoretic are anifod. Since the RNS possess no signi-transforms, digital signal processing has been ficant digit, decimal to residue conversion,principally advanced through technological division, magnitude comparison, and arithmeticachievements. These include the microprocessor, shift operations are cumbersome and should below cost high performance memory, and the read avoided. Register overflow, oue to its finiteonly memory (ROM). The availability of the ROM dynamic range, impose a severe constraint on thehas challenged our traditional attitude towards RNS operations. Ul1ike weighted numbers (decimal,hasrchallnged ouritraiti .nl atiubinary, etc.) where rounding or truncating leastperforming digital arithmetic. In particular, sgiiatdgt a oto vrlw uhithe art of fixed point multiplication has under- sigificant digits cn control verflow, such is

gone a partial metamorphosis through the use of not the case in the RNS. Since there is anROM based table look-ups. Since multiplication absence of least significant digits, the morehas been a principal speed-cost-and complexity genrai anzd incfficient operation known as scalinglimitation to digital filtering, advancements must be used. Since scaling is a form of division.

hnwarmly received, its use should be discouraged. To gain insightin this area have been rinto this problem, consider the inner product ofEXISTING LOOK-UP ARITHMETIC TECHNIQUES: two 31-dimensional real vectors x and y whose

entries are encoded as residue digits with respectMuch of the reported work on ROM based fixed to P=(32,31,29,27). Without scaling, t e dynamicpoint multiplication has been in support of range of x and y would be limited to V-- wherelinear shift invariant digital filtering. Authors V=(M/2)/31=25O56. Therefore, to insure that nosuch as Jenkins and Leon, Soderstrand, etc., have worst c se overflow can occur, a 7.3-bit (ie: V- -studied the cost-speed metrics of digital filters l58%27.i) dynamic range limitation must be imposedusing the residue numbering system. The nrincipal on x and y. With scaling, larger input ranges canadvantage of the residue numbering system is that be used at Zae expense of statistical accuracy insupports fixed point multiplication and addition the output space (analogous to roundoff errors).

i without need of preserving "carry information".wThut naae oprerting arry adinfos ti. Due to the dynamic range of RNS systems, one isThus, parallel operations are admissible. Inaddition, modular multipliers'were shown to be generally forced to accept one of the followingrealizable using table look-up methods and ROM's. two overflow prevention strategies.

Jenkins recently questioned whether the perform- 1. increase the dynamic range to a sufficientlylarge value by adding more moduli to P, orance of the residue numbering system was due to lare valueb a more dui to.the intrinsic pioperties of the system or the useof look-ap multipliers. It was concluded that "it The first option represents a brute force attackappears that when no rounding (scaling) is to the problem. Such an approach will increase to

j H 792i 'IlS ' 4K{IiW | 074)25(X1.7i ' ic l ! :1+IP I

Page 144: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

cost and complexity ,etrics of a filter. In .nd throughput (thrcugh the reduction or absenceaddition, the moduli set P must be tailored to a 3V traditional scali.ng operations) can beunique filter. The otner approach appears to be achieveJ. finally, several versions or the M4

the most popular at this time. Szabo a,.d algoritim can be considered. They are swmarizedTanaka, and others, have concentrated on the in figure 3.scaling efficiency through the choice of thethree-tupie ;noduli set P={2n-i,2n,2ndl). This Upon closer investigation of th~e table look-up

Smoduli set hasI the ab2lty to eff2ntly. Thscl data base, a potential nuisance can be foynd. Itmoduli set has the ability to efficiently scale can be examplified Ljy observing that if s-=9,

a residue number by any one of the chosen moduli. can be e4i 12 o r Thtefor-,9iHowever, there is an intrinsic limitation p32, then (s-)'q e /4>t2at .5. Therefore, itplaguing this method and it is its dynamic may be required that tAA additional fractionalrange. Using a large high-speed memory unit, bits may need to be added to the table's outputsayix the input addressing space is limited word length. However, this is not the case assy Is n suggested by the following theorem:to 2". lThis en that a oduli p- ;s 1technically limited to p.<2 (ie: x.&y.<2 ). Theorem: Let Ilvil denote the integer value of v.Therefore, the dynamic rAnge of anyImo,ular .-operation is given by M=ni-)(2 n)(2n )02 3 8.2 l8

. Then z=< j4(s )I'-I(s)Ii>p •

In many applications, an 18-bit resolution is That is, only the integer valu of 0 need be usedinsufficient resolution. and the fractional bits of *(s-) can be ignored.

New Results Proof: Let (x+y)/2-v+k/2; (x-y)/2-q+b'/2 where

Two new memory efficient algorithms have been k=O or 1. Then z<<(xfy) 2/4>p -

derived and is based on a novel factorization <(x-y)2/4> > -<<,,kv~b2/4> -<q4kv+k 2/1> >of a bilinear forn. Over a real field it is Pp " 2 p P

obvious that =<<v+kv> pk /4-<q+kq> -b /4> p=<(s )-xy=((x+y)/2) 2_((x-y)/2) 2 |. V(s,)> p

which in modular form, becomes As a result, the parallel architecture is equiva-

<xy>p=< (S +)-0(s')>p 2. lent to that shown in figure 4.

where o(s)=<s2 >p with s*=(x+y)/2 and s"=(x-y)/2. Modulo p Adder

At first glance this algorithm, which shall be The M 4 multiplier requires a modulo p adder beused to combine the two component parts of the

referred4 to as a minimal memory modular multi- solution (namely O(s4) and *(s_)). Modulo pplier (M ), would appear to be counterproductive adders pose an interesting design problem.with respect to a memory conservation metric. adr oea neetn einpolmwthespectoa memory reqiren s eoatith methc Unless a fast modulo p adder can be fabricated,The memory requirements associated with the th v rH"-s o i t d wi h a d t o i l o f ewill be shown to be substantially less than the overhead associated with addition will offset

wil b shwntobe ubtanialyles tanany 9jin in throughput achieved through tablethose of direct mechanizations, First, it should any gi r thughpu chee thrug tabe apparent that the integer s ad s look-ups. For the moduli chosen, 2-l, 2, and

in equation 2 is bounded from above by 2 n. 2ir, only the modulo 2 n adder can be realizedTherefore, only a (n~l)-bit tb addressing directly (n-bit adder with ignored overflow). It

eis requred to reaizet a ddessi would however, be desirable to use a n-bit adderspace isrqie*t elz (s-) versus the to realize the modulo 2n~l and 2nf, adder as well.2n-bit space needed for direct architectures,It would a')pear however, that there is an Using n-bit AND gates to sense the zero conditionexception to this rule. Since one of the moduli of <s>N, the overflow bit OVF, and the sign bitschosen is p=2nt1 Here the maximal value of of f(s+) and *(si), a combinational logicst(or s-) is 2n*1 which hould technically routine can be defined which will conv.rt <S>2Nrequire a (n+2)-bit address. Ilowever, by using into <s> . It can be noted that the mappingthe protocol found in figure 1, the table size requirenAts are:can be reduced to 2n+l words for all moduli. N N_Hre, the oveflow bit serves to differentiate . N N

s-~ fom .2. for p=.2 map s to s-2 <=S>1 2NThe t44 system architectyre is abstracted in 2NFigure 2. This uses 2n,1 word high-speed memory 3. for p=2 N1, map s to s or s-2-l-"-<<SN-I>2Nfor modular arithmetic look-up operations. Using, 2 2for example, the previously referenced 4K-3Ons Suppose the moduli p2 n 1, n=12, is to be imple-device, moduli having an il-bit dynamic range mented. By using two comnercially available(vs. 6-bit in the direct form) can be mechanized. 16x9 PLA's in parallel, the 12-bit outcome of anThis would yie ,Ithree-mduli dynamic range on n-bit adder and the four control bits, can be

o Ithe order of 21 -)%8.6-10'. That is, without converted to 13-bit mask. The mask would trans-1an increase in memory size (and therefor access form the output of a high-speed n-bit adder to

i- time), the dynamic range of the 1" is 233/218=215 s or s 2 n, depending on the state of the 4times larger than that obtainable through direct control bits. Based on a 25-ns 12-bit Schottky

Ameans! This large increase in dynamic range look-ahead adder, a 20-ns 16x9 PLA, and 10-nsmakes the RNS a viable alternative to traditional FET mask switches-a 65-ns modulo p adder, forfilter design methods. Both improved precision p-2P-l, 2n, and 2n+1 can be realized. The

793

Page 145: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

presence of a 65-ns modulo p adder will now allowa 140-ns large moduli residue multiplier to bebased on 35-ns Kxl 1IOS nenory units. For amoduli set 122- 2WZ, 2m +1), a fixed po;nt , "

rtlutiplier, having an output dynamic range of j•24 -*>1, can thus be fabricated having a word -

rate of 7.143M mutiplications per second orW1028.5M multiplications per second if a pipe-lined architecture is used. :f 01"

SummlaryLThe residue number system offes the potentialfor high speed parallel arithmietic. This classof arithmetic has been demonstrted to be useful

MW-

ir designing recursive algorith ,, transforms, !$VA T PA

and digital filters. One of the principallimitations to its use is its limited practical Figure 1dynamic range. To overcome this problem, alarge moduli multiplier, for the moduli set{2n-1 , 2n

, 2 n+l}, was designed. This high-speed

large noduli system was the product of the newM4 algorithm and new technologies (RAM and PIA's).The practical residue multiplier is capable o;supporting a pipelined execution rate of 28.5Kmultipliers per second.

References

1. G.A. Jullien, "Residue Number Scaling andOther Operations Using ROM Arrays," IEEE e 1Trans. Computers, Vol C-27, No. 4,pp 325-337, April 1978. -I o , U) WUHUTISI

2. C.H. Huang and F.J. Taylor, "Memory Figure 2Compression Scheme For Modular Arithmetic,to be published, IEEE ASSP.

3. M.A. Soderstrand, "A High Speed Low-CostRecursive Filter Using Residue Arithmetic,"Proc. IEEE, Vol. 65, No. 7, July 1977.

4. N.S. Szabo and R.I. Tanaka, Residue Arithme-tic and Its Applications To Computer - ."-.&t3 P

Technoloqy, McGraw-Hill, 167-.

5. W.K. Jenkins and B.J. Leon, "The Use Of " *°Jl a rj- uResidue Number System in the Design of " t ','iFinite Impulse Response Filters," IEEEC&S, CAS-24, April 1977. IS Im Won It I -Wt

6. W.K. Jenkins, "Techniques for Residue-to-Analog Conversion for Residue-EncodedDigital Filters,' zEE. C&S, CA5-25,July 1978. "1C;W

7. A. Peled and B. Liu, "A New HardwareRealization of Digital filters," IEEE ASSP,ASSP-22, December 1974.

8. 1.J. Jenkins, "A Highly Efficient Residue- ,.:W)Combinatorial Architecture for Digital

Filters," Proc. of the IEEE (Letter), I A. W. j omI um" .' rnwsVol. 66, No. 6, June 1978, pp 700-702.

This work was partially supported under AFOSR Figure3-Grant No. AF49620-79-C-0066.

794 4

Page 146: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

tA- j

I ( 'j~rI II) -)

4 1:IAI Ii~i

TI I *IIyC4

00

I T-ALi

Figre

795.

Page 147: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

LANGE~ M~OiJIII I miui.rPL IER StOR SINAL PROCISSINC

by

F.J. 7aylor

11nivor~i Ly of CincinnatLi

Abs tracr

The residue numbher system has recently been showi4 to be a

vifable signal processing media. However, it dfoes possess limitLa-

tions. One of the most serious is overflow prevention through

magnitude scaling. One muethod of overcoming this defect is to

increase the dynamic range of the nuibering system. To this end

a new high-speed large moduli multiplier has been developed. The

zwltipl icr, which is the resuilt of comhbininq the quarter squared

algorithm with recent breakthrouqhs in dfe-vice technology. As

a result, equivalent 1s-bit fju precision produicts can 1)0 obtained

at a pipelined riate of 28.511 multiple- per second.

This work was partially supported under AFOSR grant f:9620-79-C-0066.

Page 148: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

I. IIJTRODUCT ION

Recently, thle residue number sys'evi (141$) has received rene-wed attention

in the lit-erature [1-3]. This inathetinatically maiture study was, until tile

,present, in thie background of diglital system'l design because of its historic in-

abI)ility of r(lii La ha r~ida LO to iIIJort UNS £1 1i UJICi ct 14 Hoevr r OCC ccei)

break throughis in thie a rea of read-onlyu memnory LeChnlog0y hdr, sign ifCicantly

al tered tlis case. Us i ng hi gb-speed bi polar ROM' s, the ability of tile RW1S

to support ultra high-speed digital filtering has been experimentally

demonstrated [5]. The question of decimal-to-residue 1/0 operations has also

been addressed [6]. However, a major obstical to the cause of I4S fil tering

has been register overflow protection. In order to guarantee that system

registers do not overflow during run-timie, an inefficient operation referred

to as scal i'g las to be perforiied. If scaling were not required, 144S fil ters

was shown to possess higher throughput rates than those obtainable using

distributed arithmetic fie: bit-slice; ref [7]) [8]. However, when scaling

is required, the RNS architecture was shown to be at a disadvantage. It

should be remiembered however that thle distributed arithmetic filter is a

constant coefficient device (ie: shift-irvariant) whereas thle linear RNS

filter is general (ie: variable coefficient). Therefore thle RIJS provides

the user with the versatility needed to pt~rform adaptive, optimal (ex:

minimal variance in a non-stationary stochastic environment), frequency

tuneable filI er-ng which cannot be supported in a hi t-sIice configuratLion.

In this work, a new multiplier architecture is dheveloped which

significantly enhances the case for RNS fillers bly significantly reducing

jscaling overhead. The high-speed residue multiplier will be shown to

Aicrease the dynamic range of the RNS to a value which ei ther reduces th6

iitmi1bcr of scalIi 11( opera Lionis to a smiall fraction of their originalI number

Page 149: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

or make scaling -innecessary. AH li- is eccouniished without increasing

the memory budget over above that found in contemwporary RS designs.

II. RNS OVERVIEI

Interest in the RNS is due to its ability to perform high-speed arith-

metic. Speed is achieved throug. the use of a high degr e of parallelism

and an absence of carry information requirements. These two attributes are

a byproduct of the fact that there does not exist a most (least) significznt

residue digit. That is, all residue digits are of equal i,,portance. More

specifically, if P is a mduli set such t-at P {p1 ,.,I, and the pi's

are relatively prime, then if x: [-M/2, 14/2), x is uniquely re)resented by

the L- tupl e

" x ](x , .... xL)

wi th

x;- i f !, 1)

xi 2.two.inteqe>, o thiesi si

L

where <x> demotes x Podulo pi and M = i pi. The bilinear composition ofPii

two inteqlers, say x-v(x , .... X) and yy ... , is given by xy (where

o denotes addition, subtraction, or multiplicatio,) is given by

xoY-.(xJl ... x ,XLYt 3. _- :

It can be seen each residue digit, namely xioy i can be computed i.de-

pflendfhnt of ,:ii ol. (i ,: no carry infoimaw !tion r,'i rmveaI. ). In p1-acti: +,

the ,'W))ing Of x i and yi into xioYi is accamlished using table lookups where

the table residue on randomly accessed read-only memory. Typical high-speed

memory modules, whici are currently available, are:-

Page 150: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

Device :rip. Technology Co-figuration Access-SIed

0149 RO- EC. 256x4 20 n3SN14,S H ,.1- TTI. 0 1-.,^., 352147111 11'-! HIMS 4 096X I 30 n

The iirt)d",ct of two reidiie- ImodunIo. i, Pi p I Con he preoIIJL( d (ind

stored in a 2 xn-hit memory uni where ,,=mZn. Using a large exisLing high-

speed memory ('4Kxl at 30 ns), residues having up to six bit integer values

can be used (ex: P = (64,63,...}). Thus, fixed-point multipliers having a

dynaeic range of [-t4/2,M/2) can be architected which have execution rates

in the low nanoseconds.

The disadvantages of the residue number systems are mnifold. Since

the RNS possess no most significant digit, decinal to residue conversion,

division, magnitude comparison, and arithmetic shi ft operations are c.]umber-

some and shouid be avoided. Register overflow, due to its finite dynamic

range, impose a severe constraint on the RNS operations. Unlike weighted

numbers (decimal, binary, etc.) where rounding or truncating least significant

digits can control overflow, such is not the case in the RNS. Since there

is an absence of least significant digits, the more general and inefficient

operation kno.wn as scaling must be used. Since scaling is a formi of division,

its use should be discouragad. To gain insigilt into this problem, consider

the inner product of lwo 31-dlimensinnal real vcftors x an:d y whose entries

are encoded as residue digits with respect to 1 '3,31,29,27). Witimnul sca-

ing, the worst-case value of x and y would be limited to V" where

V (1/2)/31 = 25056. Therefore, to insure that no worst case overflow

5 7can occur, a 7.3-bit (ie:. V 158 -2 dynamic range limitation must

be imposed on x and y. With scaling, larger input rang.s can be used at

agc e

Page 151: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

the expense of tatistical a c. a;T" - - - ..

roundof f errors).

Dlue to the vynamic ranre imI "a in n R -It onr 0s m 1

forced to arcno:l one of n-, fn ir !..n n-r , _-y-pt inn no ,

I. i tN S C L(II dI,,. iC r', 'u- 1 c i 1i, i ,,; - r ;q l -Vald-- bi y ,1 nq

more ;-nodul i to P. or

2. ;:ake scaling a 1o-.-_e efficient opera tion.-

The first option represents a brute force attack to the problcmi. Such

an approach will increase to cost and complexity metrics of a filter. In

addition, the noduli set P must be tailored to unique filter. The other

approach appears to 1- 'e -ost popular at this time. Sazho and Tanaka,

and others, -ha'Te concentrated or: the scaling efficiency thirouch the choice

of the three-tuple moduli set P - , 2 2n1+1 This moduli set has

the ability to efficiently scale a residue number by any one of the chosen

moduli. However, there is an intrinsic limitation plauging this method and

it is its dynamic range. Using a large high-speed memory unit, say 4Kxi,

the input addressing space is limited to 2 This means that a maduli .

612is technically limiited to P- < 2 (ie: x-Y. _2 T dynamic

range of any modular operation is given by M = (2h2-l)(2n)(2+l) t 2

In many applications, an 18-bit resolution is insufficient resolution, i >

III. Principal result

It is desiranle to keep the previous y discussed tihr w moduli st-rucLure

for purposes of- potential scaling needs. However, in order to overcour 1be

existing disadvantages of this syste, ,that of dynamic range, a new approach

is called for. Since it is unrealistic to- assure substantially larger

density higu,-speed P"nories- will continue to lbecote available, it is

_ i-nctrabent t at more m~emory efficient . esidtje aritlinjtic uni t -fe designed. --

Page 152: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

An of ficient aloori tin, which is ideal ly sui Led for this aIppl icat ion, is

known as the (IItate-r-sq(Iuare Put I tip Iit cr [9- 111].

Over a real field it is obvious that

xy =((xI-y)/2) 2- ((x-y)I2V) 2'1.

which in modular formi. it becomes

where (;q(s) = s with s = (x-1y)/2 and s-= (x-y)12.

The. quarter-squanred mul tiplier has been studied by JJM. Poll ard (1976) in a

Galois fi'21d. Questions of hardware impl ementa Lion wcre not considered anti,

due to the Galois field limitation, only prime moduli could be considered.

It. flussbaum-i- (1976) studied the quarter-square musltipl ier over real fields

for use in ROM intensive diqi tal fil ters. Soderstrand and Fields (1977) m~ade

brief reference to th-is mul tipl ier for residtie arithmetic but offered no satis-

factory hardware realization. fn this paper, a practical residue arithmetic

c(arter-squared muiltiplieir wil 11be archi tected usinq commerciil ly available

hardware.

A problem that would seem to plaugle the quarter-square multiplier is the

need to realize the division by twq the sums and differences found in Eq. '1.-1 -

fin general, the existance of an N , such that <Pi N> =1, can only be guaranteedp

n~f M is realitively prime to p. Since one of the chosen moduli is p=2 ,mul ti-

plicative inverse of 2 cannot be guaranteed to exist. Therefore, equation 4

cano fe n eLred ais thkl equation m~/ 'xt)-xy >..Te poten-

tial problem of dividinq the sum of differences, found in equation 4, by

2, will be explicitly and efficiently treated for the first tinie later in this

'paper. For a 2 word memory unit, the direct product architecture (ie.: xy)

would 1limi t the maximal modul i to be hounded by 2, n=m/2. In fact, this

Page 153: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

clim can hoe x teraried to the case when p 2 (1 thrniicii use 011 the Foll1owingir

modification. Observe that if x. 0, then, it automatically followas that

<X yi~i =0. Therefore, if x i =,1-1,0AO~ (w0 hich is dete, table condition in

that the (n+l)st hit and remaining n-hit block is zero (04A\00 ... 0) ) the out-

put register would he automatically cleared. Therefore,, the lookup table, need

not be accessed for this case. Instead, thre all zero n-hit portion of thle

table addrev-allocated to x,, can he us-ed to represent x . 2 1 where x.

2 .I A 00-. 0 (see Figcure 1) . Here , the table woul d he programed to map

Yi into <2 n Yi>P using only a 211 word memory.

The memory requi rements associa ted with the quarter-squoare mul tiplier are

substantially less than those of direct mechanizations. First, it should be

apparent that thle integers s and s, found in equation 5, are bounded from

above by 2n~ Therefore, only a (n+l)-bit table addressing space is required

to realize (s-) versus the 2n-bi t space needed for direct archi tectures. It

would appear however, that there is an exception to this rule. Since one

of thle modul i chosen is p = 2 n+1 tHere thle maximal value of s+ (or s-) j- 2" +1

which would technically require (ni-2)-bi t address. However, by using thle

protocol foun:d in Figure 2, which is an adaptation of thle network found in

Figure 1, the table size can be reduced to 2'~ words for all moduli. Here,

the overflow bit serves to differentiate s =0 from 2nlIThle quarter-squared architecture is abstracted in Figure 3. I t uses a

2 word high-speed memory for modular arithn!.etic lookup operations. Using,

for example, thle previouisly re ferencedI '1K-30ns device, modul i havinig an 11-b it

dynamic range (vs. 6-hit in thle direct form) can be inechdnized. This would

yield a three-moduli.dynamic range on the order of 231)" 8.6-10'. That

is, withhout ain increase in memnory size (and therefore access time) , the

Page 154: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

7

dynamic range oF the quarter-squared R 2 33/2 1 215 Limes larger than that

obtainable through direct means! This large increase in dynamic range makes

the RNS a viable alternative to traditional filter design methods. Both

improved precis ion and l:hrnuqhpu I (Lthrouqh the reduction or absepce or tradi-

tional scaling operalions) can be achieved.

Several versions or the multiplier algorithm, can he considered. They are

sumwnarized in Figure 4. The first, called the sequential form, would have

an estimated throughput rate of 240 ns based on a 60 ns lookahead adder and

memory having an access Lime of 30 ns with a cycle time of 60 ns. The second

architecture, called the parallel form, would run at a 180 ns rate. The

parallel architecture is preferred because its higher speed, simpler control.

A 60 ns pipelined execution rate can he purchased at a small hardware cost.

Example: p - 211 . 2048, x = 1040, y = 352, then

Z = <xy>p 1536S + +

Al:s = 1376; (j(s+) = <484416>p = 1088;

Al:s = 688; A(s-) = <118336>p = 1600;

A2 QS <(S)-l(S-)>p <-512>p = 1536

Upon closer investigation of the table lookup data base, a potential

nuisance can be found. It can be examplified by observing that if s- 9,

92p = 32, then +(s ) = /4>32 = 20.25. Therefore, it may be required that

two additional fractional hits may need to le added to the table's output

wordl ng L. However, this is not the Case ds sUg.l(lLed by the following

theorem:

Theorem: Let vIl denote the integer value of v. Then z +<p(s+)U-lP(sX5>p.

That is, only the integer value or , need he used and the fractional bits of

l,( s+) can be ignored.

Page 155: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

I

Proof: For x, y and k integeri'S, one nmy define two rational numbers, namely

(x4y)/2 A v+k/2; (x-.y)/2 A qih/2 where k = 0 or i. Then z = <xy ) P

- <(x-y) 2/4> > <<v4kv+b 2/4> -<q+dv+k 2/4> " <<vrkv> + k 2/4-q k > -b 2 /4>p p p p p p p p

As a result, the parallel archilecture is equivalent to t.hat shown in

Fiqure 5. rurLhermore, by derivi||q the above Lhi,,rem over a ra t.ional field,

and showing that the results pertain to the integers, several

classica, probtemsare overcome:

1. The quarter-squared multiplier is not restricted to the Galois fields

sucigested by Pollard.

2. The question of the existence of the multiplicative inverse of 4 is

now moot.

I -

Page 156: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

. .... .....

N- X\ f IYI J)1 lA

2 11

ANTA~Pl~ip (4 sIINd1

'~l1It fyL)

i r 0VH--=1

N11)IIy (Y~lq~ln; m mm ~UIs

Page 157: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

10-

The quarter-square I11111 tipl ieor rCq:jires a mo(l o p adder he used to Comb inethe two compollont partls of the solution (namely ,]( s) and ,,( s- ) ) Ml(hil o( p

adders pose an interesting dc-ign problem. 0nless a fasL rodulo ) adder canbe fabricated, tle overhead associated with additionll will offeL ally gail illthrouqhpil achievI thr(qli h il) Ic Ui0 . I 1( J ildu I (.h lookiis, )1-- 1 1''

and 2i1+1, only the milOdt;!1 u " adder can be realized directly (n-bit adderwit, ignored overflow). It would however, be desirable to use a n-bit adderto realize the modulo 2"-1 and 2nfl adder as well. For the purpose of clarity,let s be defined to be the suin of ,(s + ) and ,(s-). The following observation

then follows:

TABLE 1Dynamic Integer Modulo 2 Adder Modul o J) Adder Example:N=3Case Range of S <s> 2N1 OVF-BIT Pi %sPi s <s> 1

1 s=O 0 0 2fl-1 0 0 02 1<s< 2N-2 S 0 2

- 1 4 43 s= 2 1-I S 0 -1 0 7 04 s=2 N 0 1 20 1 s-2N+1 8 15 2N+ ~I<.s<2 i4 s-N

2N -I 2N 10 3s--

6 s=O 0 0 2 0 0 07 I<s<2 N-I s 0 2 s 4 4

N (N N (19 2N+<s<2N 2 s-2N 1 2 s- 10 2

10 s=O 0 0 2N+1 0 0 oI 1 <s<2N I s , +1 s 4 4

1-20 1 i s 8 8?:i[ 13 2N+I<s< 2N !-I s_2 N 1I~.S2S-2 1 2N+1 s-2N1 10 1

4(spec i4 I ca 1v_ _ _ _ _-_

Page 158: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

Using n-bit AIl) gates to sense the zero condition of - 211, the overflow

bit OVi the sign bits of 1,(s+ ) and ,,(s-), combinatienal logic

can be defined ,.:hich will <s> 2N into <s> . It can be noted from ilhe

data found in Table 1 Lhat the mapping requirements are:

1. for p 2 N1, map s to S or s-2l : .<s>.NI>

2. for p 2 map s to s-2 =s,

3. for p 211+1, ra) s to s or s-2 fl-I = -l> ,

Mapping two is trivially satisfied with an n-hit adder. The other two

mappings require that s remains unchanged or it is decremented or incremented

by unity. There, are several ways to approach this problem. Bioul , Davis, arid

Quisquater have presented an, unorthodox architecture for a modulo (2n1-1) adder

using two-input qates[12] Modulo (2nl) adders can also I)e realized through

the use of end-aroun(-carries. lowever, compared to modulo 2 addition, this

approach would almost double the addition delay. This exLen(ded delay problem

can be overcome through added complexity (ie: time multiplexing two end-around-

carry adders). Mapping one and three can be efficiently realized in the manner

suggested by the example found in Appendix A. The functional operation of

adding one (mapping 1) or subtracLi n one (mapping 3) from the output of an

n-bit adder is performed by a PLA. The PLA will provide an overlay mask which

accomplishes the required task. The derivation and utility of the mask can be |

understood in the context of the following example. Exampl e: Suppose s is an

ll-bit word having a decimal valin: of ql(1 92 or s2 00001011100. If Sl)-l

91 or (r, 101 12 100 10011 is desired, one notes that only the 3-l.SII's of s

need be altered. in gereral, for n=12, only the following 13 distinct binary

masks are renuired fo form (s 1o-i)2

Page 159: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

0SI a tietrn S 1 I.S8 L i oil

XX XX X X XX X X x X =leave corresponding bitX X X X X X X X X X X 0 location of s? unichanged l(or 0)X X X X X X X X X X 0 1 -- change correspondIing hit

locf-iLion of' s to I (or 0)

0i 1 1 1 1 1 1 1 I 1 1 1 Table 11. MM K

Suppose thle moduli p) 2' .11 n 1 2, i s to be iml) (21111tcdl. Bly us incj

two coiuiercial ly available 16x9 PLA's in parallel, thle 12-hit outpt of anl

n-hit adder (shown as <> in Table 1) and the four priviously specified

control hi ts,* call he conver Led to 13-hi t miask. The mask would( trans form the

Output Of a hi gb-speed n-hi t adder to s or s-2n 1i, dependinrg Onl tile s ta te of

thle 4 control hits. Based o;-- i 25-tis 12-bit Schottky lookahead adder, a

?0-tis 1 6Y9 PLA * and 1 O-ns 1-1T miask swiklches (it) nloLaLi on conII.:; ts of Table 11)

a 65-ns modulo p) adJder, for p) 20..-1, 2'~ and 2 1-1 can b~e realI i zed. The

presence of a 65-ois modulo p adder will now allow a 1'10-nis large moduli

residue multiplier- based on 35-ns 'lKxl lIMOS memory units. (See Figure 5)

12 1? 12For oul e 1,l 2 *,2 M-l, a fixed point nil ile, aiga

output dynamic range of 23 2 can thus be fabricated having a word rate

of 7.14") 1. mul tipl ica tions per second. This compares favorably with new

16x16 VLSI multipliers. Using a pipelined architecture, which requires the

insertion of the storage registers found in Figure 5, a very impressive

throughput figure of 28.51Flmul tip] ications per second. It is JimpIortant, and

fortunate to realize timat tile I tel lI140S memory unit, used in this analysis,

has a cycle time equal to the access timie. If, as is often found in practice,

a memory unit has a cycle time approximately twice the access time, thenl

pipeline delay would increase from 35- us to 70-is.

Page 160: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

Suulmary:

The residue number system offer , t he potentia f or high1 speed para le I

a r i imhe Li c. lThis class of ari 1.h1Lic has bee, demoms tra led Lo be useful if)

designing recursive algorithms, transforms, and digital filters. One of

the principal l ii tations to its use is its 1 imi Led practical dylami c rale.

To overcome Lihis problm('I, a lar(g , modtili mill.i p ir, for tLh, mo l i set

(2 -I, 2 , 21 1l I, ds ( esiune(I. Ihlis high-speed large modul i system, was

the product of the novel algorithm and new technologies (RAM and PLA's).

The practical residue multiplier is capable of supporting a pipel ined

execution rate of 28.5 l. multipliers per secon(.

LasLly, the performance of the residue multiplier is :noted to be technology

dependent. As memory densities increase and speed improve, the multiplier per-

formance will directly benefit. As a result, the higher speeds associated

with the next cicneration of submicron techln6logy devices can provide a speed-up

of two to five. In the more distant future, when and if the Josephsen tech-

nology becomes a viable design tool, residue multiplication rates, using the

proposed methodology, may approach SOOM mui tip Ii cat ion0 per Seo.l

Re ferences

1. G.A. Jullien, "Residue Number Scaling and Other Operations Using ROMArrays," I[LE Trans. Computers, Vol C-27, No. 4, pp 325-337, April 1978.

2. C.11. Hluang and F.J. Taylor, "Memory Compression Scheme for Modular

Arithmetic, to be published, IEEE ASSP.

3. M.A. SodersLrand, "A lii gh Speed l.ov-Cost Reclrsive l'iltLer Isinog ResidueArithueLi(:," Proc. II.1., Vol. (5, No. 7, July 19/1.

4. N.S. Szabo and R.I. Tanaka, Residue Arithmetic and Its App] icaLionsTo Coj mLer 1'echnol gy, Mc raw-Ill I , . ...

5. W.K. Jenkins and B.J. Leon, "TheiUse Of Residue Number System in the_ Design of inite impulse Response Filters," IEEE- CAS , CA5-24, April 1977.

6. W.K. Jenkins, "Techniques for Residue-to-Analog Conversion for Residue-)incoded Figital Filte sp" IE CFS, CA5-25, July 1971% C - Apl9

Page 161: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

,. ~ . 1'C o~ at ! P. L.iu:, "A flevi 1Iardi- iie 1) a . 'd o Of Oigi La! ri I Ic "JULL ISSP, ASSP- 22. Deccemb~er 19/4,.B. i~iJodnk ims, "A li ghl E~Cff den' R eS i dte-C0lad utoila] Ardl~i tec Lur-e forNOW La IM crs,' PI-0C. of the ILI.. (Letter), Vol1. 66, [Io. f; Ju~Ile 19 B,PP) 700- 702.

9. ~ 11 Po l . rd , I pl 0Iicn La Li on of P u e - hcr' Tic Iran sforms , 1.1 -C t1r,11icshtters, V10. 12, hi "I. Jl 17) p3839

LetteS, Vol. Il, 1976, p)p1 2 94 11 - Ie22m c

Digital Fi 1 ers, Llectronic Letters, Vol. 13, No C6, Prh17, 1977.12. G. lBioul , M. OwyS, and J.j. Quisq'wartcr," A Conmputat 1011 Schcz,.,Ie for ailAdder Ml'lculo (2t~-1I) ," 1);r a1!ops CrJ1PbI hrS:Ite ni( 97!-)~ 309-318

I PP Jill.

Page 162: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

APIH-I-N TX A:

An exazapIe of a PLA controlledI 2n+! adder, for n=3, isdiagrammed in Figure A.T. Tn this figtulre, tlhe sm A=5 and l5

Inodulo (231+1) (i: (5+6) modulo 9=2) i. eult li n ed. Also, theaddli ion dl 'y for = 2. ha.,ed on Commolrcially availale hle dvare,J co!p}uted to be I O+2 +5=G5nsec. The ,-rafI a lh. i Lecture of tlheadder is diagrammed in Figure A.2.

FTGURiE CAPTTONS:

Figure 1: Modulo 2n+IALU

Figure 2: Ifemory Comprossion for S--

Fi lure 3: Modulo p. MliiLp ier

FigTure 4 : Archi tectures

FI" g re 5: T arge Modul1i Iu, Lip1Ier

F rure A. 1 Exanmpl e Problem

Figure A.2: General Architecture

I

I

iI -

z4

Page 163: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

xI

0I

z2 x n + I Z b0 i

y - Xy I F X ~L n

S=yIF Y Lt

Ln.L

Figr H0~o~,1111 2 +1- ALU

Page 164: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

n+1

xz

fl+. if OVR=O0 n~y l

n~l If OVfli

SELECT

Figure 2. Memory Compression For s+

Page 165: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

CL

cNI I

CI-,

GO.

Page 166: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

y +

'AyX-Ys + A 2 :C SEQUENTIAL

2'ns

PARALLEL

A1:X+Y y4~L

A1:X-y 2

-60- 60 60 18.9ns.

Page 167: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

C)Vid "

YI)

QQ)Cl) -I

-4

(U

-1)U)

0I

bD*1-4I

leri

Page 168: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

A B3

DE LAY

__ FOR N :i2

S:0V MOD 2~

* LS[3j

A BSW

MSB PLA /0 20 n

IAA/DP DP DP n

0X DON'T CARE

SIATESDP= DOUB3LE POLE

_00

=21 ELECTRONIC.00

Fgu ro A.].: E:'"ImPle Protaem

Page 169: FO DGITAL SIGNAL PPOCESSIV,as the VLSI unit hut consumes more than 3.5 times the power. However, the above VLSI multiplier units are designed to work in 2's complement and Footnote

0I

Qc:

Co +)

I< Cs- + (Ili-- ~-.- u -

Co-)C (c)

c C) >

CD)

00

biD

- z -o

______-#~ ".