8
High performance asynchronous FIR filter design in GaAs J.A. Montiel-Nelson S.V. Nooshabadi Indexing terms: FIR filter design, GaAs technology, Clocked dynamic latched logic Abstract: An asynchronous FIR architecture design using a mixed mode logic: approach in GaAs technology is presented. Combining an asynchronous design style with static and dynamic logic proves to be very suitable for high speed and low power implementation of real time mobile computing applications. The authors introduce a novel clocked dynamic latched (CDL) logic in GaAs to implement the micropipeline latches required in the single phase signalling. The reliable implementation of an 11-tap FIR filter in terms of speed, area and power dissipation in GaAs MESFET 0 . 6 ~ Vitesse technology is demonstrated. This ASIC system is fully operative across the full range of process spread variations and the temperature range of 0 to 100°C. It is robust against power supply variations of 15%. 1 Introduction There has been a considerable growth in the use of real time VLSI for DSP applications. This is due to the substantial processing power, higher speed, and reduced power dissipation offered by submicron VLSI technologies. However, in high end applications two major problems are beginning to appear. The first problem is clock skew, which is the difference in phase of a global synchronisation signal observed at different locations on a chip. Since multiphase clocks are typically used [l, 21, a nonoverlapping period is introduced between the clock phases to absorb the clock skew on a chip. This nonoverlapping time reduces the tirne available for processing. The second problem is high power dissipation (especially for low power portable devices) associated with the high speed clocking of the storage elements. In CMOS clock-based designs, many gates switch because they are connected to the clock, not because they have 0 IEE, 1997 IEE Proceedings online no. 19971324 Paper first received 4th June 1996 and in final revised form 18th February 1997 J.A. Montiel-Nelson is with the Centre for Applied Microelectronics, Uni- versidad de Las Palmas de Gran Canaria, E 35017, Las Pahnas de Gran Canaria, Spain S.V. Nooshabadi is with the School of Electrical Engineering, Northern Territory University, "I 0909, Australia new inputs to process. The largest gate typically is the clock driver. It must switch all times to provide the timing reference even if only a small part of the chip has anything useful to perform. In GaAs, the power dissipation is mainly static and therefore it is independent of the clock frequency. However, the distribution and driving of heavily loaded signals becomes a major problem in terms of propagation delay and power dissipation, which requires a very complex buffering tree structure [3-51. Digital gallium arsenide (GaAs) is firmly in the domain of high performance VLSI. The last few years have witnessed quick progress, breaking with the past stringent IC transistor count limitations. Vitesse, Philips, and TriQuint, among others, are providing access to GaAs MESFET processes that are designed to produce circuits with low power dissipation and high speed at high levels of integration. These characteristics are achieved using high performance self-aligned MES- FETs and up to five levels of interconnection. Today, circuits with more than 1,000,000 active devices can be integrated [6], and these improvements have reached such levels that GaAs may be the technology of choice for the high speed low power applications in emerging fields like multimedia and mobile computing. Recent research in VLSI design of high performance dedicated systems is directed towards the development of complex system architectures with operating frequencies close to the limit of a given technology. Systolic arrays, allowing increased system throughput, have been proposed as a powerful solution to this problem [7]. However, employing systolic techniques to implement FIR filter algorithms, results in a large area overhead due to the use of multipliers and the need for an elaborate clocking strategy. To reduce the number of multipliers and still use a systolic architecture, several alternatives employing a single multiplier have been suggested [S, 91. However, in these architectures the sampling rate depends on the number of the filter taps. Therefore, the sampling rate is several times less than the limit offered by the technology. The architecture presented here removes the need for the multipliers and replaces them by a few shift-and- add operations. The synchronous version of this architecture would have been a semisystolic design [7]. Going one step further, in this paper, an asynchronous wavefront approach [7] has been adopted. This eliminates the need for a complex clocking strategy. The block diagram of an asynchronous signal processing system incorporating our FIR filter is given in Fig. 1. This system is connected to the external AID and DiA units. The principle of operation is based on 289 IEE Proc.-Circuits Devices Syst., Vol. 144, No. 5, Octoher 1997

High performance asynchronous FIR filter design in GaAs

  • Upload
    sv

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Page 1: High performance asynchronous FIR filter design in GaAs

High performance asynchronous FIR filter design in GaAs

J.A. Montiel-Nelson S.V. Nooshabadi

Indexing terms: FIR filter design, GaAs technology, Clocked dynamic latched logic

Abstract: An asynchronous FIR architecture design using a mixed mode logic: approach in GaAs technology is presented. Combining an asynchronous design style with static and dynamic logic proves to be very suitable for high speed and low power implementation of real time mobile computing applications. The authors introduce a novel clocked dynamic latched (CDL) logic in GaAs to implement the micropipeline latches required in the single phase signalling. The reliable implementation of an 11-tap FIR filter in terms of speed, area and power dissipation in GaAs MESFET 0 . 6 ~ Vitesse technology is demonstrated. This ASIC system is fully operative across the full range of process spread variations and the temperature range of 0 to 100°C. It is robust against power supply variations of 15%.

1 Introduction

There has been a considerable growth in the use of real time VLSI for DSP applications. This is due to the substantial processing power, higher speed, and reduced power dissipation offered by submicron VLSI technologies.

However, in high end applications two major problems are beginning to appear. The first problem is clock skew, which is the difference in phase of a global synchronisation signal observed at different locations on a chip. Since multiphase clocks are typically used [l, 21, a nonoverlapping period is introduced between the clock phases to absorb the clock skew on a chip. This nonoverlapping time reduces the tirne available for processing.

The second problem is high power dissipation (especially for low power portable devices) associated with the high speed clocking of the storage elements. In CMOS clock-based designs, many gates switch because they are connected to the clock, not because they have 0 IEE, 1997 IEE Proceedings online no. 19971324 Paper first received 4th June 1996 and in final revised form 18th February 1997 J.A. Montiel-Nelson is with the Centre for Applied Microelectronics, Uni- versidad de Las Palmas de Gran Canaria, E 35017, Las Pahnas de Gran Canaria, Spain S.V. Nooshabadi is with the School of Electrical Engineering, Northern Territory University, "I 0909, Australia

new inputs to process. The largest gate typically is the clock driver. It must switch all times to provide the timing reference even if only a small part of the chip has anything useful to perform. In GaAs, the power dissipation is mainly static and therefore it is independent of the clock frequency. However, the distribution and driving of heavily loaded signals becomes a major problem in terms of propagation delay and power dissipation, which requires a very complex buffering tree structure [3-51.

Digital gallium arsenide (GaAs) is firmly in the domain of high performance VLSI. The last few years have witnessed quick progress, breaking with the past stringent IC transistor count limitations. Vitesse, Philips, and TriQuint, among others, are providing access to GaAs MESFET processes that are designed to produce circuits with low power dissipation and high speed at high levels of integration. These characteristics are achieved using high performance self-aligned MES- FETs and up to five levels of interconnection. Today, circuits with more than 1,000,000 active devices can be integrated [6], and these improvements have reached such levels that GaAs may be the technology of choice for the high speed low power applications in emerging fields like multimedia and mobile computing.

Recent research in VLSI design of high performance dedicated systems is directed towards the development of complex system architectures with operating frequencies close to the limit of a given technology. Systolic arrays, allowing increased system throughput, have been proposed as a powerful solution to this problem [7]. However, employing systolic techniques to implement FIR filter algorithms, results in a large area overhead due to the use of multipliers and the need for an elaborate clocking strategy. To reduce the number of multipliers and still use a systolic architecture, several alternatives employing a single multiplier have been suggested [S, 91. However, in these architectures the sampling rate depends on the number of the filter taps. Therefore, the sampling rate is several times less than the limit offered by the technology.

The architecture presented here removes the need for the multipliers and replaces them by a few shift-and- add operations. The synchronous version of this architecture would have been a semisystolic design [7]. Going one step further, in this paper, an asynchronous wavefront approach [7] has been adopted. This eliminates the need for a complex clocking strategy.

The block diagram of an asynchronous signal processing system incorporating our FIR filter is given in Fig. 1. This system is connected to the external AID and DiA units. The principle of operation is based on

289 IEE Proc.-Circuits Devices Syst., Vol. 144, No. 5, Octoher 1997

Page 2: High performance asynchronous FIR filter design in GaAs

\ ~~

analogu? Input; asynchronous DSP system ; analogue o u t p u t b

o u t R e q l i n R e q outReq inReq outReq H inKeq

Fig. 1 Asynchronous DSP system

the two-phase bundled data convention [ 101. This method uses transition signalling as a means of control- ling the transfer of data. The two control lines are request and acknowledge. Fig. 2 depicts this method. Since we consider only transfer of data through the computational blocks and not the interconnection logic, a simple handshake implementation is the only requirement.

Fig. 2 Two-phase bundled data convention

The handshake at the D/A side is completed by connecting its outReq output to inAck through a delay element. Similarly, the handshake at the A/D side is completed by connecting its outAck to inReq through a delay element. In Fig. 1, the AID inReq can alternatively be connected to an external sampling signal, provided the sampling period is less than the latency of the asynchronous system. Note that the internal blocks of the DSP system communicate with each other using a similar handshake protocol.

2 Multiplierless FIR filter

To reduce the hardware complexity associated with multipliers in the FIR filter, its coefficients are represented in canonical signed digit (CSD) form. This allows us to avoid the use of a full VLSI multiplier by reducing its complexity to just a few shift-and-add operations [Ill. This is done by approximating each coefficient value by a sum of a small number of signed power-of-two terms [12-141. It makes use of the fact that in binary arithmetic, multiplying a number by a power-of-two is just a matter of shifting. Consequently, filters of which the coefficient values are only a single power-of-two use multipliers with their complexity greatly reduced. Since the multiplier is the slowest circuit module and occupies the largest silicon area, filters with much reduced complexity in the architecture and the area consumption in their multipliers are not only economical in silicon area but also fast.

However, the improvement in speed and saving in silicon area are achieved at the expense of deterioration in the frequency response characteristics. The extent to which the frequency response deteriorates depends on the number of power-of-two terms used in approximating each coefficient value, the architecture of the filter, and the discrete space optimisation technique used to derive the coefficient values (111. A

290

-60dB of frequency response ripple magnitude can be achieved with two power-of-two coefficients [14].

The constant coefficient FIR filter chosen for the implementation corrects sin(x)/x distortion introduced by the D/A converters incorporating a zero order hold [15]. This filter, with a transfer function that approximates x/sin(x), is used before the D/A converter to predistort the digital samples. It is a linear shift filter having 11 taps with six different coefficients. By representing the 11 coefficients ho - h,, in CSD form, all multiplications are replaced by at most 2 additions/ subtractions of shifted data words. The filter coefficients h, - hlo in CSD form are given in Table 1. The performance of this filter is very close (2 0.028dB) to what would be obtained from an ideal filter with infinite precision coefficients [ 151.

Table 1: 11-tap x/sin(x) filter coefficients in CSD form

Coeff. CSD form

hqo, h, -2?

h,, h, 2-8 ha, h, -2-'

h,, h3 2-6 + 2-8 h,, h, -2-4 + 2-8

h5 1 + 2-2

3 Asynchronous FIR architecture

The architecture of the full 11-tap asynchronous FIR chip with its associated handshake control is illustrated in Fig. 3. Using the FIR filter in its transposed form and employing asynchronous micropipelining, the architecture is designed to minimise the delay of data distribution. The input, intermediate and output word- lengths are 14, 16 and 16 bits, respectively. Input and output micropipeline registers (MPRs), IREG and OUTREG, isolate the filter circuit from the I/O pads for the input and output buses. Hence, the pad delay is not in the critical path. Fig. 4 shows the details of a single micropipeline register stage with its associated handshake control circuitry. It consists of the handshake (HSK) circuitry, as well as the micropipeline registers for the input data (MD) and the carry save adder sum and carry signals (MCSA). The capture and pass signal (CSIGNAL) is generated by the handshake circuitry for controlling the micropipeline register and the filter taps discrete time delay registers (DTDRs). Fig. 5 depicts a single FIR filter micropipeline stage with its coefficients implemented in CSD form. Each micropipeline stage implements several filter taps as shown in Table 2.

Because CSIGNAL is used locally within a micropi-

IEE Pioc -Circuits Devices Syst , Vol 144, No 5, October 1997

Page 3: High performance asynchronous FIR filter design in GaAs

_~......_....__._._..__.._.. VMA Slags

............................................................................................................................... FIR Core Micropipeline Stages

a

Fig.3 FIR chip architecture a FIR filter architecture b Micropipeline register circuitry c Micropipeline stage taps INREG and OUTREG represent the input and output micropipeline registers; MPR, DTDR, MD and MCSA represent the micropipeline register, the discrcte time delay register, the micropipeline register for the input data and the micropipeline register for the sum and carry signals; HSK and CSIGNAL represent handshake cir- cuitry, and the capture and pass signal respectively; TAP and VMA represent the FIR filter tap and the vector merger adder stages

.............. ?.. . . WH)

,... ~ltl ..................

a Fig. 4 Adder architectwe a Black cell b 16-bit carry look ahead adder

peline stage, the skew problem is largely reduced. This contrasts with synchronous designs where the clock dis- tribution poses a major design problem.

Table 2: Micropipeline stages and their filter taps

Stages Filter taps implemented

To increase the throughput, a carry-save addition scheme is used to avoid carry propagation addition. In this way, the sum and carry signals propagate in paral- lel. The sizes of the micropipeline registers have to be doubled to store both signals. A further increase in the throughput rate is achieved by using bit level micropi- pelining, that is micropipeline registers are inserted after every adder stage, as shown in Fig. 3c.

A vector merger adder (VMA) adds the carry save adder sum and carry signals after the last filter tap. It is implemented using a 16-bit carry-look-ahead adder. It is isolated from the last filter tap by inserting a micropipeline stage. Fig. 4 is the architecture for the 16-bit carry-look-ahead adder. The vector merger

b

adder, is, in turn, partitioned into two micropipeline stages (VMA-I & VMA-11) to reduce the critical delay.

In Fig. 3, two kinds of storage registers are employed: the micropipeline register and the discrete time delay register. The micropipeline registers store the data-in, and intermediate carry and sum signals between the micropipeline stages. The discrete time delay registers are the architectural mapping of the dis- crete time delay elements in the FIR algorithm.

4 Physical design

A mixed mode logic design approach was used to implement various parts of the FIR architecture effi- ciently. Clocked dynamic logic, direct coupled FET logic, super buffer FET logic, and ultra buffer FET logic [16, 171 families are combined to produce the most area, power, and speed efficient implementation. The communication handshake was implemented using the static logic, with a better noise margin. The data path was implemented using a mix of static and dynamic logic. A novel extension of the clocked CMOS logic (C2MOS) family [l] to GaAs MESFET was employed to implement dynamic latches called clocked dynamic latches (CDL).

The computational blocks were implemented using the static logic. On the other hand, the storage elements were implemented using the clocked dynamic latches. Computational logic blocks with two clocked dynamic latch types are illustrated in Fig. 5a and b. As

IEE Proc.-Circuits Devices Syst., Vol. 144, No. 5, October 1997 29 1

Page 4: High performance asynchronous FIR filter design in GaAs

U

C' - I

7 LGND - - C

Fig. 5 Functional cells a Functional cell with clocked dynamic latch type I b Functional cell with clocked dynamic latch type I1 c Clocked dynamic latch type I d Clocked dynamic latch type I1

seen in Fig. 5c and d, two types of clocked dynamic latches were designed. Simulation results showed that the clocked dynamic latch in Fig. 5d provides a better fanout and superior noise margin. The clocked dynamic latches at the output of two cascaded logic blocks capture and pass data on the opposite phases of CSIGNAL. When the clocked dynamic latch in the first logic block is transparentiopaque, the clocked dynamic latch in the subsequent logic is opaque/ transparent.

I IinReq ICSIGNAL lCLEAR

CSIGNAL 1 out Ac k I ou tReq

a b Fig. 6 a Muller C-Element b Toggle element

Direct coupled FET logic family

The necessary event driven logic for the handshake circuitry is depicted in Fig. 6. It is implemented using the more robust and compact direct coupled logic fam-

ily. The computational and storage subsystems inside the micropipeline stages (Fig. 3c) form the major part of the design. Therefore, it was decided to implement the functional cells for the full adders in the static logic, and the registers in clocked dynamic latch, as a compromise between the throughput rate and noise margin. Hence, the register delay, the loading on CSIGNAL and the area were reduced; however, the noise margin of the static functional cells was still maintained.

Each elastic micropipeline register was constructed using a single clocked dynamic latch. Each discrete time delay register was built using two clocked dynamic latches. This was done by inserting the clocked dynamic latches in the middle and the end of the carry save adder blocks. Alternative clocked dynamic latchcs in the carry save adders are controlled by CSIGNAL and its complement. Fig. 7a - c depict the implementa- tion of the discrete time delay elements and the carry save adders.

The HSPICE simulation of 1-bit slice of (the carry save adder + the discrete time delay register) in a filter tap is shown in Fig. 8. It is seen that the sum and carry outputs have a latency of half a cycle. The typical delay from the negative edge of the clock is 300ps.

The vector merger adder is implemented in two micropipeline stages. The black cell in the carry-look- ahead adder is shown in Fig. 4b. The HSPICE simula- tion of the first vector merger adder micropipeline stage (VMA-I) is depicted in Fig. 9. The rippling effect in the carry output signals for the first and the ninth

292 IEE Proc.-Circuits Devices Syst , Vol. 144, No. 5, October 1997

Page 5: High performance asynchronous FIR filter design in GaAs

t VDD SVDD

a

Fig.7 a Latched propagate XOR gate

Carry save adder logic blocks b C

b Latched sum XOR gate c Latched carry function

Table 3: Performance of functional blocks, FIR Core, VMA-I and VMA-II in H-GaAslll process from Vitesse

Functional blocks

CSADTDR

TAP

HSK

MD M PR FIR Core

VMA-I

VMA-II

Delay (ps) Power (mW)

min. typ. max. min. typ. max.

277 300 364 0.81 1.13 1.33

277 300 364 14.84 19.57 22.58

517 581 754 3.17 4.22 4.89

157 177 224 13.20 16.93 19.99

517 581 754 18.33 25.87 27.37

635 836 1061 422.62 566.58 790.88

432 569 632 5.95 8.42 10.8

435 572 637 5.84 8.32 10.6

Area

(mmz x

7

110 17

137

243

362 1

54.36

46.26

CSADTDR represents I-bit slice of the carry save adder + the discrete time delay register; TAP represents one FIR filter tap; HSK, MD, MPR represent the hand- shake circuitry, the micropipeline register for the input data and the micropipe- line register, respectively; FIR-CORE represents all the FIR filter taps excluding the final vector merger adder; VMA-I and VMA-II represent the first and second stages of the vector merger adder, respectively

bits due to changes in LSB of one of the inputs is seen in Fig. 9. The ninth carry output being latched has a latency of half a cycle. Propagation delay through the second vector merger adder stage (VMA-11) is identical to that of VMA-I.

time, ns Fig.8 discrete time delay register

HSPICE simulation of I-bit slice of the carry save adder i the

The performance of various functional blocks, the FIR core and the vector merger adder stages is given in Table 3. Both the FIR core and the vector merger

adder are fully operative in the temperature range of 0 to 100"C, within the full range of process spread (between fast-fast and slow-slow technology process corners) [6]. They also fully operate against power sup- ply variations of 300mV. The minimum, typical, and maximum delays are based on the temperature, power supply, and process spread variations. The process spread corners are the lower left and the upper right

IEE Proc.-Circuits Devices Syst., Vol. 144, No. 5, October 1997 293

Page 6: High performance asynchronous FIR filter design in GaAs

... ... FIR Cerr Micropipline Sfages VMA S l A G I I S '..

'.. . ...

...

...

4 URFL 4 WFL

b

a

Fig. 10 HSK, MD, DTDR, MCSA, VMA and CSIGNAL represent the handshake circuitry, the micropipeline register for the input data, the discrete time delay register, the micropipeline register for the sum and carry signals, the vector merger adder, and the capture and pass signal respectively; UBFL and SBFL represent the ultra buffer FET logic and the super buffer logic respectively

Buffering strategy for the micropipeline stages

corner of the bounding box containing pairs of thresh- old voltages VTB and VTE for the depletion and enhancement mode GaAs MESFETs, respectively. These pairs are obtained from processing a certain number of wafers.

The maximum possible sampling rate of the input data is determined by the propagation delay along the critical path. For the filter architecture in Fig. 30, the critical path could be either the intrinsic micropipeline stages cycle time (with no delay element A inserted in the outReq path), delay in vector merger adder micro- pipeline stages, or the data-bus RC and fan-out delay plus the delay through the carry save adder and micro- pipeline register in the longest micropipeline stages (stages I and VI). The delays through each of these paths were simulated in HSPICE and are presented in Table 4.

Table 4: Critical path propagation delays through the long delay paths

Delay paths Delay (ps)

min. max.

LSBCSAR 277 300 365

MSBCSAR 409 478 612

REQACK 517 581 754 REQFF 635 836 1061

VMA-I 432 569 632

VMA-II 435 572 637

VMA-I and VMA-II represent the delays through the vector merger adder micropipeline Stages I and 11, respectively; LSB CSAR and MSB CSAR are delays through I -b i t slice of the carry save adder + the micropipeline register in the LSBs and MSB of data-in-lines, respectively; REQACK and REQFF refer to the propagation delays between inReq to outAck, and inReq to CSIGNAL inputs of the discrete time delay registers and the micropipeline registers

The micropipeline stage cycle time is equal to (2 x REQACK + toggle + Muller C-element). Cycle time can be reduced to just 2 x REQACK by fast forward-

294

ing the outReq to the next stage by connecting it to the output of the Muller C-element. The discrete time delay registers in a micropipeline stage should capture the data signals before the stage goes into the pass mode. This can be guaranteed as long as the delay through a C-Muller element is less than the propaga- tion delays between inReq to outAck (REQACIC in Table 4).

There is a high loading on the MSB data-in line in each filter tap. This arises due to CSD representation of the filter taps coefficients. Input data in 2s comple- ment form are sign-extended, and right-shifted by sev- eral bits. Therefore the delay through 1-bit slice of the carry save adder + the micropipeline for the input data is maximum along the MSB of the data-in line in Fig. 3. The number of gates loading the MSB of data- in line depends on the number of taps in a micropipe- line stage and the number of shifts in each tap. The number of gates loading the MSB and the other LSBs of data lines are given in Table 5.

Fig. IO depicts the buffering strategy for the data-in and CSIGNAL signals. CSIGNAL generated by the handshake circuitry is buffered and forked to provide the capture and pass signal for the micropipeline regis- ters for the input data (MD) and the micropipeline reg- isters for the sum and carry signals (MCSA). CSIGNAL is also used to clock filter tap's discrete time delay registers (DTDRs).

As shown in Fig. loa, to speed up the delay path in the MSB of data-in line super buffer FET logic and ultra buffer FET logic buffers are employed to drive the carry save adder inputs. Distributing the MSB of input data to all taps in each micropipeline stage requires multiple drivers. To drive this line, a multilevel buffering tree structure is employed. The buffering strategy for each level is seen in Table 5.

The self-timed nature of the design alleviates the problems associated with the global clock distribution present in its synchronous counterpart. However, the distribution of CSIGNAL to all discrete time delay registers and micropipeline registers in a micropipeline

IEE Proc -Circuits Devices Syst , Vol 144, No 5, October 1997

Page 7: High performance asynchronous FIR filter design in GaAs

Table 5: Fanout on data-in bus in the micropipeline stages and buffering strategy used

Stage Data bits Fanout Buffering structure

First Second level level

I,VI MSB 38 19 (UBFL) + 19 (UBFL) 2 (SBFL)

D12: D9 4 4 (SDCFL) -

0 8 3 3 (SDCFL) -

D,: D6 2 - - D5 : Do 0 - -

II,V MSB 16 16 (SBFL) - D T 2 : D8 2 - -

D,: D4 1 - -

D3 : Do 0 - -

/I1*: D8 2 - - D* : Do 1 - -

111 MSB 13 13 (SBFL) -

IV MSB 10 10 (SBFL) -

D T 2 : D4 2 - - D3 : Dz 1 - - D, : D, 0 - -

-indicates no buffering required; UBFL, SBFL and SDCFL refer to the ultra buffer FET logic, the super buffer FET logic and the source follower direct coupled FET logic, respectively

stage at the 500MHz sampling rate requires a three- level deep buffering tree structure. The buffering

structure is shown in Fig. 106 and e. This structure is formed by inserting three levels of buffers in the CSIGNAL path. The first level of buffers drive two second-level drivers and each second-level buffer drives 16 minimum size direct coupled FET logic gates in each of the filter tap discrete time delay registers or micropipeline registers.

The layout for the complete FIR filter is generated using the GaAs OLYMPO cell generator [18]. OLYMPO is an automatic layout generation system for the GaAs VLSI. OLYMPO takes a gate-level cir- cuit schematic in SPICE format and outputs a mask layout in CIF. The layout style is based on ring nota- tion [19]. The floor plan and the partial layout of the full chip is shown in Fig. 11. As shown, it is a highly regular and compact architecture. The total chip area designed in 0 . 6 ~ H-GaAsIII process from Vitesse is 4.870mm2. As indicated in the floor plan, the filter has a latency of 9 cycle times and achieves a throughput rate of 500MHz. The estimated total power dissipation is as low as 618mW.

5 Conclusion

An 11-tap asynchronous FIR filter has been designed and implemented in mixed mode logic in GaAs MES- FET technology. Using a mix of static and dynamic techniques, the area and power efficiency has been improved to a significant extent.

This asynchronous FIR filter has 14, 16 and 14 bits input, internal and output word lengths, respectively. It has a latency of nine cycle times and achieves a

Fig. 1 1

IEE Proc.-Circuits Devices Syst., Vol. 114, No. 5, October 1997

FIR filter Sjloor plan and partial layout showing ull the micropipline stages

295

Page 8: High performance asynchronous FIR filter design in GaAs

throughput rate of 500MHz. This chip is designed in 0 . 6 ~ H-GaAsIII process from Vitesse, occupies an area of 4.870mm2 and dissipates a total power of 618mW. The chip was completely simulated at the log- ical and electrical levels and is at present in production. This filter chip has been served as a demonstration vehicle to validate use of asynchronous design tech- niques for digital signal processing systems in GaAs technology. Based on this design approach, dedicated FIR filter chips can be built to meet the rapidly increasing demand for high-performance filters for applications in high-speed communications systems and video signal processing. The self-timed nature of the design alleviates the problems associated with the clock distribution present in its synchronous counterpart.

At 500 MHz throughput rate board level communica- tion between this chip and external synchronous chips becomes a major problem. Therefore, interchip timing issues must be addressed early in the design cycle to alleviate such difficulties.

6 References

1 WESTE, N., and ESHRAGHIAN, K.: ‘Principles of CMOS VLSI Design’ (Addison-Weslev. 1985)

2 UNGER, S.H.: and TAN, CY-J.: ‘Clocking schemes for high- speed digital systems’, IEEE Trans., 1986, C-35, (lo), pp. 880-895

3 DOBBERPUHL, D.W., WITEK, R.T., ALLMON, R., ANGLIN, R., and MONTANARO, J.: ‘A 200-MHz 64-b dual- issue CMOS microprocessor’, IEEE J Solid-State Circ., 1992, 27, ( l l ) , pp. 1555-1567 SAKALLAH, K.A., MUDGE, T.N., BRKS, T.M., and DAVID- SON. E.S.: ‘Synchronization of Dimlines’. IEEE Trans.. 1993.

4

CAD-12, (S), pp. 1132-1 146

5 FRIEDMAN, E.G., and POWELL, S.: ‘Design and analysis of hierarchical clock distribution system for synchronous standard ceWmacrocel1 VLSI’, IEEE J., 1986, SC-21, (2), pp. 240-246 VITESSE: ‘Foundry design manual’ (Vitesse Semiconductor Cor- 6 - - poration, 1992) KUNG, S.Y.: ‘VLSI array processors’ (Prentice-Hall, 1988) 7

8 RONCELLA. R.. SALETTI, R., TERRENI, P., and PIATEL- LI, D.: ‘Application of a systolic macrocell-based VLSI design style to the design of a single-chip high-performance FIR filter’, IEE Proc. G, Circ. Devices Syst., 1991, 138, pp. 17-21

9 ABDEL-RAHEEM. E., EL-GUIBALY, F., and ANTONI- OU, A.: ‘Systolic implementation of FIR decimators and interpo- lators’, IEE Proc. Circ. Devices Sysvst., 1994, 141, pp. 489-492

10 SUTHERLAND, I.E.: ‘Micropipelines’, Commun. ACM, 1989, 38, (6 ) , pp. 720-738

11 LIM, Y.C., EVANS, J.B., and LIU, B.: ‘Decomposition of binary integers into signed power-of-two terms’, IEEE Tuans.,

12 PERERA. W.A.: ‘Architecture for multiulierless fast Fourier 1991, CS-38, (6), pp. 667-672

13

14

transfom’hardware implementation in VLS?’, IEEE Trans., 1987, ASSP-35, pp. 1750-1760 BOWMAR, B.W., and HUNG, J.C.: ‘Minimum roundoff noise digital filters with some power-of-two coefficients’, IEEE Trans., 1984, CS-31, pp. 833-840 LIM, Y.C., and LIU, B.: ‘Design of cascade form FIR filters with discrete valued coefficients’, IEEE Trans., 1988, ASSP-36, pp. 1735-1739

15 SAMUELI, H.: ‘The design of multiplierless FIR filters for com- nensating the D/A converter freauencv resnonse distortion’. IEEE p r a m , fi88, CS-35, (8), pp 1064-10k6

N’U’NEZ, A 16 ESHRAGHIAN, K , SARMIENTO, R , CARBALLO, P P , and

‘S-peed-area-Dower oDtimization for DCFL and SDCFL class of^logic usini ring nbtation’. Micronroc. ~ i c r o - progr., 1991, 3, (2), p”p. 75-6

17 MONTIEL-NELSON, J.A.: ‘Characterization and performance estimation for GaAs MESFET based classes of logic for very high meed auidication’. Technical reDort, Centro de Microelec-

trdnica Aplickda, 1994 18 MONTIEL-NELSON, J.A.: ‘Cell synthesis and comoilation in

GaAs’. PhD thesis, ETSI Industriales, University Las Palmas de Gran Canaria, 1994

19 PUCKNELL, D.A., and ESHRAGHIAN, K.: ‘Basic VLSI design, silicon system engineering series’ (Prentice-Hall, 1994)

296 IEE Proc -Circuits Devices Syst., Vol. 144, No. 5, October 1997