[IEEE 2005 2nd International Symposium on Wireless Communication Systems - Siena, Italy (05-09 Sept. 2005)] 2005 2nd International Symposium on Wireless Communication Systems - Block

Block Processing Engine for High-ThroughputWireless Communications

Daniele Lo Iacono, Julien Zory, Ettore Messina, Nicolo' PiazzeseAdvanced System Technology Group

STMicroelectronics(daniele.loiacono, julien.zory, ettore.messina, nicolo.piazzese)@st.com

Abstract - This paper presents the Block Processing Engine(BPE), a programmable architecture specifically suited for high-throughput wireless communications. Thanks to a high degree ofparallelism and a consistent use of pipelined processing, the BPEcan satisfy the stringent real-time constraints imposed byemerging technologies. Its efficiency has been proven through theimplementation of a dual standard frequency domain equalizersupporting 3GPP HSDPA and IEEE 802.1la.

Keywords: ASIP, VLIW, WLAN, HSDPA, FDE.

1. INTRODUCTIONWireless communications are rapidly evolving toward a

large variety of systems offering high data-rate multimediaservices. Next generation terminals are asked to be more andmore powerful and still flexible enough to support anyvariations on emerging standards or even multiple standards atonce. Architectures devoted to base-band processing mustendure high computational efforts while exhibiting thepotential of switching between different algorithms or systemson-demand. In this context, traditional approaches such asASIC and DSP reveal some limitations. Indeed ASIC designexhibits high efficiency in terms of computational capacity,power consumption and real-time processing. However, itslimited flexibility strongly compromises its lifetime, definitelyinadequate to the rapid modifications ofthe technology. RecentDSPs, although powerful and flexible, are still not capable ofsustaining the data-rates imposed by most of the emergingsystems, whose requirements are growing even faster than theDSPs technology itself [1][2]. Despite of its limitations, ASICdesign is often considered as the only way to achieve high-throughputs. In this scenario, emerging Application-SpecificInstruction-set Processors (ASIPs) seem an attractive way ofsatisfying real-time requirements while assuring a certaindegree of flexibility. Exempt from supporting general-purposeapplications, ASIPs can make use of a reduced set of dedicatedinstructions with a high level of parallelism, hence allowingefficiencies comparable to that ofASICs [2].

In this paper we present the Block Processing Engine(BPE), a platform embedding a programmable processorcapable of simultaneously addressing a set of customizedhardware modules specifically suited for wireless processing.To prove the potential of the BPE, a dual standard FrequencyDomain Equalizer (FDE) supporting WCDMA/HSDPA andOFDM/WLAN 802.11 a has been implemented. Latency

0-7803-9206-X/05/$20.00 ©2005 IEEE118

evaluation and synthesis results demonstrate the capability ofreal-time running both systems at a reasonable core frequency.

11. ARCHITECTURE DESCRIPTION

A. General overviewThe Block Processing Engine (BPE) main concept is the

capability of arbitrarily associating a certain number ofdedicated Processing Units (PUs) to execute computationallyintensive signal processing on complex streams of data.

The block diagram of the BPE is depicted in Figure 1. Thecore function is a programmable controlling unit (hereaftercalled giC) which implements a basic Instruction-Set (IS),mainly devoted to the flow-control, and a dedicated IS actingon a set of PUs specifically designed for base-band wirelessprocessing. The jC is in charge of executing the basicinstructions, scheduling the dedicated instructions on the PUsand managing the data flow between the PUs and the memorysub-system.

Figure 1. Block Processing Engine Block Diagram.

To match the stringent real-time constraints imposed by theemerging high-throughput systems, the BPE must be capableof properly allocating the available resources (PUs) to performparallel or pipelined processing on complex vectors of data.

A high degree of parallelism can be achieved supportingthe concurrent execution of groups, or bundles of dedicatedinstructions. Pipelined processing requires the capability of

linking the PUs to form a processing chain. To support bothconcurrent and pipelined processing, the BPE embeds a set ofrouters which can be properly configured by the gC accordingto specific program flow.

To reduce the latency overhead due to the execution ofmultiple scalar operations on groups of data, dedicatedinstructions can directly manipulate arrays. This requires anadequate memory sub-system, which consists of a bank ofconcurrently accessible high-speed SRAM memories. The useof vectorial instructions has the side benefit of simplifying theprogram implementation and shortening considerably its size.

The pC is mounted as a slave peripheral on the system bus,and can be controlled by any external entity, hereafter calledCommunication Master (CM), having access to the bus. Usingthe gC internal registers, the CM can perform requests such asloading/saving data from/to any data memory of the memoriesbank, loading programs into the program memory, orcommanding the execution of a program. Moreover, the CM isallowed to update the content of some ,uC internal registers,which are used as parameters by the program.

To implement different standards or different procedureswithin the same standard, the BPE program memory islogically divided into banks where the different programs arestored. The ,uC supports context switching by properly savingthe status of the current program before initiating a newexecution.

B. Processing Units and dedicated instructionsAs stated above, the pC is in charge of scheduling the PUs

activities on the basis of a program loaded by the CM into theprogram memory. From the ,uC side, the PUs set is intended asa collection of objects (with a given granularity) eachembedding a set of functions performing specific actions. Sinceparallelism can be activated at PU level only, all the functionsbelonging to the same PU can not be executed concurrently.They are considered by the gC as part of the PU privateinstructions sub-set. It can be argued that the trade-off betweenprocessing speed and flexibility strongly depends on thegranularity of the PUs as well as on the adequate splitting ofpotentially concurrent functions among different PUs.Properly choosing the PUs granularity helps to identify withineach PU the possibility of being re-used for differentapplications.

Each assembly-like dedicated instruction has the format:

PUN.OPC[.OPM] [BS] [DX[,DY,DZ]] DO

PUN indicates the specific PU number, OPC indicates theoperation code of the PU private instruction, OPM indicatesthe operation mode for the specific code, BS is the block(vector) size when processing complex arrays, while DX, DY,DZ and DO are the pC registers holding the information toretrieve the three input operands and the output operandrespectively.

The operation mode can be considered as an additionalparameter specifying the way the particular function has to be

performed. The instruction operands can be either real orcomplex scalar (registers file) or vectors (memories bank).

For instance, the dedicated instruction below uses the ALU(Arithmetic Logic Unit) processing unit to perform thecomplex multiplication (MUL) between an array of size BOstored into the memory MO and a number stored into the pCdata register RI:

ALU.MUL.RND BO MO,R1 Ml

The output, rounded over 16-bit according to the specifiedoperation mode (RND) is stored into the memory Ml.

Apart from the output operand, each PU returns some statusinformation through the pC status registers. Commonly theseregisters are used to count the overflow occurrences whenperforming array processing, and can be useful to detect forinstance quantization problems within the processing chain.

C. Macros definitionOn top of classical concurrent processing, the BPE allows

the direct link of a sub-set of PUs to implement more complexfunctions, otherwise called macros. A macro is a way ofpipelining the processing among the PUs without having toperform intermediate accesses to the memories. This hasconsistent advantages in terms of throughput and powerconsumption. It must be noted that since intermediate results ofa macro are directly passed to the subsequent PUs, they willnot be anymore available after the execution of the macro.

To associate a sub-set of PUs with a macro, each PU mustbe explicitly linked using a special pC register (LX). Forinstance, the following macro adds the scalar complex valueRO to all the elements of the complex conjugate of the FFT ofthe vector residing into the memory MO:

Ml = RO + CNJ(FFT(MO))

It can be implemented linking the PUs as follows:

FTU.FFT BO MO LXCLU.CNJ BO LX LXALU.ADD.SAT BO LX,RO Ml

Figure 2 shows the latency reduction when using a macro.

(A) (B)

Figure 2. Latency of normal (A) and linked (B) executions.

D. Instructions bundlesTo perform concurrent processing, the program flow is

broken into bundles of instructions that are executed at once.Each bundle can be formed by either basic instructions only(B-bundle) or dedicated instructions only (D-bundle).

119

Since there is no instruction parallelism within the basicIS, a B-bundle merely indicates the possibility of pipeliningthe execution of a group of basic instructions when there is nodirect dependency among them, as depicted in Figure 3.

FETCH

EXECUTE

(A)

EECUTE

FEC) DECODE

(B)

Figure 3. Difference between normal (A) and B-bundle (B) execution.

Unlike the B-bundle, the D-bundle fully exploits the BPEparallelism. Instructions or even macros within a D-bundle areexecuted concurrently. It is worth noting that when fetching aD-bundle, the pC considers each macro as a single instructionto be executed concurrently with the others. As for VLIWprocessors, concurrency must be explicitly indicated within theprogram [1]. An example of D-bundle is shown in Figure 4.

CGU.SCR.INI BO D1,D2Q

C FTU. FFT BO MO LX

MACRO CLU.CNJ iO LX LX

ALU.ADD.SAT BO LX,E1 Ml i

HTU.FHT B1 M2 M3

Figure 4. Example of D-bundle and related execution latency.

E. Considerations on complexity and speedCompared to a pure ASIC implementation, the BPE suffers

the major drawback of inherently introducing an overhead interms of both complexity and speed. The impact of theadditional logic, mainly the pC and the routers, on the overalldesign needs to be carefully investigated.

A

40

E0

0

I

e0

E

E

4 S 12 16Processing Units Number

Figure 5. Synthesis results for the IoC as a function of the number of PUs.

The complexity ofthe tC strictly depends on the maximumnumber of PUs that have to be simultaneously activated (i.e.the D-bundle size). In terms of processing speed, the maximumworking frequency of the pC, which again depends on the sizeof the D-bundle, can represent a limiting factor whenembedding high-speed PUs within the BPE.

A

C)0

0

4 8 12

Processing Units Number16

Figure 6. Routers synthesis results as a function of the number of PUs.

Figures 5 and 6 summarize the synthesis results for the90nm CMOS STM low-power technology as a function of thenumber of PUs. The gC complexity growth is almost linear,while the maximum frequency is still reasonably high evenwhen using a large number of PUs. The steeper inner routerscomplexity growth can be justified considering that each PUrequires an additional router, whose size increases with thenumber ofPU.

III. RE-CONFIGURABLE HSDPA/WLAN EQUALIZER

A. System descriptionAs a case study, the BPE has been used to implement a re-

configurable equalizer supporting CDMA and OFDM signals;it is more specifically capable ofdemodulating WLAN 802.1 laas well as multi-code UMTS/HSDPA signals. In order to fullyexploit the re-usability aspects ofthe BPE, the equalization hasbeen performed into the frequency domain. While FDE iscommonly employed for multi-carrier systems, its use onsingle-carrier wideband systems has been investigated only inthe last few years [3][4]. Recent studies have demonstrated thatFDE for WCDMA systems is not only feasible, but even lesscomplex than the time-domain approach [5][6][7].

MMSE/FDE

r _FI B *dWLAN

Figure 7. Frequency Domain Equalizer (FDE) block diagram.

120

90 __ __

INNER

70 -

60 -

60 ___ _

40 _ _

30--:;1g OU-R

_I0~ ~ ~ U

The block diagram of Figure 7 shows the HSDPA/WLANdual equalizer; it clearly highlights the possibility of re-usingmost of the components. The HSDPA needs additional unitsstrictly related to CDMA systems, such as the scrambling codeand the de-spreader (here implemented through the Hadamardtransform to support multi-code detection).

The MMSE frequency domain equalization is performedusing the following element-wise equation:

(1)H=

where R, H and B are the Fourier transforms of thereceived signal, the channel impulse response and theestimated transmitted signal respectively, while &2 representsthe noise variance.

The channel response, which in the block diagram ofFigure 7 has been assumed already available into the frequencydomain, strictly depends on the system. For OFDM signals it isperformed into the frequency domain, by properly correlatingthe FFT of the received signal with the frequency domain longpreamble. For WCDMA it is first evaluated into the time-domain using the available pilot channel and then transformedinto the frequency domain for subsequent equalization. It hasto be noted that although frequency-domain channel estimationwould be possible even for WCDMA systems, complexitywould not benefit since the need of providing the Fouriertransform of the scrambling code, which is normally generatedinto the time-domain.

In OFDM systems the FFT size corresponds to the numberof sub-carriers, and thus is given by the standard (64-point forthe IEEE 802.1 la). For CDMA systems, it can be consideredas a performance parameter mainly depending on the channeldelay spread. In fact, the equalization of adjacent blocksrequires applying overlap-and-save in order to reduce theboundary errors due to the absence of cyclic prefix. Since theoverlap-and-save factor is fixed by the delay spread, thepercentage of useful symbols per block, and ultimately thethroughput, can be only adjusted acting on the FFT size. Agood trade-off between throughput and complexity is using a256-point FFT for an overlap-and-save factor of 16-chip [7].

After FDE, the WCDMA/HSDPA multi-code de-spreadingis performed by first descrambling the time-domain equalizedsignal with the cell-specific gold code, and then evaluating theNc-point FHT (Fast Hadamard Transform) of the de-scrambledsignal to simultaneously demodulate all the spreading codesassociated with the spreading factor Nc [7].

B. Equalizer implementationStarting from the block diagram of Figure 7, the first step

toward the BPE implementation of a dual-mode equalizer is toidentify the minimum set of PUs that allows performing theoverall processing.

The key PU is the FTU (Fourier Transform Unit), whichmust be capable of performing FFTs of different sizes as well

as switching between FFT and IFFT on-demand. The basicarchitecture of the FTU is depicted in Figure 8. It consists of asingle Radix-4 butterfly, serially re-used to perform FFT/IFFTof different sizes. To reduce the processing time, the data RAMand the twiddles ROM are divided into banks, in such a waythat for each butterfly calculation both input data and twiddlefactors are available in one clock cycle. Left and right RAMbanks are used for even and odd stages of the FFT processingrespectively [7].

Twiddles ROM BanksRAM |-RAMBanks 7 Banks

xi Ra ix44x

Figure 8. Fourier Transform Unit (FTU) basic architecture.

The Code Generation Unit (CGU) PU generates portions ofthe UMTS cell-specific scrambling code. The CGU embeds aninitialization instruction used to store a certain number ofgenerator seeds corresponding to different delays (within thecode period) into the generator intemal memory, thus allowingfast initialization when a delayed code has to be generated.Since the granularity of the seeds memory strongly affects thelatency introduced when generating the code, it must be chosenon the basis of a proper trade-off between the memoryoccupancy and the minimum acceptable latency. Apart fromgenerating the code with arbitrary delay, the CGU is equippedwith a special instruction that allows packing shifted replicas ofthe code into 16-bit complex word to speed-up the subsequentcorrelations. In fact, the CGU is designed to work side-by-sidewith the Correlations Bank Unit (CBU), which is mainly usedfor channel estimation and embeds 16 accumulators providingup to 16 correlation values for each clock cycle.

Multi-code HSDPA demodulation is performed using theHadamard Transform Unit (HTU), whose architecture is basedon the classical Radix-2 algorithm.

The remaining operations, such as the MMSE coefficientscalculation, are carried on using two ad-hoc Arithmetic LogicUnits (ALUs) embedding arithmetic operations on complexarrays. It is worth noting that the coexistence of two differentALUs is crucial to minimize the latency of the overallprocessing. In fact, since the parallelism of the BPE is at PU-level only, concurrency between arithmetic instructions or eventhe implementation of macros would not be possible using asingle ALU embedding all the operations.

With all the above PUs available, the implementation of thedual-mode equalizer consists of writing both the HSDPA andWLAN programs using the assembly-like language of theBPE. To increase the flexibility and to allow quickly testingdifferent scenarios, most of the system parameters are mappedon some ,uC registers that can be updated at run-time by theCM. Typically, these registers hold dynamic parameters passedby upper layers or by different functional blocks of thephysical layer, but they can be reserved also to support future

121

modifications or evolution of the standard. For instance thescrambling code number, the noise variance, the number ofDATA fields in a WLAN frame, but also the HSDPAspreading factor, the WLAN pilot carriers number and positionor even the FFT size can be considered as parameters passed tothe program at execution time.

C. Performance evaluationThe RTL implementation of the HSDPA/WLAN equalizer

has been validated against test vectors generated from a bit-trueSystemC reference model. Anticipating the integration of theWLAN support into UMTS/HSDPA mobile terminals, theequalizer has been tested using a multi-path fading channelmodel (ITU Pedestrian B), as specified by the 3GPPperformance requirements for HSDPA [8].

10e

1lo'm^D'

Ulm

10-2

1o-3

I i i

- UMTS HSDPAWLAN 2.lli

12 16

SNR (dB)

Figure 9. Performance ofthe dual-mode HSDPAIWLAN equalizer.

ACTIVE

STALUNG

LINK

ALU

ALUx xCHANNEL MMSE SIGNAL AND DATA FIELDSESTIMATION COEFFICIENTS EQUALIZATION

Figure I 0. Example of PUs scheduling for WLAN equalization.

The latency to demodulate a QPSK HSDPA un-coded slot,made of channel estimate, FDE, descrambling and multi-codede-spreading is about 36K clock cycles. When using a workingfrequency of 200MHz, the entire demodulation lasts for about1 80ps (i.e. about 27% ofthe entire slot duration). Similarly, thelatency of an un-coded QPSK IEEE 802.1 la frame with 23DATA fields takes about 5K cycles, corresponding to 25js(about 30% of the frame duration). These results largelyconfirm the possibility ofreal-time running both systems.

IV. CONCLUSIONS

It has been presented the Block Processing Engine (BPE), aprogrammable architecture suited for base-band processing. Toprove its efficiency, the BPE has been used to implement adual-standard HSDPA/WLAN frequency-domain equalizer.Synthesis results and latency evaluation have demonstrated thecapability of real-time running both standards. The high degreeof flexibility and the possibility of sharing the processingresources suggest using the BPE to implement next generationterminals supporting multiple high-throughput standards.

Figure 9 shows the BER curves of the dual-mode FDEwhen using un-coded QPSK modulation for both WLAN andHSDPA. The WLAN frame holds 23 DATA fields, while forthe HSDPA a spreading factor ofNc = 16 is used.

D. Real-time processingAs discussed above, real-time processing capabilities

depend on the way the dedicated instructions are scheduledduring the different stages of the processing. Given theinherent nature of the wireless processing, it can be expectedthat the major contribution to the latency reduction comes fromthe use of macros.

Figure 10 shows the timing diagram of the PUs activitywhen performing WLAN equalization on a frame containingtwo DATA fields only. Most ofthe processing time is absorbedby the FFT calculation, which is always linked to the ALU in amacro. It must be noted that since the pC allocates each PUwhen fetching the D-bundle, the ALU within the macro isforced to stall until the output of the FFT is available.

Finally, it is worth noting that the latency due to channelestimation averaging and MMSE coefficients calculation isconsiderably shorten by the joint use oftwo separate ALUs.

REFERENCES[1] J. Eyre, J. Bier, "The Evolution of DSP Processors", IEEE Signal

Processing Magazine, Mar. 2000.[2] J. R. Cavallaro, P. Radosavljevic, "ASIP Architecture for Future

Wireless Systems: Flexibility and Customization", 11" Wireless WorldResearch Forum, Oslo, Jun. 2004.

[3] D.D. Falconer, S.L. Ariyavisitakul, A.B. Seeyar, B. Edison, "FrequencyDomain Equalization for Single-Carrier Broadband WirelessSystems", IEEE Comm. Magazine, vol. 40(4), pp. 58-66, Apr. 2002.

[4] D.D. Falconer, S.L. Ariyavisitakul, "Broadband wireless using singlecarrier and frequency domain equalization", IEEE 5th Symposium onWireless Personal Multimedia Comm., vol. 1, pp. 27-36, Oct. 2002.

[5] I. Martoyo, T. Weiss, F. Capar, F.K. Jondral, "Low Complexity CDMAdownlink receiver based on frequency domain equalization", IEEE 58"Vehicular Technology Conference, vol. 2(6-9), pp. 987-991, Oct. 2003.

[6] J. Pan, P. De, A. Zeira, "Low Complexity Data Detection using FastFourier Transform Decomposition of Channel Correlation Matrix",IEEE Global Telecom. Conference, vol. 2, pp. 1322-1326, Nov. 2001.

[7] D. Lo lacono, E. Messina, C. Volpe, A. Spalvieri, "Serial BlockProcessing for Multi-Code WCDMA Frequency Domain Equalization",IEEE WCNC 2005, New-Orleans, March 2005.

[8] 3GPP TS 25.101, "User Equipment (UE) Radio Transmission andReception (FDD)", 3GPP, Technical Specification Group RAN.

[9] IEEE Std 802.11a-1999 - Partll "W-LAN MAC and PHY layerspecifications, IEEE.

122

Documents

[IEEE 2005 2nd International Symposium on Wireless Communication Systems - Siena, Italy (05-09 Sept. 2005)] 2005 2nd International Symposium on Wireless Communication Systems - Block