Efficient VLSI architectures for baseband signal processing in wireless base-station receivers

Efficient VLSI architectures for baseband signal processing in wireless base-station

receivers

Sridhar Rajagopal, Srikrishna Bhashyam,

Joseph R. Cavallaro, and Behnaam Aazhang

This work is supported by Nokia, TI, TATP and NSF

Introduction

Real-time VLSI architecture for multiuser channel

estimation

Multiuser channel estimation usually neglected

– high computational complexity - DSPs infeasible

– Single user sliding correlator structures used

Iterative fixed point algorithm developed

Area-Time tradeoffs presented

– Area-Constrained,Time-Constrained, Area-Time efficient

Baseband signal processing

Base-Station Receiver

Channel estimation

Detection DecodingMultiple Users

Antenna

Detected Bits

TrackingTraining

Channel estimation

Direct Path

Reflected Path

Noise +MAI

User 1

User 2

Base Station

compensate for unknown fading amplitudes and

asynchronous delays.

Need for multiuser channel estimation

Detector performance depends on accuracy of channel

estimator

Multiuser Channel Estimation

– Jointly estimate parameters for all users

– Better performance than single user estimates

Computing multiuser channel estimates

Computed by sending a training sequence of known

bits to the receiver.

When absent, detected bits can be used to update

estimates in a decision feedback mode for tracking.

Importance of multiuser estimation usually neglected

May exceed detector complexity

rbR Hiibr

bbR Tiibb

RA*R bribb

Multiuser Channel Estimation Algorithm

= {+1, -1} : Training/Tracking bits

= 8-bit integer (complex) : Received signal

N = spreading gain (typically fixed ,e.g: 32)

K = number of users (variable, <=N)

= Maximum Likelihood channel estimate

Cr

RbN

i

2Ki

bi

ri

Ai

Implementation complexity

Matrix inversions (size 32x32) per window

Unable to meet real-time on DSPs [Asilomar’99]

VLSI fixed-point architectures for matrix inversions

– Difficult to design , Finite precision problems

Typically, simpler single-user sliding correlator

structures used.

Outline

What is multiuser channel estimation?


Implementation problems

Algorithm enhancements

VLSI architectures

– Area-constrained,Time-constrained, Area-Time efficient

Conclusions

Iterative scheme for channel estimation

Bit-streaming : suitable for tracking (window length L)

Method of gradient descent

Stable convergence behavior

Simple fixed-point VLSI architecture

T00

TLL

)1i(bb

)i(bb b*bb*bRR

H00

HLL

)1i(br

)i(br r*br*bRR

)RR*A(AA )i(br

)i(bb

)1i()1i()i(

4 5 6 7 8 9 10 11 1210

-3

10-2

10-1 Comparison of Bit Error Rates (BER)

Signal to Noise Ratio (SNR)

BER

MF ActMFML ActML

O(K2N)

O(K3+K2N)

Simulations - Static multipath channel

SINR = 0 dB

Paths =3

Preamble =150

Spreading N = 31

Users K = 15

Rayleigh Fading channel with tracking

4 5 6 7 8 9 10 11 1210

-3

10-2

10-1

100

SNR

BE

R

MF - Static MF - TrackingML - Static ML - Tracking

Doppler = 10 Kmph

Outline





VLSI architectures


Conclusions

Area-Time Tradeoffs

Design for 32 users (K) and spreading code (N) 32

Target = 128 Kbps (4000 cycles at 500 MHz).

Assume single cycle addition/multiplication

Area-Constrained Architecture :Pico-cells/fewer users

Time-Constrained Architecture : Maximum data rates

Area-Time Efficient Architecture : Real-Time

Task decomposition: channel estimation

IterateCorrelation Matrices (Per Bit)

Pilot Bits

Pilot

MUX

Detected Bits

Data

MUX

AO(4K2N,8)

Rbr

O(2KN,8)

Rbb

O(2K2,8)

TIME

ChannelEstimate

to Detector

b0

(2K,1)

Tracking Window

r0

(N,8)

b(2K,1)

r(N,8)

L

Architecture design: auto-correlation

b = {+1,-1}

Multiplication is a XNOR operation

Matrix updated using XNOR gates

Auto-correlation matrix implemented as an

UP/DOWN counter(s)

T00

TLL

)1i(bb

)i(bb b*bb*bRR

Architecture design: cross-correlation

b = {+1,-1}, r = 8-bit integer vector (complex)

Multiplications reduce to additions/subtractions

Matrix (complex) can be updated with 8-bit adders

Cross-correlation matrix stored as RAM.

H00

HLL

)1i(br

)i(br r*br*bRR

Architecture design: channel estimate

A = 8-bit integer matrix (complex)

µ << 1 : Truncated multiplication [Schulte’93]

Matrix-matrix (real-complex) multiplication of integers

Forms the bottleneck (8-bit multipliers)

Concentrate on multiplication for area-time tradeoffs!

)RR*A(AA )i(br

)i(bb

)1i()1i()i(

Area-Constrained Architecture

b0

bMUX

EN

Counter

Rbb A(i)

DEMUXMUX

MAC

Add/

SubAdd/Sub

Subtract

Subtract

A(i-1)

U/D

Load Store

ji

i j

j j

r0r

b

b0

16

8

8

88

8 8

1

11

1

1

1

1

1

1

88

88

Rbr

>>8

816

T00

TLL

)1i(bb

)i(bb b*bb*bRR

H00

HLL

)1i(br

)i(br r*br*bRR

)RR*A(AA )i(br

)i(bb

)1i()1i()i(

Channel Estimate

Area-constrained Architecture: Hardware Requirements

Blocks Quantity Full AdderCells

Complex Total

Counter 1*8 8 - 8

Multiplier 1*8 64 *2 128

Adders 3*8 + 2*16 56 *2 112

Total Area 248FA cells

Total Time(N=K=32)

4K2N 128,000cycles

Time-constrained Architecture

b*bT

b0*b0T

b

b0

MUX

Rbr

M

UX

r

r0

MUX

Rbb A

Mult

Subtract >>

Subtract

2K*12K*1

2K*1 K(2K-1)*1

K(2K-1)*1

2K2*8

2KN*16

2KN*162KN*8

2K*1

N*8

N*8

N*8

2KN*8

2KN*8

ChannelEstimate

T00

TLL

)1i(bb

)i(bb b*bb*bRR

H00

HLL

)1i(br

)i(br r*br*bRR

)RR*A(AA )i(br

)i(bb

)1i()1i()i(

Auto-correlation Update in Parallel

Rbb(i,j)

Counter

bbT(i,j)U/D#

Rbb(i,i)

Counter

1

U/D#

Array of XNORs

a·b a·c a·d

b·c b·d

c·d

b c da

b (2K)

bbT (K*{2K-1}*1)Rbb (2K2*8)

Array of Counters

T00

TLL

)1i(bb

)i(bb b*bb*bRR

Cross-Correlation Update in Parallel

b c da

b (2K*1)

r (N*8)

Rbr (2KN*8)

r(j)

Rbr(i,j)

Adder

b(i)Add/Sub#

8 8

1

H00

HLL

)1i(br

)i(br r*br*bRR

Time-constrained Architecture: Hardware Requirements


Complex Total

Counter 2K2*8 16K2 - 16K2

Multiplier 4K2N*8 256K2N *2 512K2N

Adders 2KN*16 +2KN*8 +4K2N*16

48KN +64K2N

*2 96KN +128K2N

Total Area(N=K=32)

20,000,000FA cells

Total Time Log2(2K) 6 cycles

Area-Time efficient architecture design

Area - constrained Architecture

– Minimize area - single 8-bit multiplier

– 4K2N cycles (128,000 cycles ; 3.81 Kbps)

Time-constrained Architecture

– Minimize time - 4K2N 8-bit multipliers

– Log2(2K) cycles (6 cycles ; 83.33 Mbps)

Aim : To meet real-time with min. area overhead

Different parallelism levels for multipliers

Area-Time Efficient Architecture

b*bT b0*b0T

b b0

MUX

M

UX

r

r0

MUX

Mult

Subtract >>

Subtract

2K*1 2K*1

2K*12K*1

2K*1 2K*8

2K*8

1*16

1*161*8

1*1

1*8

N*8

N*8

1*8

Rbr

Counters

StoreLoad

Rbb

A(i)

DEMUXMUX

A(i-1)

1*8

Adder

1*8

2K*1

2K*8

2K*8

T00

TLL

)1i(bb

)i(bb b*bb*bRR

H00

HLL

)1i(br

)i(br r*br*bRR

)RR*A(AA )i(br

)i(bb

)1i()1i()i(

Channel Estimate

Area-Time Efficient Architecture: Hardware Requirements


Complex Total

Counter 2K*8 16K - 16K

Multiplier 2K*8 128K *2 256K

Adders 2K*16 +2*8 + 1*16

32K + 32 *2 64K + 64

Total Area(N=K=32)

10,000FA cells

Total Time 2KN 2,000cycles

Outline





VLSI architectures


Conclusions

Comparisons

Implementation ClockRate

Full AdderCells

Data Rates

C67 DSP 166 MHz - 1.02 KbpsArea 500 MHz 248 3.81 Kbps

: : : :Area-Time 500 MHz 104 256 Kbps

: : : :

Time 500 MHz 2x107 83.33 Mbps

DSPs unable to exploit bit-level parallelismInefficient storage of bitsReplacing multiplications by additions/subtractions

Scalability of Architectures with K

Disadvantages of VLSI architecturesDesign for maximum number of users in the systemIf there are fewer users,

– Turn off functional units to reduce power

– Reconfigure hardware for higher data rates (FPGA)

– Dr. Cavallaro, don’t know to handle this Question properly

– We never designed an architecture/algorithm for varying number of users dynamically. (Though we had started on it)

– What should be included in future work?

– Please give suggestions!!

Conclusions

Real-Time VLSI architecture for multiuser channel

estimation

Iterative fixed-point algorithm developed to avoid matrix

inversions

Area-Time Tradeoffs presented

– Area-Constrained, Time-Constrained, Area-Time efficient

VLSI architectures exploit bit-level computations and

parallelism to meet real-time.

Documents

Efficient VLSI architectures for baseband signal processing in wireless base-station receivers