Algorithms and Architectures for Future Wireless Base-Stations Sridhar Rajagopal and Joseph Cavallaro ECE Department Rice University April 19, 2000 This

Algorithms and Architectures for Future Wireless Base-Stations

Sridhar Rajagopal and Joseph CavallaroECE Department Rice UniversityApril 19, 2000

This work is supported by Texas Instruments, Nokia, Texas Advanced Technology Program and NSF

4/19/00 TI Meeting 2

Overview

Future Base-Stations

Current DSP Implementation

Our Approach– Make Algorithms Computationally effective

– Task Partitioning for pipelining, parallelism

Processor Design for Accelerating Wireless


Evolution of Wireless Comm

First Generation

Voice

Second/Current Generation

Voice + Low-rate Data (9.6Kbps)

Third Generation +Voice + High-rate Data (2 Mbps) + Multimedia

W-CDMA


Communication System Uplink

Direct PathReflected Paths

Noise +MAI

User 1

User 2

Base Station


Main Processing Blocks

Channel Estimation Detection Decoding

Baseband Layer of Base-Station Receiver


Proposed Base-Station No Multiuser Detection

TI's Wireless Basestation (http://www.ti.com/sc/docs/psheets/diagrams/basestat.htm)

http://www.ti.com/sc/docs/psheets/diagrams/basestat.htm





Real -Time Requirements

Multiple Data Rates by Varying Spreading Factors

Detection needs to be done in real-time– 1953 cycles available in a C6x DSP at 250MHz to detect 1 bit at 128

Kbps

SpreadingFactor

Number ofBits / Frame

Data RateRequirement

4 10240 1024 Kbps32 1280 128 Kbps

256 160 16 Kbps


Current DSP Implementation

9 10 11 12 13 14 150

2

4

6

8

10

12

14

16

18x 10

4

Number of Users

Dat

a R

ates

Ach

ieve

d

Data Rate Comparisons for Matched Filter and Multiuser Detector

Multiuser Detector(C67) Matched Filter(C67) Multiuser Detector(C64)*Matched Filter(C64)*

Targeted Data Rate

Targeted Data Rate = 128Kbps

C67 at 166MHz

Projected (8x)


Complexity

Algorithm Choice Limited by Complexity– Multistage reduces data rate by half.

Main Features– Matrix based operations– High levels of parallelism– Bit level computations

32x32 problem size for the Detector shown

Estimation, Decoding assumed pipelined.


Reasons

Sophisticated, Compute-Intensive Algorithms

Need more MIPs/FLOPs performance

Unable to fully exploit pipelining or parallelism

Bit - level computations / Storage


Our Approach

Make algorithms computationally effective– without sacrificing error rate performance

Task Partitioning on Multiple Processing Elements– DSPs : Core

– FPGAs : Application Specific / Bit-level Computations

Processor with reconfigurable support and extensions for

wireless


Algorithms

Channel Estimation– Avoid inversion by iterative scheme

Detection– Avoid block-based detection by pipelining


Computations Involved

Model

Compute Correlation Matrices

rbR H

iibr L 1

bbR T

iibb L 1

CrRb

N

i

K

i

2 Bits of K async. users aligned at times I and I-1Received bits of spreading length N for K users

iiii bAr ri

bibi+1

time

delay


Multishot Detection

b

b

b

b

A

AAAA

DK

D

K

0

10

10

r

,

,1

1,

1,1

000

0000

CA KDND

Multishot Detection

AAA 10i

Solve for the channel estimate, Ai

RAR bribb CA NK

i

2


Differencing Multistage Detection

Stage 0- Matched Filter

Stage 1

Successive Stages

)(

]Re[

)(

]Re[

11

001

00

0

ysignd

dSAAyy

ysignd

rAy

H

H

)(

]Re[11

1

1

ll

lHll

lll

ysignd

xSAAyy

ddx

S=diag(AHA)

y - soft decision

d - detected bits (hard decision)


Iterative Scheme

Tracking Method of Steepest Descent

Stable convergence behavior

Same Performance

TTLLbbbb bbbbRR 00 **

HHLLbrbr rbrbRR 00 **

)*( brbb RRAAA rbR H

iibr bbR T

iibb

RAR bribb *


Simulations - AWGN Channel

Detection Window = 12

SINR = 0 Paths =3

Preamble L =150Spreading N = 31

Users K = 1510000 bits/userMF – Matched Filter

ML- Maximum

Likelihood

ACT – using inversion4 5 6 7 8 9 10 11 1210

-3

10-2

10-1 Comparison of Bit Error Rates (BER)

Signal to Noise Ratio (SNR)

BER

MF ActMFML ActML

O(K2N)

O(K3+K2N)


Fading Channel with Tracking

4 5 6 7 8 9 10 11 1210

-3

10-2

10-1

100

SNR

BER

MF - Static MF - TrackingML - Static ML - Tracking

Doppler = 10 Hz, 1000 Bits,15 users, 3 Paths


Block Based Detector

1 12

1 12

1 12

1 12

11 22

11 22

11 22

11 22

Matched Filter

Stage 1

Stage 2

Stage 3

Matched Filter

Stage 1

Stage 2

Stage 3

Bits 2-11

Bits 12-21


Pipelined Detector

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12

Matched Filter

Stage 1

Stage 2

Stage 3

1 2 3 4 5 6 7 8 9 10 11 12


Task Decomposition [Asilomar99]

Matrix Products

InverseCorrelation Matrices (Per

Bit)

Rbr[I]O(KN)

A0HA1

O(K2N)

AHrO(KND)

A1HA1

O(K2N)

A0HA0

O(K2N)RbbAH = Rbr[I]O(K2N)

Multistage Detection

(Per Window)

O(DK2Me)

b

Pilot

Data

MUX

d

Data’ MUX

RbbAH

= Rbr[R]O(K2N)

d

Rbr[R]O(KN)

Rbb

O(K2)

Block I Block II Block III

Block IV

Channel Estimation Matched Filter

Multistage Detector


Achieved Data Rates

9 10 11 12 13 14 150

0.5

1

1.5

2

2.5

3x 10 5

Number of Users

Dat

a R

ates

Data Rates for Different Levels of Pipelining and Parallelism

(Parallel A) (Parallel+Pipe B)(Parallel A) (Pipe B) (Parallel A) B A B Sequential A + B

Data Rate Requirement = 128 Kbps


VLSI Implementation

Channel Estimation as a Case Study

Area - Time Efficient Architecture

Real - Time Implementation

Bit- Level Computations - FPGAs

Core Operations - DSPs


Motivation for Architecture

Wireless, the next wave after Multimedia Highly Compute-Intensive Algorithms Real-Time Requirements


Outline

Processor Core with Reconfigurable Support

Permutation Based Interleaved Memory

Processor Architecture -EPIC

Instruction Set Extensions

Truncated Multipliers

Software Support Needed


Characteristics of Wireless Algorithms

Massive Parallelism Bit-level Computations Matrix Based Operations Memory Intensive Complex-valued Data Approximate Computations


What’s wrong with Current Architectures for these applications?


Problems with Current Architectures

UltraSPARC, C6x, MMX, IA-64

Not enough MIPs/FLOPs

Unable to fully exploit parallelism

Bit Level Computations

Memory Bottlenecks

Specialized Instructions for Wireless Communications


Why Reconfigurable

Adapt algorithms to environment Seamless and Continuous Data Processing during

Handoffs

Home Area Wireless LAN

High Speed Office Wireless LAN

Outdoor CDMA Cellular Network


Reconfigurable Support

User InterfaceTranslation

SynchronizationTransport Network

OSILayers

3-7

Data Link Layer(Converts Frames

to Bits)

OSILayer

2

Physical Layer(hardware;

raw bit stream)

OSILayer

1


Different Protocols

Source Coding Channel Coding

Channel

Decoding

Source

Decoding

Multiuser

Detection

Channel

Estimation

MPEG-4, H.723 - Voice,Multimedia

Convolutional,Turbo - Channel Coding


A New Architecture

Processor Core

(GPP/DSP)

Cache

Q Q

Crossbar

Reconfigurable

Logic

Real-Time I/O

Bit Stream

Main

Memory

RF Unit

Processor

Add-on PCMCIACard


Why Reconfigurable

Process initial bit level computations

Optimize for fast I/O transfer

Reconfigurable

Logic

Real-Time I/O

Bit StreamRF Unit



Configuration Caches

2 64-bit data buses1 64-bit address bus

ControlBlocks

SequencerGARP Architecture at UC,Berkeley

Boolean values 64-bit Datapath Fast I/O



Wide Path to Memory

– Data Transfer

– Minimize Load Times

Configuration Caches

– Recently Displaced Configurations(5 cycles)

– Can hold 4 full size Configurations

Independent Execution



Access to same Memory System as Processor– Minimize overhead

When idle– Load Configurations

– Transfer Data


Memory Interface

Access to Main Memory and L1 Data Cache– Large, fast Memory Store

Memory Prefetch Queues for Sequential Accesses– Read aheads and Write Behinds

Processor Core

(GPP/DSP)

L1 Data Cache

Q Q

Crossbar

Main

Memory

FPGA

Instruction Cache


Permutation Based Interleaved Memory (PBI)

High Memory Bandwidth Needed Stride-Insensitive Memory System for Matrices Multiple Banks Sustained Peak Throughput (95%)

L1 Data Cache

Main

Memory


Processor Core

64-bit EPIC Architecture with Extensions(IA-64/C6x)

Statically determined Parallelism;exploit ILP

Execution Time Predictability

Processor Core

(GPP/DSP)

Cache

Q Q

Crossbar

FPGA


EPIC Principle

Explicitly Parallel Instruction Computing

Evolution of VLIW Computing

Compiler- Key role

Architecture to assist Compiler

Better cope with dynamic factors

– which limited VLIW Parallelism


Instruction Set Extensions

To accelerate Bit level computations in Wireless

Real/Complex Integer - Bit Multiplications– Used in Multiuser Detection, Decoding

Bit - Bit Multiplications– Used in Outer Product Updates

– Correlation, Channel Estimation

Complex Integer-Integer Multiplications

Useful in other Signal Processing applications– Speech, Video,,,


Architecture Support

Support via Instruction Set Extensions

Minimal ALU Modifications necessary

Transparent to Register Files/Memory

Additional 8-bit Special Purpose Registers


Integer - Bit Multiplications

64-bit Register A 64-bit Register C

+/- +/- +/-

64-bit Register D

D = D + b*CEg: Cross-Correlation

8-bit Register b

Register Renaming?


8-bit to 64-bit conversions

D = D + b*bT

Eg: Auto-Correlation

b1 = b(1:8),b(1:8),….b(1:8) b2 = b(1)b(1)……b(8)b(8)

b(1)..b(8) b(1) b(1) b(8)

b(1)..b(8) b(1) b(2) b(8)b(7)

b(8)

8-bit Register b 64-bit Register A

1.1 1.2

2.1


Bit-Bit Multiplications

D = D + b*bT


64-bit Register A = b1 64-bit Register B=b2

Ex-NOR

b1*b2Bit-Bit Multiplications

64-bit Register C=b1*b2

B1 B2 B1*B20 0 10 1 01 0 01 1 1


Increment/Decrement

64-bit Register D

+/- +/- +/-

64-bit Register (D+b1*b2)

8-bit Register b1*b2

1

D = D + b*bT



Complex-valued Data Processing

Is it easy to add ?

Is this worth an additional ALU Support ?

Typically supported by Software!

?


Truncated Multipliers

Many applications need approximate computations Adaptive Algorithms :Y = Y + mu*(Y*C) Truncate lower bits Truncated Multipliers - half the area/half the delay Can do 2 truncated multiplies in parallel with regular

Multiplier 1 Multiplier 2Truncated

Multiplier

ALU Multipliers


Software Support

Greater Interaction between Compilers and Architectures– EPIC– Reconfigurable Logic

Compiler needs to find and exploit bit level computations

Reconfigurable Logic Programming


Other Uses

Reconfigurable Logic– For accelerating loops of general purpose processors

Bit Level Support– For other voice, video and multimedia applications


Software Suggestions

Limited OS Support

Compiler Efficiency – No more Assembly!

Performance Analysis Tools

Code Composer Studio 1.2


Conclusions

DSPs to play major role in Future Base-Station

Search for Computationally Efficient Algorithms and Better

Processor Designs to meet Real-Time

Reduced Complexity Algorithms designed

Processor Core with Reconfigurable Support developed

Extra Slides


PBI Scheme

N- address length

M = 2n Banks

2N-n words in each bank

To access a word, – n-bit bank number

– N-n bit address (high-order)

Calculation of the n-bit Bank Number


Calculate Bank Number Use all N bits to get n-bit vector Y = A X , A = n*N matrix of 0’s & 1’s Y = AhXh + Al Xl (N-n,n) [Al -rank n] N-bit parity circuit with logkN levels of XOR gates (k-Fanin)

Parity Ckt.

Row 0 of A

Parity Ckt.

Row 1 of A

Parity Ckt.

Row n-1 of A

N-bit address

Decodern parity bit signals

2n bank select signals


Interleaved Memory Model

Address Source

M(0) M(1) M(M-1)

Data Sink Data Sequencer

Input Buffers

Output Buffers

Memory Banks


Aspects of EPIC

Designing Plan of Execution(POE) at Compile Time

Permitting Compiler to play Statistics– Conditional Branches, Memory references

Communicating POE to the hardware– Static Scheduling– Branch information


Architecture Features in EPIC

Static Scheduling– MultiOP– Non-Unit Assumed Latency (NUAL)

The Branch Problem– Predicated Execution– Control Speculation– Predicated Code Motion

The Memory Problem– Cache Specifiers– Data Speculation


Operation of Reconfigurable Logic

Load Configuration– If in configuration cache, minimal time

Copy initial data with coprocessor move instructions

Start execution

Issue wait that interlocks while active

Copy registers back at kernel completion

Documents

Algorithms and Architectures for Future Wireless Base-Stations Sridhar Rajagopal and Joseph Cavallaro ECE Department Rice University April 19, 2000 This