RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

DSPs for 4G wireless systems

Sridhar Rajagopal, Scott Rixner,

Joseph R. Cavallaro and Behnaam Aazhang

This work has been supported by Nokia, TI, TATP and NSF

RICE UNIVERSITY

Motivation

Wireless Mobiledevice

BasebandProgrammable

CommunicationsProcessor

RF UnitA/DD/A

•Mobile: Switch between standards and between parameters

•Base-station: varying no. of users with different parameters

Programmability - flexibility is good

RICE UNIVERSITY

The problem

Processor Type Algorithms Data rate targets Constraints

Mobile W-CDMA, W-LAN 1Mbps, 100Mbps/#users Time,Power,AreaBase-station W-CDMA 4 Mbps Time, maybe areaBase-station W-LAN 100 Mbps Time, maybe area

GPP

DSP

FPGA

VLSI

Performance Flexibility

Best architecture for Power, Area constraints ????

RICE UNIVERSITY

An approach for the solution

Algorithms well understood at data-flow level

Can design real-time systems in VLSI.

Pushing implementation higher in the chain

Current DSPs not powerful enough for our application

Use an architecture simulator to design our own

RICE UNIVERSITY

Proposed solution

Current solutions to meet real-time(Racks of DSPs)

ProgrammableProcessor for4G wireless

systems

< x cm

< x cm

Future wireless architecturesx = 2.5 (W-CDMA BS)x = 2.0 (W-LAN BS)x = 1.5 (Mobile Handset)

JOE

RICE UNIVERSITY

Past work

Algorithms

DSP

VLSI

FPGA

IMAGINE

Multiuser channel estimationMultiuser detection

Task-partitioningParallelism Pipelining

Conventional arithmeticOn-line arithmetic

Architecture innovationsFunctional unit design and usage

DistantPast

RecentPast

Recent andNear Future

Sys

tem

Des

ign

RICE UNIVERSITY

Contents

Motivation

The “Imagine” simulator

Parallel algorithms for estimation/detection/decoding

Performance comparisons and results

RICE UNIVERSITY

The IMAGINE architecture

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory System

Mic

roco

ntr

olle

r

RICE UNIVERSITY

Why IMAGINE simulator?

RSIM, SimpleScalar: GPP simulators

Great for media processing algorithms

Has a VLIW-based cluster -- DSP comparisons A good base architecture : 1024-pt FFT

Processor Type Area Time Frequency Power Energy

Imagine[Float] 2.5 cm2 7.4 s 500 MHz 3.8 W 28 JTI C6711[Float] - 138 s 150 MHz 1.3 W 180 JTI C6411[Fixed] - 40 s 300 MHz 0.25 W 10 JVirtex II [Fixed] - 2 s 125 MHz <1 W <2 J

RICE UNIVERSITY

Simulator knobs that we can turn

Cycle-accurate simulator

Varying number of Functional units and their design

Varying memory, register sizes

Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead …

Almost anything can be changed, some changes easier than others!

RICE UNIVERSITY

Caveats

2 level C++ programming

StreamC:

• transfers streams of data between main memory and stream register file (SRF)

KernelC:

• transfers streams from the SRF to the ALU clusters

Code optimized to the number of ALU clusters and the size of the data

Compiler not yet fully developed

RICE UNIVERSITY

Contents

Motivation




RICE UNIVERSITY

Typical workload representation (Base-station)

Equalization? FFT Viterbi decoding

Multiuser channel estimation Multiuser detection Viterbi decoding

Turbo decoding Multiple antenna systems (MIMO)

Wireless LAN

W-CDMA

Advanced receiver schemes

RICE UNIVERSITY

Parallel estimation/detection/decoding

Multiuser estimationreplaced matrix inversion by gradient descent

Multiuser detectionParallel Interference Cancellation (PIC)Pipelined algorithm that avoids block-based

detection

Viterbi decodingTrellis structures suited for decodingRegister exchange for survivor memoryNo traceback latency

RICE UNIVERSITY

Estimation/Detection (64,32 sizes)

TTLLbbbb bbbbRR 00 **

HHLLbrbr rbrbRR 00 **

)RR*A(AA brbb

1ii1iii RxCxLxyy )y(signd ii

H

1H10

H01

H10

H0

1H0

L R

)]AAAdiag(AAAARe[A C

]ARe[A L

)y(signd

]xAxARe[y

ii

1iH1i

H0i

MultiuserEstimation

Kernel 1,2,3

MultiuserDetection

Kernel 6, 7

Massaging matricesfor detection

Kernel 4, 5

RICE UNIVERSITY

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(2)

X(4)X(6)

X(8)

X(10)

X(12)X(14)

X(1)

X(3)

X(5) X(7)

X(9)

X(11)

X(13) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

a. Unsuitable Trellis b. Suitable Trellis c. Shuffled Suitable TrellisX(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

Trellis for rate ½ code with K = 5

Upper bound on parallel clusters for good FU utilization : N/2k

Maximum 8 parallel units for rate ½ with 16 states

RICE UNIVERSITY

Trellis structures for parallel Viterbi

Definition:

If from a present state p [1..N], set of next states are {mp} (mp has 2k elements where ‘k’ is the number of inputs at the encoder), i.e.

p {mp}then i,j [1..N] either {mi} = {mj} or {mi} {mj} =

and

a trellis that satisfies this property is denoted as a “separable” or a “fast” trellis.

RICE UNIVERSITY

X(0) X(1)

X(2)X(3)

X(4) X(5)

X(6)X(7)

X(8) X(9)

X(10) X(11)

X(12) X(13)

X(14) X(15)

Y(0) Y(1)

Y(2)Y(3)

Y(4) Y(5)

Y(6)Y(7)

Y(8) Y(9)

Y(10) Y(11)

Y(12) Y(13)

Y(14) Y(15)

X(0) X(1)

X(2)X(3)

X(4) X(5)

X(6)X(7)

X(8) X(9)

X(10) X(11)

X(12) X(13)

X(14) X(15)

Y(0) Y(1)

Y(2)Y(3)

Y(4) Y(5)

Y(6)Y(7)

Y(8) Y(9)

Y(10) Y(11)

Y(12) Y(13)

Y(14) Y(15)

a. Shuffled Suitable Trellis for ‘k=2’ b. Rearranged Shuffled Suitable Trellis for ‘k=2’

Trellis for rate 2/3 code with K = 5

Upper bound on parallel clusters for good FU utilization : N/2k

Maximum 4 parallel units for rate 2/3 with 16 states(Having 8 will involve interprocessor comm. overhead)

RICE UNIVERSITY

Survivor Management in Viterbi

Two techniquesTraceback : Commonly used Register Exchange

Traceback is good for VLSI architectures where the information bits can be decoded by proper survivor memory addressing sequentially

Drawback: Sequential and additional latency

RICE UNIVERSITY

Register exchange for decoding

Register for given node at given time contains information bits associated with surviving partial path that ends in that state

Survivors calculated in conjunction with path metrics.

Latency in conventional traceback is avoided.

Higher power consumption as entire survivor memory

contents are updated for all states for each bit.

Suited to a parallel programmable implementation as storing bits in a register for traceback touches the previous survivors anyway

RICE UNIVERSITY

Contents

Motivation




RICE UNIVERSITY

Lower bounds on + and *

0 50 100 150 200 250 30010

0

101

102

103

Ad

der

s/M

ult

iplie

rs r

equ

ired

to

mee

t re

al-t

ime

Estimation, Detection and Decoding in a W-CDMA multiuser system

Number of users

AddMul

SLOW FADING (estimation every 1000 bits)

MEDIUM FADING(estimation every 100 bits)

FAST FADING(estimation every 10 bits)

DATA RATES

RICE UNIVERSITY

Kernel 2 (mmult) for 3 +,2*

Adders have limited FU utilization

O(N3) *, O(N3) +

Multipliers 100% in loop

Divider not being utilized

Replace / with *

Communication(waiting for input)

TIM

E

LOOP

FU unavailable(input ready but

FU busy)

RICE UNIVERSITY

Kernel 2 (mmult)for 3 +,3*

better adder utilization

needs sufficient registers for scaling [register allocation may fail]

code may also need slight tuning of variables for optimization

TIM

E

RICE UNIVERSITY

Kernel computational time

Algorithm KernelFunctional unit

utilization*(3 +, 2 *)

ExecutionTime

(cycles)

Functional unitutilization*

(3 +, 3 *)

ExecutionTime

(cycles)

PerformanceImprovement(Expected:1.5)

1 70% ,100% 1224 78.6% ,78% 1064 1.15Est- 2 53% ,91% 22720 85% ,99% 14360 1.5822

imate 3 55% ,42% 1058 IN/ OUT 1058 1Total 14464

Glue 4 59% ,91% 7468 78% ,84% 5573 1.341Matrices 5 63% ,96% 12192 68% ,71% 11084 1.1

Total 16657Detect 6 67% ,100% 366 90% ,89.6% 275 1.33

7 67% ,96% 996 89% ,84.2% 760 1.31Total 1035

Decode* 8 70% ,10% 32576 32576

Time available at 128 Kbps for each of 32 users at 500 MHz : 4000 cycles

*Numbers subject to change

RICE UNIVERSITY

Communication overhead

Kernels(Micro-controller

executing)

Memoryoperations

Init

iali

zati

on

Idle time betweenkernels

RICE UNIVERSITY

Comparisons with TI C6701 DSPs

0 5 10 15 20 25 30 3510

-6

10-5

10-4

10-3

10-2

Ex

ecu

tio

n t

ime

(in

se

con

ds

)

Users

Single DSP implementation 2 DSP implementation Target data rate - 128 Kbps/user Our architecture based on Imagine

X

x

RICE UNIVERSITY

Kernel comparisons

1 2,3 4,5 6 710

-7

10-6

10-5

10-4

10-3

10-2

10-1

KERNELS

Exe

cuti

on t

ime

IMAGINE

TI C67: Internal Memory

TI C67: External Memory

RICE UNIVERSITY

4Gone Conclusions

Various programmable architectures can be investigated for 4G systems depending on algorithms, time, area and power constraints QUICKLY

Enormous potential for 4G system prototyping.

Programmable baseband processor design with broad system functionality, flexibility and low-power consumption that allows a smooth and fast transition from 2G to 3G to 4G systems.

RICE UNIVERSITY

Future work

Investigating bottlenecks, functional unit design

and other innovations needed to attain real-time

Power and area constraints

Scalability with data rates

Handset algorithms

The insights gained from the design can also be applied to DSPs and other processors.

Documents

RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,