30
RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia, TI, TATP and NSF

RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

Embed Size (px)

Citation preview

Page 1: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

DSPs for 4G wireless systems

Sridhar Rajagopal, Scott Rixner,

Joseph R. Cavallaro and Behnaam Aazhang

This work has been supported by Nokia, TI, TATP and NSF

Page 2: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Motivation

Wireless Mobiledevice

BasebandProgrammable

CommunicationsProcessor

RF UnitA/DD/A

•Mobile: Switch between standards and between parameters

•Base-station: varying no. of users with different parameters

Programmability - flexibility is good

Page 3: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

The problem

Processor Type Algorithms Data rate targets Constraints

Mobile W-CDMA, W-LAN 1Mbps, 100Mbps/#users Time,Power,AreaBase-station W-CDMA 4 Mbps Time, maybe areaBase-station W-LAN 100 Mbps Time, maybe area

GPP

DSP

FPGA

VLSI

Performance Flexibility

Best architecture for Power, Area constraints ????

Page 4: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

An approach for the solution

Algorithms well understood at data-flow level

Can design real-time systems in VLSI.

Pushing implementation higher in the chain

Current DSPs not powerful enough for our application

Use an architecture simulator to design our own

Page 5: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Proposed solution

Current solutions to meet real-time(Racks of DSPs)

ProgrammableProcessor for4G wireless

systems

< x cm

< x cm

Future wireless architecturesx = 2.5 (W-CDMA BS)x = 2.0 (W-LAN BS)x = 1.5 (Mobile Handset)

JOE

Page 6: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Past work

Algorithms

DSP

VLSI

FPGA

IMAGINE

Multiuser channel estimationMultiuser detection

Task-partitioningParallelism Pipelining

Conventional arithmeticOn-line arithmetic

Architecture innovationsFunctional unit design and usage

DistantPast

RecentPast

Recent andNear Future

Sys

tem

Des

ign

Page 7: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Contents

Motivation

The “Imagine” simulator

Parallel algorithms for estimation/detection/decoding

Performance comparisons and results

Page 8: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

The IMAGINE architecture

Stream Register FileNetworkInterface

StreamController

Imagine Stream Processor

HostProcessor

Net

wor

k

AL

U C

lust

er 0

AL

U C

lust

er 1

AL

U C

lust

er 2

AL

U C

lust

er 3

AL

U C

lust

er 4

AL

U C

lust

er 5

AL

U C

lust

er 6

AL

U C

lust

er 7

SDRAMSDRAM SDRAMSDRAM

Streaming Memory System

Mic

roco

ntr

olle

r

Page 9: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Why IMAGINE simulator?

RSIM, SimpleScalar: GPP simulators

Great for media processing algorithms

Has a VLIW-based cluster -- DSP comparisons A good base architecture : 1024-pt FFT

Processor Type Area Time Frequency Power Energy

Imagine[Float] 2.5 cm2 7.4 s 500 MHz 3.8 W 28 JTI C6711[Float] - 138 s 150 MHz 1.3 W 180 JTI C6411[Fixed] - 40 s 300 MHz 0.25 W 10 JVirtex II [Fixed] - 2 s 125 MHz <1 W <2 J

Page 10: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Simulator knobs that we can turn

Cycle-accurate simulator

Varying number of Functional units and their design

Varying memory, register sizes

Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead …

Almost anything can be changed, some changes easier than others!

Page 11: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Caveats

2 level C++ programming

StreamC:

• transfers streams of data between main memory and stream register file (SRF)

KernelC:

• transfers streams from the SRF to the ALU clusters

Code optimized to the number of ALU clusters and the size of the data

Compiler not yet fully developed

Page 12: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Contents

Motivation

The “Imagine” simulator

Parallel algorithms for estimation/detection/decoding

Performance comparisons and results

Page 13: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Typical workload representation (Base-station)

Equalization? FFT Viterbi decoding

Multiuser channel estimation Multiuser detection Viterbi decoding

Turbo decoding Multiple antenna systems (MIMO)

Wireless LAN

W-CDMA

Advanced receiver schemes

Page 14: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Parallel estimation/detection/decoding

Multiuser estimationreplaced matrix inversion by gradient descent

Multiuser detectionParallel Interference Cancellation (PIC)Pipelined algorithm that avoids block-based

detection

Viterbi decodingTrellis structures suited for decodingRegister exchange for survivor memoryNo traceback latency

Page 15: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Estimation/Detection (64,32 sizes)

TTLLbbbb bbbbRR 00 **

HHLLbrbr rbrbRR 00 **

)RR*A(AA brbb

1ii1iii RxCxLxyy )y(signd ii

H

1H10

H01

H10

H0

1H0

L R

)]AAAdiag(AAAARe[A C

]ARe[A L

)y(signd

]xAxARe[y

ii

1iH1i

H0i

MultiuserEstimation

Kernel 1,2,3

MultiuserDetection

Kernel 6, 7

Massaging matricesfor detection

Kernel 4, 5

Page 16: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(2)

X(4)X(6)

X(8)

X(10)

X(12)X(14)

X(1)

X(3)

X(5) X(7)

X(9)

X(11)

X(13) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

a. Unsuitable Trellis b. Suitable Trellis c. Shuffled Suitable TrellisX(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

X(0)

X(1)

X(2)X(3)

X(4)

X(5)

X(6)X(7)

X(8)

X(9)

X(10) X(11)

X(12)

X(13)

X(14) X(15)

Trellis for rate ½ code with K = 5

Upper bound on parallel clusters for good FU utilization : N/2k

Maximum 8 parallel units for rate ½ with 16 states

Page 17: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Trellis structures for parallel Viterbi

Definition:

If from a present state p [1..N], set of next states are {mp} (mp has 2k elements where ‘k’ is the number of inputs at the encoder), i.e.

p {mp}then        i,j [1..N] either {mi} = {mj} or {mi} {mj} =

and

a trellis that satisfies this property is denoted as a “separable” or a “fast” trellis.

Page 18: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

X(0) X(1)

X(2)X(3)

X(4) X(5)

X(6)X(7)

X(8) X(9)

X(10) X(11)

X(12) X(13)

X(14) X(15)

Y(0) Y(1)

Y(2)Y(3)

Y(4) Y(5)

Y(6)Y(7)

Y(8) Y(9)

Y(10) Y(11)

Y(12) Y(13)

Y(14) Y(15)

X(0) X(1)

X(2)X(3)

X(4) X(5)

X(6)X(7)

X(8) X(9)

X(10) X(11)

X(12) X(13)

X(14) X(15)

Y(0) Y(1)

Y(2)Y(3)

Y(4) Y(5)

Y(6)Y(7)

Y(8) Y(9)

Y(10) Y(11)

Y(12) Y(13)

Y(14) Y(15)

a. Shuffled Suitable Trellis for ‘k=2’ b. Rearranged Shuffled Suitable Trellis for ‘k=2’

Trellis for rate 2/3 code with K = 5

Upper bound on parallel clusters for good FU utilization : N/2k

Maximum 4 parallel units for rate 2/3 with 16 states(Having 8 will involve interprocessor comm. overhead)

Page 19: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Survivor Management in Viterbi

Two techniquesTraceback : Commonly used Register Exchange

Traceback is good for VLSI architectures where the information bits can be decoded by proper survivor memory addressing sequentially

Drawback: Sequential and additional latency

Page 20: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Register exchange for decoding

Register for given node at given time contains information bits associated with surviving partial path that ends in that state

Survivors calculated in conjunction with path metrics.

Latency in conventional traceback is avoided.

Higher power consumption as entire survivor memory

contents are updated for all states for each bit.

Suited to a parallel programmable implementation as storing bits in a register for traceback touches the previous survivors anyway

Page 21: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Contents

Motivation

The “Imagine” simulator

Parallel algorithms for estimation/detection/decoding

Performance comparisons and results

Page 22: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Lower bounds on + and *

0 50 100 150 200 250 30010

0

101

102

103

Ad

der

s/M

ult

iplie

rs r

equ

ired

to

mee

t re

al-t

ime

Estimation, Detection and Decoding in a W-CDMA multiuser system

Number of users

AddMul

SLOW FADING (estimation every 1000 bits)

MEDIUM FADING(estimation every 100 bits)

FAST FADING(estimation every 10 bits)

DATA RATES

Page 23: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Kernel 2 (mmult) for 3 +,2*

Adders have limited FU utilization

O(N3) *, O(N3) +

Multipliers 100% in loop

Divider not being utilized

Replace / with *

Communication(waiting for input)

TIM

E

LOOP

FU unavailable(input ready but

FU busy)

Page 24: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Kernel 2 (mmult)for 3 +,3*

better adder utilization

needs sufficient registers for scaling [register allocation may fail]

code may also need slight tuning of variables for optimization

TIM

E

Page 25: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Kernel computational time

Algorithm KernelFunctional unit

utilization*(3 +, 2 *)

ExecutionTime

(cycles)

Functional unitutilization*

(3 +, 3 *)

ExecutionTime

(cycles)

PerformanceImprovement(Expected:1.5)

1 70% ,100% 1224 78.6% ,78% 1064 1.15Est- 2 53% ,91% 22720 85% ,99% 14360 1.5822

imate 3 55% ,42% 1058 IN/ OUT 1058 1Total 14464

Glue 4 59% ,91% 7468 78% ,84% 5573 1.341Matrices 5 63% ,96% 12192 68% ,71% 11084 1.1

Total 16657Detect 6 67% ,100% 366 90% ,89.6% 275 1.33

7 67% ,96% 996 89% ,84.2% 760 1.31Total 1035

Decode* 8 70% ,10% 32576 32576

Time available at 128 Kbps for each of 32 users at 500 MHz : 4000 cycles

*Numbers subject to change

Page 26: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Communication overhead

Kernels(Micro-controller

executing)

Memoryoperations

Init

iali

zati

on

Idle time betweenkernels

Page 27: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Comparisons with TI C6701 DSPs

0 5 10 15 20 25 30 3510

-6

10-5

10-4

10-3

10-2

Ex

ecu

tio

n t

ime

(in

se

con

ds

)

Users

Single DSP implementation 2 DSP implementation Target data rate - 128 Kbps/user Our architecture based on Imagine

X

x

Page 28: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Kernel comparisons

1 2,3 4,5 6 710

-7

10-6

10-5

10-4

10-3

10-2

10-1

KERNELS

Exe

cuti

on t

ime

IMAGINE

TI C67: Internal Memory

TI C67: External Memory

Page 29: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

4Gone Conclusions

Various programmable architectures can be investigated for 4G systems depending on algorithms, time, area and power constraints QUICKLY

Enormous potential for 4G system prototyping.

Programmable baseband processor design with broad system functionality, flexibility and low-power consumption that allows a smooth and fast transition from 2G to 3G to 4G systems.

Page 30: RICE UNIVERSITY DSPs for 4G wireless systems Sridhar Rajagopal, Scott Rixner, Joseph R. Cavallaro and Behnaam Aazhang This work has been supported by Nokia,

RICE UNIVERSITY

Future work

Investigating bottlenecks, functional unit design

and other innovations needed to attain real-time

Power and area constraints

Scalability with data rates

Handset algorithms

The insights gained from the design can also be applied to DSPs and other processors.