Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

RICE UNIVERSITY

Flexible wireless communication architectures

Sridhar Rajagopal

Department of Electrical and Computer EngineeringRice University, Houston TX

Faculty Candidate Seminar – University of RochesterMarch 31, 2003

This work has been supported in part by NSF, Nokia and TI

2RICE UNIVERSITY

Future wireless devices demand flexibility

High data rate mobile devices with multimedia

Multiple antennas w/ complex signal processing algorithms

High performance and low power needs

Multiple algorithms and environments supported in same device

Fast design time

Bluetooth/Home Networks

Wireless Cellular

Wireless LAN

3RICE UNIVERSITY

Flexibility needed in different layers

Physical Layer

MAC Layer

Network Layer

Application Layer

Support for multiple wirelessenvironments and algorithms

at high data rates

Puppeteer project at Ricehttp://www.cs.rice.edu/CS/Systems/Puppeteer/

Analog RF

4RICE UNIVERSITY

Research vision: Attain flexibility

Architectures:Flexibility : support variety of sophisticated algorithmsHigh Performance: GOPs of computation (Mbps) Low Power: < 500 mW

Algorithms:Need efficient algorithms for mapping to architectures

Fast design exploration for efficient algorithms & architectures

Design me

5RICE UNIVERSITY

My contributions: Algorithms

Multi-user channel estimation:[Jnl. Of VLSI Sig. Proc.’02, ASAP’00]Matrix-inversionsNumerical techniques

conjugate-gradient descent for complexity reduction

Multi-user detection: [ISCAS’01]Block-based computation to streaming computations

Pipelining, lower memory requirements

Parallel, fixed-point, streaming VLSI implementations [IEEE Trans. Wireless Comm.’02]

6RICE UNIVERSITY

My contributions: Architectures

Heterogeneous DSP-FPGA system designs: [ICSPAT’00]

Computer arithmetic:[Symp. On Comp. Arith’01]Dynamic truncation in ASICs using on-line arithmetic

[Ph.D. Thesis]

Scalable Wireless Application-specific Processors (SWAPs)

Rapid architecture exploration for flexibility-performance tradeoffs

7RICE UNIVERSITY

Scalable Wireless Application-specific Processors

Family of flexible programmable processorsClusters of ALUsHigh performance by supporting 100’s of ALUsCan provide customization for various algorithmsAdapts (“swaps”) architecture dynamically for power

+

?

**

+

**

+

**

+

**

…? ? ?

Scale Clusters

ScaleALUs

8RICE UNIVERSITY

Rapid design exploration for SWAPs

Low “complexity”, parallel, fixed pointalgorithms

Architecture Exploration ASIC

designapply

DSPdesign

apply

SWAPs+?**

+

**

+

**

+

**…? ? ?

9RICE UNIVERSITY

Research vision summary

Provide a framework to rapidly explore:flexible, high performance, low power architectures (SWAPs)

Efficient algorithm design for mapping to SWAPs

Understanding of algorithms, DSPs and ASICs usedFlexibility-performance trade-off with increasing customization in SWAPs

Inter-disciplinary research:Wireless communications, VLSI Signal Processing, Computer

architecture, Computer arithmetic, CAD, Compilers

10RICE UNIVERSITY

Talk Outline

Research vision

SWAPs - Background

Algorithm design for SWAPs

Architecture design for SWAPs

Current and Future Research Goals

11RICE UNIVERSITY

SWAPs borrow from DSPs

DSPs useInstruction Level Parallelism (ILP)Subword Parallelism (MMX)

Current DSPsNot enough functional units (ALUs) for GOPs of computation

• cannot extend to more ALUs• TI C6x DSP has 8 ALUs -- Need 100’s of ALUs

Cannot support more registers (area,ports)Difficult to find ILP as ALUs increase

12RICE UNIVERSITY

SWAPs borrow from ASICs

Exploit data parallelism (DP) alsoAvailable in many wireless algorithmsThis is what ASICs do!

int i,a[N],b[N],c[N]; // 32 bitsshort int d[N],e[N],f[N]; // 16 bits packed

for (i = 0; i< 1024; ++i)

{

a[i] = b[i] + c[i];

d[i] = e[i] + f[i];

}ILP

DP

Subword

13RICE UNIVERSITY

SWAPs borrow from stream processors

Kernel

Viterbidecoding

StreamInput Data Output Data

Correlator channelestimation

receivedsignal

Matchedfilter

InterferenceCancellation

Decoded bits

Kernels (computation) and streams (communication)Operations on kernels use local data in clusters providing GOPs supportStreams expose data parallelism

Imagine stream processor at Stanford [Rixner’01]

Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.

14RICE UNIVERSITY

SWAPs: multi-cluster DSPs

+++***

InternalMemory

ILP

Memory: Stream Register File (SRF)

DSP(1 cluster)

+++***

+++***

+++***

+++***

…ILP

DPSWAPs

adapt clusters to DPIdentical clusters, same operations.Power-down unused FUs, clusters

15RICE UNIVERSITY

Arithmetic clusters in SWAPs

ALUs (+,*,/)Scratch-pad (Sp)

Indexed accessesComm. unit (CU)

Intercluster comm.Distributed reg. Files

Support more ALUs

Intercluster Network

From/To SRF

Cross Point

Local Register File

CU

+

+

+*

*/

+

/

+

+

+*

*/

+

/

Sp

SRF

16RICE UNIVERSITY

Talk Outline

Research vision

SWAPs Background




17RICE UNIVERSITY

SWAPs: Physical layer algorithms

Antenna

Channelestimation

Detection DecodingHigher

(MAC/Network/OS)

Layers

RF Front-end

Basebandprocessing

18RICE UNIVERSITY

SWAP mapping example: Viterbi decoding

Multiple antenna systems (MIMO systems)Complexity exponential with transmit x receive antennas

Estimation: Linear MMSE, blind, conjugate gradient….

Detection: FFT, (blind) interference cancellation….

Decoding: Viterbi, Turbo, LDPC…. & joint schemes

SWAP flexibility lets you use the best algorithms for the situation

Example for concept demonstration: Viterbi decoding

19RICE UNIVERSITY

Parallel Viterbi Decoding for SWAPs

Add-Compare-Select (ACS) : trellis interconnect : computationsParallelism depends on constraint length (#states)

Traceback: searchingConventional

• Sequential (No DP) with dynamic branching• Difficult to implement in parallel architecture

Use Register Exchange (RE) • parallel solution

ACS Unit

Traceback Unit

Detectedbits

Decodedbits

20RICE UNIVERSITY

Parallel Viterbi needs re-ordering for SWAPs

Exploiting Viterbi DP in SWAPs:Use RE instead of regular traceback Re-order ACS, RE

X(0)X(1)X(2)X(3)X(4)X(5)X(6)X(7)X(8)X(9)X(10)X(11)

X(12)X(13)X(14)X(15)

X(0)X(1)X(2)X(3)X(4)X(5)X(6)X(7)X(8)X(9)X(10)X(11)

X(12)X(13)X(14)X(15)

X(0)X(2)X(4)X(6)X(8)X(10)X(12)X(14)X(1)X(3)X(5)X(7)X(9)X(11)X(13)X(15)

X(0)X(1)X(2)X(3)X(4)X(5)X(6)X(7)X(8)X(9)X(10)X(11)X(12)X(13)X(14)X(15)

DP

vector

Regular ACSACS in SWAPs

21RICE UNIVERSITY

Talk Outline

Research vision

SWAP Background




22RICE UNIVERSITY

Designing the SWAP architecture

More clusters better than more ALUs/per cluster

1. Decide how many clustersExploit DP

2. Decide what to put within each clusterMaximize ILP with high functional unit efficiencySearch design space with “explore” tool

Time-power-area characterization

+?**

+

**

+

**

+

**

…ILP

DP

? ? ?

23RICE UNIVERSITY

Design a SWAP cluster: “Explore”

Auto-exploration of adders and multipliers for “ACS"

1

2

3

4

5

1

2

3

4

5

40

60

80

100

120

140

160

(43,58)(54,59)

(39,41)

(62,62)

(47,43)

#Multipliers

(40,32)

(70,59)

(65,45)

(49,33)

(39,27)

(80,34)

(73,41)

(61,33)

(48,26)

(39,22)

(50,22)

(85,24)

(76,33)

(60,26)

#Adders

(61,22)

(85,17)

(72,22)

(72,19)

(85,13)

(85,11)

Inst

ruct

ion

coun

t

(Adder util%, Multiplier util%)

24RICE UNIVERSITY

“Explore” tool benefits

Instruction count vs. ALU efficiencyWhat goes inside each cluster

Design customized application-specific unitsBetter performance with increased ALU utilization

Explore Algorithm 1 : 3 adders, 3 multipliers, 32 clustersExplore Algorithm 2 : 4 adders, 1 multiplier, 64 clusters

Chosen Architecture: 4 adders, 3 multipliers, 64 clusters

Explore multiple algorithmsturn off functional units not in use for given kernel

25RICE UNIVERSITY

SWAP flexibility provides power savings

Multiple algorithmsDifferent ALU requirementsDifferent cluster requirements

Turning off ALUsUse the right #ALUs for kernel from static code schedule

Turning off clusters Data across SRF of all clustersEach cluster does not have access to entire SRFNext kernel may need data from SRF of other clustersReconfiguration support needs to be provided

26RICE UNIVERSITY

SWAPs provide cluster scaling

Use mux-demux buffers

Latency hidden - Minimal loss in performance

Can turn off clusters entirely

SRF

Clusters

Mux-Demuxbuffers

27RICE UNIVERSITY

Viterbi reconfiguration using SWAPs

Packet 1Constraint length 7

(16 clusters)


(64 clusters)


(4 clusters)

DP Can be turned OFF

28RICE UNIVERSITY

64-bit Rate ½

Packet 1K = 7

Packet 2K = 9

Packet 3K = 5

Kernels(Computation)

No Data Memoryaccesses

Exe

cution T

ime

(cyc

les)

Clusters Memory

Run-time SWAP flexibility

29RICE UNIVERSITY

SWAP exploration for Viterbi decoding

1 10 1001

10

100

1000

Number of clusters

Freq

uen

cy n

eede

d to

att

ain

rea

l-ti

me

(in

MH

z)

K = 9K = 7 K = 5Different SWAPs

(Without reconfiguration)Same SWAP

(With reconfiguration)

DSP

Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

Max DP

30RICE UNIVERSITY

SWAPs : Salient features

1-2 orders of magnitude better than a DSP

Any constraint length ⇒ 10 MHz at 128 Kbps

Same code for all constraint lengths no need to re-compile or load another codeas long as parallelism/cluster ratio is constant

Power savings due to dynamic cluster scaling

31RICE UNIVERSITY

Expected SWAP power consumption

Power model based on [Khailany’03]64 clusters and 1 multiplier per cluster:

0.13 micron, 1.2 VPeak Active Power: ~9 mW at 1 MHz (DSP ~1 mW at 1 MHz)Area: ~53.7 mm2

10 MHz, 128 Kbps with reconfiguration ( DSP ~200mW)

Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of theNinth Symposium on High Performance Computer Architecture, February 8-12, 2003

0 10 20 30 40 50 60 700102030405060708090

Active Clusters (max 64)Po

wer

(in

mW

)Viterbi Clusters Peak Power

K = 9 64 ~90 mWK = 7 16 ~28.57 mWK = 5 4 ~13.8 mW

overhead 0 ~8.1 mW

32RICE UNIVERSITY

Multiuser Estimation-Detection+Decoding

Real-time target : 128 Kbps per user

1 10 10010

100

1000

10000

100000

Number of clustersFreq

uenc

y ne

eded

to a

ttain

real

-tim

e (in

MH

z)

FASTMEDIUMSLOW

32-user base-station

Mobile

DSP

Ideal C64x (w/o co-proc) needs ~15 GHz for real-time

33RICE UNIVERSITY

Expected SWAP power : base-station

32 user base-station with 3 X’s per cluster and 64 clusters:0.13 micron, 1.2 VPeak Active Power: ~18.19 mW for 1 MHz (increased X)Area: ~93.4 mm2

Total Peak Base-station power consumption:~18.19 W at 1 GHz for 32 users at 128 Kbps/user

34RICE UNIVERSITY

Talk Outline

Research vision

SWAP Background




35RICE UNIVERSITY

Current research:Flexibility vs. performance

SWAPs: 128 Kbps at ~10-100 mW for ViterbiBorrow DP from ASICs!

suitable for base-stationsFlexibility more important than power

suitable for mobile devicesPower constraints tightercan be customized for further power savings

Handset SWAPs (H-SWAPs)Borrow Task pipelining from ASICs!Application-specific units and specialized comm. network

36RICE UNIVERSITY

Handset SWAPs: H-SWAPs

Trade Data Parallelism for Task Pipelining

SRF

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

…

DP

SWAPs(max. clusters

and reconfigure)

+++*

+++*

+++*

+++*

LimitedDP

SWAPlet(limit

clusters)

+++*

+++*

+++*

+++*

LimitedDP

++*

++*

++*

++*

LimitedDP

++++

++++

LimitedDP

H-SWAPs(collection of customized

SWAPlets)

37RICE UNIVERSITY

Sample points in architecture exploration

DSPs(1 cluster)

ILPSubword

ILPSubword

DP

SWAPs(multiple)

H-SWAPs(optimized for handsets)

ILPSubword

DP Task PipeliningCustom ALUs

Programmable solutions with increased customization

Performance, Power benefits

38RICE UNIVERSITY

Future research: Efficient algorithms

MultipathChannel

EqualizerMRC Decoder

DetectorDemodulator

Non-Coherent

STC

Beam-forming

CoherentSTC

ChannelEstimator

Channel

Turbo Equalizer

Multiple Antenna Systems

39RICE UNIVERSITY

Future research: Architectures

Generalized framework and tools for evaluating algorithm-architecture and area-time-power-flexibility trade-offs

Potential applicationsImage processing:

Cameras : variety of compression algorithms

Biomedical applications: Hearing aids: DSP running on body heat*

Sensor networksCompression of data before transmission

*Quote: Gene Frantz, TI Fellow

40RICE UNIVERSITY

SWAPs: Flexibility, Performance, Power

Need flexible architectures for future wireless devicesHigher data rates, lower power, more complex algorithms

Rapid Exploration for Scalable, Wireless Application-specific ProcessorsFlexibility vs. performance trade-offs

SWAPs - flexibility, high performance and low powerExploit data parallelism like ASICs1-2 orders better performance than DSPsTurn off unused clusters and unused ALUs for low power

Documents

Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels