40
RICE UNIVERSITY Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar – University of Rochester March 31, 2003 This work has been supported in part by NSF, Nokia and TI

Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

RICE UNIVERSITY

Flexible wireless communication architectures

Sridhar Rajagopal

Department of Electrical and Computer EngineeringRice University, Houston TX

Faculty Candidate Seminar – University of RochesterMarch 31, 2003

This work has been supported in part by NSF, Nokia and TI

Page 2: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

2RICE UNIVERSITY

Future wireless devices demand flexibility

High data rate mobile devices with multimedia

Multiple antennas w/ complex signal processing algorithms

High performance and low power needs

Multiple algorithms and environments supported in same device

Fast design time

Bluetooth/Home Networks

Wireless Cellular

Wireless LAN

Page 3: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

3RICE UNIVERSITY

Flexibility needed in different layers

Physical Layer

MAC Layer

Network Layer

Application Layer

Support for multiple wirelessenvironments and algorithms

at high data rates

Puppeteer project at Ricehttp://www.cs.rice.edu/CS/Systems/Puppeteer/

Analog RF

Page 4: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

4RICE UNIVERSITY

Research vision: Attain flexibility

Architectures:Flexibility : support variety of sophisticated algorithmsHigh Performance: GOPs of computation (Mbps) Low Power: < 500 mW

Algorithms:Need efficient algorithms for mapping to architectures

Fast design exploration for efficient algorithms & architectures

Design me

Page 5: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

5RICE UNIVERSITY

My contributions: Algorithms

Multi-user channel estimation:[Jnl. Of VLSI Sig. Proc.’02, ASAP’00]Matrix-inversionsNumerical techniques

conjugate-gradient descent for complexity reduction

Multi-user detection: [ISCAS’01]Block-based computation to streaming computations

Pipelining, lower memory requirements

Parallel, fixed-point, streaming VLSI implementations [IEEE Trans. Wireless Comm.’02]

Page 6: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

6RICE UNIVERSITY

My contributions: Architectures

Heterogeneous DSP-FPGA system designs: [ICSPAT’00]

Computer arithmetic:[Symp. On Comp. Arith’01]Dynamic truncation in ASICs using on-line arithmetic

[Ph.D. Thesis]

Scalable Wireless Application-specific Processors (SWAPs)

Rapid architecture exploration for flexibility-performance tradeoffs

Page 7: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

7RICE UNIVERSITY

Scalable Wireless Application-specific Processors

Family of flexible programmable processorsClusters of ALUsHigh performance by supporting 100’s of ALUsCan provide customization for various algorithmsAdapts (“swaps”) architecture dynamically for power

+

?

**

+

**

+

**

+

**

…? ? ?

Scale Clusters

ScaleALUs

Page 8: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

8RICE UNIVERSITY

Rapid design exploration for SWAPs

Low “complexity”, parallel, fixed pointalgorithms

Architecture Exploration ASIC

designapply

DSPdesign

apply

SWAPs+?**

+

**

+

**

+

**…? ? ?

Page 9: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

9RICE UNIVERSITY

Research vision summary

Provide a framework to rapidly explore:flexible, high performance, low power architectures (SWAPs)

Efficient algorithm design for mapping to SWAPs

Understanding of algorithms, DSPs and ASICs usedFlexibility-performance trade-off with increasing customization in SWAPs

Inter-disciplinary research:Wireless communications, VLSI Signal Processing, Computer

architecture, Computer arithmetic, CAD, Compilers

Page 10: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

10RICE UNIVERSITY

Talk Outline

Research vision

SWAPs - Background

Algorithm design for SWAPs

Architecture design for SWAPs

Current and Future Research Goals

Page 11: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

11RICE UNIVERSITY

SWAPs borrow from DSPs

DSPs useInstruction Level Parallelism (ILP)Subword Parallelism (MMX)

Current DSPsNot enough functional units (ALUs) for GOPs of computation

• cannot extend to more ALUs• TI C6x DSP has 8 ALUs -- Need 100’s of ALUs

Cannot support more registers (area,ports)Difficult to find ILP as ALUs increase

Page 12: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

12RICE UNIVERSITY

SWAPs borrow from ASICs

Exploit data parallelism (DP) alsoAvailable in many wireless algorithmsThis is what ASICs do!

int i,a[N],b[N],c[N]; // 32 bitsshort int d[N],e[N],f[N]; // 16 bits packed

for (i = 0; i< 1024; ++i)

{

a[i] = b[i] + c[i];

d[i] = e[i] + f[i];

}ILP

DP

Subword

Page 13: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

13RICE UNIVERSITY

SWAPs borrow from stream processors

Kernel

Viterbidecoding

StreamInput Data Output Data

Correlator channelestimation

receivedsignal

Matchedfilter

InterferenceCancellation

Decoded bits

Kernels (computation) and streams (communication)Operations on kernels use local data in clusters providing GOPs supportStreams expose data parallelism

Imagine stream processor at Stanford [Rixner’01]

Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.

Page 14: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

14RICE UNIVERSITY

SWAPs: multi-cluster DSPs

+++***

InternalMemory

ILP

Memory: Stream Register File (SRF)

DSP(1 cluster)

+++***

+++***

+++***

+++***

…ILP

DPSWAPs

adapt clusters to DPIdentical clusters, same operations.Power-down unused FUs, clusters

Page 15: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

15RICE UNIVERSITY

Arithmetic clusters in SWAPs

ALUs (+,*,/)Scratch-pad (Sp)

Indexed accessesComm. unit (CU)

Intercluster comm.Distributed reg. Files

Support more ALUs

Intercluster Network

From/To SRF

Cross Point

Local Register File

CU

+

+

+*

*/

+

/

+

+

+*

*/

+

/

Sp

SRF

Page 16: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

16RICE UNIVERSITY

Talk Outline

Research vision

SWAPs Background

Algorithm design for SWAPs

Architecture design for SWAPs

Current and Future Research Goals

Page 17: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

17RICE UNIVERSITY

SWAPs: Physical layer algorithms

Antenna

Channelestimation

Detection DecodingHigher

(MAC/Network/OS)

Layers

RF Front-end

Basebandprocessing

Page 18: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

18RICE UNIVERSITY

SWAP mapping example: Viterbi decoding

Multiple antenna systems (MIMO systems)Complexity exponential with transmit x receive antennas

Estimation: Linear MMSE, blind, conjugate gradient….

Detection: FFT, (blind) interference cancellation….

Decoding: Viterbi, Turbo, LDPC…. & joint schemes

SWAP flexibility lets you use the best algorithms for the situation

Example for concept demonstration: Viterbi decoding

Page 19: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

19RICE UNIVERSITY

Parallel Viterbi Decoding for SWAPs

Add-Compare-Select (ACS) : trellis interconnect : computationsParallelism depends on constraint length (#states)

Traceback: searchingConventional

• Sequential (No DP) with dynamic branching• Difficult to implement in parallel architecture

Use Register Exchange (RE) • parallel solution

ACS Unit

Traceback Unit

Detectedbits

Decodedbits

Page 20: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

20RICE UNIVERSITY

Parallel Viterbi needs re-ordering for SWAPs

Exploiting Viterbi DP in SWAPs:Use RE instead of regular traceback Re-order ACS, RE

X(0)X(1)X(2)X(3)X(4)X(5)X(6)X(7)X(8)X(9)X(10)X(11)

X(12)X(13)X(14)X(15)

X(0)X(1)X(2)X(3)X(4)X(5)X(6)X(7)X(8)X(9)X(10)X(11)

X(12)X(13)X(14)X(15)

X(0)X(2)X(4)X(6)X(8)X(10)X(12)X(14)X(1)X(3)X(5)X(7)X(9)X(11)X(13)X(15)

X(0)X(1)X(2)X(3)X(4)X(5)X(6)X(7)X(8)X(9)X(10)X(11)X(12)X(13)X(14)X(15)

DP

vector

Regular ACSACS in SWAPs

Page 21: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

21RICE UNIVERSITY

Talk Outline

Research vision

SWAP Background

Algorithm design for SWAPs

Architecture design for SWAPs

Current and Future Research Goals

Page 22: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

22RICE UNIVERSITY

Designing the SWAP architecture

More clusters better than more ALUs/per cluster

1. Decide how many clustersExploit DP

2. Decide what to put within each clusterMaximize ILP with high functional unit efficiencySearch design space with “explore” tool

Time-power-area characterization

+?**

+

**

+

**

+

**

…ILP

DP

? ? ?

Page 23: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

23RICE UNIVERSITY

Design a SWAP cluster: “Explore”

Auto-exploration of adders and multipliers for “ACS"

1

2

3

4

5

1

2

3

4

5

40

60

80

100

120

140

160

(43,58)(54,59)

(39,41)

(62,62)

(47,43)

#Multipliers

(40,32)

(70,59)

(65,45)

(49,33)

(39,27)

(80,34)

(73,41)

(61,33)

(48,26)

(39,22)

(50,22)

(85,24)

(76,33)

(60,26)

#Adders

(61,22)

(85,17)

(72,22)

(72,19)

(85,13)

(85,11)

Inst

ruct

ion

coun

t

(Adder util%, Multiplier util%)

Page 24: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

24RICE UNIVERSITY

“Explore” tool benefits

Instruction count vs. ALU efficiencyWhat goes inside each cluster

Design customized application-specific unitsBetter performance with increased ALU utilization

Explore Algorithm 1 : 3 adders, 3 multipliers, 32 clustersExplore Algorithm 2 : 4 adders, 1 multiplier, 64 clusters

Chosen Architecture: 4 adders, 3 multipliers, 64 clusters

Explore multiple algorithmsturn off functional units not in use for given kernel

Page 25: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

25RICE UNIVERSITY

SWAP flexibility provides power savings

Multiple algorithmsDifferent ALU requirementsDifferent cluster requirements

Turning off ALUsUse the right #ALUs for kernel from static code schedule

Turning off clusters Data across SRF of all clustersEach cluster does not have access to entire SRFNext kernel may need data from SRF of other clustersReconfiguration support needs to be provided

Page 26: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

26RICE UNIVERSITY

SWAPs provide cluster scaling

Use mux-demux buffers

Latency hidden - Minimal loss in performance

Can turn off clusters entirely

SRF

Clusters

Mux-Demuxbuffers

Page 27: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

27RICE UNIVERSITY

Viterbi reconfiguration using SWAPs

Packet 1Constraint length 7

(16 clusters)

Packet 2Constraint length 9

(64 clusters)

Packet 3Constraint length 5

(4 clusters)

DP Can be turned OFF

Page 28: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

28RICE UNIVERSITY

64-bit Rate ½

Packet 1K = 7

Packet 2K = 9

Packet 3K = 5

Kernels(Computation)

No Data Memoryaccesses

Exe

cution T

ime

(cyc

les)

Clusters Memory

Run-time SWAP flexibility

Page 29: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

29RICE UNIVERSITY

SWAP exploration for Viterbi decoding

1 10 1001

10

100

1000

Number of clusters

Freq

uen

cy n

eede

d to

att

ain

rea

l-ti

me

(in

MH

z)

K = 9K = 7 K = 5Different SWAPs

(Without reconfiguration)Same SWAP

(With reconfiguration)

DSP

Ideal C64x (w/o co-proc) needs ~200 MHz for real-time

Max DP

Page 30: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

30RICE UNIVERSITY

SWAPs : Salient features

1-2 orders of magnitude better than a DSP

Any constraint length ⇒ 10 MHz at 128 Kbps

Same code for all constraint lengths no need to re-compile or load another codeas long as parallelism/cluster ratio is constant

Power savings due to dynamic cluster scaling

Page 31: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

31RICE UNIVERSITY

Expected SWAP power consumption

Power model based on [Khailany’03]64 clusters and 1 multiplier per cluster:

0.13 micron, 1.2 VPeak Active Power: ~9 mW at 1 MHz (DSP ~1 mW at 1 MHz)Area: ~53.7 mm2

10 MHz, 128 Kbps with reconfiguration ( DSP ~200mW)

Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of theNinth Symposium on High Performance Computer Architecture, February 8-12, 2003

0 10 20 30 40 50 60 700102030405060708090

Active Clusters (max 64)Po

wer

(in

mW

)Viterbi Clusters Peak Power

K = 9 64 ~90 mWK = 7 16 ~28.57 mWK = 5 4 ~13.8 mW

overhead 0 ~8.1 mW

Page 32: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

32RICE UNIVERSITY

Multiuser Estimation-Detection+Decoding

Real-time target : 128 Kbps per user

1 10 10010

100

1000

10000

100000

Number of clustersFreq

uenc

y ne

eded

to a

ttain

real

-tim

e (in

MH

z)

FASTMEDIUMSLOW

32-user base-station

Mobile

DSP

Ideal C64x (w/o co-proc) needs ~15 GHz for real-time

Page 33: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

33RICE UNIVERSITY

Expected SWAP power : base-station

32 user base-station with 3 X’s per cluster and 64 clusters:0.13 micron, 1.2 VPeak Active Power: ~18.19 mW for 1 MHz (increased X)Area: ~93.4 mm2

Total Peak Base-station power consumption:~18.19 W at 1 GHz for 32 users at 128 Kbps/user

Page 34: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

34RICE UNIVERSITY

Talk Outline

Research vision

SWAP Background

Algorithm design for SWAPs

Architecture design for SWAPs

Current and Future Research Goals

Page 35: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

35RICE UNIVERSITY

Current research:Flexibility vs. performance

SWAPs: 128 Kbps at ~10-100 mW for ViterbiBorrow DP from ASICs!

suitable for base-stationsFlexibility more important than power

suitable for mobile devicesPower constraints tightercan be customized for further power savings

Handset SWAPs (H-SWAPs)Borrow Task pipelining from ASICs!Application-specific units and specialized comm. network

Page 36: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

36RICE UNIVERSITY

Handset SWAPs: H-SWAPs

Trade Data Parallelism for Task Pipelining

SRF

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

+++***

DP

SWAPs(max. clusters

and reconfigure)

+++*

+++*

+++*

+++*

LimitedDP

SWAPlet(limit

clusters)

+++*

+++*

+++*

+++*

LimitedDP

++*

++*

++*

++*

LimitedDP

++++

++++

LimitedDP

H-SWAPs(collection of customized

SWAPlets)

Page 37: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

37RICE UNIVERSITY

Sample points in architecture exploration

DSPs(1 cluster)

ILPSubword

ILPSubword

DP

SWAPs(multiple)

H-SWAPs(optimized for handsets)

ILPSubword

DP Task PipeliningCustom ALUs

Programmable solutions with increased customization

Performance, Power benefits

Page 38: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

38RICE UNIVERSITY

Future research: Efficient algorithms

MultipathChannel

EqualizerMRC Decoder

DetectorDemodulator

Non-Coherent

STC

Beam-forming

CoherentSTC

ChannelEstimator

Channel

Turbo Equalizer

Multiple Antenna Systems

Page 39: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

39RICE UNIVERSITY

Future research: Architectures

Generalized framework and tools for evaluating algorithm-architecture and area-time-power-flexibility trade-offs

Potential applicationsImage processing:

Cameras : variety of compression algorithms

Biomedical applications: Hearing aids: DSP running on body heat*

Sensor networksCompression of data before transmission

*Quote: Gene Frantz, TI Fellow

Page 40: Sridhar Rajagopalsridhar/ppts/rochester-talk.pdfInput Data Output Data Correlator channel estimation received signal Matched filter Interference Cancellation Decoded bits ¾Kernels

40RICE UNIVERSITY

SWAPs: Flexibility, Performance, Power

Need flexible architectures for future wireless devicesHigher data rates, lower power, more complex algorithms

Rapid Exploration for Scalable, Wireless Application-specific ProcessorsFlexibility vs. performance trade-offs

SWAPs - flexibility, high performance and low powerExploit data parallelism like ASICs1-2 orders better performance than DSPsTurn off unused clusters and unused ALUs for low power