Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
RICE UNIVERSITY
Flexible wireless communication architectures
Sridhar Rajagopal
Department of Electrical and Computer EngineeringRice University, Houston TX
Faculty Candidate Seminar – University of RochesterMarch 31, 2003
This work has been supported in part by NSF, Nokia and TI
2RICE UNIVERSITY
Future wireless devices demand flexibility
High data rate mobile devices with multimedia
Multiple antennas w/ complex signal processing algorithms
High performance and low power needs
Multiple algorithms and environments supported in same device
Fast design time
Bluetooth/Home Networks
Wireless Cellular
Wireless LAN
3RICE UNIVERSITY
Flexibility needed in different layers
Physical Layer
MAC Layer
Network Layer
Application Layer
Support for multiple wirelessenvironments and algorithms
at high data rates
Puppeteer project at Ricehttp://www.cs.rice.edu/CS/Systems/Puppeteer/
Analog RF
4RICE UNIVERSITY
Research vision: Attain flexibility
Architectures:Flexibility : support variety of sophisticated algorithmsHigh Performance: GOPs of computation (Mbps) Low Power: < 500 mW
Algorithms:Need efficient algorithms for mapping to architectures
Fast design exploration for efficient algorithms & architectures
Design me
5RICE UNIVERSITY
My contributions: Algorithms
Multi-user channel estimation:[Jnl. Of VLSI Sig. Proc.’02, ASAP’00]Matrix-inversionsNumerical techniques
conjugate-gradient descent for complexity reduction
Multi-user detection: [ISCAS’01]Block-based computation to streaming computations
Pipelining, lower memory requirements
Parallel, fixed-point, streaming VLSI implementations [IEEE Trans. Wireless Comm.’02]
6RICE UNIVERSITY
My contributions: Architectures
Heterogeneous DSP-FPGA system designs: [ICSPAT’00]
Computer arithmetic:[Symp. On Comp. Arith’01]Dynamic truncation in ASICs using on-line arithmetic
[Ph.D. Thesis]
Scalable Wireless Application-specific Processors (SWAPs)
Rapid architecture exploration for flexibility-performance tradeoffs
7RICE UNIVERSITY
Scalable Wireless Application-specific Processors
Family of flexible programmable processorsClusters of ALUsHigh performance by supporting 100’s of ALUsCan provide customization for various algorithmsAdapts (“swaps”) architecture dynamically for power
+
?
**
+
**
+
**
+
**
…? ? ?
Scale Clusters
ScaleALUs
8RICE UNIVERSITY
Rapid design exploration for SWAPs
Low “complexity”, parallel, fixed pointalgorithms
Architecture Exploration ASIC
designapply
DSPdesign
apply
SWAPs+?**
+
**
+
**
+
**…? ? ?
9RICE UNIVERSITY
Research vision summary
Provide a framework to rapidly explore:flexible, high performance, low power architectures (SWAPs)
Efficient algorithm design for mapping to SWAPs
Understanding of algorithms, DSPs and ASICs usedFlexibility-performance trade-off with increasing customization in SWAPs
Inter-disciplinary research:Wireless communications, VLSI Signal Processing, Computer
architecture, Computer arithmetic, CAD, Compilers
10RICE UNIVERSITY
Talk Outline
Research vision
SWAPs - Background
Algorithm design for SWAPs
Architecture design for SWAPs
Current and Future Research Goals
11RICE UNIVERSITY
SWAPs borrow from DSPs
DSPs useInstruction Level Parallelism (ILP)Subword Parallelism (MMX)
Current DSPsNot enough functional units (ALUs) for GOPs of computation
• cannot extend to more ALUs• TI C6x DSP has 8 ALUs -- Need 100’s of ALUs
Cannot support more registers (area,ports)Difficult to find ILP as ALUs increase
12RICE UNIVERSITY
SWAPs borrow from ASICs
Exploit data parallelism (DP) alsoAvailable in many wireless algorithmsThis is what ASICs do!
int i,a[N],b[N],c[N]; // 32 bitsshort int d[N],e[N],f[N]; // 16 bits packed
for (i = 0; i< 1024; ++i)
{
a[i] = b[i] + c[i];
d[i] = e[i] + f[i];
}ILP
DP
Subword
13RICE UNIVERSITY
SWAPs borrow from stream processors
Kernel
Viterbidecoding
StreamInput Data Output Data
Correlator channelestimation
receivedsignal
Matchedfilter
InterferenceCancellation
Decoded bits
Kernels (computation) and streams (communication)Operations on kernels use local data in clusters providing GOPs supportStreams expose data parallelism
Imagine stream processor at Stanford [Rixner’01]
Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.
14RICE UNIVERSITY
SWAPs: multi-cluster DSPs
+++***
InternalMemory
ILP
Memory: Stream Register File (SRF)
DSP(1 cluster)
+++***
+++***
+++***
+++***
…ILP
DPSWAPs
adapt clusters to DPIdentical clusters, same operations.Power-down unused FUs, clusters
15RICE UNIVERSITY
Arithmetic clusters in SWAPs
ALUs (+,*,/)Scratch-pad (Sp)
Indexed accessesComm. unit (CU)
Intercluster comm.Distributed reg. Files
Support more ALUs
Intercluster Network
From/To SRF
Cross Point
Local Register File
CU
+
+
+*
*/
+
/
+
+
+*
*/
+
/
Sp
SRF
16RICE UNIVERSITY
Talk Outline
Research vision
SWAPs Background
Algorithm design for SWAPs
Architecture design for SWAPs
Current and Future Research Goals
17RICE UNIVERSITY
SWAPs: Physical layer algorithms
Antenna
Channelestimation
Detection DecodingHigher
(MAC/Network/OS)
Layers
RF Front-end
Basebandprocessing
18RICE UNIVERSITY
SWAP mapping example: Viterbi decoding
Multiple antenna systems (MIMO systems)Complexity exponential with transmit x receive antennas
Estimation: Linear MMSE, blind, conjugate gradient….
Detection: FFT, (blind) interference cancellation….
Decoding: Viterbi, Turbo, LDPC…. & joint schemes
SWAP flexibility lets you use the best algorithms for the situation
Example for concept demonstration: Viterbi decoding
19RICE UNIVERSITY
Parallel Viterbi Decoding for SWAPs
Add-Compare-Select (ACS) : trellis interconnect : computationsParallelism depends on constraint length (#states)
Traceback: searchingConventional
• Sequential (No DP) with dynamic branching• Difficult to implement in parallel architecture
Use Register Exchange (RE) • parallel solution
ACS Unit
Traceback Unit
Detectedbits
Decodedbits
20RICE UNIVERSITY
Parallel Viterbi needs re-ordering for SWAPs
Exploiting Viterbi DP in SWAPs:Use RE instead of regular traceback Re-order ACS, RE
X(0)X(1)X(2)X(3)X(4)X(5)X(6)X(7)X(8)X(9)X(10)X(11)
X(12)X(13)X(14)X(15)
X(0)X(1)X(2)X(3)X(4)X(5)X(6)X(7)X(8)X(9)X(10)X(11)
X(12)X(13)X(14)X(15)
X(0)X(2)X(4)X(6)X(8)X(10)X(12)X(14)X(1)X(3)X(5)X(7)X(9)X(11)X(13)X(15)
X(0)X(1)X(2)X(3)X(4)X(5)X(6)X(7)X(8)X(9)X(10)X(11)X(12)X(13)X(14)X(15)
DP
vector
Regular ACSACS in SWAPs
21RICE UNIVERSITY
Talk Outline
Research vision
SWAP Background
Algorithm design for SWAPs
Architecture design for SWAPs
Current and Future Research Goals
22RICE UNIVERSITY
Designing the SWAP architecture
More clusters better than more ALUs/per cluster
1. Decide how many clustersExploit DP
2. Decide what to put within each clusterMaximize ILP with high functional unit efficiencySearch design space with “explore” tool
Time-power-area characterization
+?**
+
**
+
**
+
**
…ILP
DP
? ? ?
23RICE UNIVERSITY
Design a SWAP cluster: “Explore”
Auto-exploration of adders and multipliers for “ACS"
1
2
3
4
5
1
2
3
4
5
40
60
80
100
120
140
160
(43,58)(54,59)
(39,41)
(62,62)
(47,43)
#Multipliers
(40,32)
(70,59)
(65,45)
(49,33)
(39,27)
(80,34)
(73,41)
(61,33)
(48,26)
(39,22)
(50,22)
(85,24)
(76,33)
(60,26)
#Adders
(61,22)
(85,17)
(72,22)
(72,19)
(85,13)
(85,11)
Inst
ruct
ion
coun
t
(Adder util%, Multiplier util%)
24RICE UNIVERSITY
“Explore” tool benefits
Instruction count vs. ALU efficiencyWhat goes inside each cluster
Design customized application-specific unitsBetter performance with increased ALU utilization
Explore Algorithm 1 : 3 adders, 3 multipliers, 32 clustersExplore Algorithm 2 : 4 adders, 1 multiplier, 64 clusters
Chosen Architecture: 4 adders, 3 multipliers, 64 clusters
Explore multiple algorithmsturn off functional units not in use for given kernel
25RICE UNIVERSITY
SWAP flexibility provides power savings
Multiple algorithmsDifferent ALU requirementsDifferent cluster requirements
Turning off ALUsUse the right #ALUs for kernel from static code schedule
Turning off clusters Data across SRF of all clustersEach cluster does not have access to entire SRFNext kernel may need data from SRF of other clustersReconfiguration support needs to be provided
26RICE UNIVERSITY
SWAPs provide cluster scaling
Use mux-demux buffers
Latency hidden - Minimal loss in performance
Can turn off clusters entirely
SRF
Clusters
Mux-Demuxbuffers
27RICE UNIVERSITY
Viterbi reconfiguration using SWAPs
Packet 1Constraint length 7
(16 clusters)
Packet 2Constraint length 9
(64 clusters)
Packet 3Constraint length 5
(4 clusters)
DP Can be turned OFF
28RICE UNIVERSITY
64-bit Rate ½
Packet 1K = 7
Packet 2K = 9
Packet 3K = 5
Kernels(Computation)
No Data Memoryaccesses
Exe
cution T
ime
(cyc
les)
Clusters Memory
Run-time SWAP flexibility
29RICE UNIVERSITY
SWAP exploration for Viterbi decoding
1 10 1001
10
100
1000
Number of clusters
Freq
uen
cy n
eede
d to
att
ain
rea
l-ti
me
(in
MH
z)
K = 9K = 7 K = 5Different SWAPs
(Without reconfiguration)Same SWAP
(With reconfiguration)
DSP
Ideal C64x (w/o co-proc) needs ~200 MHz for real-time
Max DP
30RICE UNIVERSITY
SWAPs : Salient features
1-2 orders of magnitude better than a DSP
Any constraint length ⇒ 10 MHz at 128 Kbps
Same code for all constraint lengths no need to re-compile or load another codeas long as parallelism/cluster ratio is constant
Power savings due to dynamic cluster scaling
31RICE UNIVERSITY
Expected SWAP power consumption
Power model based on [Khailany’03]64 clusters and 1 multiplier per cluster:
0.13 micron, 1.2 VPeak Active Power: ~9 mW at 1 MHz (DSP ~1 mW at 1 MHz)Area: ~53.7 mm2
10 MHz, 128 Kbps with reconfiguration ( DSP ~200mW)
Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of theNinth Symposium on High Performance Computer Architecture, February 8-12, 2003
0 10 20 30 40 50 60 700102030405060708090
Active Clusters (max 64)Po
wer
(in
mW
)Viterbi Clusters Peak Power
K = 9 64 ~90 mWK = 7 16 ~28.57 mWK = 5 4 ~13.8 mW
overhead 0 ~8.1 mW
32RICE UNIVERSITY
Multiuser Estimation-Detection+Decoding
Real-time target : 128 Kbps per user
1 10 10010
100
1000
10000
100000
Number of clustersFreq
uenc
y ne
eded
to a
ttain
real
-tim
e (in
MH
z)
FASTMEDIUMSLOW
32-user base-station
Mobile
DSP
Ideal C64x (w/o co-proc) needs ~15 GHz for real-time
33RICE UNIVERSITY
Expected SWAP power : base-station
32 user base-station with 3 X’s per cluster and 64 clusters:0.13 micron, 1.2 VPeak Active Power: ~18.19 mW for 1 MHz (increased X)Area: ~93.4 mm2
Total Peak Base-station power consumption:~18.19 W at 1 GHz for 32 users at 128 Kbps/user
34RICE UNIVERSITY
Talk Outline
Research vision
SWAP Background
Algorithm design for SWAPs
Architecture design for SWAPs
Current and Future Research Goals
35RICE UNIVERSITY
Current research:Flexibility vs. performance
SWAPs: 128 Kbps at ~10-100 mW for ViterbiBorrow DP from ASICs!
suitable for base-stationsFlexibility more important than power
suitable for mobile devicesPower constraints tightercan be customized for further power savings
Handset SWAPs (H-SWAPs)Borrow Task pipelining from ASICs!Application-specific units and specialized comm. network
36RICE UNIVERSITY
Handset SWAPs: H-SWAPs
Trade Data Parallelism for Task Pipelining
SRF
+++***
+++***
+++***
+++***
+++***
+++***
+++***
+++***
+++***
…
DP
SWAPs(max. clusters
and reconfigure)
+++*
+++*
+++*
+++*
LimitedDP
SWAPlet(limit
clusters)
+++*
+++*
+++*
+++*
LimitedDP
++*
++*
++*
++*
LimitedDP
++++
++++
LimitedDP
H-SWAPs(collection of customized
SWAPlets)
37RICE UNIVERSITY
Sample points in architecture exploration
DSPs(1 cluster)
ILPSubword
ILPSubword
DP
SWAPs(multiple)
H-SWAPs(optimized for handsets)
ILPSubword
DP Task PipeliningCustom ALUs
Programmable solutions with increased customization
Performance, Power benefits
38RICE UNIVERSITY
Future research: Efficient algorithms
MultipathChannel
EqualizerMRC Decoder
DetectorDemodulator
Non-Coherent
STC
Beam-forming
CoherentSTC
ChannelEstimator
Channel
Turbo Equalizer
Multiple Antenna Systems
39RICE UNIVERSITY
Future research: Architectures
Generalized framework and tools for evaluating algorithm-architecture and area-time-power-flexibility trade-offs
Potential applicationsImage processing:
Cameras : variety of compression algorithms
Biomedical applications: Hearing aids: DSP running on body heat*
Sensor networksCompression of data before transmission
*Quote: Gene Frantz, TI Fellow
40RICE UNIVERSITY
SWAPs: Flexibility, Performance, Power
Need flexible architectures for future wireless devicesHigher data rates, lower power, more complex algorithms
Rapid Exploration for Scalable, Wireless Application-specific ProcessorsFlexibility vs. performance trade-offs
SWAPs - flexibility, high performance and low powerExploit data parallelism like ASICs1-2 orders better performance than DSPsTurn off unused clusters and unused ALUs for low power