Upload
todd-dalton
View
216
Download
0
Embed Size (px)
Citation preview
RICE UNIVERSITY
DSPs for 4G wireless systems
Sridhar Rajagopal, Scott Rixner,
Joseph R. Cavallaro and Behnaam Aazhang
This work has been supported by Nokia, TI, TATP and NSF
RICE UNIVERSITY
Motivation
Wireless Mobiledevice
BasebandProgrammable
CommunicationsProcessor
RF UnitA/DD/A
•Mobile: Switch between standards and between parameters
•Base-station: varying no. of users with different parameters
Programmability - flexibility is good
RICE UNIVERSITY
The problem
Processor Type Algorithms Data rate targets Constraints
Mobile W-CDMA, W-LAN 1Mbps, 100Mbps/#users Time,Power,AreaBase-station W-CDMA 4 Mbps Time, maybe areaBase-station W-LAN 100 Mbps Time, maybe area
GPP
DSP
FPGA
VLSI
Performance Flexibility
Best architecture for Power, Area constraints ????
RICE UNIVERSITY
An approach for the solution
Algorithms well understood at data-flow level
Can design real-time systems in VLSI.
Pushing implementation higher in the chain
Current DSPs not powerful enough for our application
Use an architecture simulator to design our own
RICE UNIVERSITY
Proposed solution
Current solutions to meet real-time(Racks of DSPs)
ProgrammableProcessor for4G wireless
systems
< x cm
< x cm
Future wireless architecturesx = 2.5 (W-CDMA BS)x = 2.0 (W-LAN BS)x = 1.5 (Mobile Handset)
JOE
RICE UNIVERSITY
Past work
Algorithms
DSP
VLSI
FPGA
IMAGINE
Multiuser channel estimationMultiuser detection
Task-partitioningParallelism Pipelining
Conventional arithmeticOn-line arithmetic
Architecture innovationsFunctional unit design and usage
DistantPast
RecentPast
Recent andNear Future
Sys
tem
Des
ign
RICE UNIVERSITY
Contents
Motivation
The “Imagine” simulator
Parallel algorithms for estimation/detection/decoding
Performance comparisons and results
RICE UNIVERSITY
The IMAGINE architecture
Stream Register FileNetworkInterface
StreamController
Imagine Stream Processor
HostProcessor
Net
wor
k
AL
U C
lust
er 0
AL
U C
lust
er 1
AL
U C
lust
er 2
AL
U C
lust
er 3
AL
U C
lust
er 4
AL
U C
lust
er 5
AL
U C
lust
er 6
AL
U C
lust
er 7
SDRAMSDRAM SDRAMSDRAM
Streaming Memory System
Mic
roco
ntr
olle
r
RICE UNIVERSITY
Why IMAGINE simulator?
RSIM, SimpleScalar: GPP simulators
Great for media processing algorithms
Has a VLIW-based cluster -- DSP comparisons A good base architecture : 1024-pt FFT
Processor Type Area Time Frequency Power Energy
Imagine[Float] 2.5 cm2 7.4 s 500 MHz 3.8 W 28 JTI C6711[Float] - 138 s 150 MHz 1.3 W 180 JTI C6411[Fixed] - 40 s 300 MHz 0.25 W 10 JVirtex II [Fixed] - 2 s 125 MHz <1 W <2 J
RICE UNIVERSITY
Simulator knobs that we can turn
Cycle-accurate simulator
Varying number of Functional units and their design
Varying memory, register sizes
Graphical tools to investigate FU utilization, bottlenecks, memory stalls, communication overhead …
Almost anything can be changed, some changes easier than others!
RICE UNIVERSITY
Caveats
2 level C++ programming
StreamC:
• transfers streams of data between main memory and stream register file (SRF)
KernelC:
• transfers streams from the SRF to the ALU clusters
Code optimized to the number of ALU clusters and the size of the data
Compiler not yet fully developed
RICE UNIVERSITY
Contents
Motivation
The “Imagine” simulator
Parallel algorithms for estimation/detection/decoding
Performance comparisons and results
RICE UNIVERSITY
Typical workload representation (Base-station)
Equalization? FFT Viterbi decoding
Multiuser channel estimation Multiuser detection Viterbi decoding
Turbo decoding Multiple antenna systems (MIMO)
Wireless LAN
W-CDMA
Advanced receiver schemes
RICE UNIVERSITY
Parallel estimation/detection/decoding
Multiuser estimationreplaced matrix inversion by gradient descent
Multiuser detectionParallel Interference Cancellation (PIC)Pipelined algorithm that avoids block-based
detection
Viterbi decodingTrellis structures suited for decodingRegister exchange for survivor memoryNo traceback latency
RICE UNIVERSITY
Estimation/Detection (64,32 sizes)
TTLLbbbb bbbbRR 00 **
HHLLbrbr rbrbRR 00 **
)RR*A(AA brbb
1ii1iii RxCxLxyy )y(signd ii
H
1H10
H01
H10
H0
1H0
L R
)]AAAdiag(AAAARe[A C
]ARe[A L
)y(signd
]xAxARe[y
ii
1iH1i
H0i
MultiuserEstimation
Kernel 1,2,3
MultiuserDetection
Kernel 6, 7
Massaging matricesfor detection
Kernel 4, 5
RICE UNIVERSITY
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
X(0)
X(2)
X(4)X(6)
X(8)
X(10)
X(12)X(14)
X(1)
X(3)
X(5) X(7)
X(9)
X(11)
X(13) X(15)
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
a. Unsuitable Trellis b. Suitable Trellis c. Shuffled Suitable TrellisX(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
X(0)
X(1)
X(2)X(3)
X(4)
X(5)
X(6)X(7)
X(8)
X(9)
X(10) X(11)
X(12)
X(13)
X(14) X(15)
Trellis for rate ½ code with K = 5
Upper bound on parallel clusters for good FU utilization : N/2k
Maximum 8 parallel units for rate ½ with 16 states
RICE UNIVERSITY
Trellis structures for parallel Viterbi
Definition:
If from a present state p [1..N], set of next states are {mp} (mp has 2k elements where ‘k’ is the number of inputs at the encoder), i.e.
p {mp}then i,j [1..N] either {mi} = {mj} or {mi} {mj} =
and
a trellis that satisfies this property is denoted as a “separable” or a “fast” trellis.
RICE UNIVERSITY
X(0) X(1)
X(2)X(3)
X(4) X(5)
X(6)X(7)
X(8) X(9)
X(10) X(11)
X(12) X(13)
X(14) X(15)
Y(0) Y(1)
Y(2)Y(3)
Y(4) Y(5)
Y(6)Y(7)
Y(8) Y(9)
Y(10) Y(11)
Y(12) Y(13)
Y(14) Y(15)
X(0) X(1)
X(2)X(3)
X(4) X(5)
X(6)X(7)
X(8) X(9)
X(10) X(11)
X(12) X(13)
X(14) X(15)
Y(0) Y(1)
Y(2)Y(3)
Y(4) Y(5)
Y(6)Y(7)
Y(8) Y(9)
Y(10) Y(11)
Y(12) Y(13)
Y(14) Y(15)
a. Shuffled Suitable Trellis for ‘k=2’ b. Rearranged Shuffled Suitable Trellis for ‘k=2’
Trellis for rate 2/3 code with K = 5
Upper bound on parallel clusters for good FU utilization : N/2k
Maximum 4 parallel units for rate 2/3 with 16 states(Having 8 will involve interprocessor comm. overhead)
RICE UNIVERSITY
Survivor Management in Viterbi
Two techniquesTraceback : Commonly used Register Exchange
Traceback is good for VLSI architectures where the information bits can be decoded by proper survivor memory addressing sequentially
Drawback: Sequential and additional latency
RICE UNIVERSITY
Register exchange for decoding
Register for given node at given time contains information bits associated with surviving partial path that ends in that state
Survivors calculated in conjunction with path metrics.
Latency in conventional traceback is avoided.
Higher power consumption as entire survivor memory
contents are updated for all states for each bit.
Suited to a parallel programmable implementation as storing bits in a register for traceback touches the previous survivors anyway
RICE UNIVERSITY
Contents
Motivation
The “Imagine” simulator
Parallel algorithms for estimation/detection/decoding
Performance comparisons and results
RICE UNIVERSITY
Lower bounds on + and *
0 50 100 150 200 250 30010
0
101
102
103
Ad
der
s/M
ult
iplie
rs r
equ
ired
to
mee
t re
al-t
ime
Estimation, Detection and Decoding in a W-CDMA multiuser system
Number of users
AddMul
SLOW FADING (estimation every 1000 bits)
MEDIUM FADING(estimation every 100 bits)
FAST FADING(estimation every 10 bits)
DATA RATES
RICE UNIVERSITY
Kernel 2 (mmult) for 3 +,2*
Adders have limited FU utilization
O(N3) *, O(N3) +
Multipliers 100% in loop
Divider not being utilized
Replace / with *
Communication(waiting for input)
TIM
E
LOOP
FU unavailable(input ready but
FU busy)
RICE UNIVERSITY
Kernel 2 (mmult)for 3 +,3*
better adder utilization
needs sufficient registers for scaling [register allocation may fail]
code may also need slight tuning of variables for optimization
TIM
E
RICE UNIVERSITY
Kernel computational time
Algorithm KernelFunctional unit
utilization*(3 +, 2 *)
ExecutionTime
(cycles)
Functional unitutilization*
(3 +, 3 *)
ExecutionTime
(cycles)
PerformanceImprovement(Expected:1.5)
1 70% ,100% 1224 78.6% ,78% 1064 1.15Est- 2 53% ,91% 22720 85% ,99% 14360 1.5822
imate 3 55% ,42% 1058 IN/ OUT 1058 1Total 14464
Glue 4 59% ,91% 7468 78% ,84% 5573 1.341Matrices 5 63% ,96% 12192 68% ,71% 11084 1.1
Total 16657Detect 6 67% ,100% 366 90% ,89.6% 275 1.33
7 67% ,96% 996 89% ,84.2% 760 1.31Total 1035
Decode* 8 70% ,10% 32576 32576
Time available at 128 Kbps for each of 32 users at 500 MHz : 4000 cycles
*Numbers subject to change
RICE UNIVERSITY
Communication overhead
Kernels(Micro-controller
executing)
Memoryoperations
Init
iali
zati
on
Idle time betweenkernels
RICE UNIVERSITY
Comparisons with TI C6701 DSPs
0 5 10 15 20 25 30 3510
-6
10-5
10-4
10-3
10-2
Ex
ecu
tio
n t
ime
(in
se
con
ds
)
Users
Single DSP implementation 2 DSP implementation Target data rate - 128 Kbps/user Our architecture based on Imagine
X
x
RICE UNIVERSITY
Kernel comparisons
1 2,3 4,5 6 710
-7
10-6
10-5
10-4
10-3
10-2
10-1
KERNELS
Exe
cuti
on t
ime
IMAGINE
TI C67: Internal Memory
TI C67: External Memory
RICE UNIVERSITY
4Gone Conclusions
Various programmable architectures can be investigated for 4G systems depending on algorithms, time, area and power constraints QUICKLY
Enormous potential for 4G system prototyping.
Programmable baseband processor design with broad system functionality, flexibility and low-power consumption that allows a smooth and fast transition from 2G to 3G to 4G systems.
RICE UNIVERSITY
Future work
Investigating bottlenecks, functional unit design
and other innovations needed to attain real-time
Power and area constraints
Scalability with data rates
Handset algorithms
The insights gained from the design can also be applied to DSPs and other processors.