Implementing Multiuser Channel Estimation and Detection for W-CDMA
Sridhar Rajagopal, Srikrishna Bhashyam,
Joseph R. Cavallaro and Behnaam Aazhang
Rice University
{sridhar,skrishna,cavallar,aaz}@rice.edu
This work is supported by Nokia, Texas Instruments, Texas Advanced Technology Program and NSF
Organization
Joint Estimation & Detection
An Implementation-Friendly Scheme
Simulations
Architectural Features– Task Partitioning
– Area-Time Tradeoffs
Conclusions
Future Work
Base-Station with MUD
Multiple Users
Channel Estimation
Multiuser Detection Decoder
Data
Pilot
Demod-ulator
Antenna
Decision Feedback
MUX
Detected Bits
+
Base-station Receiver
Delay
MUX
d
b
Joint Estimation & Detection
Jointly estimate the channel response and detect all the user’s
bits.
Shown to have better performance as well as reduced
computational complexity.
Maximum Likelihood Based Channel Estimation
– [C.Sengupta et al. : PIMRC’1998 WCNC’1999]
Differencing Multistage Detection based on Parallel
Interference Cancellation
– [G.Xu et al. : SPIE’1999]
Computations Involved
Model
Compute Correlation Matrices
rbRH
iibr
bbRT
iibb
CrRb
N
i
K
i
2Bits of K async. users aligned at times I and I-1
Received bits of spreading length N for K users
iiii bAr ri
bibi-1
time
delay
Multishot Detection
b
b
b
b
A
AAAA
DK
D
K
0
10
10
r
,
,1
1,
1,1
000
00
00
CAKDND
Multishot Detection
AAA 10i
Solve for the channel estimate, Ai
RAR bribb *
CANK
i
2
Differencing Multistage Detection
Stage 0 [ Matched Filter Detector]
Stage 1 [ to build differencing vector]
Successive Stages
)(
]Re[
)(
]Re[
11
001
00
0
ysignd
dSAAyy
ysignd
rAy
H
H
)(
]Re[11
1
1
ll
lHll
lll
ysignd
xSAAyy
ddx
S=diag(AHA)
y - soft decision
d - detected bits
(hard decision)
Structure of AHA
AAAAAA
AAAAAAAAAAAA
H11
H00
H
H0 1
HH00H
H0
H
01
1101
100
00
0
00
KDKDH RAA
Not difficult to Compute AHA
Block Bi-Diagonal Matrix : Use Structure
Drawbacks
Matrix Inversion/ Decomposition Needed
Result not available till end of computation– Delay before Detection
Difficult for Tracking
Higher Precision Needed – Floating Point Units
Larger Memory Requirements– Storage of elements to compute inverse
– Float = 32 bits / Input accuracy = 12-14 bits
SLOW! - Difficult to meet Real-Time– [S.Rajagopal et al. : TI DSPFest’1999]
Proposed Base-Station
No Multiuser Detection
TI's Wireless Basestation (http://www.ti.com/sc/docs/psheets/diagrams/basestat.htm)
New Scheme
Iterative Method to find the Channel Estimates
– [S.Bhashyam et al. : WCNC’2000 (submitted)]
Can be easily adapted to Tracking for Fading Channels
Fixed Point Implementation
Estimates ready for detection Immediately
Simpler Hardware and Software.
– Computation Savings only Per Bit
Iterative Scheme
Tracking – Slow Fading : Large Window L
– Fast Fading : Smaller Window L
Method of Steepest Descent
Stable convergence behavior
μ fixed : Bit-by-Bit update
Matches Closely to the Scheme with Inversions
TTLLbbbb bbbbRR 00 **
HHLLbrbr rbrbRR 00 **
)*( brbb RRAAA rbR
H
iibr bbR
T
iibb
RAR bribb *
Simulations - AGWN Channel
4 5 6 7 8 9 10 11 1210
-3
10-2
10-1
Comparison of BER using Channel Estimates by inversion and by iteration
SNR
BE
R
MF ActMFML ActML
Detection Window = 12
SINR = 0
Paths =3
Preamble =150
10000 bits/user
MF – Matched Filter
ML- Maximum Likelihood
ACT – using inversion
Fading Channel with Tracking
4 5 6 7 8 9 10 11 1210
-3
10-2
10-1
100
SNR
BE
R
MF - Static MF - TrackingML - Static ML - Tracking
Doppler = 10 Hz, 1000 Bits,15 users, 3 Paths
DSP Implementation
C6201 Texas Instruments– Fixed Point Processor
– 200 MHz
32 -bit VLIW Architecture
8 Functional Units– 2 Multipliers
– 4 Adders
– 2 Load/Store
TI C Compiler
Simulation
Work in Progress!
Why better?
– Fixed Point Implementation - Faster on DSPs
– Higher Clock Speeds / Faster Multiplications
– More SIMD Parallelism due to smaller wordlength.
– Software Code Simpler to write
Smaller Program Size
Problems
– Input Bit Precision Analysis
– Overflows
Task - Partitioning the Algorithm
Multiple Users
Channel Estimation
Multiuser Detection Decoder
Data
Pilot
Demod-ulator
Antenna
Decision Feedback
MUX
Detected Bits
+
Base-station Receiver
Delay
MUX
d
b
Task Decomposition
Matrix Products
IterateCorrelation Matrices (Per
Bit)
Rbr[I]O(KN)
A0HA1
O(K2N)
AHrO(KND)
A1HA1
O(K2N)
A0HA0
O(K2N)A[I]
O(K2N)
Multistage Detection
(Per Window)
O(DK2M)
b
Pilot
Data
MUX
d
Data’MUX
A[R]O(K2N)
d
Rbr[R]O(KN)
Rbb
O(K2)
Block I Block II Block III
Block IV
Channel Estimation Multistage Detection
Task A
Task B
S.Das et al : Asilomar’99
TIME
Channel Estimation Architecture
Detection Architecture – One version already ready
– [G.Xu - Master’s Thesis 1999]
Advantages over DSP Implementation:
– Optimal Memory Utilization
– Custom Blocks for exploiting available pipelining and parallelism
– Parts could be mapped to FPGA / Reconfigurable logic
– Shows theoretical bounds for maximum achievable Data Rates
– Shows how tasks could be split among different processors
Block Diagram
b0b0’(2K2)
bb’(2 K2)
MUX(2K)
MUX(N)
MUX (2 K2)
Inverter (2 K2)
Rbb(2 K2)
Rbr[R]
(KN)
Multiplier(2 K2N)
Atmp[R]
>>(4 K2)
A[R]
(KN)
b0(2K)
r0(N)
b
r[R]
Inverter(2K)
Window
MUX(2K)
MUX(N)
Rbr[I]
(KN)
Atmp >>(4 K2)
r0(N)
r[I]
Inverter(2K)
Multiplier(2 K2N)
A[I]
(KN)
bit
8-bit
REAL
IMAG
Each block shows no. of “operations” in it.
Channel Estimation
Window
bit
8-bit
b0b0’(2K2)
bb’(2 K2)
MUX(2K)
MUX(N)
MUX (2 K2)
Inverter (2 K2)
Rbb(2 K2)
Rbr[R]
(KN)
Multiplier(2 K2N)
Atmp[R]
>>(4 K2)
A[R]
(KN)
b0(2K)
r0(N)
b
r[R]
Inverter(2K)
REAL
Each block shows no. of “operations” in it.TTLLbbbb bbbbRR 00 **
HHLLbrbr rbrbRR 00 **
)*( brbb RRAAA
Auto-correlation Structure
TTLLbbbb bbbbRR 00 **
b0b0’(2K2)
bb’(2 K2)
MUX (2 K2)
Inverter (2 K2)
•b,b0 are 1-bit
•Subtraction by using inverter
•Rbb using a Counter
• Fully Parallel •2K2 elements O(1) Time
• Pipelined [with LOAD] •2K elements O(K) Time
• Serial [with LOAD] •1 element O(2K2) Time
Rbb(2 K2)
Cross-Correlation Structure
HHLLbrbr rbrbRR 00 **
MUX(2K)
MUX(N)
Rbr[R]
(KN)
Inverter(2K)
•r is 8-bit, b is 1-bit
•Rbr using 8-bit Adders
• Based on sign of b
• Fully Parallel KN, O(1)
• Pipelined N , O(K)
• Serial 1, O(KN)
Iterative Update Structure
Rbb(2 K2)
Rbr[R]
(KN)
Multiplier(2 K2N)
Atmp[R]
>>(4 K2)
A[R]
(KN) REAL
)*( brbb RRAAA •8-bit Multipliers
•16-bit Adders for Multiplier
•8-bit Adders for A
• Parallel KN, O(K)
• Pipelined N , O(K2)
• Serial 1, O(K2N)
Elements in each block
Block Requires Area-TimeTradeoff
Fully ParallelImplementation
bbT,b0b0T 1-bit AND Gates 2K2 2K2
Rbb 8-bit UP/DOWNCounters
2K[with LOAD]
2K2
Rbr[R,I] 8-bit Adders 2N 4KNY[R,I] 8-bit Adders 4K 4KN
Multiplier[R,I]
8-bit Multipliers16-bit adders
4K4K
4KN4K
WindowBuffer
Shift Registers:1-bitShift Registers:8-bit
L2L
L2L
Atmp[R,I] 8-bit subtractors 2K 4KN
TIME O(K2) O(K)
Example : N = 32,L =100, K =32
Fully Parallel Solution : 4K Multipliers, 12K Adders : O(32) Time
Pipelined Solution :100 Multipliers, 300 Adders : O(1K) Time
Conclusions
Iterative Scheme for Joint Estimation & Detection
No loss in algorithm performance
Suitable for Hardware Implementation
– On DSPs, FPGAs and ASICs
Supports Tracking for Fading Channels
Fixed Point Implementation Feasible
ASIC architecture
– To exploit available pipelining and parallelism
Multiuser Channel Estimation and Detection algorithms POSSIBLE to
IMPLEMENT for W-CDMA.
Future Work
MS
Extend Architecture to Long Codes
Task Partition the algorithm on the Sundance Multi-DSP/FPGA board
to achieve real-time
Post-MS
Downlink
Architectures to Min. Power Consumption /Area
Implementing Coding/Decoding Blocks and integrate
RENE’
EXTRA SLIDES
Data Rates Achieved
9 10 11 12 13 14 150
0.5
1
1.5
2
2.5
3x 10
5
Number of Users
Dat
a R
ates
Data Rates for Different Levels of Pipelining and Parallelism
(Parallel A) (Parallel+Pipe B)(Parallel A) (Pipe B) (Parallel A) B A B Sequential A + B
Data Rate Requirement = 128 Kbps
Assuming Channel Estimation Real-Time
Fading Channel
SNR = 10 dB, Doppler = 10 Hz, 1000 Bits
0 5 10 150
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
User index
Err
or R
ate
Error rates of users for fading channel
ML MF MLactMFact