Upload
hugo-goodwin
View
221
Download
0
Embed Size (px)
DESCRIPTION
4/19/00 TI Meeting3 Evolution of Wireless Comm First Generation Voice Second/Current Generation Voice + Low-rate Data (9.6Kbps) Third Generation + Voice + High-rate Data (2 Mbps) + Multimedia W-CDMA
Citation preview
Algorithms and Architectures for Future Wireless Base-Stations
Sridhar Rajagopal and Joseph CavallaroECE Department Rice UniversityApril 19, 2000
This work is supported by Texas Instruments, Nokia, Texas Advanced Technology Program and NSF
4/19/00 TI Meeting 2
Overview
Future Base-Stations
Current DSP Implementation
Our Approach– Make Algorithms Computationally effective
– Task Partitioning for pipelining, parallelism
Processor Design for Accelerating Wireless
4/19/00 TI Meeting 3
Evolution of Wireless Comm
First Generation
Voice
Second/Current Generation
Voice + Low-rate Data (9.6Kbps)
Third Generation +Voice + High-rate Data (2 Mbps) + Multimedia
W-CDMA
4/19/00 TI Meeting 4
Communication System Uplink
Direct PathReflected Paths
Noise +MAI
User 1
User 2
Base Station
4/19/00 TI Meeting 5
Main Processing Blocks
Channel Estimation Detection Decoding
Baseband Layer of Base-Station Receiver
4/19/00 TI Meeting 6
Proposed Base-Station No Multiuser Detection
TI's Wireless Basestation (http://www.ti.com/sc/docs/psheets/diagrams/basestat.htm)
4/19/00 TI Meeting 7
Real -Time Requirements
Multiple Data Rates by Varying Spreading Factors
Detection needs to be done in real-time– 1953 cycles available in a C6x DSP at 250MHz to detect 1 bit at 128
Kbps
SpreadingFactor
Number ofBits / Frame
Data RateRequirement
4 10240 1024 Kbps32 1280 128 Kbps
256 160 16 Kbps
4/19/00 TI Meeting 8
Current DSP Implementation
9 10 11 12 13 14 150
2
4
6
8
10
12
14
16
18x 10
4
Number of Users
Dat
a R
ates
Ach
ieve
d
Data Rate Comparisons for Matched Filter and Multiuser Detector
Multiuser Detector(C67) Matched Filter(C67) Multiuser Detector(C64)*Matched Filter(C64)*
Targeted Data Rate
Targeted Data Rate = 128Kbps
C67 at 166MHz
Projected (8x)
4/19/00 TI Meeting 9
Complexity
Algorithm Choice Limited by Complexity– Multistage reduces data rate by half.
Main Features– Matrix based operations– High levels of parallelism– Bit level computations
32x32 problem size for the Detector shown
Estimation, Decoding assumed pipelined.
4/19/00 TI Meeting 10
Reasons
Sophisticated, Compute-Intensive Algorithms
Need more MIPs/FLOPs performance
Unable to fully exploit pipelining or parallelism
Bit - level computations / Storage
4/19/00 TI Meeting 11
Our Approach
Make algorithms computationally effective– without sacrificing error rate performance
Task Partitioning on Multiple Processing Elements– DSPs : Core
– FPGAs : Application Specific / Bit-level Computations
Processor with reconfigurable support and extensions for
wireless
4/19/00 TI Meeting 12
Algorithms
Channel Estimation– Avoid inversion by iterative scheme
Detection– Avoid block-based detection by pipelining
4/19/00 TI Meeting 13
Computations Involved
Model
Compute Correlation Matrices
rbR H
iibr L 1
bbR T
iibb L 1
CrRb
N
i
K
i
2 Bits of K async. users aligned at times I and I-1Received bits of spreading length N for K users
iiii bAr ri
bibi+1
time
delay
4/19/00 TI Meeting 14
Multishot Detection
b
b
b
b
A
AAAA
DK
D
K
0
10
10
r
,
,1
1,
1,1
000
0000
CA KDND
Multishot Detection
AAA 10i
Solve for the channel estimate, Ai
RAR bribb CA NK
i
2
4/19/00 TI Meeting 15
Differencing Multistage Detection
Stage 0- Matched Filter
Stage 1
Successive Stages
)(
]Re[
)(
]Re[
11
001
00
0
ysignd
dSAAyy
ysignd
rAy
H
H
)(
]Re[11
1
1
ll
lHll
lll
ysignd
xSAAyy
ddx
S=diag(AHA)
y - soft decision
d - detected bits (hard decision)
4/19/00 TI Meeting 16
Iterative Scheme
Tracking Method of Steepest Descent
Stable convergence behavior
Same Performance
TTLLbbbb bbbbRR 00 **
HHLLbrbr rbrbRR 00 **
)*( brbb RRAAA rbR H
iibr bbR T
iibb
RAR bribb *
4/19/00 TI Meeting 17
Simulations - AWGN Channel
Detection Window = 12
SINR = 0 Paths =3
Preamble L =150Spreading N = 31
Users K = 1510000 bits/userMF – Matched Filter
ML- Maximum
Likelihood
ACT – using inversion4 5 6 7 8 9 10 11 1210
-3
10-2
10-1 Comparison of Bit Error Rates (BER)
Signal to Noise Ratio (SNR)
BER
MF ActMFML ActML
O(K2N)
O(K3+K2N)
4/19/00 TI Meeting 18
Fading Channel with Tracking
4 5 6 7 8 9 10 11 1210
-3
10-2
10-1
100
SNR
BER
MF - Static MF - TrackingML - Static ML - Tracking
Doppler = 10 Hz, 1000 Bits,15 users, 3 Paths
4/19/00 TI Meeting 19
Block Based Detector
1 12
1 12
1 12
1 12
11 22
11 22
11 22
11 22
Matched Filter
Stage 1
Stage 2
Stage 3
Matched Filter
Stage 1
Stage 2
Stage 3
Bits 2-11
Bits 12-21
4/19/00 TI Meeting 20
Pipelined Detector
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
Matched Filter
Stage 1
Stage 2
Stage 3
1 2 3 4 5 6 7 8 9 10 11 12
4/19/00 TI Meeting 21
Task Decomposition [Asilomar99]
Matrix Products
InverseCorrelation Matrices (Per
Bit)
Rbr[I]O(KN)
A0HA1
O(K2N)
AHrO(KND)
A1HA1
O(K2N)
A0HA0
O(K2N)RbbAH = Rbr[I]O(K2N)
Multistage Detection
(Per Window)
O(DK2Me)
b
Pilot
Data
MUX
d
Data’ MUX
RbbAH
= Rbr[R]O(K2N)
d
Rbr[R]O(KN)
Rbb
O(K2)
Block I Block II Block III
Block IV
Channel Estimation Matched Filter
Multistage Detector
4/19/00 TI Meeting 22
Achieved Data Rates
9 10 11 12 13 14 150
0.5
1
1.5
2
2.5
3x 10 5
Number of Users
Dat
a R
ates
Data Rates for Different Levels of Pipelining and Parallelism
(Parallel A) (Parallel+Pipe B)(Parallel A) (Pipe B) (Parallel A) B A B Sequential A + B
Data Rate Requirement = 128 Kbps
4/19/00 TI Meeting 23
VLSI Implementation
Channel Estimation as a Case Study
Area - Time Efficient Architecture
Real - Time Implementation
Bit- Level Computations - FPGAs
Core Operations - DSPs
4/19/00 TI Meeting 24
Motivation for Architecture
Wireless, the next wave after Multimedia Highly Compute-Intensive Algorithms Real-Time Requirements
4/19/00 TI Meeting 25
Outline
Processor Core with Reconfigurable Support
Permutation Based Interleaved Memory
Processor Architecture -EPIC
Instruction Set Extensions
Truncated Multipliers
Software Support Needed
4/19/00 TI Meeting 26
Characteristics of Wireless Algorithms
Massive Parallelism Bit-level Computations Matrix Based Operations Memory Intensive Complex-valued Data Approximate Computations
4/19/00 TI Meeting 27
What’s wrong with Current Architectures for these applications?
4/19/00 TI Meeting 28
Problems with Current Architectures
UltraSPARC, C6x, MMX, IA-64
Not enough MIPs/FLOPs
Unable to fully exploit parallelism
Bit Level Computations
Memory Bottlenecks
Specialized Instructions for Wireless Communications
4/19/00 TI Meeting 29
Why Reconfigurable
Adapt algorithms to environment Seamless and Continuous Data Processing during
Handoffs
Home Area Wireless LAN
High Speed Office Wireless LAN
Outdoor CDMA Cellular Network
4/19/00 TI Meeting 30
Reconfigurable Support
User InterfaceTranslation
SynchronizationTransport Network
OSILayers
3-7
Data Link Layer(Converts Frames
to Bits)
OSILayer
2
Physical Layer(hardware;
raw bit stream)
OSILayer
1
4/19/00 TI Meeting 31
Different Protocols
Source Coding Channel Coding
Channel
Decoding
Source
Decoding
Multiuser
Detection
Channel
Estimation
MPEG-4, H.723 - Voice,Multimedia
Convolutional,Turbo - Channel Coding
4/19/00 TI Meeting 32
A New Architecture
Processor Core
(GPP/DSP)
Cache
Q Q
Crossbar
Reconfigurable
Logic
Real-Time I/O
Bit Stream
Main
Memory
RF Unit
Processor
Add-on PCMCIACard
4/19/00 TI Meeting 33
Why Reconfigurable
Process initial bit level computations
Optimize for fast I/O transfer
Reconfigurable
Logic
Real-Time I/O
Bit StreamRF Unit
4/19/00 TI Meeting 34
Reconfigurable Support
Configuration Caches
2 64-bit data buses1 64-bit address bus
ControlBlocks
SequencerGARP Architecture at UC,Berkeley
Boolean values 64-bit Datapath Fast I/O
4/19/00 TI Meeting 35
Reconfigurable Support
Wide Path to Memory
– Data Transfer
– Minimize Load Times
Configuration Caches
– Recently Displaced Configurations(5 cycles)
– Can hold 4 full size Configurations
Independent Execution
4/19/00 TI Meeting 36
Reconfigurable Support
Access to same Memory System as Processor– Minimize overhead
When idle– Load Configurations
– Transfer Data
4/19/00 TI Meeting 37
Memory Interface
Access to Main Memory and L1 Data Cache– Large, fast Memory Store
Memory Prefetch Queues for Sequential Accesses– Read aheads and Write Behinds
Processor Core
(GPP/DSP)
L1 Data Cache
Q Q
Crossbar
Main
Memory
FPGA
Instruction Cache
4/19/00 TI Meeting 38
Permutation Based Interleaved Memory (PBI)
High Memory Bandwidth Needed Stride-Insensitive Memory System for Matrices Multiple Banks Sustained Peak Throughput (95%)
L1 Data Cache
Main
Memory
4/19/00 TI Meeting 39
Processor Core
64-bit EPIC Architecture with Extensions(IA-64/C6x)
Statically determined Parallelism;exploit ILP
Execution Time Predictability
Processor Core
(GPP/DSP)
Cache
Q Q
Crossbar
FPGA
4/19/00 TI Meeting 40
EPIC Principle
Explicitly Parallel Instruction Computing
Evolution of VLIW Computing
Compiler- Key role
Architecture to assist Compiler
Better cope with dynamic factors
– which limited VLIW Parallelism
4/19/00 TI Meeting 41
Instruction Set Extensions
To accelerate Bit level computations in Wireless
Real/Complex Integer - Bit Multiplications– Used in Multiuser Detection, Decoding
Bit - Bit Multiplications– Used in Outer Product Updates
– Correlation, Channel Estimation
Complex Integer-Integer Multiplications
Useful in other Signal Processing applications– Speech, Video,,,
4/19/00 TI Meeting 42
Architecture Support
Support via Instruction Set Extensions
Minimal ALU Modifications necessary
Transparent to Register Files/Memory
Additional 8-bit Special Purpose Registers
4/19/00 TI Meeting 43
Integer - Bit Multiplications
64-bit Register A 64-bit Register C
+/- +/- +/-
64-bit Register D
D = D + b*CEg: Cross-Correlation
8-bit Register b
Register Renaming?
4/19/00 TI Meeting 44
8-bit to 64-bit conversions
D = D + b*bT
Eg: Auto-Correlation
b1 = b(1:8),b(1:8),….b(1:8) b2 = b(1)b(1)……b(8)b(8)
b(1)..b(8) b(1) b(1) b(8)
b(1)..b(8) b(1) b(2) b(8)b(7)
b(8)
8-bit Register b 64-bit Register A
1.1 1.2
2.1
4/19/00 TI Meeting 45
Bit-Bit Multiplications
D = D + b*bT
Eg: Auto-Correlation
64-bit Register A = b1 64-bit Register B=b2
Ex-NOR
b1*b2Bit-Bit Multiplications
64-bit Register C=b1*b2
B1 B2 B1*B20 0 10 1 01 0 01 1 1
4/19/00 TI Meeting 46
Increment/Decrement
64-bit Register D
+/- +/- +/-
64-bit Register (D+b1*b2)
8-bit Register b1*b2
1
D = D + b*bT
Eg: Auto-Correlation
4/19/00 TI Meeting 47
Complex-valued Data Processing
Is it easy to add ?
Is this worth an additional ALU Support ?
Typically supported by Software!
?
4/19/00 TI Meeting 48
Truncated Multipliers
Many applications need approximate computations Adaptive Algorithms :Y = Y + mu*(Y*C) Truncate lower bits Truncated Multipliers - half the area/half the delay Can do 2 truncated multiplies in parallel with regular
Multiplier 1 Multiplier 2Truncated
Multiplier
ALU Multipliers
4/19/00 TI Meeting 49
Software Support
Greater Interaction between Compilers and Architectures– EPIC– Reconfigurable Logic
Compiler needs to find and exploit bit level computations
Reconfigurable Logic Programming
4/19/00 TI Meeting 50
Other Uses
Reconfigurable Logic– For accelerating loops of general purpose processors
Bit Level Support– For other voice, video and multimedia applications
4/19/00 TI Meeting 51
Software Suggestions
Limited OS Support
Compiler Efficiency – No more Assembly!
Performance Analysis Tools
Code Composer Studio 1.2
4/19/00 TI Meeting 52
Conclusions
DSPs to play major role in Future Base-Station
Search for Computationally Efficient Algorithms and Better
Processor Designs to meet Real-Time
Reduced Complexity Algorithms designed
Processor Core with Reconfigurable Support developed
Extra Slides
4/19/00 TI Meeting 54
PBI Scheme
N- address length
M = 2n Banks
2N-n words in each bank
To access a word, – n-bit bank number
– N-n bit address (high-order)
Calculation of the n-bit Bank Number
4/19/00 TI Meeting 55
Calculate Bank Number Use all N bits to get n-bit vector Y = A X , A = n*N matrix of 0’s & 1’s Y = AhXh + Al Xl (N-n,n) [Al -rank n] N-bit parity circuit with logkN levels of XOR gates (k-Fanin)
Parity Ckt.
Row 0 of A
Parity Ckt.
Row 1 of A
Parity Ckt.
Row n-1 of A
N-bit address
Decodern parity bit signals
2n bank select signals
4/19/00 TI Meeting 56
Interleaved Memory Model
Address Source
M(0) M(1) M(M-1)
Data Sink Data Sequencer
Input Buffers
Output Buffers
Memory Banks
4/19/00 TI Meeting 57
Aspects of EPIC
Designing Plan of Execution(POE) at Compile Time
Permitting Compiler to play Statistics– Conditional Branches, Memory references
Communicating POE to the hardware– Static Scheduling– Branch information
4/19/00 TI Meeting 58
Architecture Features in EPIC
Static Scheduling– MultiOP– Non-Unit Assumed Latency (NUAL)
The Branch Problem– Predicated Execution– Control Speculation– Predicated Code Motion
The Memory Problem– Cache Specifiers– Data Speculation
4/19/00 TI Meeting 59
Operation of Reconfigurable Logic
Load Configuration– If in configuration cache, minimal time
Copy initial data with coprocessor move instructions
Start execution
Issue wait that interlocks while active
Copy registers back at kernel completion