DSP Lecture 01

Slide 1Why use Digital Signal Processing processors?
What are the typical DSP algorithms?
Parameters to consider when choosing a DSP processor.
Programmable vs ASIC DSP.
Texas Instruments’ TMS320 family.
Chapter 1, Slide *
Why go digital?
Digital signal processing techniques are now so powerful that sometimes it is extremely difficult, if not impossible, for analogue signal processing to achieve similar performance.
Examples:
Adaptive filters.
Chapter 1, Slide *
Why go digital?
Analogue signal processing is achieved by using analogue components such as:
Resistors.
Capacitors.
Inductors.
The inherent tolerances associated with these components, temperature, voltage changes and mechanical vibrations can dramatically affect the effectiveness of the analogue circuitry.
Chapter 1, Slide *
Why go digital?
Change applications.
Correct applications.
Update applications.
Why NOT go digital?
High frequency signals cannot be processed digitally because of two reasons:
Analog to Digital Converters, ADC cannot work fast enough.
The application can be too complex to be performed in real-time.
Chapter 1, Slide *
Real-time processing
DSP processors have to perform tasks in real-time, so how do we define real-time?
The definition of real-time depends on the application.
Example: a 100-tap FIR filter is performed in real-time if the DSP can perform and complete the following operation between two samples:
Chapter 1, Slide *
We can say that we have a real-time application if:
Waiting Time 0
Why do we need DSP processors?
Why not use a General Purpose Processor (GPP) such as a Pentium instead of a DSP processor?
What is the power consumption of a Pentium and a DSP processor?
What is the cost of a Pentium and a DSP processor?
Chapter 1, Slide *
Use a DSP processor when the following are required:
Cost saving.
Smaller size.
Use a GPP processor when the following are required:
Large memory.
What are the typical DSP algorithms?
The Sum of Products (SOP) is the key element in most DSP algorithms:
Chapter 1, Slide *
What does it take to do this fast … and easy?
A
t
count
for (i = 1; i < count; i++){
sum += m[i] * n[i]; }
DAC
x
Y
ADC
DSP
Over the next 20 slides, we want to provide an example to anchor the presentation and provide context. What better algorithm than the standard sum-of products. The question lead-in is “so, what problem are we trying to solve?” “The basics of DSP involve first sampling an analog signal and converting it to digital. What do we do then? Some type of algorithm to shape, modify, etc the signal. This is easily done in the digital realm. So, the time between samples is our limit to how fast we need to do the algorithm. What’s a typical algorithm look like - this! A simple sum-of products. Let’s look at a typical DSP algorithm and see how the processor is designed to handle it.
Spend about 1 minute on this slide. If the group is VERY new to DSP, you might embellish slightly on any areas you feel comfortable with. But remember, the focus is not WHY DSP, it is “assuming you know why you’d want to use this algorithm, let’s see how the processor is built to handle it”.
The lead-into the next slide is the Q shown on the slide. Also state that we plan to write the code for this algorithm and see how the architecture is designed to handle it efficiently.
OLD INFO
Fastest Execution of MACs
Ease of C Programming
Even using natural C, the ‘C6000 Architecture can perform 2 to 4 MACs per cycle
Compiler generates 80-100% efficient code
Multiply-Accumulate (MAC) in Natural C Code
for (i = 0; i < count; i++){
sum += m[i] * n[i]; }
How does the ‘C6000 achieve such performance from C?
Chapter 1, Slide *
Sample Compiler Benchmarks
Great out-of-box experience
Code available at: www.ti.com/sc/c6000compiler
How does the ‘C6000 achieve such performance from C?
HIDDEN SLIDE
To view this slide while presenting (in case of customer questions on C efficiency), click the button in the far upper-right corner.
Chapter 1, Slide *
‘C6000 Compiler excels at Natural C
While dual-MAC speeds math intensive algorithms, flexibility of 8 independent functional units allows the compiler to quickly perform other types of processing
All ‘C6000 instructions are conditional allowing efficient hardware pipelining
Instruction set and CPU hardware orthogonality allow the compiler to achieve 80-100% efficiency
A0
A31
;** --------------------------------------------------*
{ int i, float sum = 0;
for (i=0; i < count; i++) {
sum += m[i] * n[i]; } …
A0
A31
SINGLE-CYCLE LOOP KERNEL:
The ‘C6000 compiler generates code that performs at the rate of 2 MACs per cycle!
It does this by performing two taps (results) per cycle. That is, all 40 results in about 20 cycles.
The compiler generates these results from natural ANSI C code - no “tweaking” required.
Side Notes:
For simplicity and since we were running out of room on the foil, the compiler output was abbreviated. The actual compiler results are slightly different for two reasons
Actually it takes something like 28 cycles to calculate 20 terms. 20 iterations (2/cycle) plus 8 cycles of setup. If we were doing 1000 taps, it would take 508 cycles.
Due to latency of some of the instructions, the code must be unrolled to achieve maximum performance. That is, the compiler actually generates a four-cycle loop which calculates 8 results. Again, the rate is still 2 MACs per cycle.
We’re not ignoring all that needs to be done... but if there is high interest, encourage attendance of 4-day workshop...
Chapter 1, Slide *
Internal
Memory
External
Memory
.D1
.M1
.L1
.S1
.D2
.M2
.L2
.S2
Internal Buses
The point of this slide is to transition from the CPU description (now in the lower-right-hand block) to the internal buses diagram.
This slide should only take a couple seconds to present.
Chapter 1, Slide *
‘C6000 Internal Buses
The first bus is program.
If asked about 256-bit bus, this allows us to fetch 8 instructions simultaneously, which allows us to execute an instruction on each of our 8 functional units in parallel.
Two data buses - one for each register set (A & B).
Each ‘C62x data bus can load/store 32-bits/cycle.
The ‘C67x can load up to 64 bits per cycle, supporting single-cycle loads of double-float values or the ability to load 4 single-precision floats per cycle. (Stores are still 32-bit - but that’s OK since DSP's perform many more reads than writes).
‘C64 performs 64-bit loads and stores.
Read and write buses for DMA: this allows the DMA to support single-cycle transfer rates (a DMA read and write in one cycle).
Note, on 6211, 6711, and 6712, EDMA is serviced on-chip by a 64-bit bus. The external bus, though, is 32-bits for the ‘11 devices and 16-bits for the ‘12.
Chapter 1, Slide *
Internal
Memory
The point of this slide is to transition to the peripherals description.
Essentially, the next few slides describe each peripheral. One slide per peripheral with a few bullets to highlight the key features.
Don’t get into too much detail on any one peripheral - unless the question is simple/quick to answer.
The McBSP and EDMA are covered in more detail later in this workshop. The others cannot be examined further due to limited time. The 4-day workshop spends more time examining other peripherals.
Chapter 1, Slide *
CPU
4K
Program
Cache
4K
Data
Cache
The CPU can access two dedicated level-1 caches. A 4K direct-mapped cache for program code and a 2-way data cache. These level-1 caches provide single-cycle access to the CPU.
The level-2 memory is larger and a bit slower. It’s accessed whenever there is a level-1 cache miss. Even though it’s a little slower than the level-1 memory, it’s still faster than going off-chip. If the term “level-2 cache” sounds familiar, it’s because many personal computers now employ this same type of mechanism.
The level-1 vs. level-2 access is all automatic. YOU, the programmer, don’t have to worry about a thing. Just write your code as you’d normally would and the hardware figures out the quickest way to get the CPU your code and data.
What if the code/data isn’t in either the level-1 or level-2 memory? Then ...
Chapter 1, Slide *
‘C6711 Cache Logic
HIDDEN FOIL
This foil is here so that it could be linked into the student notes. If you find this diagram useful, you can either ‘un-hide’ it or click the top arrow on the preceding foil.
Chapter 1, Slide *
‘C6711 Cache Details
Level 1 Program
16 instr. in 5 cycles
Line Size: 1024 bits
HIDDEN FOIL
This foil was included to add the width of the data paths on the diagram two foils ago. If you want to use this diagram, you can either ‘un-hide’ it or, click on the bottom arrow in the upper right corner of the foil two preceding this one.
Note, the data paths are larger than expected. In fact, when there is a transfer from Level-2 to either program or data Level-1, two transfers actually take place. That is, two fetch packets, or 32 bytes of data are transferred to the Level-1 caches. This “look ahead” or “burst” feature was designed to minimize Level-1 cache misses.
L1P: 4 Kbytes = 1K instructions = 128 fetch packets (FP)
Line size is 512 bits = 16 instructions = 2 FP
L2: Line size is 1024 bits = 4 FP (2x L1P line size)
= 128 bytes (4x L1D line size)
Internal EDMA bus is 64 bits wide, though 6211/6711 devices only have 32-bit external bus. (6712 has 16-bit external bus.)
Chapter 1, Slide *
Internal
Memory
The point of this slide is to transition to the peripherals description.
Essentially, the next few slides describe each peripheral. One slide per peripheral with a few bullets to highlight the key features.
Don’t get into too much detail on any one peripheral - unless the question is simple/quick to answer.
The McBSP and EDMA are covered in more detail later in this workshop. The others cannot be examined further due to limited time. The 4-day workshop spends more time examining other peripherals.
Chapter 1, Slide *
DSP processors are optimised to perform multiplication and addition operations.
Multiplication and addition are done in hardware and in one cycle.
Example: 4-bit multiply (unsigned).
Internal L2 cache
32
32K
32K
512K
32-bit
64-bit
40-bit
1200MFLOPS
32
32K
32K
512K
Parameter
DMA channels
Multiprocessor support
Supply voltage
Power management
Applications which require:
Higher power consumption.
Can be slower than fixed-point counterparts and larger in size.
Chapter 1, Slide *
Floating vs. Fixed point processors
It is the application that dictates which device and platform to use in order to achieve optimum performance at a low cost.
For educational purposes, use the floating-point device (C6711) as it can support both fixed and floating point operations.
Chapter 1, Slide *
Application Specific Integrated Circuits (ASICs) are semiconductors designed for dedicated functions.
The advantages and disadvantages of using ASICs are listed below:
Advantages
Chapter 1, Slide *
Chapter 1, Slide *
Lowest Cost
Control Systems
Motor Control
Comm Infrastructure
Wireless Base-stations
Texas Instruments’ TMS320 family
TMS320C64x: The C64x fixed-point DSPs offer the industry's highest level of performance to address the demands of the digital age. At clock rates of up to 1 GHz, C64x DSPs can process information at rates up to 8000 MIPS with costs as low as $19.95. In addition to a high clock rate, C64x DSPs can do more work each cycle with built-in extensions. These extensions include new instructions to accelerate performance in key application areas such as digital communications infrastructure and video and image processing.
TMS320C62x: These first-generation fixed-point DSPs represent breakthrough technology that enables new equipments and energizes existing implementations for multi-channel, multi-function applications, such as wireless base stations, remote access servers (RAS), digital subscriber loop (xDSL) systems, personalized home security systems, advanced imaging/biometrics, industrial scanners, precision instrumentation and multi-channel telephony systems.
TMS320C67x: For designers of high-precision applications, C67x floating-point DSPs offer the speed, precision, power savings and dynamic range to meet a wide variety of design needs. These dynamic DSPs are the ideal solution for demanding applications like audio, medical imaging, instrumentation and automotive.
Chapter 1, Slide *
First commercially-successful floating-point DSP ‘C30 (1987)
First floating-point DSP with multiprocessing support ‘C40 (1991)
First $10 floating-point DSP ‘C32 (1995)
First 1-GFLOPS DSP ‘C6701 (1998)
First $5 floating-point DSP ‘C33 (1999)
First 2-level cache floating-point DSP ‘C6711 (1999)
First to offer 600 MFLOPS for under $10 ‘C6712 (2000)
TI Floating-Point Innovation
Chapter 1, Slide *
7.bin
“Understanding Digital Signal Processing”
by Richard G. Lyons;
ISBN 0-1310-8989-7
by Craig Marven and Gillian Ewers;
ISBN 0-4711-5243-9
James H. McClellan, Ronald W. Schafer, and Mark A. Yoder;
ISBN 0-1324-3171-8
“Digital Signal Processing Implementation
“C6x-Based Digital Signal Processing”
ISBN 0-13-088310-7
the TMS320C6000” by Nasser Kehtarnavaz;
Newnes; Book & CD-Rom (July 14, 2004)
ISBN 0-7506-7830-5
C6713 and C6416 DSK (Topics in Digital Signal Processing)” Wiley-Interscience; Book&CD-Rom (December 3, 2004) by Rulph Chassaing;
ISBN 0-4716-9007-4
to C with the TMS320C6x DSK” by Thad B. Welch;
Cameron Wright; Michael Morrow; Book & CD-Rom
(2006) ISBN 0-8493-7382-4
Chapter 1, Slide *

Documents

DSP Lecture 01