Slide 1Why use Digital Signal Processing processors?
What are the typical DSP algorithms?
Parameters to consider when choosing a DSP processor.
Programmable vs ASIC DSP.
Texas Instruments’ TMS320 family.
Chapter 1, Slide *
Why go digital?
Digital signal processing techniques are now so powerful that
sometimes it is extremely difficult, if not impossible, for
analogue signal processing to achieve similar performance.
Examples:
Adaptive filters.
Chapter 1, Slide *
Why go digital?
Analogue signal processing is achieved by using analogue components
such as:
Resistors.
Capacitors.
Inductors.
The inherent tolerances associated with these components,
temperature, voltage changes and mechanical vibrations can
dramatically affect the effectiveness of the analogue
circuitry.
Chapter 1, Slide *
Why go digital?
Change applications.
Correct applications.
Update applications.
Why NOT go digital?
High frequency signals cannot be processed digitally because of two
reasons:
Analog to Digital Converters, ADC cannot work fast enough.
The application can be too complex to be performed in
real-time.
Chapter 1, Slide *
Real-time processing
DSP processors have to perform tasks in real-time, so how do we
define real-time?
The definition of real-time depends on the application.
Example: a 100-tap FIR filter is performed in real-time if the DSP
can perform and complete the following operation between two
samples:
Chapter 1, Slide *
We can say that we have a real-time application if:
Waiting Time 0
Why do we need DSP processors?
Why not use a General Purpose Processor (GPP) such as a Pentium
instead of a DSP processor?
What is the power consumption of a Pentium and a DSP
processor?
What is the cost of a Pentium and a DSP processor?
Chapter 1, Slide *
Use a DSP processor when the following are required:
Cost saving.
Smaller size.
Use a GPP processor when the following are required:
Large memory.
What are the typical DSP algorithms?
The Sum of Products (SOP) is the key element in most DSP
algorithms:
Chapter 1, Slide *
What does it take to do this fast … and easy?
A
t
count
for (i = 1; i < count; i++){
sum += m[i] * n[i]; }
DAC
x
Y
ADC
DSP
Over the next 20 slides, we want to provide an example to anchor
the presentation and provide context. What better algorithm than
the standard sum-of products. The question lead-in is “so, what
problem are we trying to solve?” “The basics of DSP involve first
sampling an analog signal and converting it to digital. What do we
do then? Some type of algorithm to shape, modify, etc the signal.
This is easily done in the digital realm. So, the time between
samples is our limit to how fast we need to do the algorithm.
What’s a typical algorithm look like - this! A simple sum-of
products. Let’s look at a typical DSP algorithm and see how the
processor is designed to handle it.
Spend about 1 minute on this slide. If the group is VERY new to
DSP, you might embellish slightly on any areas you feel comfortable
with. But remember, the focus is not WHY DSP, it is “assuming you
know why you’d want to use this algorithm, let’s see how the
processor is built to handle it”.
The lead-into the next slide is the Q shown on the slide. Also
state that we plan to write the code for this algorithm and see how
the architecture is designed to handle it efficiently.
OLD INFO
Fastest Execution of MACs
Ease of C Programming
Even using natural C, the ‘C6000 Architecture can perform 2 to 4
MACs per cycle
Compiler generates 80-100% efficient code
Multiply-Accumulate (MAC) in Natural C Code
for (i = 0; i < count; i++){
sum += m[i] * n[i]; }
How does the ‘C6000 achieve such performance from C?
Chapter 1, Slide *
Sample Compiler Benchmarks
Great out-of-box experience
Code available at: www.ti.com/sc/c6000compiler
How does the ‘C6000 achieve such performance from C?
HIDDEN SLIDE
To view this slide while presenting (in case of customer questions
on C efficiency), click the button in the far upper-right
corner.
Chapter 1, Slide *
‘C6000 Compiler excels at Natural C
While dual-MAC speeds math intensive algorithms, flexibility of 8
independent functional units allows the compiler to quickly perform
other types of processing
All ‘C6000 instructions are conditional allowing efficient hardware
pipelining
Instruction set and CPU hardware orthogonality allow the compiler
to achieve 80-100% efficiency
A0
A31
;** --------------------------------------------------*
{ int i, float sum = 0;
for (i=0; i < count; i++) {
sum += m[i] * n[i]; } …
A0
A31
SINGLE-CYCLE LOOP KERNEL:
The ‘C6000 compiler generates code that performs at the rate of 2
MACs per cycle!
It does this by performing two taps (results) per cycle. That is,
all 40 results in about 20 cycles.
The compiler generates these results from natural ANSI C code - no
“tweaking” required.
Side Notes:
For simplicity and since we were running out of room on the foil,
the compiler output was abbreviated. The actual compiler results
are slightly different for two reasons
Actually it takes something like 28 cycles to calculate 20 terms.
20 iterations (2/cycle) plus 8 cycles of setup. If we were doing
1000 taps, it would take 508 cycles.
Due to latency of some of the instructions, the code must be
unrolled to achieve maximum performance. That is, the compiler
actually generates a four-cycle loop which calculates 8 results.
Again, the rate is still 2 MACs per cycle.
We’re not ignoring all that needs to be done... but if there is
high interest, encourage attendance of 4-day workshop...
Chapter 1, Slide *
Internal
Memory
External
Memory
.D1
.M1
.L1
.S1
.D2
.M2
.L2
.S2
Internal Buses
The point of this slide is to transition from the CPU description
(now in the lower-right-hand block) to the internal buses
diagram.
This slide should only take a couple seconds to present.
Chapter 1, Slide *
‘C6000 Internal Buses
The first bus is program.
If asked about 256-bit bus, this allows us to fetch 8 instructions
simultaneously, which allows us to execute an instruction on each
of our 8 functional units in parallel.
Two data buses - one for each register set (A & B).
Each ‘C62x data bus can load/store 32-bits/cycle.
The ‘C67x can load up to 64 bits per cycle, supporting single-cycle
loads of double-float values or the ability to load 4
single-precision floats per cycle. (Stores are still 32-bit - but
that’s OK since DSP's perform many more reads than writes).
‘C64 performs 64-bit loads and stores.
Read and write buses for DMA: this allows the DMA to support
single-cycle transfer rates (a DMA read and write in one
cycle).
Note, on 6211, 6711, and 6712, EDMA is serviced on-chip by a 64-bit
bus. The external bus, though, is 32-bits for the ‘11 devices and
16-bits for the ‘12.
Chapter 1, Slide *
Internal
Memory
The point of this slide is to transition to the peripherals
description.
Essentially, the next few slides describe each peripheral. One
slide per peripheral with a few bullets to highlight the key
features.
Don’t get into too much detail on any one peripheral - unless the
question is simple/quick to answer.
The McBSP and EDMA are covered in more detail later in this
workshop. The others cannot be examined further due to limited
time. The 4-day workshop spends more time examining other
peripherals.
Chapter 1, Slide *
CPU
4K
Program
Cache
4K
Data
Cache
The CPU can access two dedicated level-1 caches. A 4K direct-mapped
cache for program code and a 2-way data cache. These level-1 caches
provide single-cycle access to the CPU.
The level-2 memory is larger and a bit slower. It’s accessed
whenever there is a level-1 cache miss. Even though it’s a little
slower than the level-1 memory, it’s still faster than going
off-chip. If the term “level-2 cache” sounds familiar, it’s because
many personal computers now employ this same type of
mechanism.
The level-1 vs. level-2 access is all automatic. YOU, the
programmer, don’t have to worry about a thing. Just write your code
as you’d normally would and the hardware figures out the quickest
way to get the CPU your code and data.
What if the code/data isn’t in either the level-1 or level-2
memory? Then ...
Chapter 1, Slide *
‘C6711 Cache Logic
HIDDEN FOIL
This foil is here so that it could be linked into the student
notes. If you find this diagram useful, you can either ‘un-hide’ it
or click the top arrow on the preceding foil.
Chapter 1, Slide *
‘C6711 Cache Details
Level 1 Program
16 instr. in 5 cycles
Line Size: 1024 bits
HIDDEN FOIL
This foil was included to add the width of the data paths on the
diagram two foils ago. If you want to use this diagram, you can
either ‘un-hide’ it or, click on the bottom arrow in the upper
right corner of the foil two preceding this one.
Note, the data paths are larger than expected. In fact, when there
is a transfer from Level-2 to either program or data Level-1, two
transfers actually take place. That is, two fetch packets, or 32
bytes of data are transferred to the Level-1 caches. This “look
ahead” or “burst” feature was designed to minimize Level-1 cache
misses.
L1P: 4 Kbytes = 1K instructions = 128 fetch packets (FP)
Line size is 512 bits = 16 instructions = 2 FP
L2: Line size is 1024 bits = 4 FP (2x L1P line size)
= 128 bytes (4x L1D line size)
Internal EDMA bus is 64 bits wide, though 6211/6711 devices only
have 32-bit external bus. (6712 has 16-bit external bus.)
Chapter 1, Slide *
Internal
Memory
The point of this slide is to transition to the peripherals
description.
Essentially, the next few slides describe each peripheral. One
slide per peripheral with a few bullets to highlight the key
features.
Don’t get into too much detail on any one peripheral - unless the
question is simple/quick to answer.
The McBSP and EDMA are covered in more detail later in this
workshop. The others cannot be examined further due to limited
time. The 4-day workshop spends more time examining other
peripherals.
Chapter 1, Slide *
DSP processors are optimised to perform multiplication and addition
operations.
Multiplication and addition are done in hardware and in one
cycle.
Example: 4-bit multiply (unsigned).
Internal L2 cache
32
32K
32K
512K
32-bit
64-bit
40-bit
1200MFLOPS
32
32K
32K
512K
Parameter
DMA channels
Multiprocessor support
Supply voltage
Power management
Applications which require:
Higher power consumption.
Can be slower than fixed-point counterparts and larger in
size.
Chapter 1, Slide *
Floating vs. Fixed point processors
It is the application that dictates which device and platform to
use in order to achieve optimum performance at a low cost.
For educational purposes, use the floating-point device (C6711) as
it can support both fixed and floating point operations.
Chapter 1, Slide *
Application Specific Integrated Circuits (ASICs) are semiconductors
designed for dedicated functions.
The advantages and disadvantages of using ASICs are listed
below:
Advantages
Chapter 1, Slide *
Chapter 1, Slide *
Lowest Cost
Control Systems
Motor Control
Comm Infrastructure
Wireless Base-stations
Texas Instruments’ TMS320 family
TMS320C64x: The C64x fixed-point DSPs offer the industry's highest
level of performance to address the demands of the digital age. At
clock rates of up to 1 GHz, C64x DSPs can process information at
rates up to 8000 MIPS with costs as low as $19.95. In addition to a
high clock rate, C64x DSPs can do more work each cycle with
built-in extensions. These extensions include new instructions to
accelerate performance in key application areas such as digital
communications infrastructure and video and image processing.
TMS320C62x: These first-generation fixed-point DSPs represent
breakthrough technology that enables new equipments and energizes
existing implementations for multi-channel, multi-function
applications, such as wireless base stations, remote access servers
(RAS), digital subscriber loop (xDSL) systems, personalized home
security systems, advanced imaging/biometrics, industrial scanners,
precision instrumentation and multi-channel telephony
systems.
TMS320C67x: For designers of high-precision applications,
C67x floating-point DSPs offer the speed, precision, power savings
and dynamic range to meet a wide variety of design needs. These
dynamic DSPs are the ideal solution for demanding applications like
audio, medical imaging, instrumentation and automotive.
Chapter 1, Slide *
First commercially-successful floating-point DSP ‘C30 (1987)
First floating-point DSP with multiprocessing support ‘C40
(1991)
First $10 floating-point DSP ‘C32 (1995)
First 1-GFLOPS DSP ‘C6701 (1998)
First $5 floating-point DSP ‘C33 (1999)
First 2-level cache floating-point DSP ‘C6711 (1999)
First to offer 600 MFLOPS for under $10 ‘C6712 (2000)
TI Floating-Point Innovation
Chapter 1, Slide *
7.bin
“Understanding Digital Signal Processing”
by Richard G. Lyons;
ISBN 0-1310-8989-7
by Craig Marven and Gillian Ewers;
ISBN 0-4711-5243-9
James H. McClellan, Ronald W. Schafer, and Mark A. Yoder;
ISBN 0-1324-3171-8
“Digital Signal Processing Implementation
“C6x-Based Digital Signal Processing”
ISBN 0-13-088310-7
the TMS320C6000” by Nasser Kehtarnavaz;
Newnes; Book & CD-Rom (July 14, 2004)
ISBN 0-7506-7830-5
C6713 and C6416 DSK (Topics in Digital Signal Processing)”
Wiley-Interscience; Book&CD-Rom (December 3, 2004) by Rulph
Chassaing;
ISBN 0-4716-9007-4
to C with the TMS320C6x DSK” by Thad B. Welch;
Cameron Wright; Michael Morrow; Book & CD-Rom
(2006) ISBN 0-8493-7382-4
Chapter 1, Slide *