Avnet SpeedWay Workshops
1
Accelerating Your Success™
V10_1_1_2
Avnet SpeedwayDesign Workshop™
Creating FPGA-based Co-Processors for DSPs Using Model Based Design Techniques
Lecture 3: Xilinx FPGA Meets TI DSP
Avnet SpeedWay Workshops
2
2Avnet SpeedWay Design Workshop™
Develop Executable Spec in Simulink
Partition Between DSP and FPGA Co-Processor
Model-Based Design Flow
Design Exploration for Targeting Hardware
Verify Hardware in HW Co-simulation
Implement Stand-Alone Video System
This is where we are in the model-based design flow.
Avnet SpeedWay Workshops
3
3Avnet SpeedWay Design Workshop™
The Problem We Wish to Solve
Partitioning an algorithm efficiently between DSP and FPGA can be a daunting task unless one has knowledge of each device’s architecture, capabilities and design tools.
We contrast the architectures of DSP and FPGA, offering guidelines to allocate different portions of the algorithm to each.
In order to leverage maximum efficiency from an FPGA co-processor, designers need :
- guidelines to identify the high computation-load sectors of video and image processing algorithms suitable for off-loading to co-processors- a flexible design flow to explore partitioning between software and hardware, from verification to implementation
Avnet SpeedWay Workshops
4
4Avnet SpeedWay Design Workshop™
Agenda
• Xilinx FPGA Meets TI DSP
• Xilinx FPGA design for temporal template matching
• Hardware / software algorithm partitioning between TI DSP and Xilinx FPGA
Xilinx FPGA Meets TI DSP:
•We start with a quick review of the TI DM6437 DaVinci Digital Media Processor, followed by a basic intro to Xilinx FPGA architecture in order to understand how the 2 can best work together.
Xilinx FPGA design for temporal template matching:
•Next we focus on techniques to best solve the problem at hand, temporal template matching for video using the FPGA.
We conclude with some guidelines for hardware / software algorithm partitioning between TI DSP and Xilinx FPGA
Avnet SpeedWay Workshops
5
5Avnet SpeedWay Design Workshop™
TI DSP Meet Xilinx FPGA
• Programmable DSPs - the classic answer to real-time signal processing
• FPGAs - increasingly used in real-time signal processing • FPGAs complement DSPs for:
– System logic multiplexing– New peripheral or bus interface implementation– Performance acceleration in the signal processing chain
TI is the world’s leading manufacturer of DSP processors. Xilinx is the world’s leading manufacturer of FPGAs.
Let’s begin by examining the fundamental nature of DSPs and FPGAs. Then we will proceed to combine the 2 in a system.
Avnet SpeedWay Workshops
6
6Avnet SpeedWay Design Workshop™
Video Processing Subsystem
Video processing SubsystemDSP
Core
Program / Data
Storage
DSP Processors Meet FPGAs
∑=
×−N
kkhknx
1][][
DSP
Connectivity
Peripherals
DSP processors and FPGAs (Field Programmable Gate Arrays) are fundamentally different yet complementary devices.
At the heart of the TI DSP processor is the core.Mouse click …It is surrounded by memory for program and data storage, a variety of peripherals (timers, PWM, etc), connectivity interfaces (Ethernet, USB, SPI, etc) and, in DaVinci for example, specialized subsystems for video in/out.
DSP characteristics:•sequential instruction execution•Software programmable in high-level language, ex. C•Ideal for complex algorithms•Wide variety of fixed-function peripherals on-chip•Rich eco-system of the 3rd party authorized software providers in vast range of applications: video, VoIP, surveillance, communications, consumer, etc
At the heart of the Xilinx FPGA is the programmable hardware fabric, which provides the fundamental structures to build custom logic.
Other structural elements include block RAM, routing matrix, and clock management, including both a PLL and DLL. The IO block contains the structure for interfacing to external devices. There are many selectable IO standards the IO block can be configured to use. Some of those standards include LVCMOS, SSTL, HSTL, LVDS and many others not usually offered in DSP processors.
FPGA characteristics:•execute parallel computations in hardware•Ideal for fast, high-performance custom functions•Rich variety of resources on-chip•in-system programmable •Design in hardware-description language•Rich eco-system of the 3rd party IP providers in vast range of applications: video, VoIP, surveillance, communications, consumer, etc
We propose to unite TI DSP processors and Xilinx FPGAs into system solutions where each device performs what it does best.
Avnet SpeedWay Workshops
7
7Avnet SpeedWay Design Workshop™
Low Power DSPs
Low Power, Low Cost Signal ProcessingPerf
orm
ance
Po
wer
ARM
CortexA8
CortexA8
ARM9
ARMNext Code Compatibility (ISA)
ARM DSP 64X
C67x/C64x
C6000C64x
Multi-core
MSP430
C5000
Microcontrollers
Ultra-Low Power, General,
and Real-Time Control
C2000
OMAP3xxOMAPNext
DM3xxDM644x
DM646x
OMAP-L1X
Digital Media Processors
Video Performance; Arm Ease of Use
Applications Processors
Low Power, High Performance GUI/Browser Apps
High MHz / Multi-Core Signal ProcessingHigh Performance DSPs
ARM Core
DSP
MCUNext
TI Embedded Processors
DM6437 DaVinci
.
TI offers a rich eco-system of embedded processors, many with combined DSP core + ARM.
DSP core =better at complex mathematics app
- High Performance DSPs
- Low Power DSPs
ARM=better at advanced UI and system control (ARM9, Cortex A8, etc.)
<mouse click>
The DSP processor that we focus on today is the DM6437, part of the DaVinci Digital Media processor family.
Avnet SpeedWay Workshops
8
8Avnet SpeedWay Design Workshop™
Peripherals
FeaturesNew C64x+™ Core
– C64x+™ Core @ Up to 600 MHzMemory
– 80 KB L1D, 32 KB L1P Cache/SRAM– 128 KB L2 Cache/SRAMPeripherals
– Video Port Sub-System (VPSS): Input (CCDC), Output (w/DACs), Resizer, OSD, and Camera Control
– Two EMIFs: DDR2-266: 32 bits, 133 MHz; EMIF 2.1
– 10/100 Ethernet MAC, MII or RMII; PCI 33 MHz; HPI; McASP
– VLYNQ™ – Serial Interface to FPGAs– UART (2), I2C, SPI, GPIO, PWM (3), CAN
(HECC), 64-bit Timers (2)
DSP Subsystem
C64x+TM DSP 600-MHz
Core
L2128 KBCache
L1P 32KB
L1D 80KB
WDTimer
System
PWM×3
Timer64-bit×2
Connectivity
Serial InterfacesUART ×2
or
SPI
I2C
CANMcASP
McBSP ×2or
Switch Fabric
CCD Controller Video Interface
PreviewHistogram/3A
Resizer
On-ScreenDisplay (OSD)
10b DAC10b DAC10b DAC10b DACVideo
Enc(VENC)
Video Processing Subsystem
Back End
Front End
DDR2Controller
(32b)
Program/Data Storage
EMIF(8b)
EDMA EMACVLYNQPCI
33 HPIor
DDRPLL
PLL
JTAG
OSC
TI TMS320DM6437 Processor Architecture
This is the architecture of the DM6437 SOC. This device is just one of 7 DaVinci processors. This is the DSP on the Avnet Spartan3A-DSP DaVinci Evaluation Platform.
DaVinci offers an array of on-chip resources for video processing, notably:
•Improved video performance with a 50 percent cost reduction over previous DSP digital media processors•Built in DACs save ~ $2 – 4 on overall BOM cost
•VPSS offloads the DSP… Up to 40% DSP off load for DM6437 provides up to 240 MHz processor savings for more features or higher quality
Preview engine Up to 15%Resizer Up to 10%OSD Up to 15%Total Up to 40% for DM6437
The shaded blocks represent functions supported by the Avnet Board Support Package for Simulink.
We draw your attention to the VPSS and the on-chip VLYNQ serial interface, both of which are featured in this seminar.
Avnet SpeedWay Workshops
9
9Avnet SpeedWay Design Workshop™
Presentation Flow
MATLAB® and Simulink®
Algorithm and System DesignMATLABMATLAB®® and Simulinkand Simulink®®
Algorithm and System DesignAlgorithm and System Design
Real-Time WorkshopEmbedded Coder,
Targets, Links
RealReal--Time WorkshopTime WorkshopEmbedded Coder,Embedded Coder,
Targets, LinksTargets, Links
Verif
y
Generate
Generate
Code Composer
Avnet Spartan3A-DSP DaVinci Development Kit
C / ASM
XilinxXilinxXilinx
MathWorksMathWorksMathWorks
Link for CCSLink for CCS
Verif
y
Xilinx System Generator for DSP
Xilinx System Xilinx System Generator for DSPGenerator for DSP
HDL
ISEISE
Hardware Hardware CoCo--simulationsimulation
Introduce tool flow for DaVinciDigital Media Processor
1
Introduce DaVinciDigital Media Processor architecture
2
TITITI
Introduce Xilinx FPGA architecture
3
Introduce Xilinx System Generator for DSP
4
.
In day 1, we saw how the TI design environment fits into the Model-Based Design flow for video and image processing. We also covered an introduction to TI DaVinci Digital media Processors in the previous slides.
<mouse click>
We continue with a basic intro to Xilinx FPGA architecture in order to understand how DSP and FPGA can best work together
Avnet SpeedWay Workshops
10
10Avnet SpeedWay Design Workshop™
Xilinx FPGA Architecture
• Logic Fabric– Gates and flip-flops
• Embedded Blocks – Memory– DSP/Multipliers – Clock management– High speed serial I/O– Soft/Hard processors
• Programmable I/Os• In-System Programmable
– JTAG
Avnet SpeedWay Workshops
11
11Avnet SpeedWay Design Workshop™
Memory
• Block RAM– RAM or ROM– True dual port
• Separate read and write ports– Independent port size
• Data width translation– Excellent for video line buffers, FIFOs
CLKA
DIPA
ADDRA
DOPA
CLKB
ADDRB
DIA DOA
DIPB DOPBDIB DOB
Configuration Depth Data bits Parity bits16K x 1 16Kb 1 08K x 2 8Kb 2 04K x 4 4Kb 4 02K x 9 2Kb 8 1
1K x 18 1Kb 16 2512 x 36 512 32 4
Block RAM Configurations
You will work extensively with these memory blocks in the labs.
Avnet SpeedWay Workshops
12
12Avnet SpeedWay Design Workshop™
Clock Management
• Digital Clock Managers (DCMs)– Clock de-skew– Phase shifting– Clock multiplication – Clock division– Frequency synthesis
CLKIN CLK0
CLK90
CLKFX
Avnet SpeedWay Workshops
13
13Avnet SpeedWay Design Workshop™
Programmable I/Os
• Single-ended• Differential / LVDS• Programmable I/O standards
– Multiple I/O banks
• DDR I/O registers• On-chip termination
•
Standard Output VCCO Input VREF
LVTTL 3.3V --LVCMOS33 3.3V --LVCMOS25 2.5V --LVCMOS18 1.8V --LVCMOS15 1.5V --LVCMOS12 1.2V --
PCI 32/64 bit 33MHz 3.3V --SSTL2 Class I 2.5V 1.25VSSTL2 Class II 2.5V 1.25VSSTL18 Class I 1.8V 0.9V
HSTL Class I 1.5V 0.75VHSTL Class III 1.5V 0.9V
HSTL18 Class I 1.8V 0.9VHSTL18 Class II 1.8V 0.9VHSTL18 Class III 1.8V 1.1V
GTL -- 0.8VGTL+ -- 1.0V
LVDS2.5 2.5V -- Bus LVDS2.5 2.5V -- Ultra LVDS2.5 2.5V -- LVDS_ext2.5 2.5V --
RSDS 2.5V --LDT2.5 2.5V --
Diffe
rent
ialSi
ngle
ende
d
Reg
Reg
DDR mux
3-State
Reg
Reg
DDR mux
PAD
Reg
Reg
Input
Output
I/O Banks
Out of this rich offering of IO standards,
Avnet SpeedWay Workshops
14
14Avnet SpeedWay Design Workshop™
• Integrated XtremeDSP Slice– Application optimized capacity
• 3400A – 126 DSP48As• 1800A – 84 DSP48As
– Integrated pre-adder optimized for filters
– 40 opmodes– 250 MHz operation, standard
speed grade– Compatible with Virtex-DSP
• High-performance and flexibility as computation engine for DSP and video
XtremeDSP DSP48A Slice
XtremeDSP DSP48A Slice
Transcript:
The DSP48A is a optimized implementation for the Spartan-class devices. A key new feature is the addition of a pre-adder which is used in symmetric filters – one of the most common implementations in the target markets. In the DSP48 and DSP48E implementations, the pre-adder is implemented using FPGA logic resources. Including this in the DSP48A reduces logic utilization, increases performance and lowers power. The DSP48A operates at 250MHz in the -4, standard speed grade. In the Spartan-DSP domain, we will be always emphasizing the standard speed grade parts when we discuss performance as this is the lower-cost path that most of the customers will want to pursue.
The new DSP48A is most closely related to the DSP48 in Virtex-4. There is migration capability between the 3 implementations, esp. if the FFT and FIR compiler are used. In the Reference section, more detail is provided including a summary table. The customer presentation also provide more details.
The other main new feature is the expansion of the amount of BRAM, roughly 2X the ratio of BRAM to Logic as compared to other Spartan-3 generation devices. The number of BRAMs is matched to the number of DSP48s as is done in Virtex-DSP. The BRAM is also been enhanced to achieve about a 25% speed increase over Spartan-3A. There are of course additional benefits to having more and higher performance BRAM in Spartan-3 class devices. Other application areas such as embedded processing, where MicroBlaze and the soft embedded IP can take advantage of the additional memory.
Avnet SpeedWay Workshops
15
15Avnet SpeedWay Design Workshop™
Spartan-3A/3AN/3ADSP Family
32202016318x18 Multipliers
88442DCMs
92K
360K
372
13,248
700K
11K
54K
144
1,584
50K
25,3448,0644,032Logic Cells
576K360K288KBlock RAM bits
176K56K28KDistributed RAM bits
502311248Maximum I/O
1.4M400K200KGates
Device700A/N50A/N 1400A/N400A/N200A/N
84DSP48A
8
37,440
1512K
260K
519
1.8M
1800AD
126DSP48A
8
53,712
2268K
373K
469
3.4M
3400AD
.
The chart shows the various members of the Spartan-3A family – including Spartan-3A, Spartan-3AN, and Spartan-3A DSP.
<mouse click> The focus for today is the 1800 version of the Spartan-3A DSP since that is the device we’ll use on the hardware today.
Avnet SpeedWay Workshops
16
16Avnet SpeedWay Design Workshop™
From Sequential to Full Parallel Processing .
Data OutData Out
MACC UnitMACC Unit
CoefficientsCoefficients
256256--Tap FIR Filter Sequential Tap FIR Filter Sequential ImplementationImplementation
500 MHz500 MHz500 MHz256 clock cycles256 clock cycles256 clock cycles = 2 MSPS= 2 MSPS= 2 MSPS
256 clock 256 clock cycles cycles
neededneeded
Data InData In
XX
++RegReg
500 MHz500 MHz500 MHz1 clock cycle1 clock cycle1 clock cycle
= 500 MSPS= 500 MSPS= 500 MSPS
256256--Tap FIR Filter Fully Parallel ImplementationTap FIR Filter Fully Parallel Implementation
Data OutData Out
XX
++
C0C0 C0C0XXC1C1 XXC2C2 XXC3C3 XXC255C255…
RegReg
RegReg
RegReg
RegReg
RegReg
RegReg
RegReg
RegReg
++ ++ ++ ++RegReg
RegReg
RegReg
RegReg…
…Data InData In
• FPGAs can deploy hardware resources to suit the task
• Lowest resource usage • Highest performance
Xilinx FPGAs can implement a wide range of DSP functions, with the flexibility to deploy the right mix of hardware resources appropriate for the task at hand.
FIR filters are used extensively in DSP and will serve here as the basis for comparison of general inner-loop computation structures. A FIR is a sum of products involving coefficients and a time-skew buffer, or pipeline, of samples in a time skew buffer. The same design considerations apply to all inner-loop type computations: FIR IIF filers, correlators, moving average, SAD.
Shown here are 2 implementations of the same 256-tap FIR filter, both of which can be implemented in a Xilinx FPGA:
• Using a single time-shared MAC, it would take 256 clock cycles. Clocking at 1GHz would only yield around 4MSPS sample rate.
• // However, what if you had 256 of those MAC structures in one device? Now you can get a filter result every clock cycle. Running the clock at 400MHz yields a 400MSPS sample rate – 100 times faster than the sequential MAC can achieve! This is the power of parallelism in Xilinx FPGAs, which integrate large numbers of these DSP resources to achieve extremely high performance.
These compute-intensive repetitive inner-loop computations will be prime candidates to off-load from DSP to FPGA when partitioning an algorithm.
Avnet SpeedWay Workshops
17
17Avnet SpeedWay Design Workshop™
x[n-k]
bk
y[n]X
+
AccumulateMultiply
Z-1
MACC-based FIR Using DSP48A
N-1
y[n] = Σ x[n-k]bkk=0
• Implement in single DSP48A overclocked at fclock = fs x N• N = length of FIR filter, or number of coefficients• fs = filter throughput, in Mega-samples / second• fclock = computation clock rate (maximum 250 MHz in Spartan3A-DSP, lowest speed)
• Ex. 25-tap FIR with fclock = 250 MHz achieves 10 Mega-samples/second
• Most efficient use of hardware
.
DSP48A
Let’s focus on the MACC-based FIR in more detail. A single DSP48A can implement a FIR bysumming each term sequentially using a single multiplier-accumulator or ‘MACC’ to produce a result after N clock cycles, where N = filter length. In contrast to fully parallel, a serial implementation time-shares a single accumulator.
[mouse click and pause for effect]
It reduces hardware by a factor of N compared to parallel structures, but also reduces filter sampling rate throughput by the same factor : fs = fclock / N. Consequently, the MACC FIR is the optimal structure at lower sampling rates.
A comprehensive tutorial on usage of DSP48 to implement FIR filters over a wide range of filter length and desired throughput is listed in the reference section.
--------------------------------------------------------------------------
Supplementary notes:
To build an accumulator the output of the adder is registered by the flip-flop in the slice, to capture the result at node P. This result is then routed back round to the adder. Hence, each clock cycle a new input will be presented to the ‘C’ input and added to the result calculated from the previous clock cycle.
The key message is that Xilinx FPGAs offer a lot of flexibility to implement DSP functions using DSP48, tailored to the of the computation task at hand.
Avnet SpeedWay Workshops
18
18Avnet SpeedWay Design Workshop™
Presentation Flow
MATLAB® and Simulink®
Algorithm and System DesignMATLABMATLAB®® and Simulinkand Simulink®®
Algorithm and System DesignAlgorithm and System Design
Real-Time WorkshopEmbedded Coder,
Targets, Links
RealReal--Time WorkshopTime WorkshopEmbedded Coder,Embedded Coder,
Targets, LinksTargets, Links
Verif
y
Generate
Generate
Code Composer
Avnet Spartan3A-DSP DaVinci Development Kit
C / ASM
XilinxXilinxXilinx
MathWorksMathWorksMathWorks
Link for CCSLink for CCS
Verif
y
Xilinx System Generator for DSP
Xilinx System Xilinx System Generator for DSPGenerator for DSP
HDL
ISEISE
Hardware Hardware CoCo--simulationsimulation
Introduce tool flow for DaVinciDigital Media Processor
1
Introduce DaVinciDigital Media Processor architecture
2
TITITI
Introduce Xilinx FPGA architecture
3
Introduce Xilinx System Generator for DSP
4
We continue with an overview of Xilinx System Generator for DSP
Avnet SpeedWay Workshops
19
19Avnet SpeedWay Design Workshop™
System Generator for DSP
• System Generator enables the use of Simulink for FPGA design– Design DSP applications in
FPGAs without hardware design experience
• Designs are constructed using a Xilinx provided blockset
• FPGA Implementation files, optimized for Xilinx devices, are automatically generated
System Generator is a DSP design tool from Xilinx that enables the use of The Mathworks model based design environment Simulink for FPGA design. Previous experience with Xilinx FPGAs or RTL design methodologies are not required when using System Generator. Designs are captured in the DSP friendly Simulink modeling environment using a Xilinx specific blockset. All of the downstream FPGA implementation steps including synthesis and place and route are automatically performed to generate an FPGA programming file
->->->-> slides 23 .. 26 for reference only / transform into a 5 min. demo
Avnet SpeedWay Workshops
20
20Avnet SpeedWay Design Workshop™
The Xilinx DSP Blockset
• Over 90 DSP building blocks available• Abstracts away the details of the FPGA
hardware architecture• Enables design migration between
technologies• Leverages Xilinx IP to deliver high quality of
results
Over 90 DSP building blocks are provided in the Xilinx DSP blockset for Simulink. These blocks include the common DSP building blocks such as adders, multipliers and registers. Also included are a set of complex DSP building blocks such as forward error correction blocks, FFTs, filters and memories. These blocks leverage the Xilinx IP core generators to deliver optimized results for the selected device.
Avnet SpeedWay Workshops
21
21Avnet SpeedWay Design Workshop™
FIR Filter Generation
• Automatically generated performance optimized FIR filters – Takes full advantage of the
Virtex-4 DSP48 blocks to achieve 500 MHz performance
– Supports multi-rate, oversampled, multi-channel and coefficient optimization
• MathWorks FDA Tool integration provides graphical filter design and coefficient generation
FIR Compiler
FDA Tool
System Generator includes a FIR Compiler block that targets the dedicated DSP48 hardware resources in the Virtex4 and Virtex5 devices to create highly optimized implementations that can run in excess of 500 Mhz. Configuration options allow generation of direct, polyphase decimation, polyphase interpolation and oversampled implementations. Standard MATLAB functions such as fir2 or The Mathworks FDAtoolcan be used to create coefficients for the Xilinx FIR Compiler.
Avnet SpeedWay Workshops
22
22Avnet SpeedWay Design Workshop™
• Combine System Generator with RTL blocks in ISE’s Project Navigator to form complete systems
• Supports multiple instantiations of System Generator designs as sub-blocks
• Manage constraints of multiple System Generator designs
System Generator / Project Navigator Integration
Allows persistence of ISE place and route setting between designiterations.
->->->-> update screen shots to our model
Avnet SpeedWay Workshops
23
23Avnet SpeedWay Design Workshop™
Agenda
• Xilinx FPGA Meets TI DSP
• Xilinx FPGA design for temporal template matching
• Hardware / software algorithm partitioning between TI DSP and Xilinx FPGA
•We now focus on techniques to best solve the problem at hand, temporal template matching for video using Xilinx FPGAs.
Avnet SpeedWay Workshops
24
24Avnet SpeedWay Design Workshop™
Block Matching for Video & Imaging
• Integral part of most of the motion-compensated video coding standards. EgMPEG 1, MPEG 2, H.264
• Video stabilization, video analytics, target tracking• Find the best match for a selected block (‘template’) in current frame• Calculate motion vector between ‘template’ block location in
previous frame and its counterpart in current frame
• Similarity measure for best match:– Mean Absolute Error (MAE)– Mean Square Error (MSE)– Sum of the Absolute Difference (SAD)
Current frame
Previous frame
Motion Vector
.
Block matching can employ various algorithms.
<Mouse click>
SAD will be used in our work throughout this seminar.
Avnet SpeedWay Workshops
25
25Avnet SpeedWay Design Workshop™
Exhaustive Search SAD Block Matching
SAD(50,50) > 0 SAD(100,75) approaching 0
Template pixel array = T(i,j)
Starting at top-left corner (0,0), sweep template across ROI
∑ ∑= =
−=heightTemplate
i
widthTemplate
jjiIjiTyxSAD
1 1|),(),(|),(
(200, 100)
(50, 50) (100, 75)(200, 100)
SAD(200,100) = 0
Exhaustive search = SADcalculated at each and everypixel location in ROIRegion of
interest (ROI)
Exhaustive search = SAD calculated at each and every pixel location in ROI by displacing template by 1 pixel at a time.
The lower the SAD result, the better the match between the template and the pixel region beneath it. SAD = 0 indicates a perfect match.
It is obvious that an exhaustive search is very compute intensive, but will produce the best block-matching performance.
Avnet SpeedWay Workshops
26
26Avnet SpeedWay Design Workshop™
Input image 72 x 54 full blackwith template at position (20,20)
72 pixels
54 pixels(20, 20)
SAD calculated by MATLAB at start of simulation : best match at (20,20)
SAD generated in FPGA hardware (System Generator)
..SAD in Xilinx System Generator for DSP
Let’s illustrate the practical use of Simulink and System Generator in a video design flow for pattern matching using sum-of-absolute differences (SAD) targeting Xilinx Spartan3A-DSP.
The model contains a testbench comprised of a synthetic input image, 72 x 54 pixels, full-black except for a 22x18 template inserted at location 20,20. At the start of simulation, the input image is read into a MATLAB workspace array named Test_Image_IN by a callback function and presented as stimulii into the model. Simulation time is set to 54x72 + 3. (The extra 3 is to flush out initial pipeline latency)
->->->-> explain data vector is row-major as set up in init script
[1st mouse click]
The MATLAB callback function at start of simulation also calculates SAD between the template and each pixel in the input image. In the SAD calculation, the best match of a template within an image frame is the minimum value, or the darkest region on the SAD plot. As expected, the darkest region in the calculated SAD corresponds to the upper-left corner of the template.
[2nd mouse click]
During simulation, SAD between the template and each pixel in the input image is calculated in hardware. Results are displayed at the end of simulation, and compared against the MATLAB-calculated SAD. Note the MATLAB-calculated SAD image is identical to the hardware – generated image. This confirms proper operation of the hardware SAD function.
Note: the black boundary along the bottom and right side of the SAD plots represent the limits of displacement of the template inside the image. Practical pattern-matching algorithms usually limit the search to a sub-region of interest within the whole frame.
Avnet SpeedWay Workshops
27
27Avnet SpeedWay Design Workshop™
SAD in Xilinx System Generator for DSP
• How do video pixels move through FPGA memory ?
• How are pixels presented to SAD computation engine ?(50, 50)
22 x 18
template
Let’s continue with an illustration of how the SAD algorithm is executed in FPGA hardware. We start by focusing on how the image is stored in FPGA memory and presented to the SAD computation engine.
Avnet SpeedWay Workshops
28
28Avnet SpeedWay Design Workshop™
An Efficient Line Buffer Strategy
Transposition Incoming pixel rows become columns* In hardware this is just clever indexing
Avnet SpeedWay Workshops
29
29Avnet SpeedWay Design Workshop™
H
Line buffer (transposed)Video frame
ROI_widthV
Template_height = 18
Shift direction
Read 1 row + current pixel
Shift in
Addressable Shift Register (ASR)
Template_height
Current pixel
SAD / Image Management in FPGA….
Expiredpixel
When slide appears, you point to Lena, top-left and say:
'Pixels don't all just appear simultaneously in a frame of video. Rather, pixels are synchronized by horizontal and vertical sync signals. Each video line is displayed by pixels arriving from left <you point to the current pixel as the white outlined square on bridge of her nose> <mouse-click> to right.
In order to send the video to the SAD engine, we store pixels in a line buffer in transposed fashion, meaning a horizontal line of pixels is stored vertically in the transposed line buffer like this <point to white outlined square that appeared in the transposed line buffer> …
… <mouse-click> <holding your breath to show that you have yet to finish the thought, you point to next pixel as the white outlined square moving toward Lena's right eye, then point to white outlined square as it is stored in the transposed line buffer>
… <mouse-click> <point to next pixel as the white outlined square under Lena's right eye, then point to white outlined square as it is stored in the transposed line buffer>
Each location in the line buffer memory holds a column of the video image of height template_height. The dual port RAM stores as many lines of video as the height of the template.
Under control of an address counter at the pixel rate, as each new pixel arrives:- read out pixel column of Template height from port A of the transposed line buffer - slice off and discard top-most pixel from oldest video line (leftmost column of the
transposed line buffer, this is not shown for clarity) - concatenate current pixel as LSByte, result is new pixel column- shift this new pixel column into addressable shift register (ASR) for SAD engine- store the new pixel column into port B at next pixel period
< Mouse click > Line buffer and Addressable Shift Register (ASR) are available as parameterizable blocks in System Generator.
Avnet SpeedWay Workshops
30
30Avnet SpeedWay Design Workshop™
Addressable Shift Register
Avnet SpeedWay Workshops
31
31Avnet SpeedWay Design Workshop™
Under control of an address counter at the pixel rate, as each new pixel arrives :• read out pixel column of Template_height from port A of the transposed line buffer (dual port RAM)• slice off and discard top-most pixel from oldest video line • concatenate current pixel as LSByte, result is new pixel column• shift this new pixel column into addressable shift register (ASR) for SAD engine• store the new pixel column into port B at next pixel period
SAD / Image Management in FPGA
SADVideo
INTransposedLine buffer Template_width Template_width
Video data is Y (luminance), 8-bit pixels. The dual port RAM stores as many lines of video as the height of the template, in this case Template_height = 18. These 18 pixels form words of 18 x 8 = 144 bits. The notation UFix refers to ‘unsigned fixed-point’.
Dual port RAM block is mapped to one or more BRAM elements when the design is netlisted.
Note the flexibility of System Generator in managing video data.
Avnet SpeedWay Workshops
32
32Avnet SpeedWay Design Workshop™
opmode
ACC-based Sum of Absolute Difference (SAD)
∑ ∑= =
−=heightTemplate
i
widthTemplate
jjiIjiTyxSAD
_
1
_
1|),(),(|),(
template
Accumulate
Z-1
• Each line of block-match in single DSP48A overclocked at fclock = fs x Template_width• Template_width = width of template, in pixels • fs = pixel rate, in Mega-pixels / second• fclock = computation clock rate (maximum 250 MHz in Spartan3A-DSP, lowest speed)
.
-
+
abs
Image pixel
DSP48A
P = P + C
P = C
CP ∑
=
−widthTemplate
j
jIjT
_
1
|),1(),1(|
1 line of SAD
In a similar fashion to the MACC FIR, a single DSP48A in accumulator mode can implement one line of pattern matching SAD by summing each term sequentially to produce a result after template_width clock cycles, where template_width = width of the template, in pixels.
[mouse click and pause for effect]
Note the sequencer controlling the opmode port of the DSP48A. Over 40 dynamic user-controller operating modes (opmodes) can dynamically adapt XtremeDSP slice functions from clock cycle to cycle, optimizing performance through resource sharing to create custom sequential computation engines. Each XtremeDSPSlice is individually controllable.
For an accumulator function, select the C input on the 1st of N clock cycles to initialize the accumulator, then select the feedback path from the registered output P to sum with the C input on the remaining N-1 clock cycles.
Note that using the opmode in a dynamic fashion has no impact on the performance due to the registered opmode input. If all input and output registers are used, no matter the operation, DSP48A can achieve 250 MHz. In Virtex-5, DSP48E can achieve 550 MHz.
Note:
The absolute value function of the difference between template and pixel value is not implemented in the adder of the DSP48A. Rather, it is implemented in fabric using the sign bit of (template – pixel) to select +/-. The reason is that DSP48A doesn’t support opmodes P = P + C + Cin, or P = P - C – Cin. However, these are supported in DSP48 (Virtex-4) and DSP48E (Virtex-5). In these families, the sign bit could be used to select the appropriate opmode to pull the absolute value function into DSP48 or DSP48E, thereby reducing overall resource requirements.
Avnet SpeedWay Workshops
33
33Avnet SpeedWay Design Workshop™
SAD Computation Using DSP48A
DSP48A
DSP48A
DSP48A
DSP48A
Template_height DSP48A’s template
Accumulate
Z-1
-
+
abs
Image pixel
DSP48A
• Combined sequential and parallel processing to suit task
SADVideo
INTransposedLine buffer Template_width Template_width
Overclocked X template_width
There are as many DSP48A SAD engines as the height in lines of the template. An extra adder block (not shown for clarity) sums all DSP48A SAD engines for the final SAD value at the current pixel location.
Note the upsampling block prior to the SAD block. It serves to clock each DSP48A at 22X the pixel rate, or __ MHz x 22 = ___ MHz. It is good design practice to run the DSP48A close to it‘s maximum speed in order to make best use of the resource.
Optional / explanatory notes:
Recall the sequencer controlling the opmode port of the DSP48A to create the accumulator function for the sum of absolute differences. The C input on the 1st of N clock cycles initializes the accumulator, then the feedback path from the registered output P sums with the C input on the remaining N-1 clock cycles. The output of the DSP48A at the final clock cycle is the SAD of one line of the template with incoming video pixels in the line buffer.
This SAD engine can compute an exhaustive search of the template at each pixel location in the frame in real time at the video pixel rate. In practice, the SAD calculations are usually confined to a sub-region of interest in the frame by the search algorithm.
You will build and implement the SAD in Xilinx System Generator during lab 3.
Avnet SpeedWay Workshops
34
34Avnet SpeedWay Design Workshop™
Agenda
• Xilinx FPGA Meets TI DSP
• Xilinx FPGA design for temporal template matching
• Hardware / software algorithm partitioning between TI DSP and Xilinx FPGA
We now present some guidelines for hardware / software algorithm partitioning between TI DSP and Xilinx FPGA
Avnet SpeedWay Workshops
35
35Avnet SpeedWay Design Workshop™
System Execution Control
Typical DSP applications:– 20% of program code consumes 80% of required MIPS– 20% of program code requires time-consuming, difficult-to-maintain assembly
coding to increase system performance– Challenge to reduce the processing load in 20% of software and manage the
complexity of the remaining 80% of the code
ComputationData IN
80% of code20% of MIPS
20% of code80% of MIPS
Data OUT
DSP Algorithm MIPs Budget
In typical DSP applications, 20% of the program code consumes 80% of the required MIPS. This 20% of the program code often requires time-consuming, error-prone, and difficult-to-maintain assembly coding to increase overall system performance. This code also becomes far less portable than the remaining 80% of the code that focuses on initialization and system execution control. At the same time, that 80% of the code reflects the majority of the system’s complexity. This utilization outcome creates a double challenge for DSP software engineers. They must reduce the processing load in 20% of the software and manage the complexity of the remaining 80% of the code.
FPGA co-processing is well suited to addressing the 80% processing load caused by 20% of the algorithm code. The challenge is to identify what the DSP should off-load to a coprocessor.
Note: scheduling also becomes a factor as the MIPs load increases
Avnet SpeedWay Workshops
36
36Avnet SpeedWay Design Workshop™
Profiling real-time code execution on DSP
Code Profiling with MathWorks Tools
• Uses DSP/BIOS statistics to measure execution time
• Identify segments of generated code to off-load to FPGA co-processor
The code profiler in Target Support Package TC6 uses DSP/BIOS statistics objects to measure the execution time of code segments generated by individual subsystems. A code profile report helps you identify segments of generated code that are candidates for off-loading to an FPGA co-processor.
In depth technical information on code profiling is available at the following:
http://www.mathworks.com/access/helpdesk/help/toolbox/tic6000/index.html?/access/helpdesk/help/toolbox/tic6000/f8-7016.html
Avnet SpeedWay Workshops
37
37Avnet SpeedWay Design Workshop™
Code Profiling in Target Support Package TC6
• profiling report by subsystem of the Simulink model
Note the correspondence between the code profiling report and relevant sections of the Simulink model.
Observe the motion estimation sum-of-absolute differences takes 29.79 ms, which is (29.79/45.82) = 65% of the CPU load. Recall from the lecture that SAD is a compute-intensive inner-loop type calculation. Based on these numbers, we choose to partition the video stabilization model by off-loading the SAD-based motion estimation to an FPGA co-processor.
------------------------------------------------------
->->->-> Grant,
This is indeed what I was looking for. No need to put the how-to in the slide. Rather, what it needs now is a 2nd slide that clearly illustrates the correspondance between the report and relevant sections of the Simulinkmodel, perhaps with dashed lines between report < -- > model.
I agree we can put the practical how-to in the hands-on lab section (specifically lab 3).
thanks,
Luc
--------------------------------------------------------------------------------
From: Grant Martin [mailto:[email protected]]
Sent: Friday, October 03, 2008 12:33 AM
To: Langlois, Luc
Subject: Code Profiling slide
Here is a quick slide about code profiling. As far as technical information on how to do it, I would look at the following:
http://www.mathworks.com/access/helpdesk/help/toolbox/tic6000/index.html?/access/helpdesk/help/toolbox/tic6000/f8-7016.html
Do you want me to include anything on the how-to in this slide? We may be able to put this type of example in the hands-on part of the workshop as it fully explains how to do it and what we are looking at.
Avnet SpeedWay Workshops
38
38Avnet SpeedWay Design Workshop™
Video Processing Subsystem
Video processing SubsystemDSP
Core
– Parallel execution– High computation rates in fixed-point
math – Repetitive calculations– Nested inner loops
• ex. FFT, MAC FIR, moving average, correlation, SAD
– Fast access to deeply pipelined time-skew buffers
– Wide data words– Custom peripherals
Guidelines for off-load to FPGA Co-Processor
• Hi-level decision-making
– frame buffer
– Scaler– OSD– Histogram– CCD
controller
Selecting portions of an algorithm suitable for off-load to FPGA co-processor.
High-level decision-making should remain in DSP to take advantage of excellent quality of results from MathWorks TC6 code-generation tools for DM6437. Additionally, DaVinci has an on-board video processing subsystem optimized for video operations including frame buffer in DDR2, fractional image scaling, OSD, histogram, CCD controller. Consequently, these functions can remain on the DSP-side to utilize these resources.
Handling wide data words is a distinct advantage of the FPGA. For example, the transposed line buffer in the FPGA SAD engine using a dual port RAM to store words of (template_height x 8) bits = 144 bit words. This is impractical for a fixed-width DSP processor.
Ideally, the ratio of computation required /data crossing the interface between DSP and FPGA should be high, such as FFT. In the current context, motion estimation is an excellent candidate for the FPGA, because very little data needs to cross the interface between DSP and FPGA, while template matching using SAD in FPGA is highly compute-intensive.
-----------------------------------------------------------------------------------------------------------------
->->->-> Co-processing presentation from Asia TIDC in C:\My_Documents\Xilinx\DSP\TI_FPGA_co_processing
Avnet SpeedWay Workshops
39
39Avnet SpeedWay Design Workshop™
Guidelines for off-load to FPGA Co-Processor
A
E
• subset of algorithms self-contained and tightly grouped• occupy significant processing load
B
C
D
Total processing Load
A
B + C + D
E
Optimal candidates for off-load to the FPGA co-processor will occupy a significant portion of the total processing load in the stand-alone DSP.
Model-based design in Simulink is particularly efficient at revealing such candidates because it inherently structures the algorithm into hierarchical subsystems which are clearly reflected in the profile report.
Avnet SpeedWay Workshops
40
40Avnet SpeedWay Design Workshop™
Guidelines for off-load to FPGA Co-Processor
A
E
• Computation bound, not IO bound• Data transfer time < compute time
B
C
D
B
C
D
B
C
D
Data transfer time must not overwhelm computation time, otherwise gain from faster computation time is lost.
Avnet SpeedWay Workshops
41
41Avnet SpeedWay Design Workshop™
Guidelines for off-load to FPGA Co-Processor
A
E
B
C
D
• no processor dependency in the calculation while coprocessor operates on the data
• Ex. FFT, SAD
Ideally, data transfer to the co-processor need occur only at the start of the co-processor compute phase, and results returned at the end. This reduces intermediate data transfers and handshaking complexity.
Examples: FFT, SAD
Avnet SpeedWay Workshops
42
42Avnet SpeedWay Design Workshop™
Agenda
• Xilinx FPGA Meets TI DSP
• Xilinx FPGA design for temporal template matching
• Hardware / software algorithm partitioning between TI DSP and Xilinx FPGA
… proceed to lab 3 Algorithm Partitioning Between the DSP and FPGA
We began with a quick review of the TI DM6437 DaVinci Digital Media Processor, followed by a basic intro to Xilinx FPGA architecture in order to understand how the 2 can best work together.
We then focused on techniques to best solve the problem at hand, temporal template matching for video using the FPGA.
Finally, we concluded with some guidelines for hardware / software algorithm partitioning between TI DSP and Xilinx FPGA
Avnet SpeedWay Workshops
43
43Avnet SpeedWay Design Workshop™
Reference Slides
Avnet SpeedWay Workshops
44
44Avnet SpeedWay Design Workshop™
5 Application Notes available in the Virtex-4 User Guide in regard to implementation specifics
Many Reference Designs in:VHDL VerilogSystem Generator for DSP
For Further Details visit…..
www.xilinx.com/dsp
Xilinx FIR Filter Implementation Guides
http://www.xilinx.com/xlnx/xweb/xil_publications_display.jsp?sGlobalNavPick=&sSecondaryNavPick=&category=-1210767&iLanguageID=1
Avnet SpeedWay Workshops
45
45Avnet SpeedWay Design Workshop™
Spartan3A-DSP DSP48A User Guide
http://www.xilinx.com/support/documentation/user_guides/ug431.pdf
Avnet SpeedWay Workshops
46
46Avnet SpeedWay Design Workshop™
94,20852, 22434, 81655, 29634,56023,04053, 71237,440Logic Cells
224 x 1+ Gb/s LVDS pairs
3,456
240
27x27
192
5002
6,9122
962
4VSX35
120 x 1+ Gb/sLVDS pairs
2,304
160
27x27
128
5002
4,6082
642
V4VSX25
180 x 1.25 Gb/s LVDS pairs,
8 x 3.2 Gb/sTransceivers
3,024
520
27x27
192
5502
6,6532
1062
5VSX35T
360 x 1+ Gb/s LVDS pairs
5,760
384
27x27
512
5002
11,5202
2562
4VSX55
Virtex-4 SX
320 x 1.25Gb/s LVDS pairs, 16 x 3.2 Gb/s Transceivers
240 x 1.25 Gb/s LVDS pairs,
12 x 3.2 Gb/sTransceivers
208 x 622+ Mb/s LVDS pairs
176 x 622+ Mb/s LVDS pairs
High Speed Connectivity
8,7844,7522,2681,512Block RAM (Kb)
1,520780373260Distributed RAM (Kb)
27x2727x2719x1919x19Min Footprint (mm)
64028812684XtremeDSP DSP48* Slices
5502550225012501Max DSP Frequency (MHz)
19,325210,45422,26811,5121Max Block RAM
Memory Bandwidth (Gbps)
35221582321211DSP Performance (GMAC/s)
5VSX95T5VSX50T3SD3400A3SD1800A
Virtex-5 SXTSpartan-3A DSP
Virtex-DSPSpartan-DSP
XtremeDSP Device Portfolio
1 In Standard Speed Grade 2 In Fast Speed Grade
Transcript:
Here is the resulting XtremeDSP device portfolio. Only the devices labeled XtremeDSP are shown so it is not comprehensive. The LX and FX-type devices are not listed only for the sake of making this fit into one page so you can see where the new devices fit in capabilities and features.
This chart is also a little different than the typical family chart you are used to seeing. The top portion lists the DSP capabilities emphasizing the GMAC and memory bandwidth throughput. The bottom portion lists the primary DSP-centric care-abouts in the device feature set. As previously mentioned, the reference section displays a complete feature-list device table.
The key take-away’s for Spartan-3A DSP are the DSP48As, the amount of Block RAM and the performance in the standard speed grade – the lower-cost speed grade vs. the typical model of providing the highest numbers with the most costly device.
Avnet SpeedWay Workshops
47
47Avnet SpeedWay Design Workshop™
Xi linx Confidential
P rep ared b y : Ni al l B att son (Xi li nx ) 2 00 4
D QCE
43
DSP48 Sliceopmode = 0100101
Counterdata addr
18xn
yn
Load
opmode (5)
WEInput Data
117 x 18
Coefficients117 x 18
Coef Addr
Control117 x 10
CE
0
Embedded Control MACC FIRFilter Specification: Sampling Frequency = 3.84 Mhz, Coefficients = 117
Address signal is feedbackand its va lue must alwayspoint to the next required
address
Load signal is the MSB onthe coef/control memoryspace and the requiredlatency is already in the
signal
Care must be taken in setting upthe memory, especially t he initial
values so that the addressingkick st arts
Cont rol log ic is reducedto only a counter
423
Filter Size:1 XDSP Slice1 Block RAM28 Slices
Number of TapsMax Sample Rate =
Clock Rate
The A port is aconcatenation of the
coefficients, coefficientaddress, WE, CE and
Load signals
More than 256 coeffici entsat greater than 9 bit datarequires more Block RAMand traditional control
technique should be used
1
Xilinx Confidential
P rep are d by : Ni al l B at tso n (X il in x) 20 04
x1(n)18
x2(n)
x3(n)y1(n)
y2(n)
y3(n)Coefficients
104 x 1844
DSP48 Sliceopmode = 0100101
Control
Data Addr
Coef Addr
we
z-3Load
opmode (5)
Exploiting many slow data streams
Time Divisio n Mult iplexing ofthe numerous channels is
performed with a mux runningat C times faster than the
input
The output mux would mostlikely ta ke the form of a bankof ca pture registers that are
en abled as appropriate
Filter Size:1 XDSP Slice1 Block RAM116 Slices(30 for control)Number o f Taps x Number of Chann els
Max Sample Rate = Clock Rate
18
30 66
Filter Specification: Sampling Frequency = 1.2288 Mhz, Coefficients = 104, Channels = 3
The contro l is a little t rickierhere as it has to accommodate
the TDM data stream. Theembedded contro l can also be
used.
Colours represe nts the differentchannels going through the filte r.Each filte r is processed one a t a
time in stead of interleaved
11
Xilinx Confidential
P rep are d by : Ni al l B attso n (X il in x) 20 04
Four dimensions?
10 20 30 40 50 60 70 80 90100
200 300 400 500 600 700 8001000
900
0.001
0.01
0.1
10
50
100
200
300
400
500
1
Sam
ple R
ate (M
hz)
Parallel FIR Filters
IncreasingNo. of Multipliers
Dist
Mem
MACC
FIR Embedded
ControlMACC FIR
Normal ControlMACC FIR
Transpose FIR
Systolic FIR (symmetric & non)
Semi-Parallel Dist Mem FIR
Semi-Parallel BRAM FIR
• Time to put everythingtogether!
• Consider Multi-ChannelMulti-Rate FIR Filterstogether
• How do the Boundary lineschange as the interpolationor decimation rateincreases?
Number o f Coefficients (N)Log Scale
For Further Details Contact…..
www.xilinx.com/education(877) 959-2527
2 DSP Courses:
- DSP Implementation Techniques (3 day)
- DSP Design Flow (3 day)
To educate students on efficient DSP design in Xilinx FPGAsusing the latest system level design tools
Over 500 Pages
Xilinx DSP Courses
Avnet SpeedWay Workshops
48
48Avnet SpeedWay Design Workshop™
DM64x DM355
DM355 & DM335 Digital Video Evaluation Module TMDSEVM355$495
DM64xDM6467
DM6467 Digital Video Evaluation Module TMDSEVM6467$1,995
DM64xOMAP™ 3
OMAP 3 Digital Video Evaluation Module TMDXEVM3503$1,495
DM64x DM643x DM64xDM644x DM64xDM647/8
DM6437 Digital Video Development Platform TMDSVDP6437 $495
DM644x Digital Video Evaluation Module TMDSEVM6446$2,495
DM648 Digital Video Development Platform TMDXDVP648$1,295
TI DaVinci™ technology based development tools enable fast time-to-market
Note for Avnet and TI: Can we offer a discount for seminar attendees, similar to what we did for the Video and Signal Chain seminar in Fall 2008?
Avnet SpeedWay Workshops
49
49Avnet SpeedWay Design Workshop™
○VC1 ePlanned 720p●VC1 d
○H.264 MP ePlanned●H.264 MP dGA4Q08●H.264 BP dGA4Q08●H.264 BP e
Decode Planned
○MPEG-4 ASP e/d
BETA 2QGA4Q08
HW 720p●MPEG-4 SP/H.263 d
BETA NOWGA4Q08
HW 720p●MPEG-4 SP/H.263 e
BETA 2QGA4Q08
●MPEG-2 MP d○MPEG-2 e
BETA NOWGA4Q08
HW●JPEG e/dVideo / Imaging
OMAP35xxDM355C644xSoftware• FREE EVALUATION
provided for all TI software codecs
• Extensive, growing roadmap• Cross-platform availability
with API compatibility
• Complete listing of TI software inventory, including technical documentation available on www.ti.com/digitalmediasoftware or www.ti.com/dms
• Integration Support must be contracted through a TI Authorized Software Providers www.ti.com/asp
• BASIC BUNDLE (yellow highlighted items) AVAILABLE through eStorepost production release (GA)
www.ti.com/dms
● Available now○ Available now (3P IP may be purchased/sub-licensed through ASP)
Included in BASIC Bundle (by device platform)GA = General availabilityAll video/imaging codecs listed are up to D1 resolution unless otherwise indicated
e – encode d – decode BP – Baseline Profile SP – Simple ProfileMP – Main ProfileASP – Advanced Simple Profile
eXpressDSP™ Licensable Software from TI
Avnet SpeedWay Workshops
50
50Avnet SpeedWay Design Workshop™
TI DM6437 User Guides
TMS320DM643x Video Processing Front End (VPFE) User's Guidehttp://focus.ti.com/lit/ug/spru977/spru977.pdf
How to Use the VPBE and VPFE Driver on TMS320DM643x Deviceshttp://focus.ti.com/lit/an/spraap3a/spraap3a.pdf
TMS320DM643x Video Processing Back End (VPBE) User's Guidehttp://focus.ti.com/lit/ug/spru952a/spru952a.pdf
Spectrum Digital DM6437 EVM Support Home (Revision E)http://c6000.spectrumdigital.com/evmdm6437/reve/
TMS320DM643x DMP VLYNQ Port User's Guidehttp://focus.ti.com/lit/ug/spru938b/spru938b.pdf
Avnet SpeedWay Workshops
51
51Avnet SpeedWay Design Workshop™
Tuned for any video application
Tools Speed time to market
Optimized and ready to go
Software
Complete technology offering For any digital video product from capture to view
Medical Imaging
Capture Process Deliver Receive View
Processors
Video Infrastructure
PortableVideo
VideoSecurity
Video Phones
AutomotiveVision &
InfotainmentFuture Video
ProductsCamera
IP Set-Top Box
T E C H N O L O G Y
TI has and will continue to focus on, develop and promote a complete technology offering for all digital video applications from capture to display and viewing. This complete offering is based on the DaVinci Technology that combines processors, tools, software and system expertise (4 “pillars” of DaVinci) with support to enable innovation, ease of use and faster time to market.