Download pdf - Avnet Speedway Design Workshop - rdsl.csit-sun.pub.rordsl.csit-sun.pub.ro/docs/DaVinci_co_processing_Lectures/DaVinci_co... · Avnet Speedway Design Workshop ... PWM, etc), connectivity

Avnet SpeedWay Workshops

1

Accelerating Your Success™

V10_1_1_2

Avnet SpeedwayDesign Workshop™

Creating FPGA-based Co-Processors for DSPs Using Model Based Design Techniques

Lecture 3: Xilinx FPGA Meets TI DSP


2

2Avnet SpeedWay Design Workshop™

Develop Executable Spec in Simulink

Partition Between DSP and FPGA Co-Processor

Model-Based Design Flow

Design Exploration for Targeting Hardware

Verify Hardware in HW Co-simulation

Implement Stand-Alone Video System

This is where we are in the model-based design flow.


3


The Problem We Wish to Solve

Partitioning an algorithm efficiently between DSP and FPGA can be a daunting task unless one has knowledge of each device’s architecture, capabilities and design tools.

We contrast the architectures of DSP and FPGA, offering guidelines to allocate different portions of the algorithm to each.

In order to leverage maximum efficiency from an FPGA co-processor, designers need :

- guidelines to identify the high computation-load sectors of video and image processing algorithms suitable for off-loading to co-processors- a flexible design flow to explore partitioning between software and hardware, from verification to implementation


4


Agenda

• Xilinx FPGA Meets TI DSP

• Xilinx FPGA design for temporal template matching

• Hardware / software algorithm partitioning between TI DSP and Xilinx FPGA

Xilinx FPGA Meets TI DSP:

•We start with a quick review of the TI DM6437 DaVinci Digital Media Processor, followed by a basic intro to Xilinx FPGA architecture in order to understand how the 2 can best work together.

Xilinx FPGA design for temporal template matching:

•Next we focus on techniques to best solve the problem at hand, temporal template matching for video using the FPGA.

We conclude with some guidelines for hardware / software algorithm partitioning between TI DSP and Xilinx FPGA


5


TI DSP Meet Xilinx FPGA

• Programmable DSPs - the classic answer to real-time signal processing

• FPGAs - increasingly used in real-time signal processing • FPGAs complement DSPs for:

– System logic multiplexing– New peripheral or bus interface implementation– Performance acceleration in the signal processing chain

TI is the world’s leading manufacturer of DSP processors. Xilinx is the world’s leading manufacturer of FPGAs.

Let’s begin by examining the fundamental nature of DSPs and FPGAs. Then we will proceed to combine the 2 in a system.


6


Video Processing Subsystem

Video processing SubsystemDSP

Core

Program / Data

Storage

DSP Processors Meet FPGAs

∑=

×−N

kkhknx

1][][

DSP

Connectivity

Peripherals

DSP processors and FPGAs (Field Programmable Gate Arrays) are fundamentally different yet complementary devices.

At the heart of the TI DSP processor is the core.Mouse click …It is surrounded by memory for program and data storage, a variety of peripherals (timers, PWM, etc), connectivity interfaces (Ethernet, USB, SPI, etc) and, in DaVinci for example, specialized subsystems for video in/out.

DSP characteristics:•sequential instruction execution•Software programmable in high-level language, ex. C•Ideal for complex algorithms•Wide variety of fixed-function peripherals on-chip•Rich eco-system of the 3rd party authorized software providers in vast range of applications: video, VoIP, surveillance, communications, consumer, etc

At the heart of the Xilinx FPGA is the programmable hardware fabric, which provides the fundamental structures to build custom logic.

Other structural elements include block RAM, routing matrix, and clock management, including both a PLL and DLL. The IO block contains the structure for interfacing to external devices. There are many selectable IO standards the IO block can be configured to use. Some of those standards include LVCMOS, SSTL, HSTL, LVDS and many others not usually offered in DSP processors.

FPGA characteristics:•execute parallel computations in hardware•Ideal for fast, high-performance custom functions•Rich variety of resources on-chip•in-system programmable •Design in hardware-description language•Rich eco-system of the 3rd party IP providers in vast range of applications: video, VoIP, surveillance, communications, consumer, etc

We propose to unite TI DSP processors and Xilinx FPGAs into system solutions where each device performs what it does best.


7


Low Power DSPs

Low Power, Low Cost Signal ProcessingPerf

orm

ance

Po

wer

ARM

CortexA8

CortexA8

ARM9

ARMNext Code Compatibility (ISA)

ARM DSP 64X

C67x/C64x

C6000C64x

Multi-core

MSP430

C5000

Microcontrollers

Ultra-Low Power, General,

and Real-Time Control

C2000

OMAP3xxOMAPNext

DM3xxDM644x

DM646x

OMAP-L1X

Digital Media Processors

Video Performance; Arm Ease of Use

Applications Processors

Low Power, High Performance GUI/Browser Apps

High MHz / Multi-Core Signal ProcessingHigh Performance DSPs

ARM Core

DSP

MCUNext

TI Embedded Processors

DM6437 DaVinci

.

TI offers a rich eco-system of embedded processors, many with combined DSP core + ARM.

DSP core =better at complex mathematics app

- High Performance DSPs

- Low Power DSPs

ARM=better at advanced UI and system control (ARM9, Cortex A8, etc.)

<mouse click>

The DSP processor that we focus on today is the DM6437, part of the DaVinci Digital Media processor family.


8


Peripherals

FeaturesNew C64x+™ Core

– C64x+™ Core @ Up to 600 MHzMemory

– 80 KB L1D, 32 KB L1P Cache/SRAM– 128 KB L2 Cache/SRAMPeripherals

– Video Port Sub-System (VPSS): Input (CCDC), Output (w/DACs), Resizer, OSD, and Camera Control

– Two EMIFs: DDR2-266: 32 bits, 133 MHz; EMIF 2.1

– 10/100 Ethernet MAC, MII or RMII; PCI 33 MHz; HPI; McASP

– VLYNQ™ – Serial Interface to FPGAs– UART (2), I2C, SPI, GPIO, PWM (3), CAN

(HECC), 64-bit Timers (2)

DSP Subsystem

C64x+TM DSP 600-MHz

Core

L2128 KBCache

L1P 32KB

L1D 80KB

WDTimer

System

PWM×3

Timer64-bit×2

Connectivity

Serial InterfacesUART ×2

or

SPI

I2C

CANMcASP

McBSP ×2or

Switch Fabric

CCD Controller Video Interface

PreviewHistogram/3A

Resizer

On-ScreenDisplay (OSD)

10b DAC10b DAC10b DAC10b DACVideo

Enc(VENC)


Back End

Front End

DDR2Controller

(32b)

Program/Data Storage

EMIF(8b)

EDMA EMACVLYNQPCI

33 HPIor

DDRPLL

PLL

JTAG

OSC

TI TMS320DM6437 Processor Architecture

This is the architecture of the DM6437 SOC. This device is just one of 7 DaVinci processors. This is the DSP on the Avnet Spartan3A-DSP DaVinci Evaluation Platform.

DaVinci offers an array of on-chip resources for video processing, notably:

•Improved video performance with a 50 percent cost reduction over previous DSP digital media processors•Built in DACs save ~ $2 – 4 on overall BOM cost

•VPSS offloads the DSP… Up to 40% DSP off load for DM6437 provides up to 240 MHz processor savings for more features or higher quality

Preview engine Up to 15%Resizer Up to 10%OSD Up to 15%Total Up to 40% for DM6437

The shaded blocks represent functions supported by the Avnet Board Support Package for Simulink.

We draw your attention to the VPSS and the on-chip VLYNQ serial interface, both of which are featured in this seminar.


9


Presentation Flow

MATLAB® and Simulink®

Algorithm and System DesignMATLABMATLAB®® and Simulinkand Simulink®®

Algorithm and System DesignAlgorithm and System Design

Real-Time WorkshopEmbedded Coder,

Targets, Links

RealReal--Time WorkshopTime WorkshopEmbedded Coder,Embedded Coder,

Targets, LinksTargets, Links

Verif

y

Generate

Generate

Code Composer

Avnet Spartan3A-DSP DaVinci Development Kit

C / ASM

XilinxXilinxXilinx

MathWorksMathWorksMathWorks

Link for CCSLink for CCS

Verif

y

Xilinx System Generator for DSP

Xilinx System Xilinx System Generator for DSPGenerator for DSP

HDL

ISEISE

Hardware Hardware CoCo--simulationsimulation

Introduce tool flow for DaVinciDigital Media Processor

1

Introduce DaVinciDigital Media Processor architecture

2

TITITI

Introduce Xilinx FPGA architecture

3

Introduce Xilinx System Generator for DSP

4

.

In day 1, we saw how the TI design environment fits into the Model-Based Design flow for video and image processing. We also covered an introduction to TI DaVinci Digital media Processors in the previous slides.

<mouse click>

We continue with a basic intro to Xilinx FPGA architecture in order to understand how DSP and FPGA can best work together


10


Xilinx FPGA Architecture

• Logic Fabric– Gates and flip-flops

• Embedded Blocks – Memory– DSP/Multipliers – Clock management– High speed serial I/O– Soft/Hard processors

• Programmable I/Os• In-System Programmable

– JTAG


11


Memory

• Block RAM– RAM or ROM– True dual port

• Separate read and write ports– Independent port size

• Data width translation– Excellent for video line buffers, FIFOs

CLKA

DIPA

ADDRA

DOPA

CLKB

ADDRB

DIA DOA

DIPB DOPBDIB DOB

Configuration Depth Data bits Parity bits16K x 1 16Kb 1 08K x 2 8Kb 2 04K x 4 4Kb 4 02K x 9 2Kb 8 1

1K x 18 1Kb 16 2512 x 36 512 32 4

Block RAM Configurations

You will work extensively with these memory blocks in the labs.


12


Clock Management

• Digital Clock Managers (DCMs)– Clock de-skew– Phase shifting– Clock multiplication – Clock division– Frequency synthesis

CLKIN CLK0

CLK90

CLKFX


13


Programmable I/Os

• Single-ended• Differential / LVDS• Programmable I/O standards

– Multiple I/O banks

• DDR I/O registers• On-chip termination

•

Standard Output VCCO Input VREF

LVTTL 3.3V --LVCMOS33 3.3V --LVCMOS25 2.5V --LVCMOS18 1.8V --LVCMOS15 1.5V --LVCMOS12 1.2V --

PCI 32/64 bit 33MHz 3.3V --SSTL2 Class I 2.5V 1.25VSSTL2 Class II 2.5V 1.25VSSTL18 Class I 1.8V 0.9V

HSTL Class I 1.5V 0.75VHSTL Class III 1.5V 0.9V

HSTL18 Class I 1.8V 0.9VHSTL18 Class II 1.8V 0.9VHSTL18 Class III 1.8V 1.1V

GTL -- 0.8VGTL+ -- 1.0V

LVDS2.5 2.5V -- Bus LVDS2.5 2.5V -- Ultra LVDS2.5 2.5V -- LVDS_ext2.5 2.5V --

RSDS 2.5V --LDT2.5 2.5V --

Diffe

rent

ialSi

ngle

ende

d

Reg

Reg

DDR mux

3-State

Reg

Reg

DDR mux

PAD

Reg

Reg

Input

Output

I/O Banks

Out of this rich offering of IO standards,


14


• Integrated XtremeDSP Slice– Application optimized capacity

• 3400A – 126 DSP48As• 1800A – 84 DSP48As

– Integrated pre-adder optimized for filters

– 40 opmodes– 250 MHz operation, standard

speed grade– Compatible with Virtex-DSP

• High-performance and flexibility as computation engine for DSP and video

XtremeDSP DSP48A Slice

XtremeDSP DSP48A Slice

Transcript:

The DSP48A is a optimized implementation for the Spartan-class devices. A key new feature is the addition of a pre-adder which is used in symmetric filters – one of the most common implementations in the target markets. In the DSP48 and DSP48E implementations, the pre-adder is implemented using FPGA logic resources. Including this in the DSP48A reduces logic utilization, increases performance and lowers power. The DSP48A operates at 250MHz in the -4, standard speed grade. In the Spartan-DSP domain, we will be always emphasizing the standard speed grade parts when we discuss performance as this is the lower-cost path that most of the customers will want to pursue.

The new DSP48A is most closely related to the DSP48 in Virtex-4. There is migration capability between the 3 implementations, esp. if the FFT and FIR compiler are used. In the Reference section, more detail is provided including a summary table. The customer presentation also provide more details.

The other main new feature is the expansion of the amount of BRAM, roughly 2X the ratio of BRAM to Logic as compared to other Spartan-3 generation devices. The number of BRAMs is matched to the number of DSP48s as is done in Virtex-DSP. The BRAM is also been enhanced to achieve about a 25% speed increase over Spartan-3A. There are of course additional benefits to having more and higher performance BRAM in Spartan-3 class devices. Other application areas such as embedded processing, where MicroBlaze and the soft embedded IP can take advantage of the additional memory.


15


Spartan-3A/3AN/3ADSP Family

32202016318x18 Multipliers

88442DCMs

92K

360K

372

13,248

700K

11K

54K

144

1,584

50K

25,3448,0644,032Logic Cells

576K360K288KBlock RAM bits

176K56K28KDistributed RAM bits

502311248Maximum I/O

1.4M400K200KGates

Device700A/N50A/N 1400A/N400A/N200A/N

84DSP48A

8

37,440

1512K

260K

519

1.8M

1800AD

126DSP48A

8

53,712

2268K

373K

469

3.4M

3400AD

.

The chart shows the various members of the Spartan-3A family – including Spartan-3A, Spartan-3AN, and Spartan-3A DSP.

<mouse click> The focus for today is the 1800 version of the Spartan-3A DSP since that is the device we’ll use on the hardware today.


16


From Sequential to Full Parallel Processing .

Data OutData Out

MACC UnitMACC Unit

CoefficientsCoefficients

256256--Tap FIR Filter Sequential Tap FIR Filter Sequential ImplementationImplementation

500 MHz500 MHz500 MHz256 clock cycles256 clock cycles256 clock cycles = 2 MSPS= 2 MSPS= 2 MSPS

256 clock 256 clock cycles cycles

neededneeded

Data InData In

XX

++RegReg

500 MHz500 MHz500 MHz1 clock cycle1 clock cycle1 clock cycle

= 500 MSPS= 500 MSPS= 500 MSPS

256256--Tap FIR Filter Fully Parallel ImplementationTap FIR Filter Fully Parallel Implementation

Data OutData Out

XX

++

C0C0 C0C0XXC1C1 XXC2C2 XXC3C3 XXC255C255…

RegReg

RegReg

RegReg

RegReg

RegReg

RegReg

RegReg

RegReg

++ ++ ++ ++RegReg

RegReg

RegReg

RegReg…

…Data InData In

• FPGAs can deploy hardware resources to suit the task

• Lowest resource usage • Highest performance

Xilinx FPGAs can implement a wide range of DSP functions, with the flexibility to deploy the right mix of hardware resources appropriate for the task at hand.

FIR filters are used extensively in DSP and will serve here as the basis for comparison of general inner-loop computation structures. A FIR is a sum of products involving coefficients and a time-skew buffer, or pipeline, of samples in a time skew buffer. The same design considerations apply to all inner-loop type computations: FIR IIF filers, correlators, moving average, SAD.

Shown here are 2 implementations of the same 256-tap FIR filter, both of which can be implemented in a Xilinx FPGA:

• Using a single time-shared MAC, it would take 256 clock cycles. Clocking at 1GHz would only yield around 4MSPS sample rate.

• // However, what if you had 256 of those MAC structures in one device? Now you can get a filter result every clock cycle. Running the clock at 400MHz yields a 400MSPS sample rate – 100 times faster than the sequential MAC can achieve! This is the power of parallelism in Xilinx FPGAs, which integrate large numbers of these DSP resources to achieve extremely high performance.

These compute-intensive repetitive inner-loop computations will be prime candidates to off-load from DSP to FPGA when partitioning an algorithm.


17


x[n-k]

bk

y[n]X

+

AccumulateMultiply

Z-1

MACC-based FIR Using DSP48A

N-1

y[n] = Σ x[n-k]bkk=0

• Implement in single DSP48A overclocked at fclock = fs x N• N = length of FIR filter, or number of coefficients• fs = filter throughput, in Mega-samples / second• fclock = computation clock rate (maximum 250 MHz in Spartan3A-DSP, lowest speed)

• Ex. 25-tap FIR with fclock = 250 MHz achieves 10 Mega-samples/second

• Most efficient use of hardware

.

DSP48A

Let’s focus on the MACC-based FIR in more detail. A single DSP48A can implement a FIR bysumming each term sequentially using a single multiplier-accumulator or ‘MACC’ to produce a result after N clock cycles, where N = filter length. In contrast to fully parallel, a serial implementation time-shares a single accumulator.

[mouse click and pause for effect]

It reduces hardware by a factor of N compared to parallel structures, but also reduces filter sampling rate throughput by the same factor : fs = fclock / N. Consequently, the MACC FIR is the optimal structure at lower sampling rates.

A comprehensive tutorial on usage of DSP48 to implement FIR filters over a wide range of filter length and desired throughput is listed in the reference section.

--------------------------------------------------------------------------

Supplementary notes:

To build an accumulator the output of the adder is registered by the flip-flop in the slice, to capture the result at node P. This result is then routed back round to the adder. Hence, each clock cycle a new input will be presented to the ‘C’ input and added to the result calculated from the previous clock cycle.

The key message is that Xilinx FPGAs offer a lot of flexibility to implement DSP functions using DSP48, tailored to the of the computation task at hand.


18


Presentation Flow

MATLAB® and Simulink®

Algorithm and System DesignMATLABMATLAB®® and Simulinkand Simulink®®

Algorithm and System DesignAlgorithm and System Design

Real-Time WorkshopEmbedded Coder,

Targets, Links

RealReal--Time WorkshopTime WorkshopEmbedded Coder,Embedded Coder,

Targets, LinksTargets, Links

Verif

y

Generate

Generate

Code Composer

Avnet Spartan3A-DSP DaVinci Development Kit

C / ASM

XilinxXilinxXilinx

MathWorksMathWorksMathWorks

Link for CCSLink for CCS

Verif

y

Xilinx System Generator for DSP

Xilinx System Xilinx System Generator for DSPGenerator for DSP

HDL

ISEISE

Hardware Hardware CoCo--simulationsimulation

Introduce tool flow for DaVinciDigital Media Processor

1

Introduce DaVinciDigital Media Processor architecture

2

TITITI

Introduce Xilinx FPGA architecture

3

Introduce Xilinx System Generator for DSP

4

We continue with an overview of Xilinx System Generator for DSP


19


System Generator for DSP

• System Generator enables the use of Simulink for FPGA design– Design DSP applications in

FPGAs without hardware design experience

• Designs are constructed using a Xilinx provided blockset

• FPGA Implementation files, optimized for Xilinx devices, are automatically generated

System Generator is a DSP design tool from Xilinx that enables the use of The Mathworks model based design environment Simulink for FPGA design. Previous experience with Xilinx FPGAs or RTL design methodologies are not required when using System Generator. Designs are captured in the DSP friendly Simulink modeling environment using a Xilinx specific blockset. All of the downstream FPGA implementation steps including synthesis and place and route are automatically performed to generate an FPGA programming file

->->->-> slides 23 .. 26 for reference only / transform into a 5 min. demo


20


The Xilinx DSP Blockset

• Over 90 DSP building blocks available• Abstracts away the details of the FPGA

hardware architecture• Enables design migration between

technologies• Leverages Xilinx IP to deliver high quality of

results

Over 90 DSP building blocks are provided in the Xilinx DSP blockset for Simulink. These blocks include the common DSP building blocks such as adders, multipliers and registers. Also included are a set of complex DSP building blocks such as forward error correction blocks, FFTs, filters and memories. These blocks leverage the Xilinx IP core generators to deliver optimized results for the selected device.


21


FIR Filter Generation

• Automatically generated performance optimized FIR filters – Takes full advantage of the

Virtex-4 DSP48 blocks to achieve 500 MHz performance

– Supports multi-rate, oversampled, multi-channel and coefficient optimization

• MathWorks FDA Tool integration provides graphical filter design and coefficient generation

FIR Compiler

FDA Tool

System Generator includes a FIR Compiler block that targets the dedicated DSP48 hardware resources in the Virtex4 and Virtex5 devices to create highly optimized implementations that can run in excess of 500 Mhz. Configuration options allow generation of direct, polyphase decimation, polyphase interpolation and oversampled implementations. Standard MATLAB functions such as fir2 or The Mathworks FDAtoolcan be used to create coefficients for the Xilinx FIR Compiler.


22


• Combine System Generator with RTL blocks in ISE’s Project Navigator to form complete systems

• Supports multiple instantiations of System Generator designs as sub-blocks

• Manage constraints of multiple System Generator designs

System Generator / Project Navigator Integration

Allows persistence of ISE place and route setting between designiterations.

->->->-> update screen shots to our model


23


Agenda




•We now focus on techniques to best solve the problem at hand, temporal template matching for video using Xilinx FPGAs.


24


Block Matching for Video & Imaging

• Integral part of most of the motion-compensated video coding standards. EgMPEG 1, MPEG 2, H.264

• Video stabilization, video analytics, target tracking• Find the best match for a selected block (‘template’) in current frame• Calculate motion vector between ‘template’ block location in

previous frame and its counterpart in current frame

• Similarity measure for best match:– Mean Absolute Error (MAE)– Mean Square Error (MSE)– Sum of the Absolute Difference (SAD)

Current frame

Previous frame

Motion Vector

.

Block matching can employ various algorithms.

<Mouse click>

SAD will be used in our work throughout this seminar.


25


Exhaustive Search SAD Block Matching

SAD(50,50) > 0 SAD(100,75) approaching 0

Template pixel array = T(i,j)

Starting at top-left corner (0,0), sweep template across ROI

∑ ∑= =

−=heightTemplate

i

widthTemplate

jjiIjiTyxSAD

1 1|),(),(|),(

(200, 100)

(50, 50) (100, 75)(200, 100)

SAD(200,100) = 0

Exhaustive search = SADcalculated at each and everypixel location in ROIRegion of

interest (ROI)

Exhaustive search = SAD calculated at each and every pixel location in ROI by displacing template by 1 pixel at a time.

The lower the SAD result, the better the match between the template and the pixel region beneath it. SAD = 0 indicates a perfect match.

It is obvious that an exhaustive search is very compute intensive, but will produce the best block-matching performance.


26


Input image 72 x 54 full blackwith template at position (20,20)

72 pixels

54 pixels(20, 20)

SAD calculated by MATLAB at start of simulation : best match at (20,20)

SAD generated in FPGA hardware (System Generator)

..SAD in Xilinx System Generator for DSP

Let’s illustrate the practical use of Simulink and System Generator in a video design flow for pattern matching using sum-of-absolute differences (SAD) targeting Xilinx Spartan3A-DSP.

The model contains a testbench comprised of a synthetic input image, 72 x 54 pixels, full-black except for a 22x18 template inserted at location 20,20. At the start of simulation, the input image is read into a MATLAB workspace array named Test_Image_IN by a callback function and presented as stimulii into the model. Simulation time is set to 54x72 + 3. (The extra 3 is to flush out initial pipeline latency)

->->->-> explain data vector is row-major as set up in init script

[1st mouse click]

The MATLAB callback function at start of simulation also calculates SAD between the template and each pixel in the input image. In the SAD calculation, the best match of a template within an image frame is the minimum value, or the darkest region on the SAD plot. As expected, the darkest region in the calculated SAD corresponds to the upper-left corner of the template.

[2nd mouse click]

During simulation, SAD between the template and each pixel in the input image is calculated in hardware. Results are displayed at the end of simulation, and compared against the MATLAB-calculated SAD. Note the MATLAB-calculated SAD image is identical to the hardware – generated image. This confirms proper operation of the hardware SAD function.

Note: the black boundary along the bottom and right side of the SAD plots represent the limits of displacement of the template inside the image. Practical pattern-matching algorithms usually limit the search to a sub-region of interest within the whole frame.


27


SAD in Xilinx System Generator for DSP

• How do video pixels move through FPGA memory ?

• How are pixels presented to SAD computation engine ?(50, 50)

22 x 18

template

Let’s continue with an illustration of how the SAD algorithm is executed in FPGA hardware. We start by focusing on how the image is stored in FPGA memory and presented to the SAD computation engine.


28


An Efficient Line Buffer Strategy

Transposition Incoming pixel rows become columns* In hardware this is just clever indexing


29


H

Line buffer (transposed)Video frame

ROI_widthV

Template_height = 18

Shift direction

Read 1 row + current pixel

Shift in

Addressable Shift Register (ASR)

Template_height

Current pixel

SAD / Image Management in FPGA….

Expiredpixel

When slide appears, you point to Lena, top-left and say:

'Pixels don't all just appear simultaneously in a frame of video. Rather, pixels are synchronized by horizontal and vertical sync signals. Each video line is displayed by pixels arriving from left <you point to the current pixel as the white outlined square on bridge of her nose> <mouse-click> to right.

In order to send the video to the SAD engine, we store pixels in a line buffer in transposed fashion, meaning a horizontal line of pixels is stored vertically in the transposed line buffer like this <point to white outlined square that appeared in the transposed line buffer> …

… <mouse-click> <holding your breath to show that you have yet to finish the thought, you point to next pixel as the white outlined square moving toward Lena's right eye, then point to white outlined square as it is stored in the transposed line buffer>

… <mouse-click> <point to next pixel as the white outlined square under Lena's right eye, then point to white outlined square as it is stored in the transposed line buffer>

Each location in the line buffer memory holds a column of the video image of height template_height. The dual port RAM stores as many lines of video as the height of the template.

Under control of an address counter at the pixel rate, as each new pixel arrives:- read out pixel column of Template height from port A of the transposed line buffer - slice off and discard top-most pixel from oldest video line (leftmost column of the

transposed line buffer, this is not shown for clarity) - concatenate current pixel as LSByte, result is new pixel column- shift this new pixel column into addressable shift register (ASR) for SAD engine- store the new pixel column into port B at next pixel period

< Mouse click > Line buffer and Addressable Shift Register (ASR) are available as parameterizable blocks in System Generator.


30


Addressable Shift Register


31


Under control of an address counter at the pixel rate, as each new pixel arrives :• read out pixel column of Template_height from port A of the transposed line buffer (dual port RAM)• slice off and discard top-most pixel from oldest video line • concatenate current pixel as LSByte, result is new pixel column• shift this new pixel column into addressable shift register (ASR) for SAD engine• store the new pixel column into port B at next pixel period

SAD / Image Management in FPGA

SADVideo

INTransposedLine buffer Template_width Template_width

Video data is Y (luminance), 8-bit pixels. The dual port RAM stores as many lines of video as the height of the template, in this case Template_height = 18. These 18 pixels form words of 18 x 8 = 144 bits. The notation UFix refers to ‘unsigned fixed-point’.

Dual port RAM block is mapped to one or more BRAM elements when the design is netlisted.

Note the flexibility of System Generator in managing video data.


32


opmode

ACC-based Sum of Absolute Difference (SAD)

∑ ∑= =

−=heightTemplate

i

widthTemplate

jjiIjiTyxSAD

_

1

_

1|),(),(|),(

template

Accumulate

Z-1

• Each line of block-match in single DSP48A overclocked at fclock = fs x Template_width• Template_width = width of template, in pixels • fs = pixel rate, in Mega-pixels / second• fclock = computation clock rate (maximum 250 MHz in Spartan3A-DSP, lowest speed)

.

-

+

abs

Image pixel

DSP48A

P = P + C

P = C

CP ∑

=

−widthTemplate

j

jIjT

_

1

|),1(),1(|

1 line of SAD

In a similar fashion to the MACC FIR, a single DSP48A in accumulator mode can implement one line of pattern matching SAD by summing each term sequentially to produce a result after template_width clock cycles, where template_width = width of the template, in pixels.

[mouse click and pause for effect]

Note the sequencer controlling the opmode port of the DSP48A. Over 40 dynamic user-controller operating modes (opmodes) can dynamically adapt XtremeDSP slice functions from clock cycle to cycle, optimizing performance through resource sharing to create custom sequential computation engines. Each XtremeDSPSlice is individually controllable.

For an accumulator function, select the C input on the 1st of N clock cycles to initialize the accumulator, then select the feedback path from the registered output P to sum with the C input on the remaining N-1 clock cycles.

Note that using the opmode in a dynamic fashion has no impact on the performance due to the registered opmode input. If all input and output registers are used, no matter the operation, DSP48A can achieve 250 MHz. In Virtex-5, DSP48E can achieve 550 MHz.

Note:

The absolute value function of the difference between template and pixel value is not implemented in the adder of the DSP48A. Rather, it is implemented in fabric using the sign bit of (template – pixel) to select +/-. The reason is that DSP48A doesn’t support opmodes P = P + C + Cin, or P = P - C – Cin. However, these are supported in DSP48 (Virtex-4) and DSP48E (Virtex-5). In these families, the sign bit could be used to select the appropriate opmode to pull the absolute value function into DSP48 or DSP48E, thereby reducing overall resource requirements.


33


SAD Computation Using DSP48A

DSP48A

DSP48A

DSP48A

DSP48A

Template_height DSP48A’s template

Accumulate

Z-1

-

+

abs

Image pixel

DSP48A

• Combined sequential and parallel processing to suit task

SADVideo

INTransposedLine buffer Template_width Template_width

Overclocked X template_width

There are as many DSP48A SAD engines as the height in lines of the template. An extra adder block (not shown for clarity) sums all DSP48A SAD engines for the final SAD value at the current pixel location.

Note the upsampling block prior to the SAD block. It serves to clock each DSP48A at 22X the pixel rate, or __ MHz x 22 = ___ MHz. It is good design practice to run the DSP48A close to it‘s maximum speed in order to make best use of the resource.

Optional / explanatory notes:

Recall the sequencer controlling the opmode port of the DSP48A to create the accumulator function for the sum of absolute differences. The C input on the 1st of N clock cycles initializes the accumulator, then the feedback path from the registered output P sums with the C input on the remaining N-1 clock cycles. The output of the DSP48A at the final clock cycle is the SAD of one line of the template with incoming video pixels in the line buffer.

This SAD engine can compute an exhaustive search of the template at each pixel location in the frame in real time at the video pixel rate. In practice, the SAD calculations are usually confined to a sub-region of interest in the frame by the search algorithm.

You will build and implement the SAD in Xilinx System Generator during lab 3.


34


Agenda




We now present some guidelines for hardware / software algorithm partitioning between TI DSP and Xilinx FPGA


35


System Execution Control

Typical DSP applications:– 20% of program code consumes 80% of required MIPS– 20% of program code requires time-consuming, difficult-to-maintain assembly

coding to increase system performance– Challenge to reduce the processing load in 20% of software and manage the

complexity of the remaining 80% of the code

ComputationData IN

80% of code20% of MIPS

20% of code80% of MIPS

Data OUT

DSP Algorithm MIPs Budget

In typical DSP applications, 20% of the program code consumes 80% of the required MIPS. This 20% of the program code often requires time-consuming, error-prone, and difficult-to-maintain assembly coding to increase overall system performance. This code also becomes far less portable than the remaining 80% of the code that focuses on initialization and system execution control. At the same time, that 80% of the code reflects the majority of the system’s complexity. This utilization outcome creates a double challenge for DSP software engineers. They must reduce the processing load in 20% of the software and manage the complexity of the remaining 80% of the code.

FPGA co-processing is well suited to addressing the 80% processing load caused by 20% of the algorithm code. The challenge is to identify what the DSP should off-load to a coprocessor.

Note: scheduling also becomes a factor as the MIPs load increases


36


Profiling real-time code execution on DSP

Code Profiling with MathWorks Tools

• Uses DSP/BIOS statistics to measure execution time

• Identify segments of generated code to off-load to FPGA co-processor

The code profiler in Target Support Package TC6 uses DSP/BIOS statistics objects to measure the execution time of code segments generated by individual subsystems. A code profile report helps you identify segments of generated code that are candidates for off-loading to an FPGA co-processor.

In depth technical information on code profiling is available at the following:

http://www.mathworks.com/access/helpdesk/help/toolbox/tic6000/index.html?/access/helpdesk/help/toolbox/tic6000/f8-7016.html


37


Code Profiling in Target Support Package TC6

• profiling report by subsystem of the Simulink model

Note the correspondence between the code profiling report and relevant sections of the Simulink model.

Observe the motion estimation sum-of-absolute differences takes 29.79 ms, which is (29.79/45.82) = 65% of the CPU load. Recall from the lecture that SAD is a compute-intensive inner-loop type calculation. Based on these numbers, we choose to partition the video stabilization model by off-loading the SAD-based motion estimation to an FPGA co-processor.

------------------------------------------------------

->->->-> Grant,

This is indeed what I was looking for. No need to put the how-to in the slide. Rather, what it needs now is a 2nd slide that clearly illustrates the correspondance between the report and relevant sections of the Simulinkmodel, perhaps with dashed lines between report < -- > model.

I agree we can put the practical how-to in the hands-on lab section (specifically lab 3).

thanks,

Luc

--------------------------------------------------------------------------------

From: Grant Martin [mailto:[email protected]]

Sent: Friday, October 03, 2008 12:33 AM

To: Langlois, Luc

Subject: Code Profiling slide

Here is a quick slide about code profiling. As far as technical information on how to do it, I would look at the following:

http://www.mathworks.com/access/helpdesk/help/toolbox/tic6000/index.html?/access/helpdesk/help/toolbox/tic6000/f8-7016.html

Do you want me to include anything on the how-to in this slide? We may be able to put this type of example in the hands-on part of the workshop as it fully explains how to do it and what we are looking at.


38



Video processing SubsystemDSP

Core

– Parallel execution– High computation rates in fixed-point

math – Repetitive calculations– Nested inner loops

• ex. FFT, MAC FIR, moving average, correlation, SAD

– Fast access to deeply pipelined time-skew buffers

– Wide data words– Custom peripherals

Guidelines for off-load to FPGA Co-Processor

• Hi-level decision-making

– frame buffer

– Scaler– OSD– Histogram– CCD

controller

Selecting portions of an algorithm suitable for off-load to FPGA co-processor.

High-level decision-making should remain in DSP to take advantage of excellent quality of results from MathWorks TC6 code-generation tools for DM6437. Additionally, DaVinci has an on-board video processing subsystem optimized for video operations including frame buffer in DDR2, fractional image scaling, OSD, histogram, CCD controller. Consequently, these functions can remain on the DSP-side to utilize these resources.

Handling wide data words is a distinct advantage of the FPGA. For example, the transposed line buffer in the FPGA SAD engine using a dual port RAM to store words of (template_height x 8) bits = 144 bit words. This is impractical for a fixed-width DSP processor.

Ideally, the ratio of computation required /data crossing the interface between DSP and FPGA should be high, such as FFT. In the current context, motion estimation is an excellent candidate for the FPGA, because very little data needs to cross the interface between DSP and FPGA, while template matching using SAD in FPGA is highly compute-intensive.

-----------------------------------------------------------------------------------------------------------------

->->->-> Co-processing presentation from Asia TIDC in C:\My_Documents\Xilinx\DSP\TI_FPGA_co_processing


39



A

E

• subset of algorithms self-contained and tightly grouped• occupy significant processing load

B

C

D

Total processing Load

A

B + C + D

E

Optimal candidates for off-load to the FPGA co-processor will occupy a significant portion of the total processing load in the stand-alone DSP.

Model-based design in Simulink is particularly efficient at revealing such candidates because it inherently structures the algorithm into hierarchical subsystems which are clearly reflected in the profile report.


40



A

E

• Computation bound, not IO bound• Data transfer time < compute time

B

C

D

B

C

D

B

C

D

Data transfer time must not overwhelm computation time, otherwise gain from faster computation time is lost.


41



A

E

B

C

D

• no processor dependency in the calculation while coprocessor operates on the data

• Ex. FFT, SAD

Ideally, data transfer to the co-processor need occur only at the start of the co-processor compute phase, and results returned at the end. This reduces intermediate data transfers and handshaking complexity.

Examples: FFT, SAD


42


Agenda




… proceed to lab 3 Algorithm Partitioning Between the DSP and FPGA

We began with a quick review of the TI DM6437 DaVinci Digital Media Processor, followed by a basic intro to Xilinx FPGA architecture in order to understand how the 2 can best work together.

We then focused on techniques to best solve the problem at hand, temporal template matching for video using the FPGA.

Finally, we concluded with some guidelines for hardware / software algorithm partitioning between TI DSP and Xilinx FPGA


43


Reference Slides


44


5 Application Notes available in the Virtex-4 User Guide in regard to implementation specifics

Many Reference Designs in:VHDL VerilogSystem Generator for DSP

For Further Details visit…..

www.xilinx.com/dsp

Xilinx FIR Filter Implementation Guides

http://www.xilinx.com/xlnx/xweb/xil_publications_display.jsp?sGlobalNavPick=&sSecondaryNavPick=&category=-1210767&iLanguageID=1


45


Spartan3A-DSP DSP48A User Guide

http://www.xilinx.com/support/documentation/user_guides/ug431.pdf


46


94,20852, 22434, 81655, 29634,56023,04053, 71237,440Logic Cells

224 x 1+ Gb/s LVDS pairs

3,456

240

27x27

192

5002

6,9122

962

4VSX35

120 x 1+ Gb/sLVDS pairs

2,304

160

27x27

128

5002

4,6082

642

V4VSX25

180 x 1.25 Gb/s LVDS pairs,

8 x 3.2 Gb/sTransceivers

3,024

520

27x27

192

5502

6,6532

1062

5VSX35T

360 x 1+ Gb/s LVDS pairs

5,760

384

27x27

512

5002

11,5202

2562

4VSX55

Virtex-4 SX

320 x 1.25Gb/s LVDS pairs, 16 x 3.2 Gb/s Transceivers

240 x 1.25 Gb/s LVDS pairs,

12 x 3.2 Gb/sTransceivers

208 x 622+ Mb/s LVDS pairs

176 x 622+ Mb/s LVDS pairs

High Speed Connectivity

8,7844,7522,2681,512Block RAM (Kb)

1,520780373260Distributed RAM (Kb)

27x2727x2719x1919x19Min Footprint (mm)

64028812684XtremeDSP DSP48* Slices

5502550225012501Max DSP Frequency (MHz)

19,325210,45422,26811,5121Max Block RAM

Memory Bandwidth (Gbps)

35221582321211DSP Performance (GMAC/s)

5VSX95T5VSX50T3SD3400A3SD1800A

Virtex-5 SXTSpartan-3A DSP

Virtex-DSPSpartan-DSP

XtremeDSP Device Portfolio

1 In Standard Speed Grade 2 In Fast Speed Grade

Transcript:

Here is the resulting XtremeDSP device portfolio. Only the devices labeled XtremeDSP are shown so it is not comprehensive. The LX and FX-type devices are not listed only for the sake of making this fit into one page so you can see where the new devices fit in capabilities and features.

This chart is also a little different than the typical family chart you are used to seeing. The top portion lists the DSP capabilities emphasizing the GMAC and memory bandwidth throughput. The bottom portion lists the primary DSP-centric care-abouts in the device feature set. As previously mentioned, the reference section displays a complete feature-list device table.

The key take-away’s for Spartan-3A DSP are the DSP48As, the amount of Block RAM and the performance in the standard speed grade – the lower-cost speed grade vs. the typical model of providing the highest numbers with the most costly device.


47


Xi linx Confidential

P rep ared b y : Ni al l B att son (Xi li nx ) 2 00 4

D QCE

43

DSP48 Sliceopmode = 0100101

Counterdata addr

18xn

yn

Load

opmode (5)

WEInput Data

117 x 18

Coefficients117 x 18

Coef Addr

Control117 x 10

CE

0

Embedded Control MACC FIRFilter Specification: Sampling Frequency = 3.84 Mhz, Coefficients = 117

Address signal is feedbackand its va lue must alwayspoint to the next required

address

Load signal is the MSB onthe coef/control memoryspace and the requiredlatency is already in the

signal

Care must be taken in setting upthe memory, especially t he initial

values so that the addressingkick st arts

Cont rol log ic is reducedto only a counter

423

Filter Size:1 XDSP Slice1 Block RAM28 Slices

Number of TapsMax Sample Rate =

Clock Rate

The A port is aconcatenation of the

coefficients, coefficientaddress, WE, CE and

Load signals

More than 256 coeffici entsat greater than 9 bit datarequires more Block RAMand traditional control

technique should be used

1

Xilinx Confidential

P rep are d by : Ni al l B at tso n (X il in x) 20 04

x1(n)18

x2(n)

x3(n)y1(n)

y2(n)

y3(n)Coefficients

104 x 1844

DSP48 Sliceopmode = 0100101

Control

Data Addr

Coef Addr

we

z-3Load

opmode (5)

Exploiting many slow data streams

Time Divisio n Mult iplexing ofthe numerous channels is

performed with a mux runningat C times faster than the

input

The output mux would mostlikely ta ke the form of a bankof ca pture registers that are

en abled as appropriate

Filter Size:1 XDSP Slice1 Block RAM116 Slices(30 for control)Number o f Taps x Number of Chann els

Max Sample Rate = Clock Rate

18

30 66

Filter Specification: Sampling Frequency = 1.2288 Mhz, Coefficients = 104, Channels = 3

The contro l is a little t rickierhere as it has to accommodate

the TDM data stream. Theembedded contro l can also be

used.

Colours represe nts the differentchannels going through the filte r.Each filte r is processed one a t a

time in stead of interleaved

11

Xilinx Confidential

P rep are d by : Ni al l B attso n (X il in x) 20 04

Four dimensions?

10 20 30 40 50 60 70 80 90100

200 300 400 500 600 700 8001000

900

0.001

0.01

0.1

10

50

100

200

300

400

500

1

Sam

ple R

ate (M

hz)

Parallel FIR Filters

IncreasingNo. of Multipliers

Dist

Mem

MACC

FIR Embedded

ControlMACC FIR

Normal ControlMACC FIR

Transpose FIR

Systolic FIR (symmetric & non)

Semi-Parallel Dist Mem FIR

Semi-Parallel BRAM FIR

• Time to put everythingtogether!

• Consider Multi-ChannelMulti-Rate FIR Filterstogether

• How do the Boundary lineschange as the interpolationor decimation rateincreases?

Number o f Coefficients (N)Log Scale

For Further Details Contact…..

www.xilinx.com/education(877) 959-2527

2 DSP Courses:

- DSP Implementation Techniques (3 day)

- DSP Design Flow (3 day)

To educate students on efficient DSP design in Xilinx FPGAsusing the latest system level design tools

Over 500 Pages

Xilinx DSP Courses


48


DM64x DM355

DM355 & DM335 Digital Video Evaluation Module TMDSEVM355$495

DM64xDM6467

DM6467 Digital Video Evaluation Module TMDSEVM6467$1,995

DM64xOMAP™ 3

OMAP 3 Digital Video Evaluation Module TMDXEVM3503$1,495

DM64x DM643x DM64xDM644x DM64xDM647/8

DM6437 Digital Video Development Platform TMDSVDP6437 $495

DM644x Digital Video Evaluation Module TMDSEVM6446$2,495

DM648 Digital Video Development Platform TMDXDVP648$1,295

TI DaVinci™ technology based development tools enable fast time-to-market

Note for Avnet and TI: Can we offer a discount for seminar attendees, similar to what we did for the Video and Signal Chain seminar in Fall 2008?


49


○VC1 ePlanned 720p●VC1 d

○H.264 MP ePlanned●H.264 MP dGA4Q08●H.264 BP dGA4Q08●H.264 BP e

Decode Planned

○MPEG-4 ASP e/d

BETA 2QGA4Q08

HW 720p●MPEG-4 SP/H.263 d

BETA NOWGA4Q08

HW 720p●MPEG-4 SP/H.263 e

BETA 2QGA4Q08

●MPEG-2 MP d○MPEG-2 e

BETA NOWGA4Q08

HW●JPEG e/dVideo / Imaging

OMAP35xxDM355C644xSoftware• FREE EVALUATION

provided for all TI software codecs

• Extensive, growing roadmap• Cross-platform availability

with API compatibility

• Complete listing of TI software inventory, including technical documentation available on www.ti.com/digitalmediasoftware or www.ti.com/dms

• Integration Support must be contracted through a TI Authorized Software Providers www.ti.com/asp

• BASIC BUNDLE (yellow highlighted items) AVAILABLE through eStorepost production release (GA)

www.ti.com/dms

● Available now○ Available now (3P IP may be purchased/sub-licensed through ASP)

Included in BASIC Bundle (by device platform)GA = General availabilityAll video/imaging codecs listed are up to D1 resolution unless otherwise indicated

e – encode d – decode BP – Baseline Profile SP – Simple ProfileMP – Main ProfileASP – Advanced Simple Profile

eXpressDSP™ Licensable Software from TI


50


TI DM6437 User Guides

TMS320DM643x Video Processing Front End (VPFE) User's Guidehttp://focus.ti.com/lit/ug/spru977/spru977.pdf

How to Use the VPBE and VPFE Driver on TMS320DM643x Deviceshttp://focus.ti.com/lit/an/spraap3a/spraap3a.pdf

TMS320DM643x Video Processing Back End (VPBE) User's Guidehttp://focus.ti.com/lit/ug/spru952a/spru952a.pdf

Spectrum Digital DM6437 EVM Support Home (Revision E)http://c6000.spectrumdigital.com/evmdm6437/reve/

TMS320DM643x DMP VLYNQ Port User's Guidehttp://focus.ti.com/lit/ug/spru938b/spru938b.pdf


51


Tuned for any video application

Tools Speed time to market

Optimized and ready to go

Software

Complete technology offering For any digital video product from capture to view

Medical Imaging

Capture Process Deliver Receive View

Processors

Video Infrastructure

PortableVideo

VideoSecurity

Video Phones

AutomotiveVision &

InfotainmentFuture Video

ProductsCamera

IP Set-Top Box

T E C H N O L O G Y

TI has and will continue to focus on, develop and promote a complete technology offering for all digital video applications from capture to display and viewing. This complete offering is based on the DaVinci Technology that combines processors, tools, software and system expertise (4 “pillars” of DaVinci) with support to enable innovation, ease of use and faster time to market.