A TMD-MPI/MPE Based Heterogeneous Video System

i

ii

A TMD-MPI/MPE Based Heterogeneous Video System

by Tony Ming Zhou

Supervisor: Professor Paul Chow

April 2010

iii

Abstract:

The FPGA technology advancements have enabled reconfigurable large-scale

hardware system designs. In recent years, heterogeneous systems comprised of

embedded processors, memory units, and a wide-variety of IP blocks have become

an increasingly popular solution to building future computing systems. The TMD-

MPI project has evolved the software standard message passing interface, MPI, to

the scope of FPGA hardware design. It provides a new programming model that

enabled transparent communication and synchronization between tasks running on

heterogeneous processing devices in the system. In this thesis project, we present

the design and characterization of a TMD-MPI based heterogeneous video

processing system. The system is comprised of hardware peripheral cores and

software video codec. By hiding low-level architectural details from the designer,

TMD-MPI can improve development productivity and reduce the level of difficulty.

In particular, with the type of abstraction TMD-MPI provides, the software video

codec approach is an easy-to-entry point for hardware design. The primary focus is

the functionalities and different configurations of the TMD-MPI based

heterogeneous design.

iv

Acknowledgements

I would like to thank supervisor Professor Paul Chow for his patience and

guidance, and the University of Toronto and the department of Engineering Science

for the wonderful five year journey that led me to this point. Special thanks go to the

TMD-MPI research group, in particular to Sami Sadaka, Kevin Lam, Kam Pui Tang,

and Manuel Saldaña. Last but not the least, I would like to thank my family and my

friends for always being there for me. Stella, Kay, Grace, Amy, Rui, Qian, Chunan, and

David, you have painted the colours of my university life.

v

Glossary MPI Message Passing Interface API Application Program Interface FIFO First-In-First-Out NetIf Network Interface MPE Message Passing Engine TMD Originally meant Toronto Molecular Dynamics machine, but this

definition was rescinded as the platform is not limited to Molecular Dynamics. The name was kept in homage to earlier TM-series projects at the University of Toronto

VGA Video Graphics Array RGB Red-Green-Blue Colour Model FPS Frames Per Second DVI Digital Visual Interface FSL Xilinx Fast Simplex Link FIFO First-In-First-Out HDL Hardware Description Language TX Transmission/Transmitting RX Reception/Receiving MPMC Multi-Ported Memory Controller BRAM Xilinx Block RAM PLB Processor Local Bus

vi

Contents 1 Introduction

1.1Motivation …………………………………………………………………………………. 1 1.2 Objectives …………………………………………………………………………………. 1

2 Background 2.1 Literature Review ……………………………………………………………………. 2 2.2 Distributed/Shared Memory Approaches ………………………..………… 3 2.3 The Building Blocks ………………………..………………………………………… 4

3 Methods and Findings

3.1 The Video System in Software .………………………………………………….. 6 3.2 The Video System on FPGA ………………………….……………………………. 7

3.2.1 System Block Diagram ……………………………………………………… 7 3.2.2 Distributed Memory Model ………………………………………………. 8 3.2.3 Shared Memory Model ……………………………………………………… 10

4 Discussions and Conclusions

4.1 Software vs. Hardware ………………………………………………………………. 13 4.2 Conclusions and Future Directions ……………………………………………... 14

References ……………………………………………………………………………………... 16 Appendix A: Video System – Software Prototype …………………………. 17 Appendix B: Video System – Hardware System ……………………………. 21 Appendix C: File Structure ……………………………………………………………. 22

1

1 Introduction

1.1 Motivation

Chip development has become increasingly difficulty due to transistor

physical scaling limitations. Parallel processing stands out as one of the best

alternative solutions for performance improvements. Message Passing Interface, or

MPI, is a specification for an API that allows computers to communicate with one

another. After over a decade of development, it has become the de facto standard for

communications among software processes that model a parallel program with

distributed memory.

Hardware engines are generally better suited for parallel applications

compared to software. Modern day FPGA technology has enabled advances in

hardware design. With the aid of HDL and FPGA’s reprogramability, software

program can now be accelerated in hardware without the high cost of ASIC design.

However, unlike low level hardware design, high level system integration can

be complex and time consuming. Professor Paul Chow and his research group at UofT

have built a lightweight subset implementation of the MPI standard, called TMD-MPI.

It provides software and hardware middleware layers of abstraction for

communications to enable the portable interaction between embedded processors,

CEs and x86 processors. Previous work demonstrated that TMD-MPI is a feasible

high-level programming model for multiple embedded processors, but complex

systems with heterogeneous processing units are yet to be tested. [1]

1.2 Objectives

The development of TMD-MPI is still in the stage of infancy compared to the

MPI standard. Implementations and characterizations of designs are lacking. This

undergraduate thesis project attempts to fill this gap by the means of a TMD-MPI

based heterogeneous video system design and characterization.

2

After a simple feasible heterogeneous system was successfully demonstrated,

this thesis will focus on expanding the software element network to exploit more

parallelism.

2. Background

2.1 Literature Review

Although heterogeneous systems have numerous performance and energy

advantages, design complexity remains a major factor limiting its use. Successful

designs require developer expertise in multiple languages and tools. For instance, a

typical FPGA heterogeneous system engineer should possess the knowledge of HDLs,

software coding, interface details between source and destination engines, CAD tools,

and vendor-specific FPGA details. Ideally, a specialized hardware/software element

of the system could be designed independently from the other elements, yet still be

portable and easily integratable into the overall system. TMD-MPI achieves this by

abstracting away the details of coordinations between different task-specific

elements, and in addition, it provides an easy-to-use entry point.

Similar attempts have been made by OpenFPGA and NSF CHREC:

a) Open FPGA has released General API Specification 0.4 in 2008 in attempt to

propose an industry standard API for high-level language access to reconfigurable

FPGAs in a portable manner. [2] The scope of TMD-MPI is much larger, because the

types of interactions in GenAPI are very limited as it is only focusing on the low-level

X86-FPGA interaction, but not dealing with higher levels.

b) NSF CHREC on the other hand, developed a conceptually similar framework

adopting the message-passing approach. A more careful inspection reveals the

differences. The hardware and software elements in their SCF heterogeneous design

are statically mapped to one another [3]. In contrast, the mapping of TMD-MPI nodes

is dynamically defined, implying that point-to-point communication paths can be

redirected during run-time for more versatility.

3

2.2 Distributed/Shared Memory Approaches

The primary goal of this thesis focuses on functionality rather than

performance. Speed and performance considerations aside, two approaches from the

high level perspective can be adopted.

The first is a distributed memory system where all processing unit are

equipped with local memories. The fact that local data is not accessible by ranks

other than its owner implies the video frames must be passed as messages from one

rank to another. Video streaming has a unidirectional flow of data, which makes it a

well suited application for the distributed memory approach.

The second is a shared memory approach, where all the processing units

share a common memory space. Under the conditions that video data are properly

managed in memory by a special engine and the memory interface is not port limited,

the memory contents would be accessible by all the processing units. The result of

such approach suggests that the task designation to processing units can be as simple

as passing memory addresses.

The desired video frame has a size of 640px by 480px, in 32-bit RGA format,

which is equivalent to 1200 kilobytes. If the desired frame rate is 30 FPS, then the

shared memory system’s network must be able to handle roughly 1200KB * 30 =

35MB/s. The shared memory approach introduces a less network traffic-intensive

way of communication by passing 32-bit addresses as messages. The simplest

application potentially is only required to broadcast a base memory address location

to all processing units, resulting in a total network data traffic of 32 bits per rank.

The drawbacks of shared memory approach are the longer memory access

time, and proneness to data corruption. Comparison wise, the shared memory

approach exhibits a more asynchronous nature, if a section of memory is assigned to

more than one codec, racing conditions can occur to cause premature or delayed

memory updates. Moreover, if a FPGA is not robust and had caused a bit-flip in a

4

message on the network, a bad memory address in the shared memory model is

more likely to result in a catastrophic failure than a bad pixel in the distributed

memory model.

2.3 The Building Blocks

In this project, point-to-point communication channels are implemented

using Xilinx Fast Simplex Links, which are essentially FIFOs. The FIFO is a powerful

abstraction for on-chip communication that is able handle the bandwidth of this

video system. Because the FSL are unidirectional, they are implemented in pairs for

the transmission and reception of data. In the special case of CE’s interface with

TMD-MPE, an extra pair of command FIFOs is required exclusively for MPI

commands (Fig. 1).

Figure 1. The FIFO pairs act as point-to-point communication channels.

The software elements are implemented using Xilinx Microblaze soft-core

processors. Message passing protocol is instantiated at compile time by declaring the

TMD-MPI library in the source code. An additional message passing engine, or MPE,

was also created to perform a subset of message passing functions in hardware. The

hardware elements can be designed in any HDL, but will always require a TMD-MPE

to be able to connect to the network.

Figure 2. Simplified NetIf diagram showing the TX and RX muxes.

Left, RX half with RX FIFOs. Right, TX half with TX FIFOs.

5

In order to enable dynamic mapping capability for the nodes, extra state

machines are required for each element/core. The Network Interface block provides

such path routing functions to eliminate the need of extra states for each core.

Another issue arises as the system gets large, the Microblaze soft-processor only has

eight FSL channels available, limiting the number of nodes that can be connected for

communication. The Network Interface block also solves this issue extracting the

channel multiplexing away from Microblazes. It functions as a multiplexer controlled

by the destination rank in the TX half, and a demultiplexer controlled by the source

rank in the RX half (Fig. 2). Connections wise, on one side of the NetIf, a pair of Xilinx

FSLs connects the NetIf and the current node; on the other side of the NetIf, it may be

connected to all the other NetIfs for maximized freedom (Fig. 3a), or less channels for

improved performance at the cost of worse system visibility (Fig. 3b).

Figure 3. Different ways of interconnecting the nodes, a NetIf is attached to each element, a

(left), b (right).

Lastly, this thesis project was built upon an available video system by

Professor Paul Chow’s summer student Jeff Goeders. The three-node system is a

scalable TMD-MPI based video processing framework, it currently supports video

streaming from the VGA port to the DVI port on the Xilinx Vertex-5 board [5].

6

3 Methods and Findings

3.1 The Video System in Software

Implementing TMD-MPI requires first understanding how to use the MPI

specification. Logically, the first step which took place was building a video frame

processor prototype entirely in software.

This prototype application utilizes the multiple CPU cores available on a PC,

and if available cores are insufficient, the additional non-existing cores are simulated

as additional processes on the existing cores. In the application, input and output

video frames were replaced with bitmap images of the same RGB format, the

processing codec was coded in C++, and both distributed memory and shared

memory models were implemented with the aid of readily available software

libraries.

Results:

Two instances of the application are shown below in Figure 4 and 5. Memory

activity have been omitted in the figures. The black arrows in Figure 4 symbolizes the

flow of actual frame data as messages, whereas the black arrows in Figure 5

symbolizes the flow of memory pointers as messages. Although streaming is better

suited for a distributed memory model, streaming and parallel processing

applications can be interchanged for the two models.

Figure 4. Streaming using distributed memory message passing model.

ADD NOISE INVERT COLOUR FLIP IMAGE

7

Figure 5. Parallel processing using shared memory message passing model.

3.2 The Video System on FPGA

3.2.1 System Block Diagram:

Figure 6. The heterogeneous video processing system block diagram

Data passed to ranks

as pointers in memory

Parallely processed

data sent to the output

node as a pointer

8

The system block diagram in Figure 6 lays out the overall picture of the

heterogeneous video processing system. The Network Interface Block in the centre

comprised of many interconnected NetIfs is shown as a big block for simplicity. The

top region above the NetIfs Block consists of hardware elements. Hardware CEs

interface with the system network through TMD-MPEs. Rank 1 is a video decoder

that takes VGA input and places it out the system network, and Rank 2 is a video

receiver that gets the video frames from the network and stores it in the external

memory. All of the software elements reside below the NetIfs Block , there is the

special Rank 0 process and a network of Microblazes and several specialized codec-

related processes labeled by Rank 3-N. Software elements interface with the system

network through the TMD-MPI software library.

Not only is the 256MB of external memory ported for Rank 2’s memory

storage and the DVI-out core’s video output, it is also useful when a hardware

engine’s local memory is insufficient, or a shared memory model is implemented.

MPMC is the memory interface between the system and the external memory.

In MPI, nodes identify each other by ranks. The sending and receiving of

messages require source and destination ranks. Rank 0 acts as the central command

centre of the system that initializes all the ranks at the start of runtime, and it also

configures and reconfigures the mapping between ranks during runtime. In the

codec network of Microblazes labeled Rank 3-N, the hardware settings may or may

not be identical depending on the peripheral devices needed for each process. In

general, the quickest and most efficient implementation of codec processes is

duplicating a fully functioning Microblaze and its peripherals and settings, but

loading the copies with different source code for codec-specific functions.

3.2.2 Distributed Memory Model:

Jeff Goeders’ framework demonstrated video streaming from a video decoder

core (Rank 1) to a memory storage core (Rank 2). Eventually, the video is outputted

to a DVI port. The communication channel between the two cores is established

based on TMD-MPI rank-to-rank model described earlier. Therefore, a Microblaze

9

(Rank 3) can be inserted between the streaming paths of the two cores by only

changing the destination rank of Rank 1 and source rank of Rank 2. Functionality

wise, the following tasks must be performed by the Microblaze:

Receive frames from Rank 1 in units of 640 x 480 frames

Apply the video codec effects

Send modified frames out to Rank 2 in units of 640 x 480 frames.

The requirements above pose several challenges. First of all, the Microblaze

must operate at more than double the rate Rank 1 and Rank 2 respectively send and

receive data. In addition, extra clock cycles would be needed for video processing.

The most direct solution involves parallelizing the task by dividing each frame into

smaller units that are equally distributed among multiple Microblazes.

The second challenge is the local memory constraint on the Microblaze.

Messages are sent and received in the unit of frames of 1200KB in size. Unlike the

hardware engines, Microblaze suffers from the higher level interaction it has with

FIFOs. Hardware engines are able to receive and send data on 32-bit-FIFO-data-entry

basis, but Microblazes must received each 1200KB message as a whole. The

distributed memory and BRAM available to Microblaze are both too small for local

storage; the only option is the external off-chip memory. Note that using an external

memory slightly violates the principles of a true distributed local memory model.

Results:

As expected, a single Microblaze based codec suffers from poor performance.

Although the soft-core processor operates at the same system clock frequency of

100Mhz, the effective rate of video streaming results in 1-10Mhz (Fig. 7).

Figure 7. The Microblaze as the bottleneck in the streaming path.

10

There are three factors contributing to the slowdown of the Microblaze. First,

Microblazes interface with the Xilinx FSL less efficiently than hardware does. The

extra clock cycles needed for the Microblaze reoccurs for each 32-bit data entry or

each pixel on the FIFO. Secondly, due to the large size of a video frame (1200KB), a

local memory approach is not applicable, an external off-chip memory is used instead.

As the external memory must share the PLB bus with other peripheral devices, bus

arbitration and the extra traffic introduce significant delay (Fig. 8). The external

memory is off the FPGA chip, both the complex MPMC interface and long physical

distance translate to more delay. The last factor would be the implicit sequential

execution of instructions in a normal processor [1]. However, contributions of the

last factor are small since a well designed video codec should be well cached and

pipe-lined by the processor.

Figure 8. The peripheral devices that are connected to the Microblaze in this video system.

The number of remaining ports on the MPMC limits the number of codec

Microblazes to six. Even if they are perfectly parallelized, the combined effective

frequency is 60Mhz – still not caught up to the speed of other cores. Clearly, the

design needed a new direction, the shared memory model is introduced.

3.2.3 Shared Memory Model:

The shared memory approach, as described in an earlier section, enjoys the

benefits of significantly reduced network traffic at the cost of memory access time.

Implementations of the model may vary, but the key idea is such that there are a

finite number of messages sent to the codec Microblazes. In this project, six 32-bit

11

signals is sent to the codec Microblaze, they are related to the base and high

addresses of the video, the type of codec to run, and other control signals. The

number of Xilinx FSL interface delay cycles associated with the six messages is

negligible compared to the per pixel basis delay cycles associated with the

distributed memory model.

Note that the shared memory approach is at a disadvantage in terms of

memory access time only when it is compared to a true distributed memory model.

Due to the limited amount of local memory available to Microblazes, an external

memory has been used for the distributed memory system in this project. Given such,

simple analysis shows that the two models practically exhibit identical memory

access time: the distributed memory modeled Microblaze first writes to memory as it

receives the data, then reads the data back for processing and transferring to the

next node; the shared memory modeled Microblaze first reads the data for

processing, later it updates the memory contents. Exactly one read and one write took

place in both models.

The same framework by Jeff Goeders has been used as the groundwork the

shared memory model is built upon. The codec related Microblazes are placed on the

system without source and destination changes to Rank 1 and Rank 2. Memory space

management has become a crucial task in this model. The video storage core and the

DVI-out core memory spaces are separated from each other, and the codec

Microblaze’s memory access spans both spaces as it is the agent that transfers the

data from one space to the other (Fig. 9).

Figure 9. Microblaze spans two spaces as both the video processor and transferor.

12

Results:

The speed improvement is evident and linearly scalable as far as the

measurements have shown (Fig. 10). The frame rate up to four Microblazes was

measured, a linear trend has been observed. The codec effects include darkening

effect, addition of noise, colour inversion, and colour change. The spread of data is

expected for that different codec were applied during multiple runs.

The FPS was measured by a special function within each Microblaze as

opposed to being measured at the DIV-out core. Therefore, despite the limited

number of available MPMC ports, the FPS benchmark for an increased number of

Microblazes can still be simulated by reducing the task handled by each Microblaze.

In the shared memory model, most of the message passing occur between the

special Rank 0 node and the codec Microblazes. Extra measures must be taken to

ensure proper sequence of tasks for data correctness. As a result, the software code

in this model tends to be more complex.

Figure 10. Microblaze spans two spaces as both the video processor and transferor.

13

4 Discussions and Conclusions

4.1 Software vs. Hardware

All of the software cores except for Rank 0 are replaceable with hardware

cores. One of the main goals is seeking the effectiveness of software approach to

TMD-MPI programming.

The functionalities of the TMD-MPI library far exceed its hardware

counterpart - the TMD-MPE. There are 25 MPI commands available in the TMD-MPI

library, whereas the TMD-MPE has only 3 MPI commands in total - synchronous send,

asynchronous send, and receive.

For the design of these codec, given the same task and clock, a specialized

hardware engine is expected to outperform the software process. Hardware engines

can be better optimized to for specialized tasks and are better suited for parallel

applications, on the other hand, processors are general purpose oriented, making

them less suited for specialized tasks. Moreover, experimental results in the previous

sections suggest that the software processes are associated with more delay, this is

another major factor limiting a software process’s performance.

In terms of development speed and difficulty, the software method has clear

advantages because of better scalability and reduced compile time. For instance,

functionally different processors are structurally identical in hardware, thus scaling

the software processes is as simple as duplicating the existing Microblaze and its

settings. Compared to hardware development, the compile and debugging time in

software is significantly reduced: on average, regenerating the system bitstream

after modifying a codec takes 1 minute in software but 90 minutes in hardware.

Well designed hardware engines should occupy less area, but software core

based codec have less development costs. The cost function should not be simply

evaluated based on chip area and development cost, there are many other

contributing factors. Thus, the comparison is inconclusive.

14

The comparisons drew above are summarized in the following table:

Software Hardware

Functionality Very good Bad

Performance Slow Very Fast

Development Fast Slow

Cost Inconclusive Inconclusive

4.2 Conclusions and Future Directions

This thesis project is an implementation of TMD-MPI based heterogeneous

video processing system. More specifically, the video processing units were

implemented using the Microblaze soft-core processors to execute C programs in

parallel. Characterizations and analysis have demonstrated TMD-MPI as a feasible

and efficient approach for heterogeneous system design.

The performance drawbacks of software processes have limited the data

throughput. As TMD-MPI provides a scalable solution to enable parallel processing,

performance can be improved by duplicating current software processes. Despite

that the shared memory model outperformed the distributed memory model, a true

distributed local memory model was not achieved, and therefore such comparison is

slightly unfair. Finally, the shared memory model provides more abstraction for the

developer. By passing memory addresses as messages, it deals with less network

traffic and less hardware modifications. Based on experiences throughout this

project, the shared memory model is the more scalable solution.

The TMD-MPI programming model is common to both software and hardware.

For the MPI commands, a simple script can be written for conversion between TMD-

MPI and TMD-MPE commands. As C-to-HDL technology advance, automatic software

code conversion is possible. The TMD-MPI approach suggests the possibility of an

efficient automated method of hardware development for designers with little

hardware background.

15

The following is a list for future directions:

Implement more MPI commands into the TMD-MPE for better hardware

functionality. Since many functions are built upon the three basic ones

available in the MPE, a higher level MPE that utilizes TMD-MPE’s basic

functions can be introduced.

Expand the software codec network, try more complex structures and

parallelization for characterization. Because of the limited number of

MPMC ports, the codec network of Microblazes may adopt both

distributed and shared memory models. As control signals get complicated

for Rank 0, local hierarchy methodologies like the tree diagram in Fig.3b

might be needed.

Implement a multi-board system to explore a higher level of scalability.

The building pieces are already available: Sami Sadaka has a homogeneous

video processing system, and Kevin Lam has a gigabit Ethernet Bridge.

Develop a cost-effective method of automated C-to-HDL conversion so that

developers can enjoy both benefits of efficient software development and

hardware accelerated performance. One might wish to conduct an study

on tools such as Nios II C-to-Hardware Acceleration Compiler, Impulse C,

and FPGAC. Although a completely different research topic, it is one that

raises great interest for the TMD-MPI project.

16

References:

[1] M. Saldana, A. Patel, C. Madill, and P. Chow. “MPI as an Abstraction for Software-Hardware Interaction for HPRCs,” HPRCTA, 2008 Second International Workshop; Austin,TX,USA

[2] OpenFPGA Main Page, “OpenFPGA General API Specification 0.4,” OpenFPGA. [Online]. Available: http://www.openfpga.org/Standards%20Documents/OpenFPGA-GenAPIv0.4.pdf [Accessed: Feb 20th, 2010]

[3] V. Aggarwal, R. Garcia, A. George, and H. Lam, “SCF: A Device- and Language-Independent Task CoordinationFramework for Reconfigurable, Heterogeneous Systems,” HPRCTA, November 15, 2009; Portland, Oregon

[4] MPICH2: High-performance and Widely Portable MPI, [Online]. Available: http://www.mcs.anl.gov/research/projects/mpich2/ [Accessed: Feb 20, 2010].

[5] Jeffrey Goeders, “A Scalable, MPI-Based Video Processing Framework,” University of Toronto, August 2009.

[6] Manuel Saldana, “Message Passing Engine (MPE) User’s Guide,” ArchES Computing, September 2009.

17

Appendix A Video System - Software Prototype Description This software MPI application is a picture frame parallel processing program. There are currently five defined ranks in the system with two distinct tests. The number of ranks can be easily expanded. The project infrastructure was organized in Microsoft Visual Studio/C++, and it must run under MPICH2 environment. Currently, a multi-core computer is not necessary due to MPICH2’s capability of simulating it as multi-process single-core program. However, the instructions for running the program on multiple computers are given below. Key Functions: Void MPE_Master(): It is executed by Rank 0 only. The function first creates and initializes the shared memory. Then defines the tasks to be performed and assign them to the other ranks. Void_MPE_Slave(): The slave function is executed by all non 0 ranks. The function contains a polling loop that is always polling for tasks from Rank 0, the received task is then translated and the corresponding codec is executed. Void Codec (int add, int size, int rank): The codec function takes in three parameter, “add” determines the baseaddress in the memory for processing to start, size is essentially the size of frame to be executed, and lastly the rank parameter is needed to determine with codec to run. Instruction for running on single computer: 1) Set up MPICH2 2) Go to the working directory, debug folder. 3) Run MPI program in the command prompt:

“mpiexec –n 5 source.bmp mpi_test1.exe” “mpiexec –n 5 source2.bmp mpi_test2.exe” Format: mpiexec –n arg1 arg2 executable arg1 – the number of processes/ranks arg2 – the input picture frame to be processed executable – your MPI program generated by the compiler

Instruction for running multiple computers: 1) Makesure that MPICH2 versions are the same in all computers/machines. 2) Copy the executable to the same directory in each machine (node).

18

For example “C:\Program Files\MPICH2\examples\cpi.exe” 3) Set Network Connections: Ensure that each machine can let its files shared by

other computers 4) Set Windows Firewall: Ensure that Windows Firewall can allow files sharing by

checking the option 5) Add MPICH2 path to Windows User Variables and System Variables. 6) Run MPI program in the command prompt.

For example “mpiexec –hosts 2 domainnameA 1 domainnameB 1 c:\program files\mpich2\examples\cpi.exe”

Software Variable Definitions: 1) Tasks are MPI messages of type int array, assigned by Rank 0 through MPI_Send

command. Example task declaration: “int taskname[TASKSIZE] = {0,1,640*480, 0, 0, 0, 0, 0};” Format: t[TASKSIZE] = { 0: source rank,

1: destination rank, 2: size of frame to access, 3: memory address/pointer of the frame, 4-7: unused }

2) Tags are transferred with each MPI message, it is used to determine the type of

message this MPI command carries. TAG: the message only contains info about the source and destination. TAG_ext: the extended verion of the previous tag, the message only contains the source and destination rank, memory address and size is carried in the message as well. dieTAG: signal to shut down the current core.

3) A rank is a unique identity of each process. Rank 0 is always the system control centre for the initializing and task assigning of the system

Current Tests: Test 1: The memory pointer is passed from codec to codec, the entire frame is being operated on. Test 2: Different memory pointers are assigned by Rank 0. The frame is divided into three sections, which are handled by three different ranks/codecs/processes.

19

Appendix B Video System – Hardware System Description The hardware system is a heterogeneous video processing system. It is built upon Jeff Goeders’ video streaming framework. The software ranks provide a scalable solution to video processing. This project has been developed as a proof-of-concept implementation. The hardware engines interface with the network through TMD-MPE v1.0, and the soft-core processors interface with the network through TMD-MPI library v1.0. The Block Diagram:

Instruction for generating the bitstream: 1) Must first source the Xilinx ISE 10.1 suitte: “source /opt/xilinx10.1/10.1.sh” 2) In ./xps_streaming and ./xps folder, type: “make –f system.make clean” cleans the created netlist “make –f system.make libs” generates the software libraries “make –f system.make bits” generates the unmodified bitstream Distributed model, streaming example 3) In the project root folder, ./xps_streaming needs to be changed to ./xps

20

4) Execute the Python script to generate the final bitstream: “./compile.py ./streaming.cfg” Find the generated bitstream in ./bit. Shared memory model example 3) In the project root folder, the system is built in ./xps 4) Execute the Python script to generate the final bitsteram: “./compile.py ./compile.cfg” Find the generated bitstream in ./bit. Tips for adding additional ranks:

The NetIf channels must be connected in the correct order, the channel number correspond to the NetIf number it connects with.

The order that the new core is defined in the ./*.cfg file is the order or ranks (0 being the first item, 1 next, 2 follows, etc.)

The new rank definition must be added in ./rt/rt_m0_b0_f0.mem file, new rank number is added in the beginning since the routing table is defined in the reverse order.

The number of routing table words must increase by 2 every time a new rank is added, this parameter called C_NUM_OF_WORDS can be found in ./xps/system.mhs

The new rank should always be initialized by Rank 0 to ensure the correct order of operation to prevent racing conditions.

Dips Switches: The dips switches at the bottom right of the Vertex-5 board are used for both debugging and system settings. SW_Pin1 & SW_Pin2: Used for Uart Mux, supports up to 4 Microblazes SW_Pin3 & SW_Pin4: Used only in the shared memory model, must be set before the system initializes.

SW_Pin[4:3] Codec Effect

0 no modifications to the video

1 introduces dotted pattern to the video

2 causes colour inversion to the video

3 divides the screen in four sections and the following codec are applied: dotted

21

SW_Pin 5-8: The four bit control signal for the debug mux, signals from different cores are displayed on the GPIO LEDs based on this value:

SW_Pin[8:5] Signal displayed on LEDs

0 vga_in_to_fsl_0_o_DBG_H_CNT

1 vga_in_to_fsl_0_o_DBG_V_CNT 2 vga_in_fsl_to_mpe_0_o_DBG_CS

3 vga_in_fsl_to_mpe_0_o_DBG_SEND_CNT

4 vga_mpe_to_ram_0_o_DBG_RECV_CS 5 vga_mpe_to_ram_0_o_DBG_RECV_CNT

6 vga_mpe_to_ram_0_o_DBG_PLB_CS

7 vga_mpe_to_ram_0_o_DBG_WRITE_FRAME_CNT 8 vga_in_analyze_0_o_DBG_H_SYNC_WIDTH

9 vga_in_to_fsl_0_o_DBG_V_SYNC_WIDTH Rank 0 – The Control Microblaze: This software process is responsible for initializing the system and directing the traffic for the codec processes.

Parameter Description

BASEADDR The base address defined for the video frame to be processed

HIGHADDR The high address defined for the video frame to be processed

TFT_ADDR The starting address of the DVI-Out memory space

TMR_ADDR The address of the xps_timer counter

GPIO_DIP_SW The address to read SW_Pin[4:3]

FPS_DISPLAY The enable signal for displaying the FPS information

DEBUG_LEVEL The debugging level, see the source code for more details

DEBUG_REPS The number of repetitions for certain debugging stages Other Tips:

Null-modem setting RS-232 cable is required for the Vertex-5 board System infrastructure can be updated purely with system.mhs and

system.mss files. Initializing the DVI can sometimes fail if the port is plugged in, simply remove

it and plug it back after the configurations are complete.

22

Appendix C File Structure: The design is located in /work/zhoutony/video_proc_Microblaze/

File/Folder Description EDK-XUPV5-LX110T-Pack XUP Virtex-5 board definitions

arches-mpi TMD-MPI software library

compile.py Script to compile source code

compile.cfg Configuration file for shared mem model

streaming.cfg Configuration file for distributed mem model

doc Documentations folder.

sim_scripts Scripts and files necessary for simulation

xps Shared memory model project files

xps_streaming Distributed memory streaming project files

src/mb0_streaming.c Rank 0 code for distributed mem model

src/mb1_streaming.c Rank 3 code for distributed mem model

src/mb0_multi_main.c Rank 0 code for shared mem model

src/mbx_multi_main.c Rank 3-N code for shared mem model

23