Analysis and Acceleration for Target Recognition

Analysis and Acceleration for Target Recognition

by

Jairam Ramanathan

Submitted to the Department of Electrical Engineering and Computer Sciencein partial fulfillment of the requirements for the degree of

Master of Engineering

at the

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

February 2001

BARKERMASSACHUSETTS INSTITUTE

OF TECHNOLOGY

JUL 3 12002

LIBRARIES

@ Jairam Ramanathan, MMI. All rights reserved.

The author hereby grants to MIT permission to reproduce and distribute publiclypaper and electronic copies of this thesis document in whole or in part.

A uthor ......................................Department of Electrical Engineeri ng and Computer Science

February 6, 2001

Certified by.......Paul D. Fiore

Senior Principal Engineer, BAE SYSTEMSVI-A Company Thesis Supervisor

Certified by.......Dan E. Dudgeon

Senior Staff, MIT Lincoln LaboratoryMIT Thesis Supervisor

Accepted by.........Arthur C. Smith

Chairman, Department Committee on Graduate Students

Analysis and Acceleration for Target Recognition

by

Jairam Ramanathan

Submitted to the Department of Electrical Engineering and Computer Scienceon February 6, 2001, in partial fulfillment of the

requirements for the degree ofMaster of Engineering

Abstract

This thesis examined the hardware acceleration properties of automatic target recog-nition algorithms. It specifically focused on an algorithm produced by the System-Oriented High Range Resolution Automatic Recognition Program at the Wright-Patterson Air Force Base. Analysis of this algorithm determined which calculationswould be most suitable for and derive the most benefit from hardware acceleration.The algorithm was appropriately modified and restructured to ease its hardware trans-lation while not significantly affecting the algorithm performance. A portion of thealgorithm was then implemented and executed on a custom hardware board contain-ing multiple field-programmable gate arrays, and the timing and algorithmic perfor-mance were compared with the corresponding software execution statistics. The nearorder of magnitude speedup showed the viability of custom hardware acceleration fortarget recognition algorithms.

VI-A Company Thesis Supervisor: Paul D. FioreTitle: Senior Principal Engineer, BAE SYSTEMS

MIT Thesis Supervisor: Dan E. DudgeonTitle: Senior Staff, MIT Lincoln Laboratory

2

Acknowledgments

Generous financial support for my work was provided by BAE SYSTEMS (formerly

Sanders, a Lockheed Martin Company). In particular at BAE SYSTEMS, I would

like to thank Dr. Cory Myers for his advice and guidance over the past three years.

I would also like to thank my thesis advisors, Dr. Paul Fiore of BAE SYSTEMS and

Dr. Dan Dudgeon of MIT Lincoln Laboratory, for their help in bringing my thesis to

completion.

I would like to thank Ken Smith, John Zaino, and Marion Reine of BAE SYS-

TEMS, and Eric Pauer, formerly of Sanders, for their assistance and advice while I

was undertaking my research.

I would finally like to thank my parents for their constant support and encour-

agement while I pursued my goals.

3

Contents

1 Introduction 9

1.1 Automatic Target Recognition . . . . . . . . . . . . . . . . . . . . . . 9

1.2 Acceleration Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Project Background 15

2.1 Initial Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Wordlength Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.1 Bit Precision Assignment . . . . . . . . . . . . . . . . . . . . . 19

2.2.2 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Final Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Synthesis and Generation . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5 Target D evice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.6 Sum m ary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 SHARP Algorithm 26

3.1 SAR to HRR Conversion .. . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Least Squares Fitting. . .. . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.1 Power Transformation . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Algorithm Performance. . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Summary ....... ... ................. .. ... ..... .. 35

4

4 Software Analysis

4.1 Timing Analysis ...........

4.2 Least Squares Fit . . . . . . . . . .

4.2.1 QR Factorization Approach

4.2.2 Matched Filter Approach .

4.3 Power Transformation . . . . . . .

4.4 Revised Timing Analysis . . . . . .

4.5 Summary . . . . . . . . . . . . . .

5 Hardware-Specific Analysis

5.1 Power Transformation . . . . . .

5.1.1 Vector Transformation

5.1.2 Vector Averaging.....

5.2 Least Squares Fitting . . . . . . .

5.2.1 Bias Removal and Magnitu<

5.2.2 Correlation . . . . . . . .

5.3 Summary . . . . . . . . . . . . .

6 Performance Comparison

6.1 Bias Removal and Normalization

6.2 Correlation . . . . . . . . . . . .

6.3 Final Comparison Estimates . . .

6.3.1 Algorithm Performance .

6.3.2 Algorithm Timing.....

6.4 Proposed Improvements.....

6.5 Summary . . . . . . . . . . . . .

7 Summary

A Acronyms

Normalization

5

36

37

38

39

40

46

46

47

49

50

50

54

55

55

58

61

62

63

68

72

72

73

74

75

76

78

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

List of Figures

1-1 Overview of forced decision and threshold decision methods. (a) Forced

decision. (b) Threshold decision . . . . . . . . . . . . . . . . . . . . . 10

1-2 SHARP Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2-1 Ptolemy Screenshots. (a) ACS Domain Palette (b) Example Dataflow

D iagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2-2 Wordlength Analysis Tool Outputs. (a) All cost-variance pairs (b)

Pareto-optimal cost-variance pairs . . . . . . . . . . . . . . . . . . . . 22

3-1 Segmentation of SAR image to simulate Doppler filtering . . . . . . . 28

3-2 SHARP Algorithm Block Diagram . . . . . . . . . . . . . . . . . . . 29

4-1 Range shifting of a bmp263 target by using 70-windows. (a) Original

80-vector (b)-(l) 11 70-wide range shifts from rightmost to leftmost. . 43

4-2 Range shifting of a bmp263 target by using 80-windows. (a) Original

90-vector. (b)-(l) 11 80-wide range shifts from rightmost to leftmost. . 45

5-1 Power Transformation Block Diagram . . . . . . . . . . . . . . . . . . 51

5-2 Divider Output - Floating Point and <4.8> Fixed Point . . . . . . . 52

5-3 LUT Output - Floating Point Input and <4.8> Fixed Point Input . . 53

5-4 LUT Output - Floating Point and <1.8> Fixed Point . . . . . . . . . 53

5-5 LUT Output - Floating Point Input and <1.8> Transformed Divider

Input......... .................................... 54

5-6 Normalization Block Diagram . . . . . . . . . . . . . . . . . . . . . . 56

5-7 Ptolemy Normalization Design . . . . . . . . . . . . . . . . . . . . . . 57

6

5-8 Correlation Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . 58

5-9 Ptolemy Correlation Design . . . . . . . . . . . . . . . . . . . . . . . 60

6-1 Execution Schedule for the Normalization Routine . . . . . . . . . . . 64

6-2 FPGA Occupancy for the Normalization Routine . . . . . . . . . . . 65

6-3 Input Signature to Normalization Routine . . . . . . . . . . . . . . . 66

6-4 Normalization Routine - Software and Hardware . . . . . . . . . . . . 66

6-5 Normalization Routine - Floating Point and Modified Fixed Point . . 67

6-6 Execution Schedule for the Correlation Routine . . . . . . . . . . . . 69

6-7 FPGA Occupancy for the Correlation Routine . . . . . . . . . . . . . 70

6-8 Correlation Target Signature . . . . . . . . . . . . . . . . . . . . . . . 71

6-9 Correlation Template Signature . . . . . . . . . . . . . . . . . . . . . 71

6-10 Correlation by Range Shift . . . . . . . . . . . . . . . . . . . . . . . . 72

7

List of Tables

3.1 Confusion Matrix by Vehicle Using Original AFRL Code . .

3.2 Confusion Matrix by Vehicle Type Using Original AFRL Code

4.1 Timing Analysis of Original Software . . . . . . . . . . . . . . .

4.2 Confusion Matrix by Vehicle Using Matched Filter . . . . . . .

4.3 Confusion Matrix by Vehicle Type Using Matched Filter . . . .

4.4 Timing Analysis of Optimized Software . . . . . . . . . . . . . .

5.1 Elementary Function Blocks . . . . . . . . . . . . . . . . . . . .

6.1 Normalization Design Bit Precisions . . . . . . . . . . . . . . . .

6.2 Correlation Design Bit Precisions . . . . . . . . . . . . . . . . .

6.3 Confusion Matrix by Vehicle Using Fixed-Point Simulation . . .

6.4 Confusion Matrix by Vehicle Type Using Fixed-Point Simulation

8

. . . 34

. . . 34

. . . 38

. . . 44

. . . 44

. . . 47

50

. . . 63

. . . 68

72

. . . 73

Chapter 1

Introduction

Automatic target recognition (ATR) is an important part of many military appli-

cations. The ability to discriminate between hostile and friendly targets as well as

the ability to differentiate between various hostile targets are often mission critical

objectives for real-time systems. As such, it is a common priority to produce the best

possible recognition performance in the least possible time.

In terms of execution speed, there are limits to what can be currently accom-

plished in software. Many high-speed applications are turning to dedicated hardware

to provide an execution speed that software cannot attain. ATR is a promising

candidate for hardware execution. ATR systems are computationally intensive, but

the computations performed are highly repetitive, enabling even a minor speedup to

significantly improve execution time.

In this chapter, we present a brief introduction to ATR, examine the major issues

to be considered for acceleration and discuss our method of approach, and finally

present an outline for the remainder of the thesis.

1.1 Automatic Target Recognition

Target recognition entails classifying a target given characteristic signatures (tem-

plates) of several target classes. It is important to note that the number of different

types of observable targets will generally exceed the number of targets for which tem-

9

plates are available. Consequently, a design choice must be made to deal with the

situation that the target does not belong to any of the template classes. There are

two reasonable options: the identifier can either match the target as well as possible

to one of the template classes (forced-decision) or it can leave the target unclassified

(threshold-decision). These approaches are shown in Figure 1-1.

Observation Profile rATR iTarget ID

AspectTret DTemnplated(a)

Observation Profile A TR or tifier, BelowTagtI

ThreshicldAspect T e TUnknown

(b)

Figure 1-1: Overview of forced decision and threshold decision methods. (a) Forceddecision. (b) Threshold decision.

There are strengths and shortcomings to both threshold-decision and forced-

decision approaches. Forced-decision methods are guaranteed to classify all known

targets. However, they will also (incorrectly) identify all unknown targets. Con-

versely, threshold-decision methods will leave most unknown targets unclassified.

However, they will generally also fail to identify some known targets.

It is computationally more intensive to implement a threshold-decision approach.

Before classifying any target, the decision must be made whether that target should

be classified at all. In many cases, all the calculations for a forced-decision approach

must be repeated in a threshold-decision approach, but end up unused due to the

threshold comparison. Consequently, the method chosen for a particular application

depends strongly on the performance and computational specifications. The extent to

which the template set spans the possible target set also strongly impacts the choice

of approach.

10

Adar ATR Target ID

Figure 1-2: SHARP Approach

The Air Force Research Laboratory (AFRL) at Wright-Patterson Air Force Base

(WPAFB) has developed an algorithm that implements a forced-decision method.

The algorithm attempts to classify targets given their high range resolution (HRR)

radar signatures. Their approach is called the System-Oriented HRR Automatic

Recognition Program (SHARP) [32]. The basic premise of the SHARP approach is

shown in Figure 1-2.

The SHARP objective is to develop and mature advanced air-to-ground HRR ATR

capabilities for transition into suitable operational Air Force airborne platforms. It is

clear that in an application of this type, it is desirable to minimize the latency between

data collection and classification. Therefore, this thesis examined a framework that

provided the SHARP algorithm with a significant speedup via hardware acceleration.

Field-programmable gate arrays (FPGAs) have received a great deal of attention

in recent years. FPGAs are hardware devices that contain a large array of small

configurable logic blocks (CLBs) surrounded by many routing resources [10]. Conse-

quently, they can be easily reconfigured to perform functions required by the user.

This thesis used custom hardware containing FPGAs to achieve the hardware accel-

eration of the SHARP algorithm.

1.2 Acceleration Issues

The first stage of acceleration is purely at the software level. In order to properly reap

the benefits of hardware acceleration, an algorithm must be reorganized and modified

in a manner that is conducive towards hardware mapping. In many cases, this reor-

11

ganization actually provides a speed gain for software execution. However, this gain

is even more pronounced when demonstrated in the hardware domain. These types

of improvements are generally achieved by using alternate methods for calculations

that increase the computational efficiency of the algorithm. This research examined

the SHARP algorithm to determine what portions would benefit the most from an

algorithmic speedup, and then attempted to improve their efficiency.

Another class of algorithm modification may result in better performance when

implemented in hardware, but may actually slow the performance in software. Typ-

ically, these types of improvements are made by modifying the control flow of the

algorithm. Software implementations of algorithms are able to make full use of the

incoming data and alter their behaviors based on this data. However, it is expensive

and inefficient to change the steps of an algorithm in hardware. Designs in which

the flow of the data is independent of what the data contains are more suitable for

implementation in hardware and result in better performance than algorithms with

variable control flows. Every effort must be made to remove data dependencies from

the control flow in order to make the best use of the available hardware resources.

This research examined all data dependencies in the SHARP algorithm and designed

reasonable alternatives for dealing with them.

An additional factor that must be considered in the mapping of an algorithm to

hardware is the precision of the data. While most software algorithms use floating-

point calculations, it is currently impractical to implement floating-point operations

in hardware. Even assuming fixed-point arithmetic, decisions must be made regarding

the data precisions to be kept at various points in the algorithm. It is important to

note that whenever floating-point data is truncated to a fixed-width precision, error

is introduced. Therefore, bit precisions were chosen in a manner that minimized the

effect on the output.

While the aforementioned considerations can be viewed as hardware independent,

it is essential that the actual specifics of the hardware be taken into consideration.

It is common practice to make use of libraries of basic hardware elements when

generating a large-scale design. It is possible that these libraries may not contain

12

some elements required by the design. In such cases, a decision must be made to

either create a design for the missing element or to alter the algorithm to make use

of existing blocks. These decisions generally must be made on a case-by-case basis

because the complexities of both approaches depend on the specific situation.

In addition, the physical hardware resources must also be considered. Many hard-

ware boards contain more than one FPGA, so it is necessary to determine an efficient

way to partition the algorithm to make use of the available resources. Ideally, a design

would use up as much of the FPGAs as possible, since speedup is generally propor-

tional to area in FPGA designs. In addition, the transfer of data between FPGAs

must be considered. There are typically a limited number of interconnects between

FPGAs, restricting the amount of data that can be passed from one FPGA to an-

other. This research attempted to partition the SHARP algorithm in a manner that

made use of the available resources while still facilitating analysis and debugging.

A final concern to be addressed is data transfer to and from the hardware. For

most applications, the time needed to write data to the memory of the hardware

board and read the output is not negligible. In fact, for small-scale designs, the data

transfer time may exceed the time actually spent on computation. Consequently, it

is important that the data transfer be managed so that the speed benefit gained by

executing the computation in hardware is not outweighed by the time lost in data

transfer.

By addressing the above issues, this thesis shows that it is possible to produce a

hardware design that provides a significant speedup in execution time for the SHARP

algorithm, while not adversely affecting the algorithm's performance. As a result,

it will be more feasible to incorporate computationally-intensive ATR methods in

practical real-time systems.

1.3 Overview of the Thesis

We begin in Chapter 2 by presenting our approach to generating hardware designs.

The approach was to utilize the tools produced by the Algorithm Analysis and Map-

13

ping Environment for Adaptive Computing Systems (ACS) [4] project at BAE SYS-

TEMS. In this chapter, we will present an overview of the methodology of the ACS

project.

In Chapter 3, we present a development of the SHARP algorithm. The necessary

computations are arranged in two separate components that execute sequentially in

order to facilitate analysis. We also present the results of execution of the SHARP

algorithm in order to measure the performance of the algorithm.

Chapter 4 begins the analysis of the SHARP algorithm in a way that prepares the

algorithm for hardware implementation, without making decisions that rely on the

use of any particular hardware device. All modifications to the SHARP algorithm are

performed strictly at the software level. We use a timing analysis in order to better

focus our efforts and present the modified performance and timing results at the end

of the chapter.

In Chapter 5, we discuss modifications and analyses that focus on the hardware

level. This chapter discusses both design decisions that would be necessary for any

hardware implementation as well as decisions that are specifically intended for the

hardware board used on the ACS program.

Chapter 6 presents the hardware implementation of the algorithm and compares

its performance with the performance of its software counterpart. We provide timing

estimates and discuss how they could be improved through further design modifi-

cation. We also discuss the extent to which the hardware acceleration benefits the

algorithm as a whole.

Finally, in Chapter 7, we summarize our findings and discuss the results of this

research.

14

Chapter 2

Project Background

In order to create the hardware designs used to accelerate the SHARP algorithm, this

research made use of the tools developed by the Algorithm Analysis and Mapping

Environment for Adaptive Computing Systems (ACS) [4] project at BAE SYSTEMS.

The goal of this project is to decrease the amount of time spent in creating hardware

designs by automating much of the design process.

The ACS project has specifically concentrated on implementing digital signal pro-

cessing (DSP) algorithms in field-programmable gate arrays (FPGAs), because FP-

GAs offer very high-speed processing at a small hardware cost [11]. To demonstrate

the potential of FPGA computation for DSP algorithms, the ACS tools were used

to implement a Winograd Discrete Fourier Transform (DFT) [24] and a linear FM

detector [25].

It has also been shown [30] that FPGAs are particularly suitable for ATR ap-

plications because logic can be configured down to the bit level. While FPGAs do

not offer the computing power of other technologies, such as application-specific in-

tegrated circuits (ASICs), their high speed-to-cost ratio makes them attractive to

developers. Additionally, unlike ASICs, which are programmed at the factory, FP-

GAs can be repeatedly reprogrammed by the user to suit the functionality needed

[10]. FPGAs have been used to accelerate both target detection [17], and target

recognition algorithms, typically using SAR imagery [6], [9], [30].

For these reasons, it was decided to employ the tools provided by the ACS project

15

in order to achieve hardware acceleration of the SHARP algorithm. Because devel-

opment of the ACS tools was continually progressing, this involved contributing to

the actual development of the software in order to be able to reap the benefits of its

capabilities.

This chapter will present the methodology of the ACS project. Section 2.1 will

describe the method used to lay out an algorithm in software as a starting point

from which hardware design can begin. Section 2.2 will present the tool used to

automatically set bit precisions throughout the algorithm as well as schedule the

execution of the algorithm components. Section 2.3 will describe how this information

was then employed to begin the generation of hardware-level components. Section 2.4

will show how the design process was brought to completion, resulting in full hardware

designs that are ready for implementation. Finally, in Section 2.5, we present specifics

of the target hardware device used on the ACS program.

2.1 Initial Implementation

The first step taken in the ACS program to create a hardware design is to lay out

the algorithm in Ptolemy [29], a software program developed at the University of

California at Berkeley. Ptolemy is a simulation tool that enables designers to build

and test large-scale systems using elementary building blocks. Since this is very much

the same approach taken when building large-scale hardware designs, the ACS project

created a new domain in Ptolemy that provided the functionality necessary to begin

hardware design. [19] shows a very similar approach that used Khoros, produced by

Khoral Research [16] for the simulation tool.

From an interface standpoint, the ACS domain of Ptolemy is very similar to its

other domains. Namely, a system is built up by placing elementary computational

blocks (such as adders, multipliers, etc.) on the workspace and connecting them

together to built more complex functions. Figure 2-1 shows both a typical group

of blocks (called a "palette") and a dataflow, which will be used again later in this

thesis. However, the difference between the ACS domain and the other domains in

16

Ptolemy lies in the interpretation of the individual blocks.

Because Ptolemy is at its core a simulation environment, most domains are used

for simulation. For example, in the synchronous dataflow (SDF) domain, the function-

ality of an adder is to sum its inputs and produce the result as its output. However,

the goal of the ACS domain is not simulation; instead, the ACS domain tries to en-

capsulate the information necessary to implement an element in hardware within that

block. Consequently, at the simplest level of abstraction, the adder block contains

the instructions for building an adder in hardware.

Naturally, there are more layers of complexity to be addressed. There is no univer-

sal adder design capable of handling all situations. Different applications may require

different precisions on the input or output, or may require different input-to-output

latencies. Consequently, the ACS blocks have two primary functions. The first func-

tion is to generate files which can be used in conjunction with the wordlength analysis

tool to be discussed in Section 2.2 to help the designer configure the block to meet

the needs of the larger system. We discuss this function below. The second function

is to actually generate a design that fits the criteria that the designer has specified.

This function will be discussed in Section 2.3.

The intention of the ACS program was to make each element as configurable as

possible. For example, in addition to being able to set bit precisions on the input

and output, designers would be able to choose among different designs for the same

element, depending on whether area or latency was a priority. However, while the

blocks have been implemented in such a way that adding these capabilities is possible,

they are not currently implemented. Consequently, the main use of the first function

of the ACS blocks is to determine the bit precisions for the fixed-precision calculations

that occur in hardware.

The ACS blocks have many variable parameters. Many blocks operate on vectors

rather than scalars, so vector lengths can be specified. Particular blocks, such as a

gain element, may have parameters specific to their function, such as the gain level.

Additionally, all blocks have bit precision parameters. If the designer knows the

precisions that are required for the system and knows that the underlying design will

17

(a)

(b)

Figure 2-1: Ptolemy Screenshots. (a) ACS Domain Palette (b) Example DataflowDiagram

18

be able to handle those precisions, they can be manually entered and locked.

In many cases, the designer may not know what the precision requirements are for

a specific block. In these cases, the wordlength analysis tool discussed in Section 2.2

can be used to suggest wordlengths. In order to be able to use the wordlength tool,

the ACS blocks must be able to generate files that describe their input/output rela-

tionships. These files are produced as MATLAB [21] scripts. They consist of files that

calculate the output of the block from its input, calculate the possible output range

given the ranges of its inputs, characterize the input-to-output latency, describe any

restrictions imposed by the underlying design on the input and output wordlengths,

as well as estimate the space needed for the design based on its parameters. The

generation of these files will be discussed in Section 2.3.

2.2 Wordlength Analysis

The wordlength analysis tool serves two primary purposes. First, it is responsible for

choosing bit precisions for all inputs and outputs that have not already been locked

by the user. There are two components of the bit precision of a fixed-point number:

the major bit, indicating the significance of the highest order bit of the number in

two's complement form, and the wordlength, indicating the number of bits used to

quantize that number.

The second function of the tool is to generate scheduling information for the

design. This information is necessary to compensate for the different latencies that

may exist along different datapaths in the design. The schedule will be used in Section

2.3 to ensure that all inputs to a particular block arrive at the same time, regardless

of when they are first available in hardware. We discuss each of these functions below.

2.2.1 Bit Precision Assignment

The first step in assigning bit precisions is to choose all of the major bits. Typically,

this can be done easily using the information provided by each block concerning the

output range span. The tool imposes the requirement that the input precisions be

19

chosen. Consequently, the input range span is known. From this, the range spans for

every node in the design can be calculated. As a result, the major bits are set to the

minimum value that spans the required range.

Once the major bits have been set, it is necessary to choose wordlengths for each

node. Unlike the major bits, which are essentially prechosen due to range analysis,

there is a large degree of flexibility in the wordlengths. Any set of wordlengths

that satisfies the constraints specified by the different blocks is considered a feasible

combination. However, most feasible combinations are of little practical value. It is

therefore necessary to establish a way to score each combination and determine which

of these combinations are "optimal."

The method used to optimize the design wordlengths is described in detail in [10]

and [13]. The tool makes use of the Markov chain Monte Carlo (MCMC) method [7] to

choose wordlengths that are jointly optimized with regards to output noise variance

and hardware cost. Specifically, the only wordlength combinations of interest are

those such that no other design has both lower variance and lower hardware cost.

These combinations are known as "Pareto-optimal" points.

The MCMC method is used as follows. The initial starting point is a large number

of randomly chosen wordlength combinations. Note that all wordlengths that have

been explicitly chosen by the designer will not be changed. Combinations that violate

any of the constraints imposed by the individual blocks are rejected. Presumably,

unless the wordlength constraints are highly restrictive, many feasible combinations

remain.

At this point, the combinations are evaluated with regards to cost and variance.

Any of the combinations that are not Pareto-optimal are rejected. The remaining

combinations are randomly perturbed. From a given combination, the subsequent

combination will either be identical to the original, or will differ by 1 in one of the

wordlengths. Assuming there are n wordlengths to be chosen, this results in 2n + 1

possible perturbations. Again, all designs that violate constraints or are not Pareto-

optimal are rejected.

Through repeated iteration of this process, several steady-state points will be

20

reached. From any steady-state point, altering any of the wordlengths will either

violate a constraint or result in a higher output variance or hardware cost. These

jointly-optimized combinations are then presented to the designer. Typically, the

designer will want to choose the design with the highest cost that will still fit in

the physical hardware, since these designs will usually have the lowest variances.

However, in some circumstances, a lower cost design may be desired, so the user is

free to choose a different design. Figure 2-2 shows typical outputs of the wordlength

analysis. Figure 2-2a shows all of the cost-variance pairs encountered during iteration,

while Figure 2-2b shows only those pairs that are Pareto-optimal.

Once the wordlengths have been set, all that remains to be done is scheduling

after which the design is ready to be created as in Section 2.3.

2.2.2 Scheduling

Recall that the purpose of scheduling is essentially to synchronize the inputs to all

blocks. This is achieved through use of the latency information provided by each

block. Latency generally depends on the wordlength chosen for a block. Therefore,

it is to be expected that different wordlength combinations will result in different

schedules.

Scheduling is achieved in a feed-forward sense, beginning with the data sources and

progressing through to the data sinks. All blocks that are connected to the sources

and have no other inputs can execute as soon as the schedule is launched. This

rationale is applied to every block. Namely, a block can execute as soon as all of its

inputs are available. Using the latency information, it is reasonably straightforward

to schedule all the blocks.

Note that in many cases, this scheduling implies that while the output of a block

may be available, it may not actually be consumed for many clock cycles. Conse-

quently, some delay must be introduced to synchronize the datapaths in hardware.

The wordlength analysis tool includes these anticipated delays in its cost estimates.

The insertion of these delays will be discussed below.

21

1150 1200 1250 1300Cost

(a)

1200Cost

(b)

1250

1350

1300 1350

Figure 2-2: Wordlength Analysis Tool Outputs. (a) All cost-variance pairs (b) Pareto-optimal cost-variance pairs

22

1030

10 9

C

1028

14001027

1050

10,

28.Lu 100a

102711050

1100

1100

7 Number of Designs=9

1150

2.3 Final Implementation

Once the wordlengths and schedule have been generated, this information is once

again used by the ACS domain in Ptolemy to actually begin creating the design. The

bit precisions provided by the wordlength analysis tool are entered and locked into

Ptolemy. Ptolemy then uses the generation capability of each block to actually create

designs that meet the user specifications. This is currently achieved in one of two

ways. The block will either contain instructions for generating a custom-designed

Very High Speed Integrated Circuit (VHSIC) Hardware Design Language (VHDL)

file or initiate a call to the external Core Generator tool, produced by Xilinx [34]. An

invocation of the Core Generator will result in both a VHDL file and an Electronic

Design Interchange Format (EDIF) file.

It is from these underlying designs that the Ptolemy blocks were initially con-

structed. All of the information that the blocks provide to the wordlength analysis

tools are properties of either the custom-made designs or the generic designs provided

by the Core Generator. Consequently, the characteristics of the designs have been

encapsulated and analyzed before any designs are actually instantiated.

In addition to generating the individual hardware designs, Ptolemy is also respon-

sible for connecting the individual blocks so that they function together as a system.

There are two factors to be considered in making connections between blocks. First,

any differences in bit precision on either end of a connection must be accounted for.

Note that there will not be any difference in major bit, only in wordlength. Addi-

tionally, the wordlength can only decrease across a connection, since an increase in

wordlength would only result in artificial bits being appended. Consequently, Ptolemy

ensures that all connections are made properly by trimming off any unneeded low or-

der bits.

Additionally, Ptolemy must implement the delays implied by the schedule. For

example, if a block produces its output at clock cycle 1, but the block that will

consume this output can only execute at clock cycle 3 due to latency on another

input, Ptolemy inserts a delay of 2 clock cycles on this connection to synchronize the

23

inputs. These additions to accommodate varying bit precisions and input latencies

are made through the inclusion of additional VHDL files.

2.4 Synthesis and Generation

There are two remaining tasks to be completed before the design can be used. The

VHDL files that have been generated are read into Synplify, a software tool produced

by Synplicity [28]. This tool is used to simulate the hardware execution of the design

to ensure that it performs as expected. Once the simulation results are approved, the

design is synthesized. Recall that only the VHDL files produced by invoking the Core

Generator have corresponding EDIF files. Synthesis of the design produces EDIF files

for the remaining VHDL files.

The final task is processing all of the EDIF files using the Alliance Series tools

produced by Xilinx [34]. This results in final bitstreams which can be used to program

the hardware and actually execute the design. Execution is accomplished through use

of software libraries provided by the board manufacturer to communicate with the

hardware. The Alliance tools were used to target an Annapolis Micro Systems [2]

Wildforce hardware board, which is described below.

2.5 Target Device

The hardware details of the Wildforce board are presented in [3]. Here we will only

present a brief description of those features of the Wildforce that will play a role in

our design process.

The Wildforce board has 5 Xilinx [34] XC4062 FPGAs labeled CPEO and PEl-

PE4. However, CPEO is used for external interface. All start signals are sent to CPEO

and CPEO returns all the interrupts. Consequently, there are 4 FPGAs available for

use in design. With each FPGA is an associated memory bank, which can contain

262,144 (256K) 36-bit values.

Each FPGA has a 36-bit connection to its associated memory. Additionally, there

24

is a 32-bit connection from each FPGA to a local bus which can be used to transfer

data between FPGAs. Through the use of the Peripheral Component Interconnect

(PCI) bus, the external world also has 36-bit connections to each of the memory

banks.

2.6 Summary

This chapter presented the approach of the ACS project. Ptolemy was used in con-

junction with the wordlength analysis tool to fully specify the design and then to

begin hardware generation. Several tools were then used to simulate and complete

generation of the design, resulting in bitstreams that could actually be implemented

in a reconfigurable hardware board.

We also presented some specifications of the Wildforce board, which was used

as the target device by the ACS program. The role of these specifications in our

hardware-specific analysis of the SHARP algorithm will be shown in Chapter 5. The

results of using the ACS tools for accelerating the SHARP algorithm will be presented

in Chapter 6.

25

Chapter 3

SHARP Algorithm

Moving target identification is a long-standing military objective. [26] provides a brief

overview of the recent history of ATR development, culminating in the current focus

on hardware implementation for target identification. The ability to classify an un-

known target regardless of velocity, orientation, or heading has obvious applications.

Various approaches have been taken to attempt to handle this problem. Traditional

synthetic aperture radar (SAR) approaches have been tried [5]; however, SAR ATR of

moving targets has been less successful than SAR ATR of stationary targets. Due to

the long processing interval inherent with SAR imagery, target information is prone

to blurring, resulting in degraded ATR performance.

Target identification using moving target indication (MTI) radar has also been at-

tempted [14]. MTI radar has the advantage over SAR imagery of being able to detect

and track moving targets. However, MTI radar lacks the bandwidth necessary for

target discrimination. Consequently, although targets can be detected, they cannot

be identified.

As a result, the Air Force Research Laboratory (AFRL) at Wright-Patterson Air

Force Base (WPAFB) has begun investigating the potential of high range resolution

(HRR) radar for the moving target problem [22]. HRR radar is well-suited for iden-

tifying moving targets for two reasons. Unlike SAR, HRR radar signatures can be

formed quickly, allowing the ground clutter to be separated from the target through

the use of Doppler filtering [20]. Additionally, because of its large range resolution,

26

HRR provides the bandwidth lacking in MTI radar, making target discrimination

feasible.

The AFRL initiated the System-oriented HRR Automatic Recognition Program

(SHARP) [32] in order to develop HRR ATR capability. As part of the SHARP pro-

gram, an algorithm was designed that implemented a forced-decision ATR approach.

The algorithm classifies targets based on their signatures and aspect angles using an

extensive template set.

The SHARP algorithm makes use of the dataset from the Moving and Stationary

Target Acquisition and Recognition (MSTAR) [15] database to measure its perfor-

mance. The subset of the MSTAR dataset used had seven classes, consisting of three

BMP2 armored personnel carriers (APCs), one BTR70 APC, and three T72 main

battle tanks. There were two sets of data available: the training set and the test set.

The training set was taken at a 170 depression angle and consisted of roughly 230

different aspect angles for each class spread across the total 3600 span. The test set

was taken at a 15* depression angle and consisted of roughly 195 aspect angles for

each class.

The results of using the SHARP algorithm to classify the entire test set are stored

in a confusion matrix. Confusion matrices are described in [8] and provide a way

to measure the performance of a classification algorithm. The rows of the confusion

matrix represent the test set, while the columns represent the template set. The value

in row i and column j of the matrix is the percentage of signatures in Class i that

were classified as Class j.

The MSTAR data was available as SAR imagery rather than HRR signatures.

In Section 3.1, we discuss the calculations necessary to obtain simulated HRR data

from the SAR images. These calculations were necessary so that there was data with

which to test the algorithm performance, but they were not considered part of the

algorithm (which is intended to operate on real HRR data). The SHARP algorithm

consists of some preprocessing and a least squares fitting, but we present these out of

order in Sections 3.2 and 3.3 because their development is more intuitive in this order.

Finally, in Section 3.4 we present the results of executing the SHARP algorithm on

27

the MSTAR dataset.

3.1 SAR to HRR Conversion

The approach of the SHARP algorithm is to employ the characteristics of HRR

radar signatures to achieve better results for moving target recognition. However,

the MSTAR dataset used to determine algorithm performance consisted only of SAR

imagery. It was therefore necessary to perform some preprocessing of the data in

order to be able to simulate HRR data and get a better estimate of the algorithm

performance.

The method used to obtain simulated HRR data from the available SAR images is

discussed in [14] and [32]. The dimensions of the original SAR images were 128x 128.

These were subsampled by taking the center 101x70 pixels where the actual target

was contained as shown in Figure 3-1. This was done to simulate the results of the

Doppler filtering which would be performed on actual HRR signatures to separate

the target from the ground clutter. The subimages were then zero padded back up

to 128x 128 so that an inverse Fast Fourier Transform (IFFT) could be taken. The

IFFT was taken on the dimension that was 70 pixels wide before zero padding and

resulted in range signatures collected over the radar aperture (the dimension that was

originally 101 pixels wide).

Figure 3-1: Segmentation of SAR image to simulate Doppler filtering

The resulting signatures were finally deweighted in angle using an inverse Taylor

28

HRR Signature Preprocessing Least Squares Fit arget ID

Aspect Angle Templates

Figure 3-2: SHARP Algorithm Block Diagram

window [31] to make the energy of the signatures uniform. The result was a 128x 128

matrix, representing 128 signatures that were each 128-wide in range. It is important

to note that only the center 101x70 of this matrix represented real data while the

rest was zero padding. This 128x 128 matrix was used as the input to the algorithm;

it would be compressed to a single signature, which was then used for classification.

The method of performing this compression will be discussed in Section 3.3.

3.2 Least Squares Fitting

The dataflow of the SHARP algorithm is shown in Figure 3-2, a revision of Figure

1-1. We begin by presenting the least squares fit routine.

The SHARP algorithm minimizes an error metric and bases its classification deci-

sion on that minimum error. Consequently, it makes sense to begin our development

of the algorithm by defining this metric. Given a target vector d and a template

vector m, the SHARP algorithm tries to fit the vectors to each other by altering the

bias and magnitude of m. Specifically, the algorithm determines an optimal offset k,

and an optimal gain k2 such that k, + k2 * mi di, or equivalently,

1 m] k d. (3.1)

Since in most cases it is not possible to find a k such that Equation 3.1 is exactly

satisfied, the least squares method provides a k that minimizes the error between the

29

two sides of Equation 3.1. This solution is developed in [27) and is given by

1T 1Tk= TMIT d. (3.2)

Our primary focus is not to determine the least squares solution to this equation, but

to determine the error associated with that solution because our objective is to find

the minimum error. Naturally, the error is dependent on the choice of k and is given

by

= 1 m k -) d [ 1 m k -d) . (3.3)

Consequently, the SHARP algorithm determines the least squares solution k to Equa-

tion 3.1 and then evaluates Equation 3.3 to determine the resultant error.

Now that we have defined a metric by which to compare the various template

signatures, we must define the template set to which this metric will be applied. The

template set spans two dimensions which we must take into account: aspect angle

and range. Clearly, one possible solution is the brute force attempt; we can consider

template vectors spanning all aspect angles and consider all possible range shifts and

minimize the error across all the resulting comparisons. However, this is a highly

inefficient approach that fails to utilize knowledge of the data.

The SHARP algorithm assumes that the signal information is contained in the

center 70 range gates of the entire 128-gate span. The algorithm further assumes that

the templates are already close to being correctly aligned. Consequently, the decision

was made to only consider shifts within five range gates. This results in a dramatic

savings in computation when compared with considering all 139 possible range shifts.

In order to facilitate performing the eleven necessary range shifts, the target vector

is padded on each end with five zeros to produce an 80-wide vector. The eleven shifts

are generated by taking subvectors (1-70, 2-71, ... , 11-80) of the padded vector.

To cut down on the number of aspects we must consider, we make use of the

fact that we know the aspect angle of the target vector. It therefore makes sense to

choose some subset of angles centered around the target aspect. In fact, [1] shows

30

that performance may actually be improved by constraining the aspect angles to avoid

mismatches that may occur at drastically different angles. The SHARP algorithm

only considers templates whose aspects are within 5' of the aspect angle of the target.

This is again a huge savings since a 360-degree span has been reduced to a 10-degree

span.

The result of these decisions is to reduce the number of error calculations that

must be computed to a manageable number. The template data contained signatures

that were multiples of 10 apart in aspect. Consequently, the maximum number of

aspect angles to be considered for any given target aspect angle a is eleven (i.e.

a - 5, a - 4,. . . , a + 5). Therefore, we see that for a given target signature, we must

perform at most 11 (aspects) x 11 (ranges) x 7 (classes) = 847 error calculations.

3.3 Preprocessing

We recall that the input data was available as a 128x 128 matrix. Additionally, we

know that the least squares fitting component of the SHARP algorithm expects the

data as a 70-wide (or 80-wide, depending on zero padding) vector. Consequently, it is

apparent that some preprocessing had to be done on the input data before the least

squares fit could be performed. Since we know that only the center 101x70 of the

input matrix represented the target data, only this submatrix was used for processing.

The first step in the preprocessing was to remove automatic gain control and

range effects from the signatures in order to improve classifier performance. This was

achieved through a normalization. Recall that each input signature r was 70-wide

and contained range gate magnitudes. From this vector, a power vector p was formed

such that pi = r2. The root mean square (RMS) amplitude of p was then calculated

and is given by

1 7 (3.4)Prms =\128 PI(.4

where the -L factor is due to the mean being taken over all 128 range gates (including

zero padding). The power vector was then normalized by this RMS amplitude to

31

obtain

n = . (3.5)Prms

Combining Equations 3.4 and 3.5, we obtain

n = 2 (3.6)

We finally substitute for p and obtain

n = (3.7)

where the squaring is done elementwise.

Before the resulting signatures could be averaged to produce a single vector, an

additional transformation had to be performed to satisfy the constraints of the least

squares fit routine. This is discussed below in Section 3.3.1.

3.3.1 Power Transformation

It is important to note that using the method of error calculation above made the

SHARP algorithm a Gaussian classifier. As shown in [33], this means that the incom-

ing data was assumed to have a Gaussian distribution. Equivalently, the incoming

data had to be completely characterized by its first and second moments, i.e. its

mean (y) and variance (o2 ). This assumption was implicitly made by using Equation

3.3 to perform the least squares fit. Determining the optimal bias k, was equivalent

to matching the means of the two signatures because the new mean Pnew was equal

to k1 . Similarly, finding the optimal gain k2 was equivalent to matching the variances

because the new variance Oj2 was equal to k2 * o 2.

Although the data was assumed to be Gaussian, it has been shown that the

statistics of HRR data are better modeled as having a Rayleigh distribution [22].

Consequently, in order to maximize the performance of the classifier, it was necessary

to transform the data. Empirical studies conducted in [1] determined that a trans-

32

form of the form t = nc would transform the distribution such that it more closely

resembled a Gaussian distribution. Specifically, using c = 0.2 resulted in improved

classifier performance.

Combining the power transformation with Equation 3.7, we see that the calcula-

tion performed on each of the 101 signatures is

t= r (3.8)

Up to now, the preprocessing was identical for both the template and target sig-

natures. At this point, the computation finally differed. Because we wanted the

template signatures to be as accurate as possible, all 128 signatures were averaged

together to obtain a single 70-gate vector. However, the target signatures should at-

tempt to simulate the quality of real HRR signatures. Consequently, only the center

eight profiles were averaged. Target vectors were finally padded with five zeros on

either end to accommodate the range shifting of the least squares fit.

3.4 Algorithm Performance

We present the results of execution of the SHARP algorithm on the 1,362 MSTAR test

signatures in two different formats. Table 3.1 shows the confusion matrix arranged

by vehicle. Table 3.2 shows the confusion matrix arranged by vehicle class. Recall

that element (i, j) indicated the percentage of Class i signatures that were assigned to

Class j. Through these tables, we can see quantitatively how the SHARP algorithm

performs.

We see from Table 3.1 that the performance by vehicle was certainly not optimal.

In fact, for the BMP263 and BMP2c21, less than 50% of the signatures were correctly

classified. However, we see from Table 3.2 that the majority of misclassifications were

still the same vehicle type. Consequently, we see that while the SHARP algorithm did

not perform well on a vehicle-by-vehicle basis, it did perform well on vehicle types. In

fact 86.64% of the target signatures were correctly matched with their vehicle class.

33

Table 3.1: Confusion Matrix by Vehicle Using Original AFRL Code

Actual

PredictedBMP263 BMP266 BMP2c21 BTR70c21 T72132 T72812 T72s7

BMP263 47.69% 16.92% 23.08% 3.59% 3.59% -2.56% 3.59%BMP266 7.69% 60.00% 14.87% 4.10% 4.10% 3.59% 5.64%BMP2c21 21.54% 12.31% 49.74% 4.62% 4.10% 3.59% 4.10%BTR70c21 3.59% 4.10% 5.64% 76.92% 3.08% 3.08% 3.59%

T72812 0.00% 2.05% 1.54% 1.54% 5.64% -75.38% 13.85%7T72s7 1.57% 4.19% 2.62% 1.57% 8.90% 10.99% 70.16%

Table 3.2: Confusion Matrix by Vehicle Type Using Original AFRL Code

PredictedBMP2 BTR70 T72

BMP2 84.62% 4.10% 11.28%Actual BTR70 13.33% 76.92% 9.74%

T72 6.87% 1.20% 91.92%

A detailed timing analysis of the SHARP algorithm will be presented in Chapter

4, but we note here that the time required for classification of the test signatures (ex-

cluding the time necessary for template formation) was approximately 21.64 minutes.

While a speedup of a process that only takes 21.64 minutes to execute is not very

significant, it is important to note that in a practical system, this time would be much

larger. It is likely that the template set would be much more densely populated in

aspect and would contain data for many depression angles. Consequently, we could

expect a realistic execution time to be on the order of days. In this case, the speedup

provided by hardware acceleration would clearly be significant.

We present the results of algorithm execution not only to demonstrate what the

capabilities and shortcomings of the SHARP algorithm are, but also to provide a

point of reference for our further analysis. Since we are attempting to provide the

same capability as the SHARP algorithm, but with hardware acceleration which will

significantly decrease the execution time, we need a quantifiable way to compare

the results. We would therefore like to see that executing some or all of the SHARP

algorithm in hardware does not significantly change the confusion matrices from those

in Tables 3.1 and 3.2.

34

3.5 Summary

In this chapter, we presented the methods used in the SHARP algorithm. We demon-

strated how the SAR imagery from the MSTAR dataset was processed to produce

simulated HRR signatures for classification. The SHARP algorithm was developed in

two parts: preprocessing and least squares fitting. The preprocessing consisted of a

normalization and a power transformation. This power transformation was necessary

because the least squares fit routine was implemented as a Gaussian classifier.

Since the goal of this research was to accelerate the algorithm without significantly

affecting performance, the results of executing the SHARP algorithm were presented

to serve as a reference point for future comparisons. Chapter 4 will begin our anal-

ysis of the SHARP algorithm, which will rely heavily upon an understanding of the

framework presented in this chapter.

35

Chapter 4

Software Analysis

In attempting to accelerate an algorithm, it is important that the original software be

optimized as much as possible. Achieving a 10 x speedup through the use of hardware

acceleration is not very impressive if a 5 x speedup can be gained simply by improving

the software. Thus, it is only possible to get a true measure of the impact of hardware

acceleration by comparison to a software implementation that is fully optimized.

Additionally, this software optimization can typically lead to better hardware

performance. Naturally, functions that require few computations will execute faster

in both hardware and software. However, the lack of computational elements results

in extra FPGA area. Therefore, the existing blocks can be implemented using larger

area designs which can either improve the precision of the calculation or reduce its

latency. Consequently, it is clear that software optimization is an important step

towards hardware acceleration.

In Section 4.1, we present a timing analysis of the SHARP algorithm. Section

4.2 presents an analysis of the least squares fit routine. The section begins with the

original AFRL approach and then presents an alternative that significantly improves

its efficiency. Section 4.3 presents optimizations that target the preprocessing in the

SHARP algorithm. Finally, in Section 4.4, we present a revised timing analysis to

demonstrate the effects of these modifications and refocus further study.

36

4.1 Timing Analysis

It is important to recognize that the SHARP algorithm consisted of two separate

parts. Template signature formation comprised the first part while target signature

formation and classification made up the second part. The important distinction

between these parts is that the former could be computed and stored ahead of time.

The compilation of the template library was not be considered part of the algo-

rithm execution since the template data was static. Once the templates had been

computed, they could simply be stored and read on subsequent execution. Conse-

quently, our analysis of the SHARP algorithm focused only on the formation and

classification of target signatures.

We began our analysis by examining the execution of the SHARP algorithm to

determine the relative amounts of computation involved. A profiling of this sort

was helpful to determine what calculations dominated the execution time and would

therefore demonstrate the greatest benefit from an acceleration. Conversely, it also

prevented us from undertaking a detailed analysis to gain a significant speedup in a

computation that did not make up a significant portion of the algorithm execution

time. Through analysis of the initial execution of the algorithm, we were able to

better focus our efforts and achieve the greatest speedup.

All timing tests were conducted on a Sun Microsystems Ultra 5 running at 360

MHz with 320 MB of RAM. As mentioned above, we ignored the time required for

template formation. Consequently, we broke our analysis into three categories: target

loading/preprocessing, least squares fitting, and classification. The timing is shown

in Table 4.1 for all seven target classes, which we recall consisted of 1,362 signatures

(128x 128 matrices). All timings are shown in seconds.

As expected, the first priority for acceleration was the least squares fit routine.

Least squares fitting required many error calculations, each of which required several

vector multiplications. Additionally, the least squares fit was performed on many

vectors. Consequently, it was no surprise that it was responsible for most of the

execution time.

37

Table 4.1: Timing Analysis of Original Software

Loading/Preprocessing Least Squares Fit ClassificationBMP263 24.55 164.16 .11BMP266 24.89 160.08 .11

BMP2c21 24.56 162.55 .11BTR70 24.62 161.00 .11T72132 24.54 156.27 .11T72812 25.05 161.48 .11T72s7 24.50 159.52 .11Total 172.71 1125.06 .77

We also see that the secondary area for acceleration was the preprocessing. While

we could not do much to improve the load time of the data, we could improve the

time required to preprocess the data into power-transformed HRR signatures. Con-

sequently, we also focused on the power transformation and attempted to improve its

execution time.

4.2 Least Squares Fit

We recall from Chapter 3 that the goal of the least squares fitting was to minimize the

error between the target signature and the class to which it was assigned. Specifically,

the SHARP algorithm determined the class that minimized the error between the

target signature and all template signatures that were within 50 in aspect and within

5 range gate shifts. In this context, the error to be minimized was

E2 =IIM][ k I- d)'( 1 m ][k I- d), (4.1)

where m and d are column vectors representing the template and target signatures,

respectively, and k is a 2 x 1 vector consisting of the optimal offset and gain between

the vectors.

38

4.2.1 QR Factorization Approach

The approach taken by the original AFRL code to calculate the fitting error was

relatively straightforward. The code first calculated the least squares solution k and

then plugged it into Equation 4.1 to arrive at the fitting error. However, rather than

simply using Equation 3.2 to calculate k, the AFRL code employed a QR factorization

in order to reduce the amount of calculation necessary.

From [27], we know that the goal of QR factorization is to take a matrix A and

produce two matrices, Q and R, such that A = QR. Additionally, Q must have

orthonormal columns (implying that QTQ = I) and R must be upper triangular.

The AFRL took advantage of this method by taking the QR factorization of [ 1 m ].

Since the dimensions of [ 1 m ] were 70 x 2, the dimensions of Q and R were 70 x 2

and 2 x 2, respectively. This factorization was then substituted into Equation 3.2.

This substitution is shown below:

1T 1Tk T [1 M T d (4.2)

k =((QR)T QR)-'(QR)T d (4.3)

k = (RTQTQR)-IRTQTd (4.4)

k (RTR)-RTQTd. (4.5)

It is not clear that using QR factorization provided a benefit over the straightforward

calculation. In fact, from [27) we know the QR factorization of a 70 x 2 matrix requires

280 multiplications. Additionally, we can easily calculate the number of multiplica-

tions in Equations 4.2 and 4.5 to be 430 and 161, respectively. It would thus seem

that the QR factorization approach required 441 total multiplications (11 more than

straightforward calculation). However, the benefit of the QR approach was in the

context of its use.

Recall that in order to compare one target signature to one template signature,

there were eleven range shifts to be considered. For straightforward calculation,

this resulted in 11 x 430 = 4,730 multiplications. However, in the QR approach,

39

factorization only had to be done once per template. Consequently, only 280 + 11

x 161 = 2,051 multiplications were required. While this was clearly an improvement

over the straightforward approach, it could be improved further using a matched

filter.

4.2.2 Matched Filter Approach

The foundation for using a matched filter approach for the SHARP algorithm is

developed in [12]. We will present this development, extend it to implementation,

and demonstrate the resulting benefit.

The development of the matched filter approach began with the original equation

by which the SHARP algorithm performed the least squares fitting,

1 MI[ k ~ d. (4.6)

From [273, we know that the least squares solution k is found by solving

1T 1T

[ 1 m k = d (4.7)

or equivalently,

N Nm - Ndmlk]= (4.8)

where N is the length of m and d (in this case, 70), and fn and d are the means of

m and d, respectively. Solving the first line of this equation gives

Nk1 + Nmfk 2 = Nd, (4.9)

k1 = d-fk 2. (4.10)

40

Substituting back into Equation 4.6 gives

[1 m]dk 2 rn d, (4.11)k2

Al - k2fn1 + k2 m ~ d, (4.12)

and finally,

(m - fn1)k2~ (d - dl). (4.13)

Note that m - iil is simply m with its bias removed. Similarly, d - Al is d with its

bias removed. Defining rhn = m - mn1 and d = d - dl, we obtain

inhk2 ~d. (4.14)

From this, we know the least squares solution for k2 is

k2 Tr7 (4.15)m m

Using Equations 4.14 and 4.15, we can calculate the fitting error:

E2 = (k 2xh - a)T (k 2 ih - a) (4.16)

E2 = (k2 1n _ d T ) (k2 in - a) (4.17)

62 (k 2~n-rTh - 2k 2 xhTd + ada) (4.18)

62 = dd -h Td. (4.19)m m

We see from Equation 4.19 that if f~n and d were normalized so that Iil = jdl = 1,

then minimizing the least squares error could be accomplished by maximizing ~nTa,

which is precisely the objective of a matched filter [23]. Therefore, we see that by

removing the bias from the template and target vectors and then normalizing them to

unit magnitude, we could accomplish the necessary error minimization by maximizing

the correlation between the two vectors. However, it was important to realize that

calculating the bias and normalizing were expensive calculations. Consequently, it

41

was desirable that these operations not be performed frequently.

Recall from Section 3.2 that a single target vector was compared against seven

template classes. Within each template class, signatures were only considered if their

aspect angles were within 5' of the target signature. Additionally, the target signature

was shifted up to 5 range gates in either directions to align the signatures. Cleary, the

choice of template vector had no bearing on processing the target vector. However, a

bias removal and normalization was necessary for every range shift using the AFRL

approach for range shifting.

The range shifting done by the algorithm was performed by padding the original

70-vector with 5 zeros on either end and then taking a window of size 70 from the

resulting 80-vector. However, Figure 4-1 shows the problem with such an approach.

Figure 4-la shows the 80-vector obtained by zero-padding one of the bmp263 target

signatures. Figures 4-1b through 4-11 show the 11 range shifts of that signature.

The 70 samples containing target information were elements 6 through 75 of the

80-vector. However, every window besides the center one left out at least one of

these samples. In the extreme case, windows 1-70 and 11-80 left out 5 samples of the

original vector as shown in Figure 4-1b and 4-11. In many cases, this did not matter

because the actual target signatures were generally smaller than 70 range gates, but

in some cases, real data samples were lost. Consequently, the mean of the vector

could potentially change with every range shift, requiring constant bias removal and

normalization.

Instead of accepting this conclusion and taking the resulting performance degra-

dation, we proposed a modification to the algorithm which enabled the bias removal

and normalization to be performed only once for each target vector. This modification

was to ensure that samples were never dropped. This was accomplished by providing

additional zero padding and taking a larger window. Specifically, the original 70-

vector was padded with 10 zeros on either end resulting in a 90-vector. The 11 range

shifts were then performed by taking a subwindow of size 80. Note that the target

samples were contained in elements 11-80 of the 90-vector. We then see in Figure 4-2

that each range shift included all 70 of the target samples. Consequently, the mean

42

2

1

0

220 40 60 80

(d)

20 40 60

20 40 60

(j)

20 40 60Range

(b)

20 40 60

if[

0

2

20 40 60

11

0

2

1

0

20 40 60

(k)

20 40 60Range

2

1i[

0

220 40 60

1i[

0

2

20 40 60

1i

0

2

1

0

20 40 60

(I)

20 40 60Range

Figure 4-1: Range shifting80-vector (b)-(l) 11 70-wide

of a bmp263 target by using 70-windows.range shifts from rightmost to leftmost.

(a) Original

43

(a)U)

2

1i[

(c)

(e)

0

2

ci

0

2

E)

(f)

(g) (h) (I)

0

2

+-p

CE1

0

did not change across the range shift, implying that bias removal and normalization

only needed to be performed once for each target vector.

Making this modification did slightly change the performance of the algorithm

since forcing all the target samples to be kept changed some of the error calculations

from the original version. The modified performances are shown in Tables 4.2 and

4.3. As before, element (i, J) indicates the percentage of Class i signatures that were

identified as Class j. Note that the algorithm performance actually improved slightly

because by not dropping samples, we were using a more accurate measurement of the

least squares error.

Table 4.2: Confusion Matrix by Vehicle Using Matched Filter

Predicted

Actual

I BMP263 BMP266 BMP2c21 BTR70c21 T72132 T72812 T72s7BMP263 47.69% 16.41% 24.10% 4.62% 2.56% 2.05% 2.56%BMP266 7.69% 60.51% 13.33% 4.10% 4.62% 3.59% 6.15%BMP2c21 21.03% 12.82% 50.26% 3.59% 4.10% 4.10% 4.10%BTR70c21 3.59% 4.10% 6.15% 77.44% 3.08% 3.08% 2.56%

T72132 2.04% 1.02% 4.08% 0.51% 76.02% 7 .90% 6.122%T72812 2.04% 1.04% 4.04% 0.5% 76.02% 7.90% 12%T72s7 1.05% 3.66% 2.62% 1.05% 8.90% 11.52% 71.20%

Table 4.3: Confusion Matrix by Vehicle Type Using Matched Filter


BMP2 84.62% 4.10% 11.28%Actual BTR70 13.85% 77.44% 8.72%

T72 5.84% 1.20% 92.96%

We recall that QR factorization provided an advantage over straightforward calcu-

lation because the cost of factorization was amortized across the eleven range shifts.

Similarly, the matched filter approach provided an advantage over QR factorization

because the cost of bias removal and normalization was amortized across the number

of template signatures to be considered. Bias removal and normalization required 80

multiplications as did the calculation of the correlation. Consequently, to compare

against M template signatures, QR factorization required 2,051M multiplications,

44

2

20 40 60 800

2(d)

20 40 60 80

20 40 60 8

20 40 60 80

1 *

0

2

1

0

2

1

0

(j)

20 40 60 80Range

0

(a)

20 40 60 80

(h)

20 40 60 800

2

1

(k)

20 40 60 80Range

0

2

E

0

Figure 4-2: Range shifting of a bmp263 target by using 80-windows. (a) Original90-vector. (b)-(l) 11 80-wide range shifts from rightmost to leftmost.

45

. . -

(b)2

1

0

2

1

0

2

1

I.

a)

0)1

(e)

0

2(g)

0)1i

0

2

U)

*0

1 [

(c)

20 40 60 80

(f)

20 40 60 80

(I)

20 40 60 80

( )

20 40 60 80Range

2

while the matched filter approach required 80 + M x 11 x 80 = 80 + 880M mul-

tiplications. In addition, by comparing Tables 4.1 and 4.4, we see that adopting a

matched filter approach provided a significant advantage over the QR factorization

approach in the algorithm timing.

4.3 Power Transformation

In considering the power transformation, we chose not to conduct an in-depth analysis

to determine if alternate calculations could achieve the same result more efficiently.

Instead, we noted that there was one straightforward but critical change that could

be implemented which resulted in a dramatic speedup.

Recall from Section 3.3.1 that for target formation, the center eight profiles of

the 101 available signatures were averaged together to produce the target signature,

while template formation employed all 101 profiles. However, the original AFRL code

performed the power transformation and range gate normalization on all profiles in

both cases. The only difference between the template and target formation in the

AFRL code was that the target formation code used only eight of these profiles for

averaging, in spite of having preprocessed all of the signatures.

It was apparent that we could gain a substantial savings in computation simply by

eliminating the unnecessary calculations from the original AFRL code. By ensuring

that the preprocessing only occurred on the eight profiles used in the averaging rather

than all 101 profiles, we expected a speedup on the order of 12x. This ensured that the

time spent on computing the power transformation and the range gate normalization

would be dominated by the time required for the least squares fitting. The next section

will present a revised timing analysis that incorporates the above modifications.

4.4 Revised Timing Analysis

Table 4.4 shows the results of the timing analysis on the fully optimized software. All

timings are shown in seconds.

46

Table 4.4: Timing Analysis of Optimized Software

Loading/Preprocessing Least Squares Fit Classification

BMP263 2.79 34.22 .11BMP266 2.53 33.93 .12BMP2c21 2.58 34.55 .12BTR70 2.54 34.14 .11T72132 2.63 34.67 .11T72812 2.69 34.20 .11T72s7 2.57 33.64 .12

Total 18.33 239.35 .80

By comparing Tables 4.1 and 4.4, we see that the software optimizations made

a significant impact upon the execution time of the algorithm. The total execution

time was reduced from 1298.54 seconds to 258.48 seconds. Additionally, the ratio

between least squares fitting time and preprocessing time had almost doubled, fur-

ther supporting the supposition that least squares fitting dominated the execution.

Therefore, accelerating the least squares fitting remained our highest priority.

4.5 Summary

This chapter began the analysis of the SHARP algorithm. The timing analysis pre-

sented in Section 4.1 led us to focus our efforts primarily on the least squares fitting

routine, and secondarily on the power transformation. The QR factorization approach

used by the AFRL was presented and analyzed. It was then replaced with a matched

filter approach, which significantly decreased its execution time. The unnecessary

calculations were removed from the power transformation which also resulted in a

speedup.

Finally, we presented a revised timing analysis. This analysis showed that the

algorithm execution time had been greatly reduced due to our optimizations and fur-

ther supported the claim that accelerating the least squares fitting was the highest

priority. Having concluded our software analysis, we begin a hardware-specific anal-

ysis in Chapter 5 that alters the already modified algorithm to further improve its

47

anticipated hardware performance.

48

Chapter 5

Hardware-Specific Analysis

In this chapter, we examine the SHARP algorithm from a hardware standpoint.

Thus far, hardware has not been a factor in our analysis. This chapter considers the

difficulties that may present themselves when mapping the SHARP algorithm into

hardware using the ACS program approach.

Recall from Section 1.2 that our concerns with regards to hardware implemen-

tation are control flow modifications, bit precisions, design availability, partitioning,

and data transfer. While Chapter 2 indicated that most of the bit precision decisions

could be left to the wordlength analysis tool, this chapter discusses situations in which

the tool would not suffice, and considers the remaining hardware issues. These issues

are analyzed with regards to our two areas of interest: the power transformation and

the least square fitting.

Our hardware-specific analysis begins with the power transformation in Section

5.1. We ultimately decided not to implement the power transformation in hardware,

but this section explains why this decision was made and provides insight which can

be used in further analysis. We then analyze the least squares fit in Section 5.2,

concluding with the Ptolemy diagrams of our design in Figures 5-7 and 5-9.

49

5.1 Power Transformation

We begin our analysis with a list of the elementary blocks available for use in our

design in Table 5.1. From these blocks, we were able to construct larger systems that

implemented the functionality we desired.

Table 5.1: Elementary Function Blocks

AccumulatorChopConst

DividerLookup Table (LUT)

MaximumMultiplyRegister

ShiftSqrt

SquareSubtractor

Produces a running total of its inputProduces a subvector of its inputProduces a constant valueProduces the quotient of its two inputsPerforms a table lookup on its inputProduces the maximum of its two inputsProduces the product of its two inputsRepeats its inputPerforms a bitshift on its inputProduces the square root of its inputSquares its inputProduces the difference of its two inputs

Recall from Section 3.3.1 that the ultimate purpose of the power transformation

was to average eight power-transformed vectors. Consequently, from an implementa-

tion standpoint, there were two components to consider. First, the range gate nor-

malization and power transformation calculation had to be performed on the eight

input vectors. Second, the eight vectors had to be averaged together. We consider

these components in Sections 5.1.1 and 5.1.2.

5.1.1 Vector Transformation

From Chapter 3, we recall that the equation to be implemented on each vector is

(5.1)

where r contains the magnitudes of the individual range gates and the exponentiation

is done elementwise. This calculation is represented as a block diagram in Figure 5-1.

50

r 0.4

(t I 70 )0-1

The vector was squared at the outset. The lower datapath was then used to calculate

the RMS amplitude. The vector was squared again and then accumulated. The

Chop was used to read the 90 partial sums produced by the Accumulator and only

output the final sum. The Shift was a right shift by 7 bits, which was equivalent to a

division by 128. The square root of the value was then taken, resulting in the RMS

amplitude. The Register repeated the amplitude 90 times so that each element of

squared 90-vector could be normalized. Finally, a LUT was used to implement the

function y = x0., completing the calculation.

input Square Divider LUT -- output

Square Accumulator Chop Shift Sqrt Register

Figure 5-1: Power Transformation Block Diagram

There were two issues that needed to be considered to make this design feasi-

ble. First, the output precision of the divider had to be manually chosen. This was

necessary for all dividers and was a consequence of the inability to sufficiently char-

acterize the divider input. Specifically, the denominator input could theoretically be

very close to zero. As a result, the major bit chosen by the wordlength analysis tool

would be extremely conservative and result in a high quantization error. Therefore,

we attempted to choose an appropriate major bit for the divider output based on our

knowledge of the system.

In order to choose the divider precision, we considered the system after the first

Square and before the LUT. Note that if we removed the Shift (which is equivalent

to a gain of ) then the system was a straightforward normalization. That is, the

output vector was simply the input vector divided by its magnitude. As a result,

without the Shift, the output range span could be no larger than [-1,1]. We then had

to trace the effect of the Shift. Passing through the Sqrt resulted in a factor of .

The Register did not affect this factor, but passing through the Divider inverted it, so

the divider output range span could be no larger than [-8V2-, 8VZ] ~ [-11.3,11.3].

Consequently, the output major bit of the divider could be set to 4, which would

51

Floating Point- Fixed Point

7-

6-

5-

'E 4 -CU

3-

2-

00 10 20 30 40 50 60 70

Range

Figure 5-2: Divider Output - Floating Point and <4.8> Fixed Point

easily span this range.

The second factor to be considered was the LUT. The underlying design pro-

vided by Xilinx supported up to 256 256-bit values [34]. This effectively meant the

wordlength of the input to the LUT had to be 8, since there were only 28 = 256 dif-

ferent addresses in the LUT. Combining this with our earlier choice of major bit, the

output precision of the divider had to be set to <4.8>, using the notation <mb.wl>

to indicate a major bit of mb and a wordlength of wl. This meant that the least

significant bit (LSB) only had a significance of 2-, so all quantizations were only

guaranteed to be within .125 of the corresponding floating point values. Figure 5-2

shows the output of the Divider in floating point and at <4.8> precision for a typical

bmp263 target. Figure 5-3 shows the corresponding outputs of the exponentiation.

Figure 5-4 shows the floating point output and the same output quantized to <1.8>

precision.

It is apparent that from Figures 5-2 and 5-3 that <4.8> precision was not suffi-

cient to characterize the input. However, Figure 5-4 shows that <1.8> precision was

52

1.6

1.4

1.2

C-

caC .8

0.6-

0.4-

0.2 -

U - i i ,I , . .0 10 20 30 40 50 60 70

Range

Figure 5-3: LUT Output - Floating Point Input and <4.8> Fixed Point Input

1.6

1.4

1.2 P

0.8 F

0.6-

0.4 -

0.2-

0


10 20 30 40 50 60 70Range

Figure 5-4: LUT Output - Floating Point and <1.8> Fixed Point

53


-o

0)

1

1.6Floating Point

- Fixed Point

1.4-

1.2-

1

Z 0.8-0)

0.6-

0.4-

0.2-

0 10 20 30 40 50 60 70Range

Figure 5-5: LUT Output - Floating Point Input and <1.8> Transformed DividerInput

acceptable for the output. Consequently, if we could make the input more closely

resemble the output, 8-bit precision would be sufficient. We accomplished this by

inserting two Sqrt's between the Divider and LUT. We used the Sqrt's to maintain

16-bit precision until the input of the LUT, which was set to <1.8> precision. How-

ever, the input had already been raised to the 0.25 power by the Sqrt's, so the LUT

was used to implement the function y = x0.8, resulting in an overall exponent of 0.2.

The output precision was then set to <1.16>. The floating point and <1.16> fixed

point outputs are shown in Figure 5-5.

5.1.2 Vector Averaging

In order to be able to average the eight transformed vectors, it was necessary to

develop a way to parallelize them. The only reasonable way to do this was through

the use of a memory element. The output of the transformation calculation could

be stored in memory. Then, when all eight vectors were in memory, an address

54

generator could output the vectors in parallel. Specifically, the first element of each

vector would be produced, followed by the second element of each vector until all

elements had been produced.

The advantage of rearranging the elements in this fashion was that it was then

possible to produce the average of the 8 vectors. In order to produce the first element

of the average vector, the first elements of each of the 8 original vectors could be

passed into an accumulator in serial order. The output of the accumulator could be

downsampled to produce the final sum, which could be passed through a gain of } to8

produce the final average.

Unfortunately, time was not available for this approach to be taken. The com-

plexity involved in including a memory bank into the calculation and developing the

necessary address generator did not permit its implementation. Fortunately, as Sec-

tion 4.4 indicates, the time saved in performing the power transformation in hardware

was not the dominating factor in the final performance. Consequently, we concen-

trated the remainder of our analysis on the least squares fit.

5.2 Least Squares Fitting

The least squares fitting consisted of two distinct parts. First, the input vectors

had their biases removed and their magnitudes normalized. Second, the correlation

between the target vectors and the appropriate template vectors was calculated.

5.2.1 Bias Removal and Magnitude Normalization

We begin as before by presenting a block diagram of the desired calculation in Figure

5-6.

In order to calculate the bias, a Chop was used to take an 80-subvector of the

input 90-vector. The vector was accumulated and a Chop was used to extract the

total sum. The bias was finally calculated by multiplying by a constant of . The bias80~

was then repeated 80 and 90 times, so that it could subtracted from both the original

90-vector and the 80-subvector. Note that this could also have been achieved through

55

input Subrato Divider output

Chop Subtractor Square - Accumulator Sqr Register

AccumulatorHChop Multiply Register

ConstRegister

Figure 5-6: Normalization Block Diagram

the use of one Register and Subtractor operating on the 90-vector. However, because

the magnitude which was calculated next operated on the 80-vector, an additional

Chop would have been necessary, increasing the latency of the design. Consequently,

two Registers and Subtractors were used to operate on both the 90-vector and the

80-subvector.

To calculate the magnitude, the 80-vector was squared and accumulated. Once

again, a Chop was used to get the total sum. The square root of this sum was the

desired magnitude. The magnitude was repeated 90 times and used to normalize the

90-vector by using a Divider.

We note once again that we had to manually set the output range of the divider.

Fortunately, the purpose of this divider was to divide the elements of a vector by its

magnitude, so we fixed its output range to [-1,1]. Aside from this, all precisions were

left for the wordlength analysis tool to assign.

The normalization design was estimated to be large enough to require an entire

FPGA dedicated to it. Consequently, PEI was used for normalization, requiring the

signature data to be written to the memory bank of PEl. Chapter 6 will show this

estimate was accurate, as PEI had a relatively high design occupancy. Figure 5-7

shows the Ptolemy graph for the normalization design. Note that Ptolemy requires

buffers (Shifts of 0) to be placed after the inputs, but these did not affect our analysis,

so we disregard them.

56

Figure 5-7: P

tolemy N

ormalization D

esign

57

data input -

Cop Multiply AM ulator Chop MaximumMultiply Accumulator Chop

iMultiply muAccumulator Chop Maximum-- pMultiply Accumulator Chop

Multiply Accumulator Chop

Multiply|- Accumulator Chp Maximum Maximum Maximum output

-Multiply Accumulator Chop

-_N0 Multiply|- Ac ulator Chop Maximum Maximum

-ho Multiply - Accumulator Chop

-- f 7 ~ o : Multiply- Accumulator Chop Maximum Maximum

Multiply - Accumulator Chop

T _ -,t L _ _,Ib

Figure 5-8: Correlation Block Diagram

5.2.2 Correlation

The block diagram for the correlation calculation is shown in Figure 5-8. There

were two inputs to the correlation: a target signature and an angle input which was

used to produce a template signature. Once both signatures had been generated, the

target signature (which was 90-wide) was passed to 11 parallel Chops which took the

11 different range shifts. Each of these 80-vectors was correlated with the template

vector through the use of a Multiply, an Accumulator, and a Chop. Finally, the 11

correlations were passed through a binary tree of Maximums to produce the maximum

correlation.

There were many issues to be resolved with the correlation design. The most

difficult decision was how to handle the template library. In software execution, all

templates within 50 in aspect were considered. Due to nonuniform aspect spacing,

this meant that the number of templates being considered was aspect-dependent.

Consequently, there was a data-dependent control flow.

In order to remove this dependency, the template set was filled out to contain 360

templates. Because templates were spaced in multiples of 10, any angles that were not

represented had templates containing all zeros inserted. Additionally, to facilitate the

58

ang np emfayl i t--

±50 aspect span, the template sets were given 5' wraparound at each end. Therefore,

for a given angle, there were exactly 11 templates per class to be considered and they

appeared sequentially in the template library. This resulted in a template library size

of 7 (classes) x 370 (vectors) x 80 (elements/vector) = 207,200 elements.

An additional problem was how the template library would be used. Because

there was no way to hold a target vector in place, the decision was made to only

perform one vector comparison per initiation. Note that this meant that each target

vector had to be repeated 77 times (7 (classes) x 11 (aspect angles/class)) for all the

comparisons to be made. Consequently, a simple address generator was built into the

template library which would output one signature given its starting address. Clearly,

this is not the most efficient approach, but timing constraints prevented an alternate

approach (i.e. a memory element could be used to hold the target vector in place)

from being investigated.

The target set consisted of 1,362 vectors. Since each vector had to be repeated 77

times, the input dataset consisted of 104,874 90-vectors, meaning 9,438,660 elements

had to be written to the hardware memory. Because the memory banks only had

262,144 addresses, this dataset was broken into 37 smaller pieces. To make processing

simpler, vectors were not split across pieces; rather, each download consisted of 37

90-vectors repeated 77 times (256,410 elements).

The only remaining difficulties were related to bit precision. The decision was

made to put all of the correlation calculation (not necessarily the template address

generation) in a single FPGA. If the correlation was not able to fit into a single FPGA,

additional designs would be necessary in order to be able to route the template data

to multiple FPGAs, due to the limited number of interconnects between FPGAs.

The most expensive elements in the correlation design were the Multiplies. It was

not possible to fit 11 16-bit multipliers in a single FPGA. Consequently, the multipliers

were reduced to 12-bit input precision. This still resulted in accuracy down to 2-1

(assuming <1.12> precision), which would be sufficient for determining maximum

correlations. Figure 5-9 shows the Ptolemy graph for the correlation design. Again,

the buffers can be disregarded.

59

Figure 5-9:

Ptolem

y Correlation D

esign

60

At this point, since the crucial design elements had been decided, the only re-

maining issue was to make FPGA assignments. Since there were only two large

components (the normalization and the correlation), this was relatively easy. The

data input and normalization had already been placed in PEI. The address input

was sent to PE2. The template library and address generator were placed in PE3.

Finally, the correlator and output memory were assigned to PE4.

5.3 Summary

This chapter analyzed the SHARP algorithm to eliminate any difficulties in creating

a hardware design for it. After analyzing the power transformation, it was decided

that the complexity inherent in creating a design for it outweighed the benefit that

would be gained through its hardware acceleration. Consequently, execution of the

power transformation was left in software.

The least squares fit was also analyzed with regards to hardware issues. The

biggest difficulty encountered was handling the template library. The library was

modified to remove data dependencies and a compromise was made to repeat the

input data set in order to reduce the complexity of the design. Although this com-

promise greatly reduced the efficiency of the design, Chapter 6 will show that the

resulting hardware acceleration still provided a significant speedup. This chapter fi-

nally presented the Ptolemy diagrams that were used to produce the hardware designs

that will be presented in Chapter 6.

61

Chapter 6

Performance Comparison

Our initial intent was to provide hardware acceleration for the SHARP algorithm,

such that a standalone application could be developed that implemented a portion of

the algorithm in hardware, and demonstrated an appreciable speedup over its fully

software counterpart. Unfortunately, due to project funding and timing constraints, it

was not possible to fully meet this objective. However, here we present the results we

were able to obtain which show that the ultimate objectives are completely attainable.

Recall from Chapter 5 that our decision was to only implement the least squares

fit routine in hardware, due to the inherent complexity of implementing the power

transformation. Consequently, the least squares fit was split into two components.

Bias removal and normalization comprised the first part while correlation and a max-

imum tree comprised the second. These designs were created separately in order to

facilitate testing, and also because they are completely modular; there would be no

difficulty in merging the components once each had been fully tested.

In Section 6.1, we present the normalization design. We consider both its timing

and algorithmic performance. We also suggest further modifications that result in

improved performance. Section 6.2 presents the correlation design. Although this

design was not fully functional, its performance and timing were estimated. Section

6.3 combines the two designs and compares the estimated timing and performance

results with the software results presented in Chapter 4. Finally, Section 6.4 indicates

how these designs could be further modified to improve the overall performance.

62

Table 6.1: Normalization Design Bit Precisions

Block Input II Input 2 Output

input <4.16>Chop <4.16> <4.16>

Accumulator <4.16> <11.17>Chop <11.16> <11.15>Const <-6.14>

Multiply <11.15> <-6.14> <4.16>Register <4.15> <4.15>Register <4.16> <4.16>

Subtractor <4.16> <4.15> <5.16>Subtractor <4.16> <4.16> <5.16>

Square <5.15> <10.16>Accumulator <10.16> <17.18>

Chop <17.17> <17.17>Sqrt <17.16> <9.15>

Register <9.15> <9.15>Divider <5.14> <9.15> <6.16>output <6.16>

6.1 Bias Removal and Normalization

We begin first with the precisions chosen by the wordlength analysis tool, shown in

Table 6.1. Elements are listed as they appear in Figure 5-6, in column-major order.

There are two factors of interest in this table. First, the input precision was

arbitrarily chosen to be <4.16>. We know from our analysis of the power transfor-

mation in Section 5.1.1 that a precision of <1.16> would be more appropriate and

provide better precision on the input. Additionally, we note that the Divider output

was not manually set to <1.16>. The <6.16> output precision was a result of the

conservative precision assignment made by the wordlength analysis tool. Figure 6-1

shows the execution schedule for the design, indicating an execution time of 312 clock

cycles. Figure 6-2 shows a floorplan of PEI for this design that indicates the area of

the design. This floorplan was generated using the graphical floorplanner tool in the

Xilinx Alliance tools.

The floating point software and fixed point hardware results for the input bmp263

63

0-

-5 -

-10 - - - -

-15

a)

E-20

a)

0

-30 .. .. ... .

-35

-40~0 50 100 150 200 250 300

Clock cycles

Figure 6-1: Execution Schedule for the Normalization Routine

64

0I"%Z NII!

% ~ :i NIX

x JZI 4;I K t.

Figure~~~~~~~~~~~ 6-:FG-cupnyfrteNomlztnRuie

65

1.5

ca

0.5-

00 10 20 30 40 50 60 70 80 90

Range

Figure 6-3: Input Signature to Normalization Routine

0.5Software

- Hardware

0.4-

0.3-

0.2-

as0.1

0-

-0.1

-0.210 10 20 30 40 50 60 70 80 90

Range

Figure 6-4: Normalization Routine - Software and Hardware

66

0.3

0.25-

0.2-

0.15-

CD 0.1

CD0O

2 0.05 -

0-

-0.05 -

-0.1

-0.1510 10 20 30 40 50 60 70 80 90

Range

Figure 6-5: Normalization Routine - Floating Point and Modified Fixed Point

signature shown in Figure 6-3 are shown in Figure 6-4. We can see that the normal-

ization only performed well in hardware at output values close to zero. As the output

signature moved away from 0, the error between the software and hardware versions

increased. By examining the precisions, it was determined that the cause of this poor

performance was a result of the two Accumulators and the Square. The output pre-

cisions of these blocks were only accurate within 2--, 20, and 2', respectively. By

modifying these precisions so that precision was accurate up to the LSB of the input,

we would hope to see better results. In fact the floating point output is compared

with a software implementation using these fixed-point precision in Figure 6-5.

Here, we see excellent performance, even using fixed-precision arithmetic. We

would also expect this performance to further improve when the input precision and

divider precision are manually set to <1.16> since these precisions better fit the data.

Additionally, we expect the design cost to decrease slightly because the wordlengths

would grow more slowly given an initial major bit of 1.

67

- Floating Point-- Fixed Point

Table 6.2: Correlation Design Bit PrecisionsBlock Input 1 Input 2 1 Output

data input <1.16>angle input <15.16>

Template Library <15.16> <1.16>Chop <1.16> <1.12>

Multiply <1.12> <1.12> <2.16>Accumulator <2.16> <9.16>

Chop <9.16> <4.16>Maximum <4.16> <4.16> <4.16>

output <4.16>

6.2 Correlation

While project funding prevented a working correlation design from being fully tested,

we had all the information necessary to simulate execution and predict what the result

would have been. We begin as before with the precisions provided by the wordlength

analysis tool. The precisions along the 11 parallel paths and the precisions of the 10

Maximums were identical, so we present a condensed version of the precision table in

Table 6.2. Figure 6-6 shows the corresponding execution schedule, taking a total of

100 clock cycles. Figure 6-7 shows the occupancy of PE4, since most of the calculation

was there. PE2 and PE3 only contained hardware for address and template generation

and, as a result, were mostly unoccupied.

We see from Table 6.2 that once again, precision was lost in the accumulator.

Consequently, we again modified the accumulator precision to be accurate to the

LSB of its input (i.e. from <9.16> to <9.23>). Figure 6-10 shows the comparison

between software floating point and fixed point simulation when the target signature

shown in Figure 6-8 (already normalized) was correlated with the template signature

shown in Figure 6-9. Figure 6-10 shows the correlations for each of the 11 range shifts.

We can see from Figure 6-10 that with the modified precisions, the correlation

design worked well with fixed-precision arithmetic. Section 6.3 will present results for

the overall algorithm execution.

68

0

E

z

I U:

00

-150 L0 10 20 30 40 50 60 70 80 90 100

Clock cycles

Figure 6-6: Execution Schedule for the Correlation Routine

69

Figure 6-7: FPGA Occupancy for the Correlation Routine

70

0.3

0.25

0.2 F

0.15

0.1

0.05

0

-0.05

-0.1

-0.150 10 20 30 40 50 60 70 80 90

Range

Figure 6-8: Correlation Target Signature

G

.3-

5 -

.2-

5-

.1-

5 --

0-

)5-

10 20 30 40Range

50 60 70 80

Figure 6-9: Correlation Template Signature

71

(D.E

M1cts

0

0

0.1

0

0.0

ca*0

-0.

-0

00

0.3 I I I I I I I I

0.3

0.2

0

2 3 4 5 6Shift

7 8 9 10 11

Figure 6-10: Correlation by Range Shift

6.3 Final Comparison Estimates

6.3.1 Algorithm Performance

Our first concern should be the fidelity of the algorithm. Whether the hardware

version of the algorithm executes faster than its software counterpart is irrelevant if

the hardware version performs very poorly. Thus, in Tables 6.3 and 6.4 we present the

confusion matrices obtained through software execution, but with the least squares

fit performed using the precisions we have chosen for the hardware implementation.

Table 6.3: Confusion Matrix by Vehicle Using Fixed-Point Simulation

Actual

PredictedBMP263 BMP266 BMP2c21 BTR70c21 T72132 T72812 T72s7

BMP263 46.15% 15.90% 24.10% 4.10% 3.08% 4.10% 2.56%BMP266 7.69% 58.97% 14.36% 4.10% 4.10% 4.62% 6.15%BMP2c21 20.51% 13.33% - 50.77% 4.10% 4.62% 2.56% 4.10%BTR70c21 4.62% 6.15% 5.13% 75.38% 3.08% 2.56% 3.08%

T72132 2.04% 1.53% 4.08% 0.00% 75.00% 10.20% 7.14%T72812 0.00% 2.05% 1.54% 1.03% 6.15% 74.87% 14.36%T72s7 1.57% 4.19% 2.09% 1.05% 8.90% 11.51% 70.68%

72

Ix Floating Point5 - Fixed Point

-! 11

0.9

0.8

0.7

0.6

0

-60.500

0.4

0.3-

0.2 -

0.1 -

1

Table 6.4: Confusion Matrix by Vehicle Type Using Fixed-Point Simulation


BMP2 83.93% 4.10% 11.97%Actual BTR70 15.90% 75.38% 8.72%

T72 6.36% 0.69% 92.96%

We can see that through comparison with Tables 4.2 and 4.3, that the perfor-

mance degraded slightly due to the use of fixed-precision calculation. However, this

degradation did not appear to be very significant since the fixed point results were

within 2% of the floating point results.

6.3.2 Algorithm Timing

We also estimated the timing performance of the hardware version of the algorithm.

Because we know that we will suffer some performance degradation through the use

of fixed-point arithmetic, a hardware implementation is not advantageous when com-

pared with the software version unless it runs significantly faster.

A timing test was performed to measure the time necessary to transfer all of the

input (data input as well as angle input) to the hardware board and read back the

output in the same manner that this would be done in the final working version. The

total time spent in data transfer was 6.99 seconds. We must now estimate the time

spent in calculation. Recall that the normalization and correlation designs took 312

and 100 clock cycles to execute, respectively. However, the last 90 clock cycles of

the normalization design were used for writing the output vector to a memory bank

because the design was standalone. Consequently in a combined design, the total

execution time would be 312 + 100 - 90 = 322 clock cycles to compare one target

vector to one template vector.

The data was processed in 37 pieces, each consisting of 37 vectors, repeated 77

times. This results in a total of 37 x 37 x 77 x 322 ==33,942,986 clock cycles to process

all vectors. Since the board was running at a clock speed of 2.5 MHz, we should

73

expect a total computation time of 13.58 seconds. Consequently, the total time spent

in least squares fitting is 20.57 seconds. Because we have not altered the remainder

of the algorithm, we can use the timing measurements from Table 4.4 to estimate the

total execution time for the accelerated algorithm.

We should still expect 18.33 seconds to preprocess the data. Our timing estimate

for least squares fitting is 20.57 seconds. The time for classification should remain

unchanged at .80 seconds. Consequently, the estimated total time for execution of

the accelerated SHARP algorithm is 39.70 seconds. This is clearly an improvement

over the software execution time of 258.48 seconds and indicates that even with a

suboptimal design such as the one proposed here, a significant benefit can be gained

over software execution.

6.4 Proposed Improvements

Even though we have shown that the above design provides a significant speedup for

the SHARP algorithm, there are several further modifications that can be made that

will further improve the hardware speedup of the SHARP algorithm. Here, we briefly

discuss two improvements.

The first improvement is fairly straightforward. Even though the Wildforce board

was only running at 2.5 MHz, it supports clock speeds up to 50 MHz [3]. Consequently,

if the clock speed could be increased, we would expect a directly proportional decrease

in the computation time of the least squares fit. The board was only clocked at 2.5

MHz due to a difficulty encountered in receiving board interrupts at higher clock

speeds. Consequently, if this problem were resolved, the execution time could be

decreased by simply increasing the Wildforce clock speed.

The second improvement is to design a memory element to hold target vectors in

place. Recall that the lack of this element required the dataset to be repeated 77 times.

Consequently, we should expect a dramatic savings if we successfully implemented it.

However, in order to make full use of it, the template library address generator would

have to be expanded to produce all 77 comparison template vectors.

74

The first effect of this change would be to reduce the data transfer time. The input

dataset would only consist of 1,362 90-vectors and 1,362 aspect angles. The output

would consist of 77 correlations for each of the 1,362 input vectors. Consequently,

only one download would be necessary. We conservatively estimate the data transfer

time to be 1 second.

Additionally, for each input vector, normalization only occurs once, while corre-

lation happens 77 times. Consequently, even if we assume a one-time cost of 90 clock

cycles to store the target vector and a cost of 50 clock cycles for address generation

(a conservative estimate), then the total number of clock cycles to fully process one

vector would be at most 312 + 90 + 77 x (50 + 100) = 11,952 clock cycles. To pro-

cess all 1,362 vectors (still at 2.5 MHz) would then take approximately 6.5 seconds.

Including data transfer, we see that even our conservative estimate of 7.5 seconds is

a significant improvement over the estimate of 20.57 seconds in Section 6.3.

6.5 Summary

In this chapter, we have presented timing and performance results, both measured

and estimated, for the hardware acceleration of the SHARP algorithm. We have

demonstrated that although the hardware design we presented was not optimal, its

inefficiency was outweighed by the benefit of simply running in hardware.

Our timing estimate indicated that hardware execution would result in a speedup

on the order of 6.5x when compared with the optimized software implementation

results presented in Chapter 4, while the performance of the algorithm in hardware

remained within 2% of its software counterpart.

We also indicated areas in which the design could be further improved. By in-

creasing the Wildforce clock speed and using a memory element to eliminate the need

to repeat the dataset 77 times, the algorithm speedup could be increased to well over

an order of magnitude, demonstrating the potential of reconfigurable hardware.

75

Chapter 7

Summary

ATR is an evolving military application. As technology advances, the performance

standards for ATR systems are set higher. Until ATR systems achieve perfect recog-

nition, they will continue increasing in computational intensity as technology permits

in order to improve performance.

It is naturally important that the time for classification remain relatively con-

stant as the classification performance increases. The merits of perfect recognition

are questionable if the recognition cannot be included in a real-time system. Conse-

quently, for military applications, it is important to investigate technologies that are

increasing in computational power as well as speed.

FPGAs have a promising future in areas of intensive computing. Reconfigurable

computing has recently received a great deal of attention due to its computational

capabilities coupled with its affordability and ease of design. Numerous defense-

related projects have begun examining the potential of FPGAs for computationally

intensive applications.

This thesis examined the potential of FPGA acceleration for a particular ATR

system, the SHARP algorithm. While the identification performance of the algorithm

was clearly not precise enough to be including in a practical system, it served as an

instructional case to demonstrate the capabilities of reconfigurable hardware for target

recognition systems.

This thesis also demonstrated the advances being made in FPGA design tools.

76

The amount of time being spent in hardware design is being drastically reduced

due to the existence of tools like those provided by the ACS program. While the

development of these tools is still far from maturation, this thesis demonstrated that

they can already be used to design and implement sophisticated systems while hiding

most of the low-level decisions from the designer. As this field matures, we can expect

even more low-level decisions to be abstracted away, while achieving more efficient

underlying designs.

By examining these technologies through the framework of a comparatively simple

ATR algorithm, this thesis created a foundation that demonstrates what the current

capabilities are and where the needs for further study lie. It is hoped that the methods

and results presented in this thesis will serve as a basis to advance future research.

77

Appendix A

Acronyms

ACS: Adaptive Computing Systems

AFRL: Air Force Research Laboratory

APC: Armored Personnel Carrier

ASIC: Application Specific Integrated Circuit

ATR: Automatic Target Recognition

CLB: Configurable Logic Block

DFT: Discrete Fourier Transform

EDIF: Electronic Design Interchange Format

FPGA: Field-Programmable Gate Array

HRR: High Range Resolution

IFFT: Inverse Fast Fourier Transform

LSB: Least Significant Bit

LUT: Lookup Table

MCMC: Markov Chain Monte Carlo

MSTAR: Moving and Stationary Target Acquisition and Recognition

MTI: Moving Target Indication

RMS: Root Mean Square

SAR: Synthetic Aperture Radar

SDF: Synchronous Dataflow

SHARP: System-Oriented HRR Automatic Recognition Program

78

VHDL: VHSIC Hardware Description Language

VHSIC: Very High Speed Integrated Circuit

WPAFB: Wright-Patterson Air Force Base

79

Bibliography

[1] Air Force Research Laboratory, An Air-to-Ground Classification Analysis. AFRL

Internal Memo, 1994.

[2] Annapolis Micro Systems, Inc., Annapolis Micro Systems Home Page.

http://www.annapmicro.com, 2001.

[3] Annapolis Micro Systems, Inc, Wildforce Reference Manual, 1999.

[4] BAE SYSYEMS, Algorithm Analysis and Mapping Environment for Adaptive

Computing Systems. http://www.sanders. com/adv-tech/aam. htm, 2000.

[5] B. Bhanu, G. Jones III, Object Recognition Results Using MSTAR Synthetic

Aperture Radar Data. IEEE Workshop on Computer Vision Beyond the Visible

Spectrum: Methods and Applications, pages 55-62, 2000.

[6] Y. Cho, Optimized Automatic Target Recognition Algorithm on Scalable

Myrinet/Field Programmable Array Nodes. IEEE Asilomar Conference on Sig-

nals, Systems, and Computers, 2000.

[7] P. Djuric, Bayesian Methods for Signal Processing. IEEE Signal Processing Mag-

azine, pages 26-28, September 1998.

[8] R. Duda, P. Hart, Pattern Classification and Scene Analysis. John Wiley & Sons,

Inc. 1973.

[9] P. Fiore, P. Topiwala, Bit-ordered Tree Classifiers for SAR Target Classification.

IEEE Asilomar Conference on Signals, Systems, and Computers, 1997.

80

[10] P. Fiore, A Custom Computing Framework for Orientation and Photogrammetry.

Ph.D. Thesis, Massachusetts Institute of Technology, June 2000.

[11] P. Fiore, C. Myers, J. Smith, E. Pauer, Rapid Implementation of Mathematical

and DSP Algorithms in Configurable Computing Devices. SPIE International

Symposium on Voice, Video, and Data Communications, November 1998.

[12] P. Fiore, Suitability of AFRL Forced Decision Algorithm for ACS Demo. Sanders

Internal Memo, August 1999.

[13] P. Fiore, Wordlength Optimization via the MCMC for Custom DSP Computing.

Submitted to IEEE Transactions on Signal Processing, November 2000.

[14] D. Gross et al. High Range Resolution Ground Moving Target ATR Using Ad-

vanced Space-Based SAR/MTI Concepts. AIAA Space Technology Conference &

Exposition, September 1999.

[15] R. Hummel, Moving and Stationary Target Acquisition and Recognition

(MSTAR). http://www.darpa.mil/spa/Programs/mstar.htm, 2001.

[16] Khoral Research, Inc. Khoral Home Page. http://www.khoral.com, 2001.

[17] D. Kottke, P. Fiore, Systolic Array for Acceleration of Template Based ATR,

IEEE International Conference on Image Processing, 1997.

[18] T. Lamont-Smith, Translation to the Normal Distribution for Radar Clutter.

IEEE Proceedings - Radar, Sonar, and Navigation, Vol. 147, No. 1, pages 17-22,

February 2000.

[19] B. Levine et al. Mapping of an Automated Target Recognition Application from

a Graphical Software Environment to FPGA-based Reconfigurable Hardware.

IEEE Symposium on Field-Programmable Custom Computing Machines, pages

292-293, 1999.

81

[20] T. Marzetta, E Martinsen, C. Plum, Fast Pulse Doppler Radar Processing Ac-

counting for Range Bin Migration. IEEE National Radar Conference, pages 264-

268, 1993.

[211 The Mathworks, MATLAB Introduction Home Page.

http://www.mathworks.com/products/matlab/ 2001.

[22] R. Mitchell, R. DeWall, Overview of High Range Resolution Radar Target Iden-

tification. SPIE Automatic Target Recognition Conference, pages 35-47, October

1994.

[23] A. Oppenheim, R. Schafer, Digital Signal Processing. Prentice-Hall, 1975.

[24] E. Pauer, P. Fiore, J. Smith, C. Myers, Algorithm Analysis and Mapping En-

vironment for Adaptive Computing Systems. A CM International Symposium on

Field-Programmable Gate Arrays, 1999.

[25] E. Pauer, P. Fiore, J. Smith, Algorithm Analysis and Mapping Environment

for Adaptive Computing Systems: Further Results. IEEE Symposium on Field-

Programmable Custom Computing Machines, pages 264-265, April 1999.

[26] J. Ratches, C. Walters, R. Buser, B. Guenther, Aided and Automatic Target

Recognition Based Upon Sensory Inputs from Images Forming Systems. IEEE

Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No. 9, pages

1004-1019, September 1997.

[27] G. Strang, Introduction to Linear Algebra, Wellesley-Cambridge Press, 1998.

[28] Synplicity, Inc., Synplicity Home Page, http://www.synplicity.com, 2001.

[29] University of California at Berkeley, The Ptolemy Project.

http://ptolemy.eecs.berkeley.edu, 2001.

[30] J. Villasenor et al. Configurable Computing Solutions for Automatic Target

Recognition. IEEE Symposium on Field-Programmable Custom Computing Ma-

chines, pages 70-79, 1996.

82

[31] A. Wilkinson, R. Lord, M. Inggs, Stepped-Frequency Processing by Reconstruc-

tion of Target Reflectivity Spectrum. IEEE Southern African Symposium on

Communications and Signal Processing, September 1998.

[32] R. Williams et al. Automatic Target Recognition of Time Critical Moving Targets

Using 1D High Range Resolution (HRR) Radar. IEEE Radar Conference, pages

54-59, 1999.

[33] A. Willsky, G. Wornell, J. Shapiro, Stochastic Processes, Detection, and Estima-

tion, 6.432 Course Notes, 1999.

[34] Xilinx Corporation, Xilinx Home Page http://www.xzilinx.com, 2000.

83

Documents

Analysis and Acceleration for Target Recognition