Enhanced SAR Image Processing using A Heterogeneous
MultiprocessorHeterogeneous Multiprocessor by
Multiprocessor by
YU SHI
ii
Nyckelord Keywords
YU SHI
Synthetic antenna aperture (SAR) is a pulses focusing airborne
radar which can achieve high resolution radar image. A number of
image process
algorithms have been developed for this kind of radar, but the
calculation burden is still heavy. So the image processing of SAR
is normally
performed “off-line”.
The Fast Factorized Back Projection (FFBP) algorithm is considered
as a computationally efficient algorithm for image formation in
SAR, and
several applications have been implemented which try to make the
process “on-line”.
CELL Broadband Engine is one of the newest multi-core-processor
jointly developed by Sony, Toshiba and IBM. CELL is good at
parallel
computation and floating point numbers, which all fit the demands
of SAR image formation.
This thesis is going to implement FFBP algorithm on CELL Broadband
Engine, and compare the results with pre-projects. In this project,
we try
to make it possible to perform SAR image formation in
real-time.
CELL Broadband Engine, Synthetic antenna aperture, Fast Factorized
Back Projection (FFBP) algorithm, C
language ,parallel programming, parallel computing, Matlab
Linköpings universitet
http://www.ep.liu.se
iii
iv
Abstract Synthetic antenna aperture (SAR) is a pulses focusing
airborne radar which can achieve high resolution radar image. A
number of image process algorithms have been developed for this
kind of radar, but the calculation burden is still heavy. So the
image processing of SAR is normally performed “off-line”. The Fast
Factorized Back Projection (FFBP) algorithm is considered as a
computationally efficient algorithm for image formation in SAR, and
several applications have been implemented which try to make the
process “on-line”. CELL Broadband Engine is one of the newest
multi-core-processor jointly developed by Sony, Toshiba and IBM.
CELL is good at parallel computation and floating point numbers,
which all fit the demands of SAR image formation. This thesis is
going to implement FFBP algorithm on CELL Broadband Engine, and
compare the results with pre-projects. In this project, we try to
make it possible to perform SAR image formation in real-time.
Keywords: CELL Broadband Engine, Synthetic antenna aperture, Fast
Factorized Back Projection (FFBP) algorithm, C language, parallel
programming, parallel computing, Matlab
v
vi
Acknowledgements Firstly, I want to thank my supervisor Doctor Di
Wu and my examiner Processor Dake Liu for their warm hearted help
and heuristic guidance during my final thesis. Then I would like to
give thanks to all my friends in Linkoping who give their happiness
to me all the time. I also cherish the love from my family and my
girlfriend Xiaoqian Liu. Yu Shi Linkoping, 11th, Feb, 2008
vii
viii
2.3.1 Data collection
........................................................................................8
2.3.2 Pulse
compression...................................................................................8
2.3.3 Doppler Effect and Doppler Filter
Bank.................................................8 2.3.4
Detection
.................................................................................................9
2.3.5 Resolving
..............................................................................................10
xi
xii
1.1 Background
Detection and tracking are the two primary functions of radar.
Synthetic aperture radar (SAR) can create a high-resolution image
of static ground scenes in two- dimensions, which is first
described and demonstrated by Carl Wiley of Goodyear Aircraft in
1951. SAR is an active system, and it can take high-resolution
image at any time because of its all-weather and day-or-night
capabilities. Figure 1.1 is an image of Washington, D.C taken by
Synthetic Aperture Radar.
Figure 1.1 Synthetic Aperture Radar image of Washington,
D.C[14]
As innovative ideas popping up and new technologies development,
SAR is used
1
much more extensively than ever before, both military and civilian
uses, such as Reconnaissance, Surveillance, and Targeting, Treaty
Verification and Nonproliferation, Navigation and Guidance, Foliage
and Ground Penetration and Associative Aperture Synthesis Radar.
Due to the heavy calculation burden and limited computational
capabilities, the real- time performance of SAR systems is not so
efficient. In order to improve this performance, this paper is
going to discuss the issue that implements an advanced SAR
algorithm called FFBP (Fast Factorized Back Projection) on STI
CELL. FFBP is an efficient algorithm to create a high-resolution
image for SAR system and draws a lot of attentions recently. STI
CELL is jointly developed by Sony, Toshiba, and IBM with a new
parallel computation concept.
1.2 Purpose of the Thesis
The purpose of the thesis is to implement Fast Factorized Back
Projection (FFBP) algorithm on CELL Broadband Engine to improve the
efficiency of SAR signal processing and make it possible to perform
signal processing in real-time. The kernel of FFBP algorithm is
accelerated by the feature of parallel calculation and SIMD.
1.3 Way of work
This project is based on C language extended by in-line assembly
language (the intrinsic of CELL) and the IBM FULL SYSTEM SIMULATOR
for Cell Broadband Engine. To ease the workload of prototyping, we
utilized Matlab to help prototype the algorithm used in this
project. And we also make comparison between our results and
pre-work in a FPGA to illustrate the high-efficiency of parallel
calculation.
1.4 Outline
Chapter 2 describes basic processing flow of airborne radar.
Chapter 3 gives a short introduction to Synthetic aperture radar
(SAR). Chapter 4 describes the Fast Factorized Back Projection
(FFBP) algorithm, and the way it works in this thesis.
2
Chapter 5 presents an overview of CELL Broadband Engine. The
chapter introduces not only the hardware, but also the software
features. Chapter 6 discusses the main constraints of this project,
and presents the considerations of this project in design stage.
Chapter 7 covers the software details of this project, including
data structure design and working flow design. Chapter 8 presents
the results based on Chapter 7, and compares the results with pre-
work. Chapter 9 discusses the results and future work for further
studying.
3
4
Introduction to Radar System and overview of Radar Signal
Processing
2.1 Radar fundamentals
Radar is short for “radio detection and ranging”, and its main uses
are detection, tracking, and imaging, in which detection is the
most fundamental use of radar.
Figure 2.1 This RADAR is the ARPA Long-Range Tracking and
Instrumentation Radar (ALTAIR) located in the Kwajalein atoll on
the island of Roi-Namur in the Ronald Reagan Ballistic Missile
Defense Test Site. It was initially developed and built between
1968 and 1970. [4]
5
As shown in figure 2.2, main parts of Radar include transmitter,
waveguide, duplexer, receiver, electronic section including
software, and links to end users. And frequencies of carrier waves
in most radar systems are between 1 and 40 GHz.
Figure 2.2 Radar Components [5]
6
2.2 Introduction to Airborne Radar System
Airborne radar is radar equipment based on aircraft platform, used
in military or civilian applications. Airborne radar can be used
for weather assessment, navigation, altimetry, mapping, and
military combat.
SAR (synthetic aperture radar) is the technique used on airborne
radar to obtain high resolution mapping of terrain. This technique
uses Doppler shift produced by targets on ground as the aircraft
passes by to produce high-resolution (both range and azimuth
dimensions) mapping of earth surface.
Airborne radars have to overcome the unique design
difficulties,
mainly caused by clutter and size limitations. Clutter is the
unexpected echoes which can interface the echoes from targets, and
is
mainly caused by ground echo. And size limitations influence
the
design of antenna and the radio frequency to be used.
Figure 2.3 Illustration of an airborne radar environment
7
2.3.1 Data collection
Radar collects echoes reflected from objects to detect and range
them. Then the distance between Radar and objects would be
proportional to the wave traveling time. Each transmitted pulse
corresponds to a certain distance from radar, and the data
collected from each pulse is called range bins.
2.3.2 Pulse compression
If we wanted both high resolution and long detection range, the
transmitted pulse is extremely short with high power. Due to
practical power limits, we have to keep PRF (Pulse Repetition
Frequency) low enough. Then comparably wide wave would be
transmitted, which places us in a difficult dilemma. Except this,
range resolution is decreased by the distribution of pulse during
its transfer time. Pulse compression is one of the technologies
which can lead us out of this dilemma and difficulty. Basically,
radar sends out modulated pulses of long enough waves so that the
practical power limits are not exceeded. When the echoes are
received, radar demodulates the echoes to compress the pulse.
Normally, three pulse compression methods are used mostly. They are
linear frequency modulation (chirp), binary phase modulation and
ploy phase modulation.
2.3.3 Doppler Effect and Doppler Filter Bank
The Doppler Effect is a shift in the frequency of a wave radiated,
reflected, or received by objects in motion. [7] As illustrated in
figure 2.4, Doppler Effect happens when point wave source
moves.
8
Figure 2.4 A wave radiated from a point source when stationary (a)
and when moving (b). Wave is compressed in direction of motion,
spread out in opposite direction, and unaffected in direction
normal to motion. [7]
Doppler filter bank is designed to detect the echoes from many
different source simultaneously based on their differences in
Doppler frequency. Under the assumption that there are no filter
side lobes, the output of the filter’s frequency should fall within
the filter’s frequency band. However, actually, there may be some
signals’ frequencies lies outside the filter’s frequency band
because of filter side lobes.
2.3.4 Detection
Detection process should be done after Doppler filtering. It can
detect the potential targets according to the energy of received
signals. For the target could be detected, its signal energy plus
the accompanying noise energy must exceed a certain threshold
value, as shown in Figure 2.5. If the detection threshold is too
low, there would be too many objects detected as targets. On the
other hand, if it’s too high, there would be some targets missed in
the detection process. There is a circuity called Constant False
Alarm Rate circuity (CFAR), whose responsibility is to determine
the detection threshold, in detector circuity.
9
Figure 2.5 Integrated noise energy at end of successive integration
times. On average, for a target to be detected integrated signal
energy must greater than Smin. [7]pp136
2.3.5 Resolving
When time of flight is longer than the time between two echoes, the
echoes can be aliased. Then it is impossible to determine which
transmitted pulse an echo belongs to. [16] p11 There is also
another problem, which is difficult to distinguish aliased and
non-aliased echoes. This can be overcome by using resolving
technology, analogously with the range resolving, in which prf
switching is used in a number of subsequent CPIs.
10
3.1 Introduction to SAR
SAR is short for Synthetic Aperture Radar, which is applied to
imaging terrain clutter. SAR, mounted on an airplane or a
satellite, can take high-resolution radar imaging in two dimensions
from low-resolution aperture data. Because of limited computer
calculation ability and its complex digital signal processing
algorithm, SAR is not efficient Radar, when it’s first introduced
in 1950s. As DSP technologies’ developing and algorithm’s
evolution, SAR came back again into engineers’ consideration, due
to its ability to obtain high azimuth resolution.
3.2 SAR fundamentals
As mentioned in pre-chapters, a radar map has to provide high
resolution in both range and azimuth dimensions. High range
resolution can be achieved by pulse compression techniques as other
radar systems discussed in Chapter 2. Azimuth is perpendicular to
range as described in Figure 3.1. High resolution normally cannot
be achieved by conventional operations, and, however, SAR can
provide a relative high resolution in azimuth compared to other
radars, which is an important advantage to other radars.
11
Figure 3.1
Azimuth resolution a for an antenna of SAR can be calculated by a =
R*λ/D (3.1)[2] R is the distance from antenna to target, D is the
length of antenna, and λ is wave length. From the formula 3.1 we
could see that if users want to get a higher resolution in azimuth,
we have to increase the length of antenna. So, if target distant R
= 1 km, wave length λ=50 cm, and users require azimuth resolution a
= 50 cm, then antenna length would be 1 km, which is apparently
impossible in practice. By considering this impossible task,
instead of building a large physical phased array antenna, a single
array element that moves through successive element positions to
form the complete array would be a nice choice. [3] Since waves
travel at the speed of light, and then we could neglect the speed
of airplane, so we could use start-stop-
12
approximation in this case. Along this route, the single array
element sends and receives echoes at each position. Then the data
collected from each position is coherently combined to simulate a
large array antenna in microwave hardware usually. This process is
shown by Fig 3.2, which illustrates the situation that antenna is
on a airplane platform.
Figure 3.2
Then put range bins together to create an echo matrix used by
digital processor for further processing.
13
3.3 Doppler Effect consideration
Since we take the assumption of start-stop-approximation, the plane
would keep still during two element-positions. Then there would be
no Doppler Effect, mentioned in Chapter 2, in the process above.
However, this is not true in real situation, the Doppler Effect
does exit and echoes have different frequency shift. To reduce this
error, we could calculate this frequency shift by using distance
between two echoes.
14
4.1 Comparison among SAR algorithms
4.1.1 Algorithms based on FFT
A bunch of image formation algorithms have been developed for SAR
to help SAR create high resolution map from low-resolution aperture
data, since SAR is first invented in 1950s. Most of them operate in
frequency domain by using Fast Fourier Transform (FFT) or Discrete
Fourier Transform (DFT) in early SAR. As Figure.4.1 shows ERS-2 SAR
data over the study area has been enhanced using Fast Fourier
Transformation (FFT) based filtering approach, and also using Frost
filtering technique.
15
Figure 4.1 Block diagram for FFT based filtering
FFT techniques can reduce the calculation burden from N^2 to a
complexity of N log N when N is a regular power of two (2x) [2].
This is computationally efficient only when flight trajectory is
linear and the speed is constant, which are not the case in
practice.
16
4.1.2 Back-Projection techniques
By using back-projection techniques in time domain can lead us out
of these problems, since Back-Projection algorithm can handle with
irregularly sampled echo data. Back- projection is widely used in
computed tomography (CT), but it is considered as low computational
efficiency in SAR due to the differences between CT’s and SAR’s
sensor detections. One of extensively used Back-Projection
algorithms is called Global Back Projection with a computation
complexity in the order of N^3. Another one is Factorized Back
Projection (FBP), which has a reduced computation burden of N^2log
N, but this algorithm is still slow and causes some minor errors.
Compared to BGP, one of FBP algorithms called Fast Factorized Back
Projection (FFBP), is more computationally efficient to speed up
image formation, which is the algorithm used in this thesis. The
calculation burden of FFBP, in each iteration, is n MN log n L,
when n is an integer with the lowest value of 3, L is the length of
the whole aperture, and MN is the number of resulting pixel. The
less computation burden makes it possible to process SAR image in
real-time.
4.2 FFBP work-flow
FFBP is an algorithm that can accelerate the image formation speed
in time domain. With the assumption of straight trajectory and
constant flying speed, we could iteratively merge several low
angular resolution apertures into one large aperture, as
illustrated in Figure 4.1, to reduce the data redundancy and gain
higher angular resolution. FFBP is performed as: First, factorizing
echo data into a number of decimated data sets for sub images in a
number of stages, and then back-project to the corresponding sub
image. As shown in Figure 4.1, in each iteration, contributing
apertures are combined to new bigger apertures with higher angular
resolution. In this case, three apertures merged into a large
aperture in each iteration. This procedure will be going on several
iterations, until full aperture with full angular resolution is
obtained.
17
Figure 4.2 Simplified illustration of FFBP
The problem here is how to decide which apertures should be merged
into one aperture. Calculations have to be done due to this problem
for every aperture on SAR, and the following is the calculation
formula of FFBP. After each iterate, a new echo matrix is formed
for next iteration or as final results to be stored. Followings are
the details of FFBP algorithm in math. As shown in figure 4.2, the
data-set for a given sub image mn in stage s is given by
)exp(),(),( ,
1)1(
, )1()(
'' mnpqc
Qq
−+
=
−
[13]
Where is the data-set for the sub image in the previous stage that
contains the
sub image mn, and =
)1( '' −s nmd
mnpqr ,Δ mnqqp yy ,sin)( θ− is the delay required to focus at
the
centre of the sub image mn, where mnq,θ is the angle from to the
sub image centre. [12]
qy
The along-track sample positions are given by the factor Q and the
new sample positions are given by [13]
18
[13]
Figure 4.3 At each stage of the FFBP algorithm, groups of
along-track samples (from the appropriate data-set in the previous
stage) are combined and focused to the centre of each sub image.
[13]
Then the calculation burden of FFBP, in each iteration, is n MN log
n L, when n is an integer with the lowest value of 3, L is the
length of the whole aperture, and MN is the number of resulting
pixel.
4.3 Parallelization of FFBP FFBP described above can be performed
in a parallel processing in order to generate image in real time.
This algorithm can be parallelized on different levels of
granularity, coarse granularity level and fine granularity level.
On coarse granularity level, the whole data set is split into
several loosely coupled
19
subsets, and each subset can be processed independently, when data
dependency allows doing this. On fine granularity level,
parallelism is exploited in data subsets. This is carried out by
exploiting instruction level, thread or data parallelism.
4.4 Existing FFBP Implementation Andreas Hast and Lars Johansson
have described their implementation of FFBP with FPGA in their
paper “FAST FACTORIZED BACK-PROJECTION IN AN FPGA”. In this paper,
an implementation using FPGA with a hard CPU core is proved to be
feasible to calculate FFBP in real time. Andreas Hast and Lars
Johansson used Matlab code provided by Annelie Wyholt (PHD student
at Chalmers University of technology) to simulate the CARABAS
system output, and their studying is based on this simulated SAR
system. Their implementation can perform in real time when the plan
speed is less than 194m/s. Meanwhile, Rockwell Sabreliner 40A,
which is used as the SAR platform, has a cruise speed of around 236
m/s without the antennas, which means I still have chances to
improve the performance in my project.
20
Chapter 5
Overview of Cell Broadband Engine and Parallel Programming Cell’s
full name is Cell Broadband Engine Architecture, abbreviated CBEA
in full or Cell BE in part, which is jointly developed by Sony,
Toshiba, and IBM (STI). Cell is a member of Cell Broadband
Processor Architecture (CBEA) microprocessor family, and is
initially designed for game application or media rich devices.
However, due to its advanced architecture and strong computation
abilities, Cell has been widely used not only for initial purpose,
but also used for other calculation-related applications.
Figure 5.1 Layout of IBM CELL
5.1 Design considerations
Power use, Memory use, and processor frequency are the main three
performance- limiting factors of contemporary microprocessor, which
are called the three
21
performance-limiting walls. To scale the power-limitation wall,
Cell is designed to contain different cores. One core called PPE is
optimized to run an operating system and control-intensive code on
it, another eight cores called SPEs are specialized for
compute-intensive (data-plane) applications. Today’s symmetric
multiprocessors come into multi-gigahertz time and the latency
caused by DRAM is increasing to nearly 1000 cycles, which means
data transferring between main storage and processor dominate
programs performance. To reduce this latency, SPEs adopt a 3-level
memory structure (main storage, local stores and register files)
and asynchronous DMA transfers between main storage and local
stores. The technology of increasing depth of instruction pipelines
to obtain higher operating frequencies has reached its limit. By
designing different uses cores for different tasks, such as PPE and
SPEs, allows different cores to be designed for high frequency
without excessive overhead.
5.2 Architecture Overview
Cell Broadband Engine is a single chip multiprocessor initially
designed for applications in game console and media-rich
applications, such as PS3 and high definition television. It has
nine cores on the chip working on a coherent, shared memory. Due to
its architecture and computation features, it is also extensively
used in much broader ways, such as intensive computation use and
server.
5.3 Architecture features
As illustrated in Figure 5.2, the main parts of Cell Broadband
Engine include PowerPC Processor Element, Synergistic Processor
Elements, and Element Interconnect Bus.
22
Figure 5. 2 Overview of Cell Broadband Engine architecture
The main processor, PowerPC Processor Element (PPE), consists of a
Power
Processing Unit (PPU) connected to a 512KB L2 cache. PPU is a
64-bit PowerPC Architecture reduced instruction set computer (RISC)
core. PPE’s tasks are running the operating system, managing system
resources and coordinating the SPEs’ threads. The key design goals
of the PPE are to maximize the performance/power ratio as well as
the performance/area ratio. [10] It supports dual-thread due to PPU
is dual-issue, and both PowerPC instruction set and the Vector/SIMD
Multimedia Extension instruction set. Figure 5.3 illustrates the
way of how dual issue mechanism works.
23
Synergistic Processor Element (SPE) consist a Synergistic
Processing Unit (SPU)
and a Memory Flow Controller (MFC), as shown in Figure 5.4. An SPU
is a compute engine with SIMD support and contains a RISC core,
256KB local storage, and a large 128-entry, 128-bit register file
used for both floating-point and integer operations. Like PPU, SPUs
are also dual-issue (shown in Figure 5.3), which contains an Even
Pipeline containing floating point and fixed point units and an Odd
Pipeline containing Permute Unit, Local Store Unit, Channel Unit,
and Branch Unit. SPEs are not supposed to run an operating system
on it and can only operate data in their own local storage.
24
Figure 5.4 Synergistic Processing Element (SPE) block diagram
[6]
Data transfer of SPEs relies on MFC, which is a channel interface
between local storage and main memory. The MFC contains a DMA
controller with an associated MMU, as well as an Atomic Unit to
handle synchronization operations with other SPUs and the PPU. [10]
This kind of MFC can doing a asynchronous DMA transfers between
main storage and their local stores, which means the channel
interface can independently of the SPU moves data and instructions
between main storage and their local stores.
Element Interconnect Bus (EIB) is circular bus through which PPE
and SPEs
communicate coherently with each other and PPE or SPEs communicate
with main storage and I/O module. EIB is made of two 128-bit data
channels with a 4-ring structure (two clockwise and two
counterclockwise) for data, and a tree structure for commands, as
illustrated in Figur 5.5. The internal bandwidth of EIB is 384 GB/s
and it supports more than 100 DMA memory requests between main
storage and the SPEs.
25
Figure 5.5 The EIB grapples with eight concurrent transactions
[17]
Three types of storage are defined in CELL Broadband Engine,
including one
main-storage domain, eight SPE local store domains, and eight SPE
channel domains. Main-storage is configured by the operating system
running on PPE, and can be shared by all processors and
memory-mapped devices. In contrast, local storage and channels are
private to SPU, LS, and MFC of each SPE.
5.4 Parallel Programming on CELL
PPE’s instruction set is based on an extended version of the
PowerPC instruction set, which consist of the Vector/SIMD
Multimedia Extension instruction set and a few additions and
changes. Although the instruction sets of SPEs are similar to the
Vector/SIMD Multimedia Extension part of PPE’s instruction set,
they are still different, and then the programs for PPE and SPEs
must be complied by ppu-complier and spu-complier separately.
26
5.4.1 SIMD Vectorization
SIMD (Single Instruction, Multiple Data) operation is the most
outstanding programming feature of CELL Broadband Engine. Vector is
a pack of data stored in a one-dimension-array working as the
operand of SIMD operations. This SIMD processing exploits the
parallelism in data-level, and it means that one single instruction
can be applied to multiple data at one time. In order to support
for SIMD operations, both PPE and SPEs have 128-bit register to
hold multiple data as a single vector. SIMD is also supported by
PPE’s Multimedia Extension Instruction Set and SPE’s Instruction
Set. Figure 5.6 takes Four Current Add Operations as an example to
illustrate how SIMD works.
Figure 5.6 Four Current Add Operations
Both Multimedia Extension Instruction Set and SPU instruction set
have C-Language extensions, which can release programmers from the
intensive program work of Assembly Language, since the form of
C-Language function call is a convenient substitute of in-line
Assembly Language instructions.
5.4.2 Data Parallelization Methods
Depend on different requirements and constraints of applications,
there are two different methods to organize related data. One is
called an array of structure (AOS) and another is called a
structure of arrays (SOA). Consider, for example, subdivision
surfaces in which the single triangle defined by floating-point
vertices a, b, and c in Figure 5.7 below is subdivided into
multiple triangles. [12]
27
Figure 5.7 Point subdivision illustrated
AOS is also called vector-across form, which keeps each three or
four component vertex in a single SIMD vector. This data-packing
approach is a very natural way of representing a 3-Dimension vertex
and often produces small code. However, it typically produces less
efficient code and generally requires significant loop-unrolling to
improve its efficiency. Another method, a structure of arrays
(SOA), is also called parallel-array form. Here, each corresponding
data value for each vertex is stored in a corresponding location in
a set of vectors.[18]p74 This method may produces more efficient
code than AOS depending on different algorithms.
5.5 Software Development Kit
CELL Broadband Engine contains a Software Development Kit (SDK) for
developing programs on it. This SDK includes required tools and
some examples that highlights the general principles for developer
The components of SDK including The IBM Full System Simulator for
the CELL Broadband Engine, systemsim, system root image, GNU tools
( C and C ++ compiler, linkers, assemblers and binary utilities for
both PPU and SPU), IBM xlc compiler, newlib for SPU, gdb debuggers,
PPC64 Linux with CBE enhancements, SPE Runtime management library,
Static timing analysis timing tool, Performance tools, an Eclipse
based Integrated Development Environment (IDE), Standardized SIMD
math libraries, Example source code.
28
Chapter 6
Design Considerations Fast Factorized Back Projection (FFBP,
described in Chapter 4) has been used for image formation. However,
due to the calculation burden and limit calculation abilities, this
is performed off-line, although FFBP is more efficient than GBP. In
this chapter, we will discuss the possibility of real-time FFBP on
SAR radar with CELL as the processor.
6.1 Real-Time Performance Issues
Due to heavy computation, the process of data of SAR is normally
performed “off- line”. Our target here is to make it “on-line”
process. Calculation time decides the maximum aircraft speed in
SAR, when pulse spacing is fixed. As described in paper “FAST
FACTORIZED BACK-PROJECTION IN AN FPGA” [2], the airplane used in
CARABAS is a Rockwell Sabreliner 40A with a cruise speed of around
236 m/s without antennas. This speed directly decides the time
constraints of this project. The FPGA implementation of FFBP in
“FAST FACTORIZED BACK-PROJECTION IN AN FPGA” finished the following
number of operations in 2.33 second, which is time upper limit of
my project.
29
Table 6.1 Number of operations at iteration: Operations Operation
number in each iteration
1.26157824 *109 Add
1.1943936 * 107 Subtraction
1.7915904 * 107 Multiple
5.49421056 * 108 Divide
2.985984 * 106(consists of many operations) Square Root
(Sqrt)
2.985984 * 106(consists of many operations) Sine
2.985984 * 106(consists of many operations) Cosine
2.985984 * 106(consists of many operations) Inversed Sine
2.985984 * 106(consists of many operations) Round
4.478976 * 106(consists of many operations) Ceil The number of
operations at iteration: Total number at iteration: 1866240000
Total number at six iterations: 11197440000
6.2 Rounding errors
In traditional way of implementing FFBP, such as FPGA, digital
signal data is scaled, since they normally do not support floating
point numbers due to limited budget. So we have to spend more extra
time to estimate the errors and then try to give corrections. Cell
Broadband Engine supports 32-bit floating point numbers very well,
and there’s no need to do scaling during computation, so the
Precision Problems in traditional methods will not be an issue
here.
6.3 Program Efficiency Issues
To achieve high program efficiency in Cell, programmers have to
study carefully of data dependency, program dependency, and, which
is most important here, the
30
possibility of the algorithm to be programmed in a parallel way.
For the data in SAR stored in an echo matrix, data dependency
doesn’t occur between any two rows in this matrix, since every echo
independent from each other. Then we can simply divide the data set
into several parts by rows, and sign them to each SPE. This also
gives the possibility to map FFBP on CELL in parallel program.
Within program, structure of the program is a critical factor to
the program efficiency due to Dual-issue structure in SPE cores.
CPI (cycle per instruction) is the factor to measure the quality of
program structure.
31
32
Chapter 7
Software Implementation This project is implemented in C language
extension of Cell Broadband Engine in a parallel programming way.
Matlab was also used to model and simulate at the beginning stage
and validate results from this implementation. The implementation
is based on IBM FULL SYSTEM SIMULATOR that is included in CELL SDK
2.0.
7.1 Data to be processed
We assume that the data to be processed here is from CARABAS SAR
system, and is ready for FFBP calculation. The Matlab code
simulated the incoming radar data is available, which is written by
Lars Ulander in September-97 and then modified by Anelie Wyholt in
May-05, the necessary CARABAS SAR system parameters were available.
There are several simplifications and assumptions made in the radar
data here. [2] 1. Range ambiguities ignored. 2. Flat Earth geometry
ignored. 3. Fixed antenna pointing across-track is assumed. 4.
Constant gain in elevation. 5. Real-time range compression of
transmitted FM signal.[2] The raw radar data here is stored as a
729 * 2048 double complex matrix that is based on a pulse spacing
of 0.8 m and an aperture angle of 90 degrees. This matrix is stored
in memory and is represented in polar coordinates. [2]
7.2 Data Store
The data matrix is a 729 * 2048 double complex matrix, which is not
directly supported by CELL. Then we have to divide this complex
matrix into two double real matrixes. And each of them takes up 729
* 2048 * 32 bits = 46656 K bits memory. We also have to allocate
two memory spaces to store new generated matrixes from
33
FFBP calculation. Then, totally, we need 46656 *4 = 186623 K bits
memory to store data. As mentioned in pre-chapters, there are three
storage domains defined in CELL BROADBAND ENGINE: one main storage
domain, eight SPE local store domains (256 K), and eight SPE
channel domains. [35] And the main storage used here in this
project is 256 M bits, so then we should put data matrixes in main
storage.
7.3 Data transfer
SPE is the element to perform most of the calculation work, and SPE
can only deal with the instructions and data in local storage
domain. However we have to store initial data in main memory
domain, so then we have to transfer data from main memory to local
store and decide how to do this in order not to affect the whole
program’s efficiency. As mentioned in pre-chapters, PPU and SPUs
use MFC’s DMA to transfer data and instructions from main memory.
And SPU can execute instructions while DMA transfer data and
instructions autonomously and asynchronously, which covers the
latency caused by data transfer. In other words, we try to overlap
data movement with computation. As shown in Figure 7.1, it’s a
simple double buffer flow chart, which is one of the best methods
to achieve autonomous and asynchronous. To maximum utilize DMA’s
transfer ability and improve transfer efficiency, we used DMA list
to a large extent in this program.
34
7.4.1 Work Partition
The first and most important issue in this project is how to
partition and allocate the whole work. Here we used four arrays to
store data. Two are for input data (real part, and complex part),
and another two are for output results data in very iteration. As
explained in figure 4.1, this allocation method eliminates the
inter-row data dependency. Then we chose to partition this work
into 8 parts by rows evenly and allocate these parts to each of the
8 SPEs. However, 729 can’t be divided by 8, so we
35
take the last row out for PPE to process during waiting time, and
partition 728-row- new matrix into 8 parts. The idea here is to
allocate as much work as we can to SPEs and don’t allow PPE waiting
there, doing nothing. This project uses a large dataset, so we
can’t keep all the results after each iteration. Therefore, we kept
the newest results and the results of last iteration and output the
results of last iteration. And keep the memory allocated for
matrixes 16-bit-aligned in order to implement a correct and fast
DMA transfer. There are also special skills used for parallel
programming to improve the program’s efficiency. First is Exploit
SIMD by programmers themselves and choose the optimal SIMD strategy
in this project. Since we implement FFBP on CELL in C language,
auto-vectorize by compiler may not achieve optimal results for this
application because of the flexibility of C language. Therefore, in
this project, I used intrinsics as much as possible, since
intrinsics are essentially inline assembly with high computational
efficiency. To gain a high performance in this program, we chose
Structure of Array (SOA) as SIMD strategy (Introduced in Chapter
5). Next issue to think about is to understand what dual-issue is
and how to use it. In order to obtain a high efficiency in program,
we should keep both of the pipelines busy as long time as possible,
which means achieve a high dual-issue rate. Therefore we arrange
the program structure by choosing the instructions according to
their pipelines. Table 7.1 Instructions for the even pipeline
[12]
Pipeline 0 (even) Instructions Latency (clocks)
Single precision floating-point ops 6
Double precision floating-point ops 6+7
Integer multiplies Integer/float convert Interpolate estimate
7
Immediate loads Logical ops Integer add/subtract Sign extend Count
leading zero Select bits Carry/borrow generate
2
36
Pipeline 1 (odd) Instructions Latency (clocks)
Loads and stores Branch hints Channel operations Moves to/from
SPRs
6
4
There is a tool called spu-timing static timing analyzer available
in the SDK, and it can annotate its assembly instructions with the
instruction-pipeline state. In other words, spu-timing static
timing analyzer can show where dual-issue was successfully
accomplished, and where a dual-issue was possible, but not occur
due to dependencies. This helps me schedule the instructions’
sequence in order to obtain the maximum dual-issue rate.
7.4.2 Program Skills
In order to avoid expensive branches and stalls due to
miss-predicts, select bits instructions are used to reduce
branches. When the branches can’t be eliminated, _builtin_expect
language extension was used to reduce the number of branch miss-
predicts. In this project, DMA transfers are always initialized by
SPEs. This is done for these reasons: 1. There are eight times more
SPEs than PPE. 2. The SPE command queue is twice as deep as the
proxy command queue. 3. Consumer-managed transfers are easier to
synchronize. 4. The number of cycles to initiate a transfer from
the SPE is smaller than the
number of cycles to initiate the same transfer from the PPE. [12]
Since in this project’s code the number of loop iterations is
constant, we use SIMD to unroll the loops, which helps optimizer to
efficiently schedule. Due to the basic loop
37
structure, this loop-unrolling work also improves the dual-issue
rates by computing as the same time as loading and storing data.
The following data shows the effectiveness of loop unrolling a
sample workload that performed OpenGL-like coordinate
transformation and lighting. Table 7.3 Performance metrics for
unrolling loops [12]
In the math model of FFBP algorithm, data dependency exits
throughout the whole model, which would cause data dependency stall
if without optimization. Reduce this dependency by parallelizing
several calculations and reducing the number of variants to save
register’s use.
7.5 Program design
7.5.1 Kernel Algorithm and Matlab model
In this project, I used the same Matlab model as used in the paper
“FAST FACTORIZED BACK-PROJECTION IN AN FPGA” (Andreas Hast and Lars
Johansson) and the input data used here is simulated data created
by myself. The key issue of FFBP is to decide which elements are
going to be combined into a new element. The combination process is
a simply add operation in this Matlab model, so then the address
calculation would consume almost all the running time of this
code.
38
7.5.2 Work Steps
The design process of this parallel program includes four steps: 1.
Analyze Matlab code and re-write it in C language:
Write FFBP algorithm in Matlab, and then re-write it in C language,
since Matlab is much easier in program math equations. In this
step, we also used the inline math library for CELL Broadband
Engine, and make sure all the calculation results are right.
2. SIMDize the code on PPE:
SIMDizing the code from last step may have several ways, but
programmers should carefully choose one SIMD strategy to improve
parallel behavior efficiency. In this project, we chose SOA
(mentioned in 5.4.2). Also, we checked the results from this step,
and evaluate the efficiency of this code.
3. Migrate the PPE code onto one SPE: The procedures involved
creating threads for SPE on PPE, migrating C
instructions and CELL PPE intrinsics into SPE intrinsics, and
adding DMA transfers to move data between main memory and local
store. Here again, checking the results is a must, and we spent
more time in optimization. For example, utilize asynchronous DMA
transfers features to improve efficiency. Figure.7.4 shows the
address mapping from SPE local store to main memory by using
address off-set method.
4.Parallelize code across 8 SPEs
Since there’s little data dependency within one iteration, we
partition the data to parallelize the computation across the 8
SPEs. In this step, we check the results and do the final
optimization of the code.
Figure 7.2 shows the process flow of PPE and Figure 7.3 is the flow
of SPEs. As illustrated in Figure 7.2, PPE read initial input data
into main memory, and then creates SPE threads to allocate work to
different SPEs. After that, PPE begin to process the special case,
last row of the input data matrix, while SPEs are processing major
part of the work. After PPE finish the special case processing, it
begins to wait for SPEs finishing their job and check if all the
work are done or not. If yes, PPE will output the calculation
results. If not, PPE will switch the input data matrix address and
the output data matrix address and then continue to do the same job
in next iteration as last one.
39
40
As shown in Figure 7.3, SPEs initialize DMA by themselves and read
in data through their DMAs, which can save DMA initialize time and
data transfer time. While DMA is working, each SPE is calculating
the address offsets in advance and the offsets are to use later
when its DMA finishes data transmission. The results output
procedure is outputting results to main memory from local store
through DMA, which is parallelized by calculating address offsets
for next iteration in advance. At last, SPE will examine whether it
has finished all the demanded iterations, if not, SPE will continue
the same operations in next iteration, if yes, SPE will return a
“success” signal to PPE and end the thread.
41
42
Figure 7.4 shows the address mapping from local store to main
memory by address offset. SPEs get beginning of input data’s
initial addresses information in main memory from PPE, and then
calculate the address offsets of target data they needed. By adding
these two address parameters, SPEs can find the target data in main
memory.
Figure 7.4 Address mapping from SPE local store to main
memory
43
44
8.1 Algorithm kernel test
The kernel of FFBP algorithm here is to find the positions of
elements used to generate a new element in radar data matrix. When
the range between apertures and the size of radar data matrix are
decided, the positions can be decided. Compared with the results
from Matlab, the results from this project are correct, which means
the kernel part of the algorithm is implemented correctly. Due to
lack of original radar data, the test of whole implementation can’t
be performed. There’s a corner case should be noticed here, which
is the inversed Sine and inversed Cosine, because of their
restricted domains. Matlab can handle this automatically, but C
language can’t. So I utilised math library in CELL toolchain-3.3 to
overcome this difficulty.
8.2 Precision analysis
CELL Broadband Engine supports for floating point number very well,
and therefore, there’s no need to take special consideration about
the error due to rounding. However, in the pre-projects which are
implemented with FPGA, errors were created, since FPGA do not
support for floating point number. Since SAR demands high precision
in the results, the capability, calculating with floating number,
of CELL fits this demand very well. And this feature can light the
burden of programmers and improve the efficiency of
programmers.
8.3 Real-Time Performance
As mentioned in chapter 6.1, the cruise speed of platform which is
Rockwell Sabreliner 40A airplane in this thesis, decides the time
constraints of this implementation. In the thesis “FAST FACTORIZED
BACK-PROJECTION IN AN FPGA”[2], they used FPGA to calculate a six
iterations on a 729*2048 matrix and it
45
takes about 2.33 seconds, which allowed a maximum speed of 194m/s.
However, the maximum cruise speed of the plan is 236 m/s. According
to the results of this project, it takes SPU 273875486 cycles to
calculate the same job mentioned above. The SPU’s target
clock-frequency at introduction is 3.2 * 10^9 Hz, and then we know
that it takes about 0.085586 second for SPE to finish the job,
which is a six-iteration calculating. If we round this time up to
0.1 second, then the maximum plane speed allowed will be 4462 m/s,
which is much greater than the speed of 236 m/s. This would give
the designer of SAR great flexibility in designing.
8.4 SPE efficiency analysis
IBM FULL SYSTEM SIMULATOR allows accurate cycle simulation of SPUs,
and it also provides a statistics function called SPUStats in the
simulator. Table 8.1 and Table 8.2 are the statistics results from
one of the 8 SPUStats. In Table 8.1, Total Cycle Count means Total
SPU run cycles. Performance Cycle Count means total run cycles of
the code we interested in. Performance Instruction Count means
total instructions executed within the code we interested in and
the number in the brackets means total NOP instructions executed
within this code. Performance CPI is the average number of Cycles
–Per-Instruction (CPI), and this number is an important parameter
to the efficiency of the code. Branch instructions are the total
branch type instructions executed. Hint Instructions is count of
HBP type instructions executed. Hint Hits is count of executed
instructions which were loaded from the hint target prefetch
buffer. [15]
46
Table 8.1 Total Cycle Count 273875486 Performance Cycle count
273875486 Performance Instruction count 250210140(249648855)
Performance CPI 1.09 (1.10) Branch instructions 280953 Branch taken
280311 Branch not taken 642 Hint instructions 280152 Hint hit
279732 In Table 8.2, Single Cycle means cycles in which only one
non-NOP instruction was executed Dual Cycle means cycles in which 2
non-NOP instructions were executed (Dual Issue) NOP Cycle means
cycles in which only NOP instructions were executed Branch Miss
Stalls means cycles in which branch miss-predict prevented any
instruction from executing Prefetch Miss Stalls means cycles in
which instruction run-out occurred Dependency Stalls means cycles
in which source/target operand dependencies prevented any
instruction from being issued FP Resource Stalls means cycles in
which shared use of FPU stages prevented any instruction from being
issued Hint Target Stall means cycles for which target losd delay
for a hinted branch prevented instruction fetch Pipe Hazard Stall
means cycles for which pipeline scheduling hazards prevented
instruction issue Channel Stall means cycles for which the pipeline
was stalled waiting on channel operations to complete Init Cycles
means cycles elapsed in SPU pipeline initialization sequence
[15]
47
Cycle Type Cycle Number Ratio Single cycle 217201647 79.3%
Dual cycle 16223604 5.9% Nop cycle 280902 0.1% Stall due to branch
miss 20345 0.0% Stall due to prefetch
miss 0 0.0%
Stall due to dependency 36121830 13.2% Stall due to fp resource
conflict
0 0.0%
12 0.0%
35 0.0%
Channel stall cycle 4027100 1.5% SPU Initialization cycle 9 0.0%
Total Cycle 273875486 100% The stall caused by DMA conflicts is
called Channel stall cycle and is only 1.5% of the total cycles,
which is in a acceptable rang. So I didn’t pay special attention to
this problem and didn’t schedule DMA globally on purpose. Actually,
there are some methods to handle with these conflicts, such as
storing data into different memory banks.
48
Chapter 9 Conclusions and Future work The purpose of this thesis is
to make the “off-line” signal processing of SAR an “on- line”
signal processing by utilizing CELL Broadband Engine and FFBP
algorithm.
9.1 Conclusions
This project in one part of many parts in SAR digital signal
processing. We can’t give exact processing time of the whole SAR
process. However, according to the benchmark results and the
comparison with the results from “FAST FACTORIZED BACK-PROJECTION
IN AN FPGA” (Chapter 8), this project is qualified to be used in
real-time process.
9.2 Implement complexity
Since this project is based on CELL Broadband engine, there’s no
need to design a new SoC special for this algorithm. Engineers are
only required to master C language and Matlab and some knowledge of
the algorithm. The main workload here is the optimization of the
code.
9.3 Future Work
This project is only one step of all the steps of SAR process that
can’t be finished just in one project. There are still many aspects
of this project to be improved, such as the efficiency of the code.
The energy consumption issue is not considered in this project. But
energy
49
consumption issue is a very important issue in signal process
system, and even more important in small signal process system. So
this energy consumption issue may be a very interesting topic when
implementing FFBP on CELL Broadband Engine.
50
in Electrical Engineering and Computer Systems Engineering, Andreas
Hast Lars Johansson
3. Fundamentals of Radar Signal Processing, Mark A.Richards,
Ph.D-----pp392 4. Wikipedia.org
http://en.wikipedia.org/wiki/Image:Radar_antenna.jpg 5.
Wikipedia.org
http://en.wikipedia.org/wiki/Image:Radar_composantes.png 6.
Real-Time Space-Time Adaptive Processing on the STI CELL
Multiprocessor,
Yi-Hsien Li, LITH-ISY-EX--07/3953SE, pp6, Linköping 2007 7.
Introduction to airborne radar, second edition, GEORGE W.STIMSON,
Scitech
publishing, INC. MENDHAN, NEW JERSEY. pp189 8. Wikipedia.org
http://en.wikipedia.org/wiki/Cell_microprocessor 9. Programming
Tutorial DRAFT, Software Development Kit for Multicore
Acceleration Version 3.0, pp5, IBM 10. Cell Broadband Engine
Architecture and its first implementation,
Thomas Chen (
[email protected]), Systems Performance, IBM Ram
Raghavan (
[email protected]), Systems Performance, IBM Jason Dale
(
[email protected]), Systems Performance, IBM Eiji Iwata
(
[email protected]), Microprocessor Development, Sony
Computer Entertainment Inc.29 Nov 2005
http://www.ibm.com/developerworks/power/library/pa-cellperf/
11. Programming Tutorial DRAFT, Software Development Kit for
Multicore Acceleration Version 3.0, pp35, IBM
12. Maximizing the power of the Cell Broadband Engine processor: 25
tips to optimal
application performance, Daniel A. Brokenshire
http://www-128.ibm.com/developerworks/power/library/pa-celltips1/
13. A COMPARISON OF FAST FACTORISED BACK-PROJECTION ANDWAVENUMBER
ALGORITHMS FOR SAS IMAGE RECONSTRUCTION A. J. Hunter, M. P. Hayes,
P. T. Gough Acoustics Research Group, Dept. Electrical and Computer
Engineering, University of Canterbury, New Zealand Email:
a.hunter,m.hayes,p.gough @elec.canterbury.ac.nz
14. Sandia National Laboratories,
http://www.sandia.gov/radar/sar.html 15. Using Systemsim to Guide
Application Transformation and Tuning for the Cell
Broadband Engine, Micheal Kistler, David Murrell, Vipin Sachdeva
16. Efficient Parallel Architectures for Future Radar Signal
Processing, Anders
Åhlander, Department of Computer Science and Engineering, CHALMERS
UNIVERSITY OF TECHNOLOGY, GÖTEBORG, SWEDEN 2007
17. Meet the experts: David Krolak on the Cell Broadband Engine EIB
bus, Power Architecture editors, developerWorks IBM
http://www.ibm.com/developerworks/power/library/pa-expert9/
18 Cell Broadband Engine Programming Tutorial, Version 2.1,
IBM Systems and Technology Group 2070 Route 52, Bldg. 330 Hopewell
Junction, NY 12533-6351
52
Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller
dess framtida ersättare – under en längre tid från
publiceringsdatum under förutsättning att inga extra-ordinära
omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa,
ladda ner, skriva ut enstaka kopior för enskilt bruk och att
använda det oförändrat för ickekommersiell forskning och för
undervisning. Överföring av upphovsrätten vid en senare tidpunkt
kan inte upphäva detta tillstånd. All annan användning av
dokumentet kräver upphovsmannens medgivande. För att garantera
äktheten, säkerheten och tillgängligheten finns det lösningar av
teknisk och administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som
upphovsman i den omfattning som god sed kräver vid användning av
dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet
ändras eller presenteras i sådan form eller i sådant sammanhang som
är kränkande för upphovsmannens litterära eller konstnärliga
anseende eller egenart.
För ytterligare information om Linköping University Electronic
Press se förlagets hemsida http://www.ep.liu.se/ Copyright The
publishers will keep this document online on the Internet - or its
possible replacement - for a considerable time from the date of
publication barring exceptional circumstances.
The online availability of the document implies a permanent
permission for anyone to read, to download, to print out single
copies for your own use and to use it unchanged for any
non-commercial research and educational purpose. Subsequent
transfers of copyright cannot revoke this permission. All other
uses of the document are conditional on the consent of the
copyright owner. The publisher has taken technical and
administrative measures to assure authenticity, security and
accessibility.
According to intellectual property law the author has the right to
be mentioned when his/her work is accessed as described above and
to be protected against infringement.
For additional information about the Linköping University
Electronic Press and its procedures for publication and for
assurance of document integrity, please refer to its WWW home page:
http://www.ep.liu.se/
© 2008, Yu Shi
1.3 Way of work
Introduction to Radar System and overview of Radar Signal
Processing
2.1 Radar fundamentals
2.3 Basic signal processing chain
2.3.1 Data collection
2.3.2 Pulse compression
2.3.4 Detection
2.3.5 Resolving
3.1 Introduction to SAR
4.1.2 Back-Projection techniques
4.2 FFBP work-flow
5.1 Design considerations
5.2 Architecture Overview
5.3 Architecture features
5.4.1 SIMD Vectorization
7.2 Data Store
7.3 Data transfer
7.4.1 Work Partition
7.4.2 Program Skills
7.5 Program design
7.5.2 Work Steps