Enhanced SAR Image Processing using A Heterogeneous Multiprocessor

Enhanced SAR Image Processing using A Heterogeneous MultiprocessorHeterogeneous Multiprocessor by
Multiprocessor by
YU SHI
ii
Nyckelord Keywords
YU SHI
Synthetic antenna aperture (SAR) is a pulses focusing airborne radar which can achieve high resolution radar image. A number of image process
algorithms have been developed for this kind of radar, but the calculation burden is still heavy. So the image processing of SAR is normally
performed “off-line”.
The Fast Factorized Back Projection (FFBP) algorithm is considered as a computationally efficient algorithm for image formation in SAR, and
several applications have been implemented which try to make the process “on-line”.
CELL Broadband Engine is one of the newest multi-core-processor jointly developed by Sony, Toshiba and IBM. CELL is good at parallel
computation and floating point numbers, which all fit the demands of SAR image formation.
This thesis is going to implement FFBP algorithm on CELL Broadband Engine, and compare the results with pre-projects. In this project, we try
to make it possible to perform SAR image formation in real-time.
CELL Broadband Engine, Synthetic antenna aperture, Fast Factorized Back Projection (FFBP) algorithm, C
language ,parallel programming, parallel computing, Matlab
Linköpings universitet
http://www.ep.liu.se
iii
iv
Abstract Synthetic antenna aperture (SAR) is a pulses focusing airborne radar which can achieve high resolution radar image. A number of image process algorithms have been developed for this kind of radar, but the calculation burden is still heavy. So the image processing of SAR is normally performed “off-line”. The Fast Factorized Back Projection (FFBP) algorithm is considered as a computationally efficient algorithm for image formation in SAR, and several applications have been implemented which try to make the process “on-line”. CELL Broadband Engine is one of the newest multi-core-processor jointly developed by Sony, Toshiba and IBM. CELL is good at parallel computation and floating point numbers, which all fit the demands of SAR image formation. This thesis is going to implement FFBP algorithm on CELL Broadband Engine, and compare the results with pre-projects. In this project, we try to make it possible to perform SAR image formation in real-time. Keywords: CELL Broadband Engine, Synthetic antenna aperture, Fast Factorized Back Projection (FFBP) algorithm, C language, parallel programming, parallel computing, Matlab
v
vi
Acknowledgements Firstly, I want to thank my supervisor Doctor Di Wu and my examiner Processor Dake Liu for their warm hearted help and heuristic guidance during my final thesis. Then I would like to give thanks to all my friends in Linkoping who give their happiness to me all the time. I also cherish the love from my family and my girlfriend Xiaoqian Liu. Yu Shi Linkoping, 11th, Feb, 2008
vii
viii
2.3.1 Data collection ........................................................................................8 2.3.2 Pulse compression...................................................................................8 2.3.3 Doppler Effect and Doppler Filter Bank.................................................8 2.3.4 Detection .................................................................................................9 2.3.5 Resolving ..............................................................................................10
xi
xii
1.1 Background
Detection and tracking are the two primary functions of radar. Synthetic aperture radar (SAR) can create a high-resolution image of static ground scenes in two- dimensions, which is first described and demonstrated by Carl Wiley of Goodyear Aircraft in 1951. SAR is an active system, and it can take high-resolution image at any time because of its all-weather and day-or-night capabilities. Figure 1.1 is an image of Washington, D.C taken by Synthetic Aperture Radar.
Figure 1.1 Synthetic Aperture Radar image of Washington, D.C[14]
As innovative ideas popping up and new technologies development, SAR is used
1
much more extensively than ever before, both military and civilian uses, such as Reconnaissance, Surveillance, and Targeting, Treaty Verification and Nonproliferation, Navigation and Guidance, Foliage and Ground Penetration and Associative Aperture Synthesis Radar. Due to the heavy calculation burden and limited computational capabilities, the real- time performance of SAR systems is not so efficient. In order to improve this performance, this paper is going to discuss the issue that implements an advanced SAR algorithm called FFBP (Fast Factorized Back Projection) on STI CELL. FFBP is an efficient algorithm to create a high-resolution image for SAR system and draws a lot of attentions recently. STI CELL is jointly developed by Sony, Toshiba, and IBM with a new parallel computation concept.
1.2 Purpose of the Thesis
The purpose of the thesis is to implement Fast Factorized Back Projection (FFBP) algorithm on CELL Broadband Engine to improve the efficiency of SAR signal processing and make it possible to perform signal processing in real-time. The kernel of FFBP algorithm is accelerated by the feature of parallel calculation and SIMD.
1.3 Way of work
This project is based on C language extended by in-line assembly language (the intrinsic of CELL) and the IBM FULL SYSTEM SIMULATOR for Cell Broadband Engine. To ease the workload of prototyping, we utilized Matlab to help prototype the algorithm used in this project. And we also make comparison between our results and pre-work in a FPGA to illustrate the high-efficiency of parallel calculation.
1.4 Outline
Chapter 2 describes basic processing flow of airborne radar. Chapter 3 gives a short introduction to Synthetic aperture radar (SAR). Chapter 4 describes the Fast Factorized Back Projection (FFBP) algorithm, and the way it works in this thesis.
2
Chapter 5 presents an overview of CELL Broadband Engine. The chapter introduces not only the hardware, but also the software features. Chapter 6 discusses the main constraints of this project, and presents the considerations of this project in design stage. Chapter 7 covers the software details of this project, including data structure design and working flow design. Chapter 8 presents the results based on Chapter 7, and compares the results with pre- work. Chapter 9 discusses the results and future work for further studying.
3
4
Introduction to Radar System and overview of Radar Signal Processing
2.1 Radar fundamentals
Radar is short for “radio detection and ranging”, and its main uses are detection, tracking, and imaging, in which detection is the most fundamental use of radar.
Figure 2.1 This RADAR is the ARPA Long-Range Tracking and Instrumentation Radar (ALTAIR) located in the Kwajalein atoll on the island of Roi-Namur in the Ronald Reagan Ballistic Missile Defense Test Site. It was initially developed and built between 1968 and 1970. [4]
5
As shown in figure 2.2, main parts of Radar include transmitter, waveguide, duplexer, receiver, electronic section including software, and links to end users. And frequencies of carrier waves in most radar systems are between 1 and 40 GHz.
Figure 2.2 Radar Components [5]
6
2.2 Introduction to Airborne Radar System
Airborne radar is radar equipment based on aircraft platform, used in military or civilian applications. Airborne radar can be used for weather assessment, navigation, altimetry, mapping, and military combat.
SAR (synthetic aperture radar) is the technique used on airborne radar to obtain high resolution mapping of terrain. This technique uses Doppler shift produced by targets on ground as the aircraft passes by to produce high-resolution (both range and azimuth dimensions) mapping of earth surface.
Airborne radars have to overcome the unique design difficulties,
mainly caused by clutter and size limitations. Clutter is the
unexpected echoes which can interface the echoes from targets, and is
mainly caused by ground echo. And size limitations influence the
design of antenna and the radio frequency to be used.
Figure 2.3 Illustration of an airborne radar environment
7
2.3.1 Data collection
Radar collects echoes reflected from objects to detect and range them. Then the distance between Radar and objects would be proportional to the wave traveling time. Each transmitted pulse corresponds to a certain distance from radar, and the data collected from each pulse is called range bins.
2.3.2 Pulse compression
If we wanted both high resolution and long detection range, the transmitted pulse is extremely short with high power. Due to practical power limits, we have to keep PRF (Pulse Repetition Frequency) low enough. Then comparably wide wave would be transmitted, which places us in a difficult dilemma. Except this, range resolution is decreased by the distribution of pulse during its transfer time. Pulse compression is one of the technologies which can lead us out of this dilemma and difficulty. Basically, radar sends out modulated pulses of long enough waves so that the practical power limits are not exceeded. When the echoes are received, radar demodulates the echoes to compress the pulse. Normally, three pulse compression methods are used mostly. They are linear frequency modulation (chirp), binary phase modulation and ploy phase modulation.
2.3.3 Doppler Effect and Doppler Filter Bank
The Doppler Effect is a shift in the frequency of a wave radiated, reflected, or received by objects in motion. [7] As illustrated in figure 2.4, Doppler Effect happens when point wave source moves.
8
Figure 2.4 A wave radiated from a point source when stationary (a) and when moving (b). Wave is compressed in direction of motion, spread out in opposite direction, and unaffected in direction normal to motion. [7]
Doppler filter bank is designed to detect the echoes from many different source simultaneously based on their differences in Doppler frequency. Under the assumption that there are no filter side lobes, the output of the filter’s frequency should fall within the filter’s frequency band. However, actually, there may be some signals’ frequencies lies outside the filter’s frequency band because of filter side lobes.
2.3.4 Detection
Detection process should be done after Doppler filtering. It can detect the potential targets according to the energy of received signals. For the target could be detected, its signal energy plus the accompanying noise energy must exceed a certain threshold value, as shown in Figure 2.5. If the detection threshold is too low, there would be too many objects detected as targets. On the other hand, if it’s too high, there would be some targets missed in the detection process. There is a circuity called Constant False Alarm Rate circuity (CFAR), whose responsibility is to determine the detection threshold, in detector circuity.
9
Figure 2.5 Integrated noise energy at end of successive integration times. On average, for a target to be detected integrated signal energy must greater than Smin. [7]pp136
2.3.5 Resolving
When time of flight is longer than the time between two echoes, the echoes can be aliased. Then it is impossible to determine which transmitted pulse an echo belongs to. [16] p11 There is also another problem, which is difficult to distinguish aliased and non-aliased echoes. This can be overcome by using resolving technology, analogously with the range resolving, in which prf switching is used in a number of subsequent CPIs.
10
3.1 Introduction to SAR
SAR is short for Synthetic Aperture Radar, which is applied to imaging terrain clutter. SAR, mounted on an airplane or a satellite, can take high-resolution radar imaging in two dimensions from low-resolution aperture data. Because of limited computer calculation ability and its complex digital signal processing algorithm, SAR is not efficient Radar, when it’s first introduced in 1950s. As DSP technologies’ developing and algorithm’s evolution, SAR came back again into engineers’ consideration, due to its ability to obtain high azimuth resolution.
3.2 SAR fundamentals
As mentioned in pre-chapters, a radar map has to provide high resolution in both range and azimuth dimensions. High range resolution can be achieved by pulse compression techniques as other radar systems discussed in Chapter 2. Azimuth is perpendicular to range as described in Figure 3.1. High resolution normally cannot be achieved by conventional operations, and, however, SAR can provide a relative high resolution in azimuth compared to other radars, which is an important advantage to other radars.
11
Figure 3.1
Azimuth resolution a for an antenna of SAR can be calculated by a = R*λ/D (3.1)[2] R is the distance from antenna to target, D is the length of antenna, and λ is wave length. From the formula 3.1 we could see that if users want to get a higher resolution in azimuth, we have to increase the length of antenna. So, if target distant R = 1 km, wave length λ=50 cm, and users require azimuth resolution a = 50 cm, then antenna length would be 1 km, which is apparently impossible in practice. By considering this impossible task, instead of building a large physical phased array antenna, a single array element that moves through successive element positions to form the complete array would be a nice choice. [3] Since waves travel at the speed of light, and then we could neglect the speed of airplane, so we could use start-stop-
12
approximation in this case. Along this route, the single array element sends and receives echoes at each position. Then the data collected from each position is coherently combined to simulate a large array antenna in microwave hardware usually. This process is shown by Fig 3.2, which illustrates the situation that antenna is on a airplane platform.
Figure 3.2
Then put range bins together to create an echo matrix used by digital processor for further processing.
13
3.3 Doppler Effect consideration
Since we take the assumption of start-stop-approximation, the plane would keep still during two element-positions. Then there would be no Doppler Effect, mentioned in Chapter 2, in the process above. However, this is not true in real situation, the Doppler Effect does exit and echoes have different frequency shift. To reduce this error, we could calculate this frequency shift by using distance between two echoes.
14
4.1 Comparison among SAR algorithms
4.1.1 Algorithms based on FFT
A bunch of image formation algorithms have been developed for SAR to help SAR create high resolution map from low-resolution aperture data, since SAR is first invented in 1950s. Most of them operate in frequency domain by using Fast Fourier Transform (FFT) or Discrete Fourier Transform (DFT) in early SAR. As Figure.4.1 shows ERS-2 SAR data over the study area has been enhanced using Fast Fourier Transformation (FFT) based filtering approach, and also using Frost filtering technique.
15
Figure 4.1 Block diagram for FFT based filtering
FFT techniques can reduce the calculation burden from N^2 to a complexity of N log N when N is a regular power of two (2x) [2]. This is computationally efficient only when flight trajectory is linear and the speed is constant, which are not the case in practice.
16
4.1.2 Back-Projection techniques
By using back-projection techniques in time domain can lead us out of these problems, since Back-Projection algorithm can handle with irregularly sampled echo data. Back- projection is widely used in computed tomography (CT), but it is considered as low computational efficiency in SAR due to the differences between CT’s and SAR’s sensor detections. One of extensively used Back-Projection algorithms is called Global Back Projection with a computation complexity in the order of N^3. Another one is Factorized Back Projection (FBP), which has a reduced computation burden of N^2log N, but this algorithm is still slow and causes some minor errors. Compared to BGP, one of FBP algorithms called Fast Factorized Back Projection (FFBP), is more computationally efficient to speed up image formation, which is the algorithm used in this thesis. The calculation burden of FFBP, in each iteration, is n MN log n L, when n is an integer with the lowest value of 3, L is the length of the whole aperture, and MN is the number of resulting pixel. The less computation burden makes it possible to process SAR image in real-time.
4.2 FFBP work-flow
FFBP is an algorithm that can accelerate the image formation speed in time domain. With the assumption of straight trajectory and constant flying speed, we could iteratively merge several low angular resolution apertures into one large aperture, as illustrated in Figure 4.1, to reduce the data redundancy and gain higher angular resolution. FFBP is performed as: First, factorizing echo data into a number of decimated data sets for sub images in a number of stages, and then back-project to the corresponding sub image. As shown in Figure 4.1, in each iteration, contributing apertures are combined to new bigger apertures with higher angular resolution. In this case, three apertures merged into a large aperture in each iteration. This procedure will be going on several iterations, until full aperture with full angular resolution is obtained.
17
Figure 4.2 Simplified illustration of FFBP
The problem here is how to decide which apertures should be merged into one aperture. Calculations have to be done due to this problem for every aperture on SAR, and the following is the calculation formula of FFBP. After each iterate, a new echo matrix is formed for next iteration or as final results to be stored. Followings are the details of FFBP algorithm in math. As shown in figure 4.2, the data-set for a given sub image mn in stage s is given by
)exp(),(),( ,
1)1(
, )1()(
'' mnpqc
Qq
−+
=
−
[13]
Where is the data-set for the sub image in the previous stage that contains the
sub image mn, and =
)1( '' −s nmd
mnpqr ,Δ mnqqp yy ,sin)( θ− is the delay required to focus at the
centre of the sub image mn, where mnq,θ is the angle from to the sub image centre. [12]
qy
The along-track sample positions are given by the factor Q and the new sample positions are given by [13]
18
[13]
Figure 4.3 At each stage of the FFBP algorithm, groups of along-track samples (from the appropriate data-set in the previous stage) are combined and focused to the centre of each sub image. [13]
Then the calculation burden of FFBP, in each iteration, is n MN log n L, when n is an integer with the lowest value of 3, L is the length of the whole aperture, and MN is the number of resulting pixel.
4.3 Parallelization of FFBP FFBP described above can be performed in a parallel processing in order to generate image in real time. This algorithm can be parallelized on different levels of granularity, coarse granularity level and fine granularity level. On coarse granularity level, the whole data set is split into several loosely coupled
19
subsets, and each subset can be processed independently, when data dependency allows doing this. On fine granularity level, parallelism is exploited in data subsets. This is carried out by exploiting instruction level, thread or data parallelism.
4.4 Existing FFBP Implementation Andreas Hast and Lars Johansson have described their implementation of FFBP with FPGA in their paper “FAST FACTORIZED BACK-PROJECTION IN AN FPGA”. In this paper, an implementation using FPGA with a hard CPU core is proved to be feasible to calculate FFBP in real time. Andreas Hast and Lars Johansson used Matlab code provided by Annelie Wyholt (PHD student at Chalmers University of technology) to simulate the CARABAS system output, and their studying is based on this simulated SAR system. Their implementation can perform in real time when the plan speed is less than 194m/s. Meanwhile, Rockwell Sabreliner 40A, which is used as the SAR platform, has a cruise speed of around 236 m/s without the antennas, which means I still have chances to improve the performance in my project.
20
Chapter 5
Overview of Cell Broadband Engine and Parallel Programming Cell’s full name is Cell Broadband Engine Architecture, abbreviated CBEA in full or Cell BE in part, which is jointly developed by Sony, Toshiba, and IBM (STI). Cell is a member of Cell Broadband Processor Architecture (CBEA) microprocessor family, and is initially designed for game application or media rich devices. However, due to its advanced architecture and strong computation abilities, Cell has been widely used not only for initial purpose, but also used for other calculation-related applications.
Figure 5.1 Layout of IBM CELL
5.1 Design considerations
Power use, Memory use, and processor frequency are the main three performance- limiting factors of contemporary microprocessor, which are called the three
21
performance-limiting walls. To scale the power-limitation wall, Cell is designed to contain different cores. One core called PPE is optimized to run an operating system and control-intensive code on it, another eight cores called SPEs are specialized for compute-intensive (data-plane) applications. Today’s symmetric multiprocessors come into multi-gigahertz time and the latency caused by DRAM is increasing to nearly 1000 cycles, which means data transferring between main storage and processor dominate programs performance. To reduce this latency, SPEs adopt a 3-level memory structure (main storage, local stores and register files) and asynchronous DMA transfers between main storage and local stores. The technology of increasing depth of instruction pipelines to obtain higher operating frequencies has reached its limit. By designing different uses cores for different tasks, such as PPE and SPEs, allows different cores to be designed for high frequency without excessive overhead.
5.2 Architecture Overview
Cell Broadband Engine is a single chip multiprocessor initially designed for applications in game console and media-rich applications, such as PS3 and high definition television. It has nine cores on the chip working on a coherent, shared memory. Due to its architecture and computation features, it is also extensively used in much broader ways, such as intensive computation use and server.
5.3 Architecture features
As illustrated in Figure 5.2, the main parts of Cell Broadband Engine include PowerPC Processor Element, Synergistic Processor Elements, and Element Interconnect Bus.
22
Figure 5. 2 Overview of Cell Broadband Engine architecture
The main processor, PowerPC Processor Element (PPE), consists of a Power
Processing Unit (PPU) connected to a 512KB L2 cache. PPU is a 64-bit PowerPC Architecture reduced instruction set computer (RISC) core. PPE’s tasks are running the operating system, managing system resources and coordinating the SPEs’ threads. The key design goals of the PPE are to maximize the performance/power ratio as well as the performance/area ratio. [10] It supports dual-thread due to PPU is dual-issue, and both PowerPC instruction set and the Vector/SIMD Multimedia Extension instruction set. Figure 5.3 illustrates the way of how dual issue mechanism works.
23
Synergistic Processor Element (SPE) consist a Synergistic Processing Unit (SPU)
and a Memory Flow Controller (MFC), as shown in Figure 5.4. An SPU is a compute engine with SIMD support and contains a RISC core, 256KB local storage, and a large 128-entry, 128-bit register file used for both floating-point and integer operations. Like PPU, SPUs are also dual-issue (shown in Figure 5.3), which contains an Even Pipeline containing floating point and fixed point units and an Odd Pipeline containing Permute Unit, Local Store Unit, Channel Unit, and Branch Unit. SPEs are not supposed to run an operating system on it and can only operate data in their own local storage.
24
Figure 5.4 Synergistic Processing Element (SPE) block diagram [6]
Data transfer of SPEs relies on MFC, which is a channel interface between local storage and main memory. The MFC contains a DMA controller with an associated MMU, as well as an Atomic Unit to handle synchronization operations with other SPUs and the PPU. [10] This kind of MFC can doing a asynchronous DMA transfers between main storage and their local stores, which means the channel interface can independently of the SPU moves data and instructions between main storage and their local stores.
Element Interconnect Bus (EIB) is circular bus through which PPE and SPEs
communicate coherently with each other and PPE or SPEs communicate with main storage and I/O module. EIB is made of two 128-bit data channels with a 4-ring structure (two clockwise and two counterclockwise) for data, and a tree structure for commands, as illustrated in Figur 5.5. The internal bandwidth of EIB is 384 GB/s and it supports more than 100 DMA memory requests between main storage and the SPEs.
25
Figure 5.5 The EIB grapples with eight concurrent transactions [17]
Three types of storage are defined in CELL Broadband Engine, including one
main-storage domain, eight SPE local store domains, and eight SPE channel domains. Main-storage is configured by the operating system running on PPE, and can be shared by all processors and memory-mapped devices. In contrast, local storage and channels are private to SPU, LS, and MFC of each SPE.
5.4 Parallel Programming on CELL
PPE’s instruction set is based on an extended version of the PowerPC instruction set, which consist of the Vector/SIMD Multimedia Extension instruction set and a few additions and changes. Although the instruction sets of SPEs are similar to the Vector/SIMD Multimedia Extension part of PPE’s instruction set, they are still different, and then the programs for PPE and SPEs must be complied by ppu-complier and spu-complier separately.
26
5.4.1 SIMD Vectorization
SIMD (Single Instruction, Multiple Data) operation is the most outstanding programming feature of CELL Broadband Engine. Vector is a pack of data stored in a one-dimension-array working as the operand of SIMD operations. This SIMD processing exploits the parallelism in data-level, and it means that one single instruction can be applied to multiple data at one time. In order to support for SIMD operations, both PPE and SPEs have 128-bit register to hold multiple data as a single vector. SIMD is also supported by PPE’s Multimedia Extension Instruction Set and SPE’s Instruction Set. Figure 5.6 takes Four Current Add Operations as an example to illustrate how SIMD works.
Figure 5.6 Four Current Add Operations
Both Multimedia Extension Instruction Set and SPU instruction set have C-Language extensions, which can release programmers from the intensive program work of Assembly Language, since the form of C-Language function call is a convenient substitute of in-line Assembly Language instructions.
5.4.2 Data Parallelization Methods
Depend on different requirements and constraints of applications, there are two different methods to organize related data. One is called an array of structure (AOS) and another is called a structure of arrays (SOA). Consider, for example, subdivision surfaces in which the single triangle defined by floating-point vertices a, b, and c in Figure 5.7 below is subdivided into multiple triangles. [12]
27
Figure 5.7 Point subdivision illustrated
AOS is also called vector-across form, which keeps each three or four component vertex in a single SIMD vector. This data-packing approach is a very natural way of representing a 3-Dimension vertex and often produces small code. However, it typically produces less efficient code and generally requires significant loop-unrolling to improve its efficiency. Another method, a structure of arrays (SOA), is also called parallel-array form. Here, each corresponding data value for each vertex is stored in a corresponding location in a set of vectors.[18]p74 This method may produces more efficient code than AOS depending on different algorithms.
5.5 Software Development Kit
CELL Broadband Engine contains a Software Development Kit (SDK) for developing programs on it. This SDK includes required tools and some examples that highlights the general principles for developer The components of SDK including The IBM Full System Simulator for the CELL Broadband Engine, systemsim, system root image, GNU tools ( C and C ++ compiler, linkers, assemblers and binary utilities for both PPU and SPU), IBM xlc compiler, newlib for SPU, gdb debuggers, PPC64 Linux with CBE enhancements, SPE Runtime management library, Static timing analysis timing tool, Performance tools, an Eclipse based Integrated Development Environment (IDE), Standardized SIMD math libraries, Example source code.
28
Chapter 6
Design Considerations Fast Factorized Back Projection (FFBP, described in Chapter 4) has been used for image formation. However, due to the calculation burden and limit calculation abilities, this is performed off-line, although FFBP is more efficient than GBP. In this chapter, we will discuss the possibility of real-time FFBP on SAR radar with CELL as the processor.
6.1 Real-Time Performance Issues
Due to heavy computation, the process of data of SAR is normally performed “off- line”. Our target here is to make it “on-line” process. Calculation time decides the maximum aircraft speed in SAR, when pulse spacing is fixed. As described in paper “FAST FACTORIZED BACK-PROJECTION IN AN FPGA” [2], the airplane used in CARABAS is a Rockwell Sabreliner 40A with a cruise speed of around 236 m/s without antennas. This speed directly decides the time constraints of this project. The FPGA implementation of FFBP in “FAST FACTORIZED BACK-PROJECTION IN AN FPGA” finished the following number of operations in 2.33 second, which is time upper limit of my project.
29
Table 6.1 Number of operations at iteration: Operations Operation number in each iteration
1.26157824 *109 Add
1.1943936 * 107 Subtraction
1.7915904 * 107 Multiple
5.49421056 * 108 Divide
2.985984 * 106(consists of many operations) Square Root (Sqrt)
2.985984 * 106(consists of many operations) Sine
2.985984 * 106(consists of many operations) Cosine
2.985984 * 106(consists of many operations) Inversed Sine
2.985984 * 106(consists of many operations) Round
4.478976 * 106(consists of many operations) Ceil The number of operations at iteration: Total number at iteration: 1866240000 Total number at six iterations: 11197440000
6.2 Rounding errors
In traditional way of implementing FFBP, such as FPGA, digital signal data is scaled, since they normally do not support floating point numbers due to limited budget. So we have to spend more extra time to estimate the errors and then try to give corrections. Cell Broadband Engine supports 32-bit floating point numbers very well, and there’s no need to do scaling during computation, so the Precision Problems in traditional methods will not be an issue here.
6.3 Program Efficiency Issues
To achieve high program efficiency in Cell, programmers have to study carefully of data dependency, program dependency, and, which is most important here, the
30
possibility of the algorithm to be programmed in a parallel way. For the data in SAR stored in an echo matrix, data dependency doesn’t occur between any two rows in this matrix, since every echo independent from each other. Then we can simply divide the data set into several parts by rows, and sign them to each SPE. This also gives the possibility to map FFBP on CELL in parallel program. Within program, structure of the program is a critical factor to the program efficiency due to Dual-issue structure in SPE cores. CPI (cycle per instruction) is the factor to measure the quality of program structure.
31
32
Chapter 7
Software Implementation This project is implemented in C language extension of Cell Broadband Engine in a parallel programming way. Matlab was also used to model and simulate at the beginning stage and validate results from this implementation. The implementation is based on IBM FULL SYSTEM SIMULATOR that is included in CELL SDK 2.0.
7.1 Data to be processed
We assume that the data to be processed here is from CARABAS SAR system, and is ready for FFBP calculation. The Matlab code simulated the incoming radar data is available, which is written by Lars Ulander in September-97 and then modified by Anelie Wyholt in May-05, the necessary CARABAS SAR system parameters were available. There are several simplifications and assumptions made in the radar data here. [2] 1. Range ambiguities ignored. 2. Flat Earth geometry ignored. 3. Fixed antenna pointing across-track is assumed. 4. Constant gain in elevation. 5. Real-time range compression of transmitted FM signal.[2] The raw radar data here is stored as a 729 * 2048 double complex matrix that is based on a pulse spacing of 0.8 m and an aperture angle of 90 degrees. This matrix is stored in memory and is represented in polar coordinates. [2]
7.2 Data Store
The data matrix is a 729 * 2048 double complex matrix, which is not directly supported by CELL. Then we have to divide this complex matrix into two double real matrixes. And each of them takes up 729 * 2048 * 32 bits = 46656 K bits memory. We also have to allocate two memory spaces to store new generated matrixes from
33
FFBP calculation. Then, totally, we need 46656 *4 = 186623 K bits memory to store data. As mentioned in pre-chapters, there are three storage domains defined in CELL BROADBAND ENGINE: one main storage domain, eight SPE local store domains (256 K), and eight SPE channel domains. [35] And the main storage used here in this project is 256 M bits, so then we should put data matrixes in main storage.
7.3 Data transfer
SPE is the element to perform most of the calculation work, and SPE can only deal with the instructions and data in local storage domain. However we have to store initial data in main memory domain, so then we have to transfer data from main memory to local store and decide how to do this in order not to affect the whole program’s efficiency. As mentioned in pre-chapters, PPU and SPUs use MFC’s DMA to transfer data and instructions from main memory. And SPU can execute instructions while DMA transfer data and instructions autonomously and asynchronously, which covers the latency caused by data transfer. In other words, we try to overlap data movement with computation. As shown in Figure 7.1, it’s a simple double buffer flow chart, which is one of the best methods to achieve autonomous and asynchronous. To maximum utilize DMA’s transfer ability and improve transfer efficiency, we used DMA list to a large extent in this program.
34
7.4.1 Work Partition
The first and most important issue in this project is how to partition and allocate the whole work. Here we used four arrays to store data. Two are for input data (real part, and complex part), and another two are for output results data in very iteration. As explained in figure 4.1, this allocation method eliminates the inter-row data dependency. Then we chose to partition this work into 8 parts by rows evenly and allocate these parts to each of the 8 SPEs. However, 729 can’t be divided by 8, so we
35
take the last row out for PPE to process during waiting time, and partition 728-row- new matrix into 8 parts. The idea here is to allocate as much work as we can to SPEs and don’t allow PPE waiting there, doing nothing. This project uses a large dataset, so we can’t keep all the results after each iteration. Therefore, we kept the newest results and the results of last iteration and output the results of last iteration. And keep the memory allocated for matrixes 16-bit-aligned in order to implement a correct and fast DMA transfer. There are also special skills used for parallel programming to improve the program’s efficiency. First is Exploit SIMD by programmers themselves and choose the optimal SIMD strategy in this project. Since we implement FFBP on CELL in C language, auto-vectorize by compiler may not achieve optimal results for this application because of the flexibility of C language. Therefore, in this project, I used intrinsics as much as possible, since intrinsics are essentially inline assembly with high computational efficiency. To gain a high performance in this program, we chose Structure of Array (SOA) as SIMD strategy (Introduced in Chapter 5). Next issue to think about is to understand what dual-issue is and how to use it. In order to obtain a high efficiency in program, we should keep both of the pipelines busy as long time as possible, which means achieve a high dual-issue rate. Therefore we arrange the program structure by choosing the instructions according to their pipelines. Table 7.1 Instructions for the even pipeline [12]
Pipeline 0 (even) Instructions Latency (clocks)
Single precision floating-point ops 6
Double precision floating-point ops 6+7
Integer multiplies Integer/float convert Interpolate estimate
7
Immediate loads Logical ops Integer add/subtract Sign extend Count leading zero Select bits Carry/borrow generate
2
36
Pipeline 1 (odd) Instructions Latency (clocks)
Loads and stores Branch hints Channel operations Moves to/from SPRs
6
4
There is a tool called spu-timing static timing analyzer available in the SDK, and it can annotate its assembly instructions with the instruction-pipeline state. In other words, spu-timing static timing analyzer can show where dual-issue was successfully accomplished, and where a dual-issue was possible, but not occur due to dependencies. This helps me schedule the instructions’ sequence in order to obtain the maximum dual-issue rate.
7.4.2 Program Skills
In order to avoid expensive branches and stalls due to miss-predicts, select bits instructions are used to reduce branches. When the branches can’t be eliminated, _builtin_expect language extension was used to reduce the number of branch miss- predicts. In this project, DMA transfers are always initialized by SPEs. This is done for these reasons: 1. There are eight times more SPEs than PPE. 2. The SPE command queue is twice as deep as the proxy command queue. 3. Consumer-managed transfers are easier to synchronize. 4. The number of cycles to initiate a transfer from the SPE is smaller than the
number of cycles to initiate the same transfer from the PPE. [12] Since in this project’s code the number of loop iterations is constant, we use SIMD to unroll the loops, which helps optimizer to efficiently schedule. Due to the basic loop
37
structure, this loop-unrolling work also improves the dual-issue rates by computing as the same time as loading and storing data. The following data shows the effectiveness of loop unrolling a sample workload that performed OpenGL-like coordinate transformation and lighting. Table 7.3 Performance metrics for unrolling loops [12]
In the math model of FFBP algorithm, data dependency exits throughout the whole model, which would cause data dependency stall if without optimization. Reduce this dependency by parallelizing several calculations and reducing the number of variants to save register’s use.
7.5 Program design
7.5.1 Kernel Algorithm and Matlab model
In this project, I used the same Matlab model as used in the paper “FAST FACTORIZED BACK-PROJECTION IN AN FPGA” (Andreas Hast and Lars Johansson) and the input data used here is simulated data created by myself. The key issue of FFBP is to decide which elements are going to be combined into a new element. The combination process is a simply add operation in this Matlab model, so then the address calculation would consume almost all the running time of this code.
38
7.5.2 Work Steps
The design process of this parallel program includes four steps: 1. Analyze Matlab code and re-write it in C language:
Write FFBP algorithm in Matlab, and then re-write it in C language, since Matlab is much easier in program math equations. In this step, we also used the inline math library for CELL Broadband Engine, and make sure all the calculation results are right.
2. SIMDize the code on PPE:
SIMDizing the code from last step may have several ways, but programmers should carefully choose one SIMD strategy to improve parallel behavior efficiency. In this project, we chose SOA (mentioned in 5.4.2). Also, we checked the results from this step, and evaluate the efficiency of this code.
3. Migrate the PPE code onto one SPE: The procedures involved creating threads for SPE on PPE, migrating C
instructions and CELL PPE intrinsics into SPE intrinsics, and adding DMA transfers to move data between main memory and local store. Here again, checking the results is a must, and we spent more time in optimization. For example, utilize asynchronous DMA transfers features to improve efficiency. Figure.7.4 shows the address mapping from SPE local store to main memory by using address off-set method.
4.Parallelize code across 8 SPEs
Since there’s little data dependency within one iteration, we partition the data to parallelize the computation across the 8 SPEs. In this step, we check the results and do the final optimization of the code.
Figure 7.2 shows the process flow of PPE and Figure 7.3 is the flow of SPEs. As illustrated in Figure 7.2, PPE read initial input data into main memory, and then creates SPE threads to allocate work to different SPEs. After that, PPE begin to process the special case, last row of the input data matrix, while SPEs are processing major part of the work. After PPE finish the special case processing, it begins to wait for SPEs finishing their job and check if all the work are done or not. If yes, PPE will output the calculation results. If not, PPE will switch the input data matrix address and the output data matrix address and then continue to do the same job in next iteration as last one.
39
40
As shown in Figure 7.3, SPEs initialize DMA by themselves and read in data through their DMAs, which can save DMA initialize time and data transfer time. While DMA is working, each SPE is calculating the address offsets in advance and the offsets are to use later when its DMA finishes data transmission. The results output procedure is outputting results to main memory from local store through DMA, which is parallelized by calculating address offsets for next iteration in advance. At last, SPE will examine whether it has finished all the demanded iterations, if not, SPE will continue the same operations in next iteration, if yes, SPE will return a “success” signal to PPE and end the thread.
41
42
Figure 7.4 shows the address mapping from local store to main memory by address offset. SPEs get beginning of input data’s initial addresses information in main memory from PPE, and then calculate the address offsets of target data they needed. By adding these two address parameters, SPEs can find the target data in main memory.
Figure 7.4 Address mapping from SPE local store to main memory
43
44
8.1 Algorithm kernel test
The kernel of FFBP algorithm here is to find the positions of elements used to generate a new element in radar data matrix. When the range between apertures and the size of radar data matrix are decided, the positions can be decided. Compared with the results from Matlab, the results from this project are correct, which means the kernel part of the algorithm is implemented correctly. Due to lack of original radar data, the test of whole implementation can’t be performed. There’s a corner case should be noticed here, which is the inversed Sine and inversed Cosine, because of their restricted domains. Matlab can handle this automatically, but C language can’t. So I utilised math library in CELL toolchain-3.3 to overcome this difficulty.
8.2 Precision analysis
CELL Broadband Engine supports for floating point number very well, and therefore, there’s no need to take special consideration about the error due to rounding. However, in the pre-projects which are implemented with FPGA, errors were created, since FPGA do not support for floating point number. Since SAR demands high precision in the results, the capability, calculating with floating number, of CELL fits this demand very well. And this feature can light the burden of programmers and improve the efficiency of programmers.
8.3 Real-Time Performance
As mentioned in chapter 6.1, the cruise speed of platform which is Rockwell Sabreliner 40A airplane in this thesis, decides the time constraints of this implementation. In the thesis “FAST FACTORIZED BACK-PROJECTION IN AN FPGA”[2], they used FPGA to calculate a six iterations on a 729*2048 matrix and it
45
takes about 2.33 seconds, which allowed a maximum speed of 194m/s. However, the maximum cruise speed of the plan is 236 m/s. According to the results of this project, it takes SPU 273875486 cycles to calculate the same job mentioned above. The SPU’s target clock-frequency at introduction is 3.2 * 10^9 Hz, and then we know that it takes about 0.085586 second for SPE to finish the job, which is a six-iteration calculating. If we round this time up to 0.1 second, then the maximum plane speed allowed will be 4462 m/s, which is much greater than the speed of 236 m/s. This would give the designer of SAR great flexibility in designing.
8.4 SPE efficiency analysis
IBM FULL SYSTEM SIMULATOR allows accurate cycle simulation of SPUs, and it also provides a statistics function called SPUStats in the simulator. Table 8.1 and Table 8.2 are the statistics results from one of the 8 SPUStats. In Table 8.1, Total Cycle Count means Total SPU run cycles. Performance Cycle Count means total run cycles of the code we interested in. Performance Instruction Count means total instructions executed within the code we interested in and the number in the brackets means total NOP instructions executed within this code. Performance CPI is the average number of Cycles –Per-Instruction (CPI), and this number is an important parameter to the efficiency of the code. Branch instructions are the total branch type instructions executed. Hint Instructions is count of HBP type instructions executed. Hint Hits is count of executed instructions which were loaded from the hint target prefetch buffer. [15]
46
Table 8.1 Total Cycle Count 273875486 Performance Cycle count 273875486 Performance Instruction count 250210140(249648855) Performance CPI 1.09 (1.10) Branch instructions 280953 Branch taken 280311 Branch not taken 642 Hint instructions 280152 Hint hit 279732 In Table 8.2, Single Cycle means cycles in which only one non-NOP instruction was executed Dual Cycle means cycles in which 2 non-NOP instructions were executed (Dual Issue) NOP Cycle means cycles in which only NOP instructions were executed Branch Miss Stalls means cycles in which branch miss-predict prevented any instruction from executing Prefetch Miss Stalls means cycles in which instruction run-out occurred Dependency Stalls means cycles in which source/target operand dependencies prevented any instruction from being issued FP Resource Stalls means cycles in which shared use of FPU stages prevented any instruction from being issued Hint Target Stall means cycles for which target losd delay for a hinted branch prevented instruction fetch Pipe Hazard Stall means cycles for which pipeline scheduling hazards prevented instruction issue Channel Stall means cycles for which the pipeline was stalled waiting on channel operations to complete Init Cycles means cycles elapsed in SPU pipeline initialization sequence [15]
47
Cycle Type Cycle Number Ratio Single cycle 217201647 79.3%
Dual cycle 16223604 5.9% Nop cycle 280902 0.1% Stall due to branch miss 20345 0.0% Stall due to prefetch
miss 0 0.0%
Stall due to dependency 36121830 13.2% Stall due to fp resource conflict
0 0.0%
12 0.0%
35 0.0%
Channel stall cycle 4027100 1.5% SPU Initialization cycle 9 0.0% Total Cycle 273875486 100% The stall caused by DMA conflicts is called Channel stall cycle and is only 1.5% of the total cycles, which is in a acceptable rang. So I didn’t pay special attention to this problem and didn’t schedule DMA globally on purpose. Actually, there are some methods to handle with these conflicts, such as storing data into different memory banks.
48
Chapter 9 Conclusions and Future work The purpose of this thesis is to make the “off-line” signal processing of SAR an “on- line” signal processing by utilizing CELL Broadband Engine and FFBP algorithm.
9.1 Conclusions
This project in one part of many parts in SAR digital signal processing. We can’t give exact processing time of the whole SAR process. However, according to the benchmark results and the comparison with the results from “FAST FACTORIZED BACK-PROJECTION IN AN FPGA” (Chapter 8), this project is qualified to be used in real-time process.
9.2 Implement complexity
Since this project is based on CELL Broadband engine, there’s no need to design a new SoC special for this algorithm. Engineers are only required to master C language and Matlab and some knowledge of the algorithm. The main workload here is the optimization of the code.
9.3 Future Work
This project is only one step of all the steps of SAR process that can’t be finished just in one project. There are still many aspects of this project to be improved, such as the efficiency of the code. The energy consumption issue is not considered in this project. But energy
49
consumption issue is a very important issue in signal process system, and even more important in small signal process system. So this energy consumption issue may be a very interesting topic when implementing FFBP on CELL Broadband Engine.
50
in Electrical Engineering and Computer Systems Engineering, Andreas Hast Lars Johansson
3. Fundamentals of Radar Signal Processing, Mark A.Richards, Ph.D-----pp392 4. Wikipedia.org http://en.wikipedia.org/wiki/Image:Radar_antenna.jpg 5. Wikipedia.org http://en.wikipedia.org/wiki/Image:Radar_composantes.png 6. Real-Time Space-Time Adaptive Processing on the STI CELL Multiprocessor,
Yi-Hsien Li, LITH-ISY-EX--07/3953SE, pp6, Linköping 2007 7. Introduction to airborne radar, second edition, GEORGE W.STIMSON, Scitech
publishing, INC. MENDHAN, NEW JERSEY. pp189 8. Wikipedia.org http://en.wikipedia.org/wiki/Cell_microprocessor 9. Programming Tutorial DRAFT, Software Development Kit for Multicore
Acceleration Version 3.0, pp5, IBM 10. Cell Broadband Engine Architecture and its first implementation,
Thomas Chen ([email protected]), Systems Performance, IBM Ram Raghavan ([email protected]), Systems Performance, IBM Jason Dale ([email protected]), Systems Performance, IBM Eiji Iwata ([email protected]), Microprocessor Development, Sony Computer Entertainment Inc.29 Nov 2005 http://www.ibm.com/developerworks/power/library/pa-cellperf/
11. Programming Tutorial DRAFT, Software Development Kit for Multicore Acceleration Version 3.0, pp35, IBM
12. Maximizing the power of the Cell Broadband Engine processor: 25 tips to optimal
application performance, Daniel A. Brokenshire
http://www-128.ibm.com/developerworks/power/library/pa-celltips1/
13. A COMPARISON OF FAST FACTORISED BACK-PROJECTION ANDWAVENUMBER ALGORITHMS FOR SAS IMAGE RECONSTRUCTION A. J. Hunter, M. P. Hayes, P. T. Gough Acoustics Research Group, Dept. Electrical and Computer Engineering, University of Canterbury, New Zealand Email: a.hunter,m.hayes,p.gough @elec.canterbury.ac.nz
14. Sandia National Laboratories, http://www.sandia.gov/radar/sar.html 15. Using Systemsim to Guide Application Transformation and Tuning for the Cell
Broadband Engine, Micheal Kistler, David Murrell, Vipin Sachdeva 16. Efficient Parallel Architectures for Future Radar Signal Processing, Anders
Åhlander, Department of Computer Science and Engineering, CHALMERS UNIVERSITY OF TECHNOLOGY, GÖTEBORG, SWEDEN 2007
17. Meet the experts: David Krolak on the Cell Broadband Engine EIB bus, Power Architecture editors, developerWorks IBM http://www.ibm.com/developerworks/power/library/pa-expert9/
18 Cell Broadband Engine Programming Tutorial, Version 2.1,
IBM Systems and Technology Group 2070 Route 52, Bldg. 330 Hopewell Junction, NY 12533-6351
52
Upphovsrätt Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under en längre tid från publiceringsdatum under förutsättning att inga extra-ordinära omständigheter uppstår.
Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns det lösningar av teknisk och administrativ art.
Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.
För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/ Copyright The publishers will keep this document online on the Internet - or its possible replacement - for a considerable time from the date of publication barring exceptional circumstances.
The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.
According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.
For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/
© 2008, Yu Shi
1.3 Way of work
Introduction to Radar System and overview of Radar Signal Processing
2.1 Radar fundamentals
2.3 Basic signal processing chain
2.3.1 Data collection
2.3.2 Pulse compression
2.3.4 Detection
2.3.5 Resolving
3.1 Introduction to SAR
4.1.2 Back-Projection techniques
4.2 FFBP work-flow
5.1 Design considerations
5.2 Architecture Overview
5.3 Architecture features
5.4.1 SIMD Vectorization
7.2 Data Store
7.3 Data transfer
7.4.1 Work Partition
7.4.2 Program Skills
7.5 Program design
7.5.2 Work Steps

Documents

Enhanced SAR Image Processing using A Heterogeneous Multiprocessor