[IEEE 2007 DoD High Performance Computing Modernization Program Users Group Conference - Pittsburgh, PA, USA (2007.6.18-2007.6.21)] 2007 DoD High Performance Computing Modernization

Computationally Intensive SIP Algorithms on HPC

Jose Unpingco, Judy Gardiner, Laura Humphrey, and Stanley Ahalt Ohio Supercomputer Center, Columbus, OH

{unpingco, judithg, humphrey, ahalt}@osc.edu

Abstract

Three algorithms that are computationally or memory intensive were implemented in MATLAB and parallelized using MatlabMPI. The parallel algorithms were tested on selected high performance clusters. Significant performance improvements were noted in the parallel versions of two of the algorithms and in one variant of the third.

1. Introduction

The Department of Defense (DoD) uses many common signal and image processing (SIP) algorithms that are computationally intensive or memory intensive, or both. Such algorithms implemented on high performance computing (HPC) platforms can benefit from parallelization, either by rewriting the whole algorithm in parallel or only a portion of the algorithm. This parallelization can either focus on improving execution times, managing memory for very large problems, or both. Three SIP-related algorithms of use to the DoD SIP community were implemented in full or in part in serial, then rewritten in parallel for better performance. The first algorithm deals with Woodbury and Cholesky updates for rank-one updating of matrix inverses for space-time adaptive processing (STAP). The second concerns the large-scale random variate generation for sample-based (i.e., non-analytic) distributions. The third deals with the triangulation and interpolation of data sets that are too large to fit into machine on a single memory. Each algorithm was first implemented in part or in full using MATLAB and then rewritten in parallel using MatlabMPI. The Woodbury and Cholesky updates and the generation of random variates are computationally intensive. The parallel versions of these algorithms thus distribute the computational load across multiple processors. Data triangulation is both computationally and memory intensive. The parallel version of this algorithm divides the data into smaller, overlapping segments that can be distributed to many processors so

that each processor triangulates over a smaller portion of the data, thus lowering the memory requirement for each processor and distributing the computational load. Significant performance improvements were noted in the parallel versions of the Woodbury and Cholesky updates and the triangulation problem. One of the random variate generation techniques also showed significant improvement. These algorithms have been developed and delivered to the government as part of a User Productivity Enhancement and Technology Transfer (PET) project.

2. One-Step Matrix Inverse Updates

2.1. Motivation

Interference cancellation techniques such as those in STAP rely upon developing a statistical covariance matrix

R̂ of the interference environment as in

1

1ˆM

Hk k

kMR x x (1)

wherein M snapshots of N×1 space-time data (xk) are averaged. To detect a target based upon an exemplar target snapshot, v, a detection statistic such as the Generalized Likelihood Ratio Test (GLRT[1]) is computed as in

21

1 1

ˆ

ˆ ˆ1

H

H H

v R z

v R v z R z (2)

Whether or not the resulting scalar is above or below a defined threshold determines whether or not the testsnapshot (z) is declared to contain a target or not. In order to detect targets reliably, the set of data snapshots used must be representative of the local interference environment. In practice, this means that the set of data vectors must be chosen in an overlapped sliding-holestrategy in which is computed many times as the hole slides across the entire dataset. Ultimately, from a

HPCMP USERS GROUP CONFERENCE 2007 (HPCMP-UGC 2007)0-7695-3088-5/07 $25.00 © 2007

computational standpoint, this means recomputing the covariance matrix with rank-1 updates as in

1ˆ ˆ H

w wR R xx (3)

and then computing the inverse of the result. This is computationally intensive and the motivation for this work.

2.2. Woodbury Formula and Cholesky Decompositions

The Woodbury formula, sometimes called the Woodbury update, matrix inversion lemma, Sherman-Morrison-Woodbury formula, or Woodbury matrix identity, is (A + uvT) 1 = A 1 A 1uvT A 1 (4) where

1

11 Tv A u

(5)

and provides an insightful view into the mechanics of re-computing the matrix inverse upon such updates. Unfortunately, while mathematically exact, this update formula can be numerically unstable. This means that small initial errors (e.g., round-off) can result in growing errors when the formula is successively applied. Furthermore, the Woodbury formula itself would gain very little if coded in parallel as communication overhead would slow implementation. Gains can be made, however, by parallelizing code in which a matrix inverse must be updated many times. We have therefore parallelized an algorithm related to space-time adaptive processing of radar data. The inputs to this algorithm are a large m×n matrix, an m×1 vector s, and an m×1 vector d. The algorithm takes submatrices Yi of size m×4m from the input matrix by concatenating columns i through i+4m 1 of the input matrix, i.e., by sliding a window of size m×4m across the input matrix. A weight is then calculated for each submatrix. This weight, call it i, is calculated using the formula

/T Ti i i iw d w w (6)

where wi = Ri1 s and Ri =YYT . It is assumed that Ri is

invertible for all i; note that in this case Ri is positive definite. The first inverse, Ri

1, must be computed directly. However, subsequent values of Ri

1 may be computed using two Woodbury updates. First we remove column i 1 of the input matrix, denoted x, forming an intermediate result

11 1 1 11 1 1 1

T Ti i i x i iT R xx R R xx R

where 11

11x T

ix R x. Next we add column i+4m 1,

denoted y, giving 1 1 1 1T T

i i i y i iR T yy T T yy T

where 1

11y T

iy T y. This approach is much faster

than a naïve implementation in which Ri1 is computed

directly as1

1T T

iR xx yy

Since Ti i iR Y Y is positive definite, a Cholesky

decomposition can be used so that Ti i iR L L , where Li

is a lower triangular matrix with positive diagonal elements. Note that use of the Cholesky factors is preferred over actual calculation of the inverse in most situations due to greater efficiency and numerical stability. Thus, a more stable update for the problem can be obtained by computing the Cholesky decomposition of R1 to obtain Li, and solving for wi by solving for

Ti iL v s using back substitution (see MATLAB

documentation for linsolve taking note of option UT) and then Li wi = vi. Subsequent Cholesky factors Li+1 may be obtained using positive and negative Cholesky updates, removing column i 1 and adding column i+4m 1 to Ri(see MATLAB documentation for cholupdate). Updates to the Cholesky factors are computed using orthogonal transformations, which are extremely stable numerically. Using Cholesky updates and back substitutions is also more computationally efficient than calculating the inverse matrices directly.

2.3. Results

We measured the execution times for our implementations running on various numbers of processors on both Powell (at Army Research Laboratory Major Shred Resource Center [ARL MSRC]) and hpc11 (at Aeronautical Systems Center [ASC] MSRC). Each script was timed while using 1, 2, 4, 6, and 8 processors. Ten trials were run for each number of processors, and the median time was recorded. The input data consisted of a 64×10,000 data matrix and conformable vectors, all generated using STAP radar data. An overlapping sliding window was used to generate the individual matrices. For the direct matrix results, the individual matrices associated with the overlapped windows were computed on separate processors independently. The reduction in median time with number of processors results from spreading out the data across more available processors. The results for the Woodbury method are for computing


the matrix update as defined in the formula (4). The resulting speedup arises from spreading the computation out across more processors. Note that the numerical problems with this method remain. Timing results for the numerically more stable Cholesky method are also shown.

2.4. Summary

The Woodbury and Cholesky algorithms are much faster than the direct approach, in which the full inverse is computed at each step. They are so fast, in fact, that it is not worth parallelizing them with MatlabMPI. With respect to the specific kinds of computation this method is used for, however, parallel methods have accelerated results.

3. Random Variate Generation

3.1. Motivation

Many system simulations require noise modeling in order to represent real world performance. For instance, reliable system noise modeling in anti-jam systems can make or break performance because the underlying noise process is the limiting factor in interference suppression. Unfortunately, for modern high fidelity systems, the available methods to model system noise are limited to those distributions which correspond to existing numerical random variate methods (e.g., Gaussian, Weibull). As a result, many high resolution system models, even if real world noise data is available, are reduced to noise modeling using the standard random noise models; irrespective of whether they can adequately represent the real world noise. For example, inter-symbol interference in communication systems is usually modeled in bulk as a Rayleigh process, which has a well known parametric analytical form. This kind of bulk modeling is generally acceptable for relatively low resolution systems that benefit from long integration times. However, modern surveillance systems, especially Ground Moving Target Indication on the Global Hawk platform, do not enjoy this luxury. Thus, the problem is this: given examples of real system noise, how can a large-scale system simulation, which may run thousands of realizations in order to generate meaningful performance statistics, get more variates from the underlying distribution?

3.2 Methods

Since there is no accepted best method in all situations to generate new random variates using a sample dataset of random variates, three methods were implemented:

1. Inverse cumulative distribution function method – this method creates a cdf F(x) =P(X x) for the dataset under the assumption that all random variates in the dataset are equally probable. New random variates can then be obtained by generating a random value u on the interval [0, 1] and calculating F 1 (u) . An advantage is that the cdf and new random variates are quick and easy to compute. The disadvantages are that the generated samples will always be contained in the minimum-maximum interval of the dataset, and the pdf used to generate the samples always consists of piecewise linear segments that have zero slope.

2. Kernel estimation with rejection sampling – this method creates a pdf for the dataset by placing a kernel at each point in the dataset, summing the kernels, and normalizing the sum. New random variates are obtained by generating a value in the domain of the pdf, then generating a second value between 0 and the height of the pdf. If the second value falls below the value of the pdf at the point of the first value, the first value is returned as a new random variate; otherwise, it is rejected. An advantage of this method is that it can approximate the shape of almost any pdf. However, it is less efficient than the other two methods. It also requires the user to choose an appropriate shape and width or smoothing factor for the kernels.

3. Gaussian mixture models (GMMs)[2] – this method uses K-means clustering and an expectation-maximization algorithm to fit a specified number of weighted Gaussian distributions to the dataset. A new random variate is obtained by sampling from one of the Gaussians with a probability proportional to its weight. Some advantages of this method are that it can approximate many types of distributions, and that the estimation of the pdf and generation of new random variates is very efficient. A disadvantage is that the number of Gaussians to use in the model must be chosen appropriately. Also, this method works best for multi-modal distributions, with the number of Gaussians equal to the number modes.

3.3. Results

To evaluate each of the methods, a sample dataset was created by drawing samples from the multi-modal distribution shown in Figure 2. The methods were coded and implemented on Powell and Falcon using datasets of different sizes and different numbers of desired outputs. Triangular kernels with a width of .1 were used for the


kernel estimation method. Three Gaussians were used for the GMM. In all cases, each method produced samples that approximated the above distribution well. The table below summarizes the execution times using different numbers of processors on Powell when 1,000 samples were used as input and 1,000 or 100,000 outputs were generated. Since the methods differ in approach, they do not scale equally as a function of processors used.

Machine Case Processors Accept Reject GMM ICDF

Powell 1,000 inputs vs. 1,000outputs

246810

0.690.850.900.420.57

0.420.270.250.420.68

0.210.260.490.300.54

1,000 inputs vs. 100,000outputs

46810

120.6029.2721.0115.069.80

0.630.290.360.500.54

47.044.601.550.910.75

The table shows that execution times are short for 1,000 inputs and outputs regardless of what method is used. Due to communication costs, additional processors do not shorten execution times. This remains true for the GMM method even when the number of outputs is increased to 100,000. The acceptance rejection method is more computationally expensive for 100,000 outputs, and execution times shorten as more processors are used. The same is true for the inverse cdf method, though it is not as expensive and so its execution times are shorter overall. In results not shown here, the number of inputs used was increased to 25,000. General trends remained the same. Execution times for the inverse cdf method and the GMM method increased only slightly, on the order of tenths of a second. Kernel estimation is more expensive, and overall execution times increased more noticeably when the dataset size was increased to 25,000.

3.4 Summary

The methods implemented provide ways to generate more random variates after estimating the underlying distribution of a sample dataset. Each has its own advantages and disadvantages. All were able to approximate a multimodal triangular distribution well. The kernel estimation method saw the largest improvements when implemented in parallel, but was also the slowest method. The GMM method did not see improvements when implemented in parallel, but was the fastest method of the three.

4. Interpolation/Triangulation for Large Datasets

4.1. Motivation

Large datasets are collected with respect to variables and coordinates that are convenient and useful for the data collection methodology. Expedient analysis, on the other hand, sometimes requires using the data in coordinates and variables different from those used in the data collection. Specifically, data collected in uniform plaid rasters (i.e., the difference between successive values of a coordinate is a constant) is fastest to index and use. Sets with missing data points or non-plaid data result in large data sets that are difficult to manipulate and analyze. In particular, consider the problem of creating a new data point using interpolation. With a uniform plaid raster, it is very easy to find the points in the existing set of data that enclose the new data point since they are uniformly spaced. Interpolation can then be performed over these points to find the value for the new data point. For a non-plaid set of data, the problem of finding the enclosing points is not so easy. A common solution is to create a triangulation, a graph (i.e., a set of edges and vertices) of the data whose faces are triangles. Then, the existing data points that enclose the new data point can be found using the triangulation and interpolation can be performed using these points. The purpose of this work is to provide a prototypical MatlabMPI code that shows how a large data set, which may be too large to fit on a single processor’s memory, can be distributed across multiple processors. Each portion of the data can then be re-sampled or transformed into a format more amenable to further analysis, and a triangulation can be generated for each portion so that new points can be created through interpolation.

4.2. Methods

For security reasons, a freely available digital elevation model was used to study this problem. This is a good dataset because, under the appropriate coordinate transform, the data no longer lie on a plaid grid. The exercise is to generate interpolated data in earth-centered-earthfixed (ECEF x-y-z coordinates) from plaid latitude/longitude/height DEM data. The MPI code uses MATLAB’s griddata function, which creates a Delaunay triangulation for a set of points and uses this to interpolate values for new points. Because the model is so large, it was broken into “tiles.” For this purpose, polar coordinates were used because they are regular in angle. Thus, the model was divided into nine overlapping tiles according to angle, and each tile was assigned to one processor. Incoming points


were converted temporarily to polar coordinates to determine which tile they belong to. Because creating a Delaunay triangulation is very computationally expensive, each tile was then similarly divided into much smaller “slices,” so that griddata was called only on the necessary slices.

4.3. Results

In order to get an estimate for execution times, we first tested the griddata command by itself on one of Ohio Supercomputer Center’s systems using a slice of one map tile about .0583º wide in longitude and 5,000 new data points, then 10,000 points, then 20,000 points randomly generated within the slice. The times show that the setup cost of the griddata function, which includes creating a Delaunay triangulation for slices of the digital evaluation model, versus performing triangulation and interpolation for the input points is very large, on the order of several minutes for this small slice. The main code was tested by generating input points over 30 similar slices. The new points appeared to align well with the existing points. The execution times were recorded for 5,000 points per slice on the Powell (ARL) and the Falcon (ASC). Execution times for 5,000 points per slice were 15 minutes or 900 seconds on the Falcon (ASC) and 50 minutes or 3,000 seconds on the Powell (ARL), with most of the time being spent on calls to griddata. It should be noted that MATLAB has a delaunayfunction, which constructs a Delaunay triangulation for a set of points. As previously mentioned, this appears to be the most computationally expensive part of griddata, and thus the program as a whole. If delaunay were used instead, the Delaunay triangulation could be saved and searched over using the function tsearch. Each processor could then load the necessary Delaunay triangulation or triangulations instead of recalculating them every time. This would greatly reduce execution time.

4.4. Summary

New points in earth-centered-earth-fixed x-y-z coordinates were generated using digital terrain model data using a MatlabMPI program. The digital terrain model was, as an intermediate step, converted to polar coordinates and divided by angle into nine smaller, overlapping tiles. Each tile was assigned to one processor, and each processor further divided its tile into slices. MATLAB’s griddata function was used to create triangulations and perform interpolation over the necessary slices. An alternate and most likely much faster method would use the delaunay function to create and

save the triangulations so that triangulations only ever have to be generated once.

5. Summary

The User Productivity Enhancement and Technology Transfer (PET) project work used MatlabMPI for three broad topics: matrix inverse updates, random variate generation, and interpolation/triangulation for large datasets. Because of the communication overhead involved in parallelizing the matrix inverse, the work developed a more stable (and very fast) alternative to the Woodbury update formula with respect to a specific STAP methodology. The random variate generation work implemented three accepted methods to generate new random variates from an exemplar dataset. Interpolation for large datasets that are difficult to manipulate due to non-plaid layouts or missing data points was addressed by developing a method to spread out the underlying dataset across multiple processor memories. Each processor can then generate Delaunay triangulations and perform interpolation for points that lie in its domain.

Acknowledgements

This publication was made possible through support provided by DoD High Performance Computing Modernization Program PET activities through Mississippi State University under contract GS04T01BFC0060. The opinions expressed herein are those of the author(s) and do not necessarily reflect the views of the DoD or Mississippi State University.

References

1. Kelly, E.J., “An Adaptive Detection Algorithm.” IEEE Trans. on Aerospace and Electronics Systems, vol. AES-22, no. 1, pp. 115–127, Mar. 1986. 2. Russell, S. and P. Norvig, Artificial Intelligence: A Modern Approach, 2nd ed., Prentice Hall, New Jersey, 2003.


Figure 1. Median Execution time versus number of processors for the direct inverse, Woodbury, and

Cholesky methods

Figure 2. 5,000 samples from the distribution used to generate sample random variates


Documents

[IEEE 2007 DoD High Performance Computing Modernization Program Users Group Conference - Pittsburgh, PA, USA (2007.6.18-2007.6.21)] 2007 DoD High Performance Computing Modernization