16
http://hpc.sagepub.com/ Computing Applications International Journal of High Performance http://hpc.sagepub.com/content/early/2014/01/15/1094342013518807 The online version of this article can be found at: DOI: 10.1177/1094342013518807 published online 17 January 2014 International Journal of High Performance Computing Applications Mark Gates, Michael T Heath and John Lambros High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation Published by: http://www.sagepublications.com can be found at: International Journal of High Performance Computing Applications Additional services and information for http://hpc.sagepub.com/cgi/alerts Email Alerts: http://hpc.sagepub.com/subscriptions Subscriptions: http://www.sagepub.com/journalsReprints.nav Reprints: http://www.sagepub.com/journalsPermissions.nav Permissions: What is This? - Jan 17, 2014 OnlineFirst Version of Record >> at GEORGE MASON UNIV on May 31, 2014 hpc.sagepub.com Downloaded from at GEORGE MASON UNIV on May 31, 2014 hpc.sagepub.com Downloaded from

High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

  • Upload
    j

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

http://hpc.sagepub.com/Computing Applications

International Journal of High Performance

http://hpc.sagepub.com/content/early/2014/01/15/1094342013518807The online version of this article can be found at:

 DOI: 10.1177/1094342013518807

published online 17 January 2014International Journal of High Performance Computing ApplicationsMark Gates, Michael T Heath and John Lambros

High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation  

Published by:

http://www.sagepublications.com

can be found at:International Journal of High Performance Computing ApplicationsAdditional services and information for    

  http://hpc.sagepub.com/cgi/alertsEmail Alerts:

 

http://hpc.sagepub.com/subscriptionsSubscriptions:  

http://www.sagepub.com/journalsReprints.navReprints:  

http://www.sagepub.com/journalsPermissions.navPermissions:  

What is This? 

- Jan 17, 2014OnlineFirst Version of Record >>

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 2: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

Article

High-performance hybrid CPU and GPUparallel algorithm for digitalvolume correlation

Mark Gates1, Michael T Heath2 and John Lambros3

AbstractWe present a hybrid Message Passing Interface (MPI) and graphics processing unit (GPU)-based parallel digital volumecorrelation (DVC) algorithm for measuring three-dimensional (3D) displacement and strain fields inside a materialundergoing motion or deformation. Our algorithm achieves resolution comparable to that achieved in two-dimensional (2D) digital image correlation (DIC), in time that is commensurate with the image acquisition time, in thiscase, using microcomputed tomography (m CT) for scanning images. For DVC, the volume of data and number of correla-tion points both grow cubically with the linear dimensions of the image. We turn to parallel computing to gain sufficientprocessing power to scale to high resolution, and are able to achieve more than an order-of-magnitude increase inresolution compared with previous efforts that are not based on a parallel framework.

Keywordsdigital volume correlation, X-ray tomography, strain measurement, parallel computing, GPU computing, imageregistration

1. Introduction

Three-dimensional (3D) digital volume correlation (DVC)

is a technique used to measure 3D displacement and strain

fields throughout the interior of a material that has been

imaged using 3D techniques such as X-ray microcomputed

tomography (m CT) or confocal microscopy. DVC is an

extension of two-dimensional (2D) digital image correla-

tion (DIC) (Chu et al., 1985), which is widely used in

experimental mechanics to measure surface displacements

(Sutton et al., 2009). DVC provides a useful experimental

complement to 3D numerical simulations such as finite-

element analysis (FEA). For example, DVC can generate

the full-field 3D experimental results required to compare

with 3D finite-element simulations for validation purposes

(Zauel et al., 2006; Mostafavi et al., 2013), results that are

otherwise difficult or impossible to obtain. In addition to

validation, DVC can also play a vital role in determining

input parameters for simulations (Liu, 2005; Moulart

et al., 2009; Efstathiou et al., 2010; Rossi and Pierron,

2011). DVC has been used to analyze materials with com-

plex microstructures, such as bone and other biological

materials (Bay et al., 1999; Smith et al., 2002; Zauel

et al., 2006; Liu and Morgan, 2007; Franck et al., 2011),

rock (Lenoir et al., 2007), and concrete (Hild et al., 2013;

Yang et al., 2013); as well as complex behaviors, such as

material fatigue (Rannou et al., 2010; Mostafavi et al.,

2013), which can be difficult to simulate. Thus, although

only a relatively small number of researchers have studied

DVC, it is a powerful tool with significant future potential.

1.1. Background

The DVC process starts with a material sample possessing

a random internal pattern of features that are detectable by

the 3D imaging device. Such an internal ‘‘speckle’’ pattern

can arise either from inherent internal material microstruc-

ture, such as in bone, or by manufacturing samples with

embedded particles (Haldrup et al., 2006; Franck et al.,

2007; Germaneau et al., 2007). As depicted in Figure

1(a), a reference 3D image of the undeformed sample is

captured, then a motion or deformation is applied, and a

1 Department of Electrical Engineering and Computer Science, University

of Tennessee, Knoxville, USA2 Department of Computer Science, University of Illinois at Urbana-

Champaign, USA3 Department of Aerospace Engineering, University of Illinois at Urbana-

Champaign, USA

Corresponding author:

Mark Gates, Innovative Computing Laboratory, Department of Electrical

Engineering and Computer Science, University of Tennessee, Knoxville,

1122 Volunteer Boulevard, Suite 203, Knoxville, TN 37996, USA.

Email: [email protected]

The International Journal of HighPerformance Computing Applications1–15ª The Author(s) 2014Reprints and permissions:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/1094342013518807hpc.sagepub.com

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 3: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

3D image of the resulting deformed sample is captured

either under load (in situ) or after unloading (ex situ), as

schematically illustrated in Figure 1 (b). In Figure 1, the

3D images captured are indicated by 3 slices, out of a total

of approximately 1000 such tomographic image slices

through the material interior, each one revealing internal

pattern features. The 3D data are represented digitally as

intensity values at voxel locations of a 3D grid. We define

a grid of correlation points on the reference image, with a

3D subset surrounding each point. For each correlation

point, DVC determines the deformation that maps the

subset in the reference image (Figure 1 (a)) to the subset

in the deformed image (Figure 1 (b)) with the best correla-

tion. In Figure 1 this happens to be a rigid rotation about the

z-axis, which is illustrated in the DVC result of Figure 1 (c)

superposed on the reference image.

This mapping between the reference image and the

deformed image is given by a shape function, xðuÞ, which

defines the degrees of freedom (DOFs) to be determined at

each point. Common mappings include translation (3

DOFs), rotation (6 DOFs), and affine (12 DOFs), with more

DOFs being more expensive to compute but generally

yielding more accurate results.

To determine the deformation u, DVC seeks the best

match between the subset in the reference image, f ðxÞ, and

the subset in the deformed image, gðxðuÞÞ, by optimizing

an objective function that measures their similarity, com-

puted as a summation over all voxels x in the subset. We

use the least-squares objective function,

cðuÞ ¼P

x ðf ðxÞ � gðxðuÞÞÞ2Px f ðxÞ2

¼ f � gk k2

fk k2: ð1Þ

The normalized cross-correlation function (Chu et al.,

1985) has also been used for DVC and is amenable to the

same parallelization techniques presented here. In general,

f and g do not exactly match, but we seek the best possible

match by finding the deformation u that optimizes the

objective function cðuÞ.DVC achieves sub-voxel accuracy, resulting in a

deformation u with non-integer coordinates, requiring inter-

polation to evaluate the deformed image gðxðuÞÞ between

voxels. To achieve high-accuracy DVC results, we imple-

ment C2 tricubic B-spline interpolation (Dierckx, 1993).

1.2. Related work

DVC has been developed by various research groups since

1999, as summarized in Table 1. The evolution has moved

toward higher-order shape functions, starting from transla-

tion, then rotation, then an affine transformation. Image

interpolation has generally been cubic (C1) or cubic spline

(C2), although some larger studies have used less expensive

linear (C0) interpolation. Most DVC studies have been

relatively modest in size, computing fewer than 16,000

points. Some groups have computed up to 60,000 points

with linear image interpolation and lower-order shape

functions. Execution times have varied significantly, from

0.47 to 40 seconds per subset, depending on the images,

computer, subset size, and complexity of the method.

As an alternative to the local DVC described above,

Roux et al. (2008) introduced global DVC, where a

finite-element mesh is used to define a global mapping that

is optimized over the entire image simultaneously, instead

of computing individual subsets independently. Since the

entire image is coupled together, the parallelization of glo-

bal DVC would necessarily be different than the approach

used here for local DVC. Recently, Leclerc et al. (2012)

accelerated with a GPU a pixel-based global DVC, where

each element is a single pixel. In their approach, the GPU

computation is primarily linear algebra, solving the result-

ing system, which has several million DOFs.

Outside mechanical engineering, DIC and DVC share

much in common with image registration algorithms,

which seek to align images that either partially overlap,

or are taken at different times or with different imaging

techniques (multi-modal). Medical imaging in particular

uses both 2D and 3D image registration (Shams et al.,

2010; Flucka et al., 2011), for instance, fusing a CT scan

with an MRI scan, or aligning CT scans taken at different

times. A wide variety of techniques are used for registration

(Zitova and Flusser, 2003; Modersitzki, 2009), from track-

ing features to similarity measures based on pixel intensity,

such as least squares (i.e. sum-of-squared differences) or

mutual information. Several medical imaging researchers

have used GPUs to accelerate registration, typically accel-

erating computation of the objective function (Klein et al.,

2010; Saxena et al., 2010; Glocker et al., 2011). These have

had moderate resolution on images from 3 to 26 megavox-

els. While using similar techniques, the goal of image reg-

istration is different than DVC. Registration seeks to align

two images, either to fuse them into a new image or to com-

pare them directly. As such, it requires accuracy of approx-

imately one voxel. The time constraint may also be several

minutes or less, while a patient is undergoing an operation.

In contrast, DVC is concerned with the actual deformation

field and its derivatives, which yield strain. Thus, DVC

requires higher accuracy, on the order of 1=10 voxel,

measured at a large number of correlation points to give

a densely sampled displacement field.

1.3. Major contributions

Our goal is to perform DVC with resolution (i.e. density of

correlation points in each dimension) and accuracy compa-

rable to that achieved in 2D DIC, with a correlation time

that is commensurate with the image acquisition time.

Depending on the resolution, each CT scan can take an

hour, plus the time needed to reconstruct the 3D image.

Because of the vastly increased volume of data associated

with the reference and deformed images in 3D over 2D,

DVC requires substantially more computation and storage

than DIC to achieve similar resolution.

2 The International Journal of High Performance Computing Applications

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 4: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

The present work builds upon our previous work (Gates

et al., 2011), which focused on optimizing the serial algo-

rithm, as summarized in Section 2. In particular, we previ-

ously examined the coarse search strategy to find the initial

u0, the choice of optimization algorithm, spline basis func-

tion, and use of smoothing splines. A cursory explanation

of our statically scheduled parallel DVC algorithm was also

given, which has been refined and expanded in the current

work.

In this work, we explain in more detail our data storage

that facilitates the parallel DVC algorithm, and analyze the

cost and scalability of various implementations of a parallel

DVC algorithm. We observe that only a small portion of the

image data is in use at any one time, so by careful memory

management we have developed an out-of-core algorithm

that scales to much larger images, described in Section 3.

In contrast with the modest resolutions previously achieved

with DVC, summarized in Table 1, a typical 2D DIC grid

has up to 1002 correlation points, so a comparable resolu-

tion in 3D requires a grid with 1,000,000 points. Such high

resolution is required to resolve high strain gradients within

a material effectively. To reach our goal of a 1003 point

correlation grid, at 0.25 seconds per correlation point,

would require 3 days of serial computing time. We turn

to parallel computing to gain sufficient processing power

to scale to these large problem sizes. We explore two forms

of parallelism: coarse-grained parallelism using MPI to

compute different correlation points simultaneously, dis-

cussed in Section 4, and fine-grained parallelism using

GPUs to compute the objective function for each correla-

tion point in parallel, discussed in Section 5. In addition

to a statically scheduled parallel scheme, we propose a

x

y

z1 mm

)c()b()a(

Figure 1. DVC applied to a 3D image with 10� rotation about the z-axis, showing 3 out of approximately 1000 images: (a) referenceimage; (b) deformed image; (c) displacement field. The subset in reference image (a) is mapped to the subset in deformed image (b). Thedisplacement field (c) is computed on a 5� 5� 2 grid of correlation points. The subset size shown is 413 voxels.

Table 1. Selection of prior work on DVC. A dash (—) denotes entries not reported in papers; * denotes entries using global DVCinstead of local DVC.

Reference MaterialImageinterpolation

Shapefunction

Correlationpoints/elements

Bay et al. (1999) Bone Cubic Translation 5500Smith et al. (2002) Bone Cubic Rotation 125Verhulp et al. (2004) Al foam Cubic Affine 2130Zauel et al. (2006) Bone Cubic Translation —Franck et al. (2007) Agarose N/A Axial strain 3375Lenoir et al. (2007) Rock Linear Translation 60,000Roux et al. (2008) Solid foam — Affine 1331 *Forsberg et al. (2010) Wood Cubic spline Affine 960Hall et al. (2010) Sand linear Rotation 50,000Rannou et al. (2010) Cast iron — Affine þ crack 729 *Limodin et al. (2011) Cast iron Cubic spline Affine 15,625 *Gates et al. (2011) PDMS Cubic spline Affine 59,000Sjoodahl et al. (2012) Sugar Cubic spline Affine 8,300Morgeneyer et al. (2012) Al Cubic spline Affine 16,000 *This work Ceramic Cubic spline Affine 1,030,000

Gates et al. 3

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 5: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

dynamically schedule master–worker scheme, which exhi-

bits superior efficiency and scalability. For our GPU imple-

mentation, we highlight the importance of tuning to find the

best algorithm parameters, and demonstrate how multiple

CPU cores can effectively share a single GPU.

2. DVC algorithm

Before looking at the parallel implementation, we summarize

the serial implementation of DVC as detailed in Gates et al.

(2011), and analyze its computational cost. For added robust-

ness, here we have added a fast Fourier transform (FFT)

coarse search. Some choices made in anticipation of our

parallel algorithm are noted. The pseudocode is given in

Algorithm 2, in the context of our parallel implementation.

For mapping the reference image to the deformed

image, we use an affine shape function (Chu et al.,

1985), which defines 12 DOFs, namely the displacements

in each dimension and their first derivatives,

u ¼ u v w quqx

quqy

quqz

qvqx

qvqy

qvqz

qwqx

qwqy

qwqz

h iT

:

As illustrated in Figure 2, for each correlation point we let

f ðxÞ be the subset in the reference image and gðxðuÞÞ be the

subset in the deformed image, with a point x ¼ ½x; y; z� inthe reference subset related to the corresponding point xðuÞin the deformed subset by the affine shape function

xðuÞ ¼xþ uþ qu

qxxþ qu

qyyþ qu

qzz

yþ vþ qvqx

xþ qvqy

yþ qvqz

z

zþ wþ qwqx

xþ qwqy

yþ qwqz

z

264

375; ð2Þ

where x; y; z form a local coordinate system within each

subset.

For each correlation point, the DVC algorithm first

determines an initial u0 to start the optimization procedure

from. This initial deformation must be close enough to the

global optimum for the optimization algorithm to converge,

typically within a few pixels. If neighboring correlation

points have been previously computed, we start with a

first-order extrapolation from those results to determine a

starting u0. Otherwise, we start with a user-specified displa-

cement, often zero. Then a coarse search algorithm samples

the objective function at all voxels in a search region

around u0, seeking an improved u0.

To enhance robustness of the coarse search, when it can-

not find a u0 such that cðu0Þ is below a user-defined thresh-

old, in this work we augment it with a second coarse search

over a much larger region. This uses the normalized cross-

correlation objective function,

~cðuÞ ¼P

x f ðxÞgðxðuÞÞPx f ðxÞ2

Px gðxðuÞÞ2

� �1=2¼ f ; gh i

fk k � gk k ;

which can be evaluated efficiently for all voxels in a

kfft � kfft � kfft search region using the FFT (Gonzalez and

Richard, 2007) as

f ; gh i ¼ IFFT FFTðfpadÞ � FFTðgpadÞ� �

;

and

gk k ¼ IFFT FFTð1padÞ � FFTðg2padÞ

� �� �1=2

;

where the subset f is of size s3, g is of size ðsþ kfftÞ3, 1 is the

unit function of size s3; fpad, gpad, g2pad and 1pad are suitably

zero-padded to size ð2ðs� 1Þ þ kfftÞ3. We use the imple-

mentation in the FFTW library (Frigo and Johnson, 2005),

and ensure that 2ðs� 1Þ þ kfft has prime factors of only

2; 3; 5, and 7 for best efficiency. While ~cðuÞ differs from

cðuÞ, both have an optimum value where the images are

aligned, so ~cðuÞ provides a good coarse search for cðuÞ. This

combination of coarse search algorithms enhances robust-

ness of the overall DVC process, leveraging results from

previous correlation points when available, but not requiring

such results, which may be computed by another parallel

process and therefore be unavailable.

Once a suitable u0 has been found, a standard optimiza-

tion algorithm refines the solution u to sub-pixel accuracy.

Based on previous investigation (Gates et al., 2011), we use

the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algo-

rithm with a cubic line search due to Fletcher (1987),

although other algorithms such as Levenberg–Marquardt

have also been used (Bay et al., 1999). BFGS requires

evaluating the objective function cðuÞ and its gradient

rcðuÞ, but not its Hessian.

To evaluate cðuÞ between integer voxels, we use tricubic

B-spline interpolation. First, the B-spline coefficients must

be computed from the pixel intensities, sometimes known

as prefiltering. One potential scheme computes coefficients

for the entire n� n� n image, which involves solving three

banded ðnþ 2Þ � ðnþ 2Þ systems with an ðnþ 2Þ�½ðnþ 2Þ2� right-hand side (RHS) matrix, transposing the

RHS after each solve (Dierckx, 1993; Buis and Dyksen,

1996). Using a banded solver, such as LAPACK’s gbsv

(Anderson et al., 1999), the cost to solve for all the RHS is

Oð3n3Þ. A parallel implementation could distribute the RHS

among p processors, avoiding communication during each

x

xf (x)

g(x(u))

y z

x

Figure 2. Representation of reference subset f and deformedsubset g with linear shape function.

4 The International Journal of High Performance Computing Applications

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 6: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

solve but incurring three all-to-all communications, one after

each solve for the transpose, involving pðp� 1Þ messages

of n3=p2 floats each. The actual cost depends on the MPI

implementation and underlying network topology. This

becomes a preprocessing step that must be finished before

any correlation points can be computed.

Instead of computing spline coefficients over the entire

image, our scheme computes a spline over each subset,

plus a small padding region of b voxels. Before each

correlation function evaluation, we check that the bound-

ing box (bbox) for the deformed g subset (dashed line

in Figure 3) is still contained within this spline region,

and re-compute the spline over a new region if necessary.

For an N � N � N grid of correlation points, the cost

is Oð3N3ðsþ 2bÞ3Þ. For a spacing of d voxels between sub-

sets, N � n=d, yielding a total cost of Oð3n3ððsþ 2bÞ=dÞ3Þ.Unlike the previous scheme, our scheme has no commu-

nication. Asymptotically, both schemes have an Oðn3Þcost, so comparison would require experiments to deter-

mine the constants involved. In our scheme, instead of

being an expensive preprocessing step, the spline com-

putations now occur throughout the computation. This

is a significant benefit when computing a small number

of correlation points that do not cover the entire image,

for instance with a small-scale analysis before comput-

ing a full analysis.

Evaluating the objective function is a sum over all voxels

in the subset, so it has complexity Oðs3Þ, with the constant

accounting for 1 flop/voxel if no interpolation is needed (as

in the coarse search), to 278 flops/voxel for cubic B-spline

interpolation. Evaluating rc with a spline is an additional

248 flops/voxel.

The overall complexity for each correlation point is thus

Cpt ¼ Oðk3c s3Þ|fflfflfflffl{zfflfflfflffl}

coarsesearch

þOðtðsþ kfftÞ3 log2ðsþ kfftÞÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}FFTsearch

þ Oð3ðsþ 2bÞ3Þ|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl}B�splinecoefficients

þOðkbfgss3=aÞ|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}

BFGS

;ð3Þ

where s is the subset width (e.g. 41), b is the padding

around the subset (set to 5), kc is the width of the coarse

search volume (on average, 3), t is the percentage of points

that invoke the FFT search (on average, 1.7%), kfft is the

width of the FFT search region (set to 40), kbfgs is the num-

ber of cðuÞ evaluations during BFGS (on average, 21.5). On

average, BFGS takes 5.6 iterations and the cubic line search

adds 2.8 evaluations per BFGS iteration. When using the

GPU implementation introduced in Section 5, a is the

speedup factor of the GPU implementation over the CPU

implementation. The total serial complexity for an

N � N � N correlation grid is

N 3Cpt þ Oð2 sþ bþ dðN � 1Þf gn2Þ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}readimages

: ð4Þ

The number of slices read for each image is bounded by the

image size, n, in which case this becomes N 3Cpt þ Oð2n3Þ.

3. Data storage

To achieve our goal of computing high-resolution DVC in

an acceptable time, two interrelated issues must be

addressed: the first issue is how to store data so that we can

scale the problem to a large image size within the amount

of memory available to each processor; the second issue

is how to parallelize the DVC code to reduce the required

wall clock time.

For 2D DIC, processing two n� n images in memory

requires 2n2 bytes for image data and ðnþ 2Þ2 floats for the

B-spline coefficients used for image interpolation, which is

easily managed for typical resolutions. For 3D DVC, how-

ever, storing two n� n� n images can be a challenge. It

requires 2n3 bytes for image data and ðnþ 2Þ3 floats for

spline coefficients. For n ¼ 1024, this requires 6 GB total,

while at our scanner’s maximum 4000� 4000� 2300 res-

olution it requires 206 GB total, exceeding the memory of

most shared memory computers. A parallel implementation

could distribute the memory over p processors, requiring

Oð6n3=pÞ bytes per process. However, examining how

DVC accesses images reveals that we can use much less

memory. 3D images are typically stored as a series of 2D

images (e.g. in TIFF format), each representing a single

slice with a constant value of the z coordinate (see Figure

1 for coordinate definition). Each subset intersects only a

small number of these slices, so rather than attempting to

load the entire 3D image, we developed a data structure that

loads only those slices that are currently in use, as shown in

Figure 4. Thus, we reduce the required image data from

deformedg(u1) subset

bounding boxpadding

initial g(u0) subset

Figure 3. Derivative terms in initial transformation u0 are zero,so initial subset is square. The outer gray region is padding addedbefore computing the spline. After the optimization step, deriva-tive terms in u1 are non-zero, introducing strain and rotation.Dashed lines show the bounding box, which here is still insideinitial spline region.

Gates et al. 5

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 7: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

2n3=p to 2ðsþ bÞn2 per process, independent of p. For the

spline coefficients, the spline is computed over a single sub-

set instead of the entire image, reducing required spline coef-

ficients from ðnþ 2Þ3=p to ðsþ 2bþ 2Þ3 floats per process.

As s� n, this is a significant savings. The total memory is

thus Oð2sn2Þ bytes. Our scheme will require less memory

than loading the entire image for p < ð6nÞ=ð2sÞ. Figure 5

shows how our memory requirements scale with image size,

and compares this with computing the full image spline in

parallel for various p. A variant of this scheme for a shared

memory node would let cores within a node share a set of

images, reducing memory requirements to 2ðsþ bÞn2 per

node, instead of per process. This would require a change

in the parallel algorithm, so that processors within a shared

node are assigned correlation points in the same plane; how-

ever, we have not investigated this variation.

Our data storage scheme is implemented as a Cþþ class

called Image3D. From a user’s perspective, accessing vox-

els in an Image3D object operates much like a 3D array. To

access the voxel with indices i; j; k, we use g ¼ image(i,

j, k), which returns the i; j entry of slice k. There is an

additional function, image.load(k_begin, k_end),

to read a contiguous range of image slices. Internally,

Image3D maintains a vector of pointers to the 2D image

slices. At any time, one contiguous block of these slices

is loaded into memory, while pointers for all other slices

are null, marking them as not loaded.

This data structure makes the DVC algorithm scalable to

large problem sizes, even on a single processor, rather than

being restricted by the amount of available RAM. For

instance, to perform DVC on a 992� 1013� 994 image

required less than 240 MB of memory per processor, even

though two 3D images plus spline coefficients would take

5.6 GB. This data structure also has immediate benefits for

parallel computing, as each processor can manage its own

subset of image slices independently, simplifying the paral-

lel data distribution and communication.

4. Coarse-grained parallel computing

For parallel implementation, a computation can be decom-

posed into tasks at several different granularities. Given

multiple pairs of 3D images to analyze using DVC, the

coarsest granularity for parallel computing would be to run

multiple instances of DVC to analyze different images

simultaneously; we will not deal with this trivially paralle-

lizable case. More interestingly, within the DVC computa-

tion for a single pair of images, there are at least two

additional levels of granularity: a coarse-grained decompo-

sition computes multiple correlation points in parallel,

while a fine-grained decomposition computes the objective

function for a single correlation point in parallel. We

address the coarse-grained parallelism in this section and

discuss the fine-grained parallelism in Section 5.

To implement a coarse-grained parallel algorithm, we

assign different correlation points to different processors.

A key question is how to assign correlation points to specific

processors. In our storage scheme for 3D images, each pro-

cessor loads only the slices it needs for the current subset, as

shown in Figure 4. Therefore, assigning correlation points in

a plane with the same z coordinate to a single processor will

maximize reuse of image data already in memory. This also

makes the algorithm scalable to large data sets because each

processor reads only a fraction of the data, instead of every

processor reading and storing the entire data set. In addition,

this provides the maximum benefit of using neighboring cor-

relation points on the same processor to extrapolate a good

initial guess for the next correlation point.

However, this simple 1D decomposition by planes may

produce poor efficiency resulting from load imbalance,

where some processors are assigned more planes than

others, as shown in Figure 6 (a). We improve on this by

considering two other decompositions. One option is to

group points into rows, where all points in a row have the

same y and z coordinates, and divide rows evenly among

processors, yielding a maximum difference of one row

between different processors (Figure 6 (b)). Another

option, which is used for all static scheduling results here,

is to divide the total number of points evenly among

load}

Processor 1 Processor 2

load}

Figure 4. Each processor reads only image slices required for itscurrent subset. Different processors work on different subsets inparallel.

256 GB

64 GB

full, p = 1

full, p = 12

full, p = 96

full, p = 256

ours, s = 41

16 GB

4 GB

1 GB

256 GB

64 GB

16 GB

1000 2000 3000

image size (n)

4000

Figure 5. Growth in per-process memory requirements as afunction of image size, for various numbers of processes, p. Thefull scheme loads the entire image into memory. Our schemeloads OðsÞ slices and is independent of p.

6 The International Journal of High Performance Computing Applications

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 8: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

processors, yielding a maximum difference of one point

between different processors (Figure 6 (c)).

This static load balancing scheme assumes that each corre-

lation point takes approximately the same amount of time to

compute. However, variations in image quality (e.g. differing

speckle pattern over regions of the sample) and the underlying

deformation field can cause correlation points to take

different amounts of time to compute, leading to a load imbal-

ance. We used TAU (Shende and Malony, 2006) and Jump-

shot (Chan et al., 2008) to trace and visualize the

communication and idle time for the static load balancing

scheme, as shown in Figure 7 (a). Each horizontal bar repre-

sents one of three processes. The gray blocks show idle time at

the end of the computation for processes 0 and 1 as they wait

for process 2 to finish, demonstrating load imbalance.

To correct this imbalance, we developed a master–worker

dynamic load balancing scheme, where we assign one pro-

cess to be the master, and all other processes to be workers.

The master process, shown in Algorithm 1, creates a set of

tasks, with each task being a single row of the correlation

grid. For each worker process, the master initially sends two

tasks and sets up a non-blocking receive to wait for results

back from the worker. The master then waits on the set of

non-blocking receives for any incoming results from work-

ers. For any results that come in, it records the results and

sends a new row of correlation points to that worker process,

or a flag telling the worker it is finished. Each worker, shown

in Algorithm 2, receives a task, processes the correlation

points in it, sends results back to the master, starts a non-

blocking receive for a new task, and immediately starts on

the next queued task. By initially assigning two tasks, work-

ers always have a queued task to work on while waiting for

the next assignment from the master.

Figure 7 (b) shows a trace of the communication and

idle time for the dynamic load balancing scheme. The ver-

tical orange lines show the master process receiving

results from and sending a new task to each worker pro-

cess. The master is nearly always idle in a MPI_Waitsome

call (yellow), so it can run oversubscribed on the same

core as a worker process. The gray blocks at the end are

again idle time, but are significantly less than the idle time

for the static load balancing scheme in Figure 7 (a). The

wall clock time is reduced by 9%, from 194 seconds to

176 seconds, showing that the added communication for

dynamic load balancing is more than compensated by the

improved load balancing.

(a) (b) (c)

Figure 6. Decomposition of 5� 5� 5 grid of correlation pointsonto four processors: (a) by planes (63% efficiency); (b) by rows(89% efficiency); (c) by points (98% efficiency). Parallel efficiencygiven for simple case of all points taking equal time.

Gates et al. 7

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 9: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

We still want to be efficient with reusing the existing

image slices that have been loaded by each process. There-

fore, rather than assigning tasks in a round-robin fashion,

with task 0 going to process 0, task 1 to process 1, etc.,

we instead make an initial distribution of tasks identical

to the static load balancing ‘‘by rows’’ (Figure 6 (b)). If a

process finishes all of its initial distribution of tasks, the

master will re-assign it tasks from other processes. This

is managed by the assign function of Algorithm 1.

Similar to the serial complexity given in Equation (4),

the complexity for the parallel algorithm with p workers is

Cpar ¼ O N3

pCpt

� �þ O

N2

pmsgðNkdof Þ

� �|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

MPI messages

þO 2 sþ bþ dN

p

� �� 1

� � n2

� �|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

read images

;

ð5Þ

where msgðNÞ is the time to send a message of length N

and kdof is the number of DOFs (for affine, 12). Because the

messages are short, approximately 8 kB for N ¼ 100, the

messages are latency bound instead of bandwidth bound,

and minimal compared with the computational cost.

We implemented these parallel algorithms using MPI

(MPI Forum, 2009), to run in a variety of parallel and

distributed computing environments. To test the scal-

ability of our algorithms, we perform DVC with varying

numbers of processors on Keeneland, a cluster with 120

nodes, each with two 6-core Intel X5660 CPUs and

three NVIDIA M2090 GPUs (Vetter et al., 2011). We

test three different problem sizes, all of which cover a

552� 552� 552 voxel region of a 992� 1013� 994

image and use a 413 subset size. The small problem size

uses a 20 voxel grid spacing, yielding a 26� 26� 26

grid with 17,576 correlation points. The medium prob-

lem size uses a 10-voxel grid spacing, yielding a

51� 51� 51 grid with 132,651 correlation points. The

large problem size uses a 5-voxel grid spacing,

achieving our target resolution with a 101� 101� 101

grid with 1,030,301 correlation points.

In Figure 8 (a) we plot in log–log scale the wall clock time

versus number of processors. A slope of negative one, shown

by the triangle and black lines, indicates linear speedup. Both

the static (dashed blue lines) and dynamic (solid red lines)

load balancing schemes come close to linear speedup. In all

cases the dynamic scheme is faster than the static scheme.

Figure 8 (b) plots the parallel efficiency, defined as

ep ¼serial cost

parallel cost¼ t1

ptp;

where tp is the wall clock time with p processors. We esti-

mate the serial time by summing the computation time for

the static load balancing scheme, excluding any idle time.

The static scheme gradually loses efficiency with more pro-

cessors, while the dynamic scheme is better able to maintain

a high efficiency. For the small problem size (lines with

‘‘�’’), both schemes start to deviate from a straight line in

Figure 8 (a), which is also seen in Figure 8 (b) as a reduction

in parallel efficiency, due to the higher proportion of parallel

overhead compared with larger problems. For instance, pro-

cessors assigned points in the same plane will read images

redundantly, accounted for by the ceiling in Equation (5).

For the medium and large problem sizes, the dynamic

scheme maintains 98% efficiency up to the maximum num-

ber of processors tested, while the efficiency with the static

scheme drops to 80%. For the large problem with 96 proces-

sors, the dynamic scheme is 24% faster than the static

scheme. The largest problem—which is our target million

point correlation grid—would take an estimated 77.6 hours

on a single CPU core. The dynamic scheme solves it in 50

minutes with 96 processors; the static scheme in 62 minutes.

Since a pair of CT scans typically takes 1–2 hours, this type

of coarse-grained parallelism achieves our goal of comput-

ing high-resolution DVC in time commensurate with the

image acquisition time. However, the performance can be

improved even further as described in the next section.

(a)

process 0

process 0

process 1

process 1

process 2

0 25 50 75 100time (sec)

125 150 175 200

idle

idle

(b)

0 25 50 75 100time (sec)

125 150 175 200

process 2master

Figure 7. Idle time and communication for DVC with three processes using (a) static and (b) dynamic load balancing schemes. Grayblocks at end are idle time, blue is evaluating objective function inside BFGS, red is reading images. Orange lines between master andanother process indicate communication; master is nearly always idle in MPI_Waitsome (yellow).

8 The International Journal of High Performance Computing Applications

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 10: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

5. Fine-grained parallel computing withGPU

A complementary parallelization is to compute the objec-

tive function cðuÞ itself in parallel. Recall that the objective

function (1) evaluates the subset in the deformed image,

gðxðuÞÞ, for all voxels x in the current subset. Computing

the objective function and its gradient in parallel entails

computing the spline interpolation of gðxðuÞÞ at every

voxel in parallel, then doing parallel reductions to sum the

objective function and its derivatives. While the computa-

tion to communication ratio is insufficient to do this effi-

ciently with loosely coupled distributed computing such

as MPI, modern GPUs provide a lightweight thread model

that is ideal for this type of data-parallel computation.

With the advent of the CUDA (NVIDIA, 2007) and

OpenCL (Munshi, 2009) languages to allow generalpur-

pose GPU programming, GPUs have been used to acceler-

ate many applications, including image processing (Park

et al., 2011), FFT (NVIDIA, 2013), linear algebra (Tomov

et al., 2010), and other scientific applications (Lee et al.,

2010; Hwu, 2011). In this work, we use the CUDA exten-

sions to Cþþ developed by NVIDIA. In CUDA, a parallel

computation is decomposed into a 3D grid of blocks, which

are executed asynchronously on the GPU. Each block is

further decomposed into a 3D grid of threads, all of which

execute the same kernel function in lockstep on different

pieces of data, in SIMD (single instruction, multiple data)

fashion. A GPU uses multiple blocks to hide memory

latency. While one block is waiting for a memory read to

complete, another block can be executing. Having a suffi-

cient number of blocks and amount of computation within

each block to hide memory latency is therefore important to

achieving high performance.

For DVC, we assign each thread to compute one point in

the subset, and create blocks by tiling each slice of the subset

with 2D tiles, as illustrated in Figure 9 and detailed in Algo-

rithms 3 and 4. For instance, using 16� 16 blocks to cover a

313 subset results in a 2� 2� 31 grid of blocks, for a total

of 124 blocks, each containing 256 threads. Since 31 is not

evenly divisible by 16, tiles along two edges will have one

row or column that is outside the subset; results for these

points are set to zero. We copy the subset of the reference

image f and the spline coefficients of the deformed image

g, of size s3 and ðsþ 2bþ 2Þ3 floats, respectively, to 3D

textures on the GPU to take advantage of texture caching.

These do not need to be copied for each objective function

evaluation, but only when starting a new subset or if the

spline is recomputed. The vector u is copied to the GPU’s

constant memory, which is also cached.

512

256

128

64

time

(min

utes

)

32

16

81:1

1013 grid

513

dynamic, 263 grid

dynamic, 513 grid

dynamic, 1013 grid

static, 263 grid

static, 513 grid

static, 1013 grid

grid

263 grid

4

100

80

Para

llel e

ffic

ienc

y

60

40

20

02 4 6 12

number of CPU cores

(a)

24 48 72 96 2 4 6 12

number of CPU cores

(b)

24 48 72 96

Figure 8. (a) Strong parallel scaling in log–log scale for three problem sizes. Solid black lines show linear speedup. Weak scaling is alsodemonstrated by some data points, as 513 � 8ð263Þ and 1013 � 8ð513Þ. (b) Parallel efficiency for the same tests.

Figure 9. Tiles (small slabs) covering an image subset, showingthe decomposition of computation on GPU. Since the subset sizeis not a multiple of the tile size, tiles extend beyond the subsetboundary.

Gates et al. 9

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 11: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

Using the B-spline basis, each m� n tile uses approxi-

mately 4ðmþ 3Þðnþ 3Þ coefficients, depending on the

deformation u. For each point, evaluating the tricubic

spline and its first partial derivatives requires 526 flops.

Thus for an 8� 8 tile, the computation-to-memory ratio

is 69 flops/float. This large ratio allows for overlapping

computation and memory reads to hide memory latency,

yielding good performance on the GPU.

After computing the spline gðxðuÞÞ and its derivatives at

each point, each block does a series of standard parallel

sum reductions (Kirk and Hwu, 2010) to compute partial

sums of the objective function cðuÞ and its derivatives

within each block. Because CUDA does not provide syn-

chronization between blocks, a second CUDA kernel with

only one thread block makes a final summation of the par-

tial sums from each block, and the results are copied back

to the CPU.

Currently, other portions of the DVC computation

remain on the CPU. Particularly, the BFGS iteration itself

is computed on the CPU, because the small problem size,

only 12 DOFs, does not have sufficient parallelism to take

advantage of a GPU. Over 99% of the time for BFGS is in

the cðuÞ and rcðuÞ evaluations, which are GPU acceler-

ated. The coarse search is also performed on the CPU,

because it takes only 5% of the time, and has only a couple

of operations per memory read, so the cost of transferring

data to the GPU would outweigh any benefits. The FFT

search could be accelerated on the GPU, but as it is

invoked for only t ¼ 1:7% of subsets, the benefits would

be negligible. The best candidate for further acceleration

using the GPU is computing the B-spline coefficients,

which involves solving three systems with an

ðsþ 2bþ 2Þ� ½ðsþ 2bþ 2Þ2� RHS matrix and transpos-

ing the RHS after each solve. Benchmarking just these

two operations using the MAGMA linear algebra library

(Tomov et al., 2010), for s ¼ 41, shows a potential

speedup of 1.8, from 7.8 ms using the CPU to 4.3 ms using

the GPU. Ruijters and Thevenaz (2010) also developed an

algorithm to compute B-spline coefficients on the GPU.

10 The International Journal of High Performance Computing Applications

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 12: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

Computing splines on the GPU would, however, also

increase contention on the GPU, so the potential impact

on overall performance is unclear. Leaving some compu-

tation on the CPU allows time for other processes to share

the same GPU.

While this GPU code parallelizes the computation of the

objective function for a single subset, we also combined it with

our MPI-based DVC, making an implementation that exploits

multiple levels of parallelism, simultaneously doing both

coarse-grained parallelism across multiple CPU cores and, for

each CPU core, fine-grained parallelism using a GPU.

We use single precision floating point with CUDA, since

it is supported on all GPU cards and has twice the perfor-

mance of double precision floating point on recent cards

such as the M2090 (NVIDIA, 2011). The GPU algorithm

also computes in a different order than the CPU algorithm:

the parallel sum reduction adds terms together in a hierarch-

ical tree fashion, which tends to yield more accurate results

since numbers of similar magnitude are summed at each

step, as compared with a serial implementation that sums

terms into a single accumulator. To assess the impact of sin-

gle precision and algorithmic changes, we compare DVC

solutions computed using single and double precision. Glo-

bal error properties are identical between the solutions: both

have 1.45% error and a standard deviation in displacement

of 0.042 voxels. Differences between the computed displa-

cements, usingle � udouble, are small: 99.7% of points are

within 0.002 voxels, much less than the standard deviation

of 0.042 voxels, and all points are within 0.015 voxels.

We conclude that single precision is sufficient to compute

DVC to within its experimental accuracy limits.

Building on the parallel complexity in Equation (5), the

complexity for the GPU parallel algorithm is

Cgpu ¼ Cpar þ ON 3

ps3 þ ðsþ 2bÞ3 þ kbfgskdof

n o� �|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}

CPU$ GPU data transfer

: ð6Þ

Recall that the speedup of the GPU implementation over

the CPU implementation of the objective function is

already represented in Cpar by a in (3).

To make a fair comparison, both CPU and GPU imple-

mentations should be optimized. In our case, the inner

loops of the CPU and GPU implementations are identical.

In both cases, the four summations for g; gx; gy; gz, shown

in Algorithm 4, are merged into a single set of triple nested

loops over k; j; i. On the CPU, the Intel icc compiler with -

xsse4 optimization is able to partially vectorize these loops.

However, because of the slightly irregular memory access

pattern due to the affine shape function (2), and loading

data from unaligned addresses, the vectorization efficiency

is limited. A single CPU core achieved 3.56 Gflop/s evalu-

ating the objective function, independent of subset size. For

comparison, the optimized matrix–vector product (sgemv)

in Intel’s MKL achieves 6.0 Gflop/s; sgemv has fewer

operations than spline interpolation, but a more regular data

access pattern that can be vectorized well.

We first test the performance of the GPU for computing

the objective function itself, apart from other parts of our

DVC code. Here we define speedup as the ratio of the com-

putation time using only a CPU core to the computation

time for the GPU algorithm (in both cases using single pre-

cision). We achieve a maximum speedup of 43:5 using a

GPU, as shown in Figure 10. However, the speedup is vari-

able and depends on the machine, tile size, and subset size.

In general, larger subset sizes have a larger speedup. A

decrease in performance is seen when a multiple of the tile

size is exceeded, requiring another row or column of tiles.

For instance, 8� 8, 8� 16, and 16� 16 tile sizes all show

a decrease in speedup when the subset size reaches 493

voxels. Medium size tiles with 64 to 128 threads (4� 16,

4� 32, 8� 8, 8� 16) have the best performance, with

speedups from 21.4 to 43.5, while small tiles and large tiles

(4� 4, 16� 32) show lesser speedups. For subsequent

tests, we chose the 8� 8 tile size because it performed well

45

40

35

30

spee

dup

25

20

15

10

5

030 35 40 45

subset size50 55 60

160

140

120

100

4×4

4×8

4×16

4×32

8×8

Gfl

op/s

8×16

8×32

16×16

16×32

CPU

80

60

40

20

0

Figure 10. GPU performance compared with one CPU core, for evaluating objective function with various subset and tile sizes. CPUachieves 3.56 Gflop/s, independent of subset size. We chose 8� 8 as the best overall tile size.

Gates et al. 11

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 13: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

consistently, being the first or second fastest for all subset

sizes. As we accounted for CPU-to-GPU communication

separately in our performance model, and the frequency

of communication depends on the number of function

evaluations during BFGS, this speedup does not include

communication. For instance, with s ¼ 41, copying f and

g to the GPU takes 0.71 ms (9.8 Gbit/s), while evaluating

cðuÞ and rcðuÞ an average of 19 times during BFGS takes

5.47 ms on the GPU and 193.33 ms on the CPU. Thus

including communication reduces the speedup from 35.3

to 31.5.

Because we are not fully utilizing the GPU, multiple

CPU cores can share a single GPU and all gain a speedup.

To illustrate, we perform a small weak scaling experiment

with static scheduling, where each CPU core computes

DVC for one slice of a 26� 26� p correlation grid.

Profiles of process 0 are shown in Figure 11 for various

numbers of processors and GPUs. Since process 0 performs

the same computation in all cases, any increase in time can

be attributed to resource contention and parallel overhead.

For the CPU-only code, the time is dominated by BFGS,

which incurs only a small increase in time with increasing

cores. For more than one core, idle time is introduced as

this process waits for other processes to finish. Reading

images from disk, the coarse search, FFT search, and com-

puting B-spline coefficients are all a small percentage of

the overall time. For one GPU, BFGS is up to 24.9 times

faster than the CPU-only version, while the overall compu-

tation is accelerated by 8.7 times, a consequence of

Amdahl’s law (Amdahl, 1967): we parallelize one portion

of the application, but the maximum speedup achievable

is limited by the amount of time spent in the remaining

serial code. As mentioned previously, some of this serial

code is also amenable to computation with the GPU, with

the B-spline computation being the largest and best candi-

date. With one GPU, as the number of CPU cores increase,

the BFGS time increases modestly up to four cores, then

increases more sharply for larger numbers of cores, due

to contention on the GPU. The time to copy data from the

CPU to the GPU also increases, but is still a small percent-

age of the overall time, indicating that PCIe bandwidth is

not a bottleneck. For two GPUs, at most six cores use each

GPU, so the increase in time is much less significant, while

for three GPUs, at most four cores use each GPU, so the

increase is slight.

To investigate the scalability of our algorithm, we per-

form a strong scaling experiment using multiple nodes of

Keeneland. As shown in Figure 12, for the 1013 grid size,

we achieve speedups of 3.5, 6.5, and 8.0 times using one,

two, and three GPUs, respectively, per 12-core node.

Again, we see that adding a second GPU alleviates conten-

tion, increasing overall performance 47% compared with

one GPU, while adding a third GPU per node has dimin-

ished returns, adding only 10%. We also observe near-

perfect linear scaling for both the CPU-only and GPU

codes. As before, the dynamic load balancing scheme

achieves 98% parallel efficiency for CPU-only code, and

88% or better parallel efficiency using GPUs.

6. Application

While this work is primarily concerned with the parallel

implementation and performance, we also want to empha-

size that our method maintains high accuracy. We demon-

strate the effectiveness of our enhanced, high-performance

DVC code to detect a 3D deformation field based on a par-

ticular elastic solution. We take two consecutive scans,

called baseline images, using the same settings and without

moving (other than the tomograph rotation) or deforming

the sample. Ideally, the displacement field for a pair of

baseline images is zero everywhere, but scanning will

introduce experimental noise. We use a pair of baseline

images, in order to include a realistic amount of experimen-

tal noise, and apply an artificial deformation to one image

2 4

225

200

175

150

time

(sec

)

125

100

75

50

25

6 8 10

CPU only 1 GPU 2 GPUs

Idle

Read

Coarse

Spline

GPU copy

BFGS

FFT

3 GPUs

8.7×

24.9×

12 2 4 6 8 10 12

CPU cores

2 4 6 8 10 12 2 4 6 8 10 12

Figure 11. Time for one process of weak scaling experiment with static scheduling, in all cases performing exactly the same com-putation on slice z ¼ 240. Changes in time with more processes are therefore due to resource contention and parallel overhead.

12 The International Journal of High Performance Computing Applications

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 14: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

using a known analytical solution for this particular

instance. The sample material is a ceramic foam, pictured

in Figure 1. It is scanned with an Xradia MicroCT scanner

at 10 m m resolution, resulting in a 992� 1013� 994 voxel

3D image.

For strain-inducing deformations, we use the 3D defor-

mation field around a rigid spherical inclusion of radius R

in a linear elastic material under uniaxial tension. The displa-

cement field for this problem was derived by Goodier

(1933). We apply an artificial deformation to the first base-

line image using the analytical solution with a sphere radius

R of 50 voxels (0.5 mm), then compute DVC between this

artificially deformed image and the second baseline image.

By using an artificial deformation, we have the exact solu-

tion to compare with and are better able to assess the accu-

racy of the DVC method, apart from additional uncertainties

inherent in using a load frame to perform in situ load tests.

We compute the DVC solution over a 5523 voxel region

with a 413 subset using our target 1013 grid with over a mil-

lion correlation points.

Figure 13 shows a vector plot of the 3D displacement

field measured by DVC around the sphere, viewed from just

off the z-axis so that effectively we see just the u and v dis-

placements. Tension is applied parallel to the x-axis, and the

Poisson effect causes compression in the y and z directions.

The distortion near the sphere is evident: within each row,

displacements near the sphere are smaller than those farther

from the sphere, approaching zero displacement at the sur-

face of the sphere. Overall, we achieve a 1.11% relative L2

error in the u displacement, with a mean absolute error of

0.06 voxels in u. We also apply a least-squares fit to deter-

mine the applied tension T (Gates et al., 2011), which we are

able to measure with 0.06% error.

7 Conclusion

We were able to achieve our goal of computing accurate,

high-resolution DVC in time commensurate with the image

acquisition time. Each scan can take up to an hour. We

performed DVC on a 1013 grid with a 413 subset size in

50 minutes using 96 processors. Using significantly less

hardware (and energy), the GPU-accelerated version

achieved this computation also in 50 minutes with just 1

node (12 cores and 3 GPUs), and in 6.2 minutes using 8

nodes (96 cores and 24 GPUs). Using GPUs accelerated the

computation by up to 8 times, compared to using CPUs

alone. Both the CPU and GPU versions are highly scalable,

work within memory constraints, and achieve 88% or better

parallel efficiency on up to 96 cores. This DVC algorithm

provides the capability to perform high-resolution DVC on

large 3D images obtained from modern m CT scanners.

Acknowledgments

We are grateful to Christian Espinoza and Professor W Kriven

of the Department of Materials Science and Engineering

at the University of Illinois for providing the ceramic foam

sample, and to Charles Mark Bee, Leilei Yin, and the Ima-

ging Technology Group at the Beckman Institute for use of

the Xradia MicroCT scanner.

Funding

This work was supported by the Center for Simulation of

Advanced Rockets under contract by the U.S. Department

of Energy (contract number B523819), the Institute for

Advanced Computing Applications and Technologies, and

the University of Illinois Campus Research Board (award

number 09084).

References

Amdahl GM (1967) Validity of the single processor approach to

achieving large scale computing capabilities. In: Proceedings

of the April 18–20, 1967, Spring Joint Computer Conference.

New York, NY: ACM Press, pp. 483–485.

512256

time

(min

utes

) 1286432168421

12 24

CPU cores

48 72 96

1:1

1013 grid, CPU1013 grid, 1 GPU/node1013 grid, 2 GPU/node1013 grid, 3 GPU/node

513 grid, CPU513 grid, 1 GPU/node513 grid, 2 GPU/node513 grid, 3 GPU/nodeLinear speedup

Figure 12. Strong parallel scaling in log–log scale for two prob-lem sizes. Using one, two, and three GPUs per node achieves 3.5,6.5, and 8.0 times speedup, respectively, compared with CPUsalone for 1013 grid size.

Figure 13. Arrow plot of displacements measured with DVCaround spherical inclusion of radius R.

Gates et al. 13

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 15: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J,

et al. (1999) LAPACK Users’ Guide, 3rd edn. Philadelphia, PA:

SIAM Press. Available at: http://www.netlib.org/lapack/lug/

Bay BK, Smith TS, Fyhrie DP and Saad M (1999) Digital volume

correlation: Three-dimensional strain mapping using X-ray

tomography. Experimental Mechanics 39: 217–226.

Buis PE and Dyksen WR (1996) Efficient vector and parallel

manipulation of tensor products. ACM Transactions on Math-

ematical Software 22(1): 18–23.

Chan A, Gropp W and Lusk E (2008) An efficient format for

nearly constant-time access to arbitrary time intervals in large

trace files. Scientific Programming 16(2–3): 155–165.

Chu T, Ranson W, Sutton M and Peters W (1985) Applications of

digital-image-correlation techniques to experimental mechanics.

Experimental Mechanics 25: 232–244.

Dierckx P (1993) Curve and Surface Fitting with Splines. Oxford:

Oxford University Press.

Efstathiou C, Sehitoglu H and Lambros J (2010) Multiscale strain

measurements of plastically deforming polycrystalline tita-

nium: Role of deformation heterogeneities. International

Journal of Plasticity 26: 93–106.

Fletcher R (1987) Practical Methods of Optimization, 2nd edn.

New York: Wiley.

Flucka O, Vettera C, Weina W, Kamena A, Preimb B and Wester-

mannc R (2011) A survey of medical image registration on

graphics hardware. Computer Methods and Programs in Bio-

medicine 104: e45–e57.

Forsberg F, Sjodahl M, Mooser R, Hack E and Wyss P (2010) Full

three-dimensional strain measurements on wood exposed to

three-point bending: Analysis by use of digital volume corre-

lation applied to synchrotron radiation micro-computed tomo-

graphy image data. Strain 46: 47–60.

Franck C, Hong S, Maskarinec S, Tirrell D and Ravichandran G

(2007) Three-dimensional full-field measurements of large

deformations in soft materials using confocal microscopy

and digital volume correlation. Experimental Mechanics

47: 427–438.

Franck C, Maskarinec SA, Tirrell DA and Ravichandran G (2011)

Three-dimensional traction force microscopy: a new tool for

quantifying cell-matrix interactions. PLoS ONE 6: e17833.

Frigo M and Johnson S (2005) The design and implementation of

FFTW3. Proceedings of the IEEE 93(2): 216–231.

Gates M, Lambros J and Heath MT (2011) Towards high perfor-

mance digital volume correlation. Experimental Mechanics

51: 491–507.

Germaneau A, Doumalin P and Dupre JC (2007) 3D strain field mea-

surement by correlation of volume images using scattered light:

Recording of images and choice of marks. Strain 43: 207–218.

Glocker B, Sotiras A, Komodakis N and Paragios N (2011)

Deformable medical image registration: Setting the state of the

art with discrete methods. Annual Review of Biomedical Engi-

neering 13: 219–244.

Gonzalez RC and Richard E (2007) Digital Image Processing, 3rd

edn. Englewood Cliffs, NJ: Prentice-Hall.

Goodier JN (1933) Concentration of stress around spherical and

cylindrical inclusions and flaws. Journal of Applied

Mechanics 55(7): 39–44.

Haldrup K, Nielsen S, Mishnaevsky L Jr, Beckman F and Wert J

(2006) 3-dimensional strain fields from tomographic measure-

ments. Proceedings of SPIE 6318: 63181B.

Hall SA, Bornert M, Desrues J, et al. (2010) Discrete and conti-

nuum analysis of localised deformation in sand using X-ray

m CT and volumetric digital image correlation. Geotechnique

60(5): 315–322.

Hild F, Roux S, Bernard D, Hauss G and Rebai M (2013) On the

use of 3D images and 3D displacement measurements for the

analysis of damage mechanisms in concrete-like materials. In:

VIII International Conference on Fracture Mechanics of Con-

crete and Concrete Structures, pp. 1–12.

Hwu W (2011) GPU Computing Gems Emerald Edition. San

Mateo, CA: Morgan Kaufmann.

Kirk DB and Hwu WMW (2010) Programming Massively Paral-

lel Processors: A Hands-on Approach. San Mateo, CA: Mor-

gan Kaufmann.

Klein S, Staring M, Murphy K, Viergever MA and Pluim JPW

(2010) elastix: A toolbox for intensity-based medical image

registration. IEEE Transactions on Medical Imaging 29:

196–205.

Leclerc H, Periea JN, Hild F and Roux S (2012) Digital volume

correlation: what are the limits to the spatial resolution?

Mechanics and Industry 13: 361–371.

Lee VW, Kim C, Chhugani J, et al (2010) Debunking the 100x

GPU vs. CPU myth: An evaluation of throughput computing

on CPU and GPU. ACM SIGARCH Computer Architecture

News 38: 451–460.

Lenoir N, Bornert M, Desrues J, Besuelle P and Viggiani G (2007)

Volumetric digital image correlation applied to X-ray microto-

mography images from triaxial compression tests on argillac-

eous rock. Strain 43: 193–205.

Limodin N, Rethore J, Adrien J, Buffiere JY, Hild F and Roux S

(2011) Analysis and artifact correction for volume correlation

measurements using tomographic images from a laboratory X-

ray source. Experimental Mechanics 51: 959970.

Liu C (2005) On the minimum size of representative volume ele-

ment: An experimental investigation. Experimental Mechanics

45(3): 238–243.

Liu L and Morgan EF (2007) Accuracy and precision of digi-

tal volume correlation in quantifying displacements and

strains in trabecular bone. Journal of Biomechanics 40:

3516–3520.

Modersitzki J (2009) FAIR: Flexible Algorithms for Image Regis-

tration. Philadelphia, PA: SIAM.

Morgeneyer T, Helfen L, Mubarak H and Hild F (2012) 3D digital

volume correlation of synchrotron radiation laminography

images of ductile crack initiation: An initial feasibility study.

Experimental Mechanics, in press.

Mostafavi M, McDonald S, Mummery P and Marrow T (2013)

Observation and quantification of three-dimensional crack

propagation in poly-granular graphite. Engineering Fracture

Mechanics, in press.

Moulart R, Rotinat R and Pierron F (2009) Full-field evaluation of

the onset of microplasticity in a steel specimen. Mechanics of

Materials 41: 1207–1222.

MPI Forum (2009) MPI: A message-passing interface standard.

14 The International Journal of High Performance Computing Applications

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from

Page 16: High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation

Munshi A (ed.) (2009) The OpenCL Specification. Khronos

OpenCL Working Group.

NVIDIA (2007) NVIDIA CUDA C Programming Guide.

NVIDIA (2011) Tesla M-class GPU computing modules acceler-

ating science.

NVIDIA (2013) CUFFT library.

Park IK, Singhal N, Lee MH, Cho S and Kim CW (2011) Design

and performance evaluation of image processing algorithms

on GPUs. IEEE Transactions On Parallel and Distributed Sys-

tems 22: 91–104.

Rannou J, Limodin N, Rethore J, et al. (2010) Three dimensional

experimental and numerical multiscale analysis of a fatigue

crack. Computer Methods in Applied Mechanics and

Engineering 199: 1307–1325.

Rossi M and Pierron F (2011) Identification of plastic constitutive

parameters at large deformations from three dimensional dis-

placement fields. Computational Mechanics 49: 53–71.

Roux S, Hild F, Viot P and Bernard D (2008) Three-dimensional

image correlation from X-ray computed tomography of solid

foam. Composites: Part A 39: 1253–1265.

Ruijters D and Thevenaz P (2010) GPU prefilter for accurate

cubic B-spline interpolation. The Computer Journal 55:

15–20.

Saxena V, Rohrer J and Gong L (2010) A parallel GPU algorithm

for mutual information based 3D nonrigid image registration.

In: Euro-Par 2010 Parallel Processing, Spring.

Shams R, Sadeghi P, Kennedy RA and Hartley RI (2010) A sur-

vey of medical image registration on multicore and the GPU.

IEEE Signal Processing Magazine 27: 50–60.

Shende S and Malony AD (2006) The TAU parallel performance

system. International Journal of High Performance Comput-

ing Applications 20: 287–311.

Sjodahl M, Siviour CR and Forsberg F (2012) Digital volume cor-

relation applied to compaction of granular materials. Procedia

IUTAM 4: 179–195.

Smith TS, Bay BK and Rashid MM (2002) Digital volume

correlation including rotational degrees of freedom during

minimization. Experimental Mechanics 42: 272–278.

Sutton MA, Orteu JJ and Schreier H (2009) Image Correlation for

Shape, Motion and Deformation Measurements. New York:

Springer.

Tomov S, Nath R, Ltaief H and Dongarra J (2010) Dense linear

algebra solvers for multicore with GPU accelerators. In:

2010 IEEE International Symposium on Parallel & Distribu-

ted Processing, Workshops and PhD Forum (IPDPSW).

Piscataway, NJ: IEEE, pp. 1–8.

Verhulp E, van Rietbergen B and Huiskes R (2004) A three-

dimensional digital image correlation technique for strain

measurements in microstructures. Journal of Biomechanics

37: 1313–1320.

Vetter J, Glassbrook R, Dongarra J, et al. (2011) Keeneland:

Bringing heterogeneous GPU computing to the computational

science community. IEEE Computing in Science and Engi-

neering 13: 90–95.

Yang Z, Ren W, Mostafavi M, Mcdonald SA and Marrow TJ

(2013) Characterisation of 3D fracture evolution in concrete

using in-situ X-ray computed tomography testing and digital

volume correlation. In: VIII International Conference on Frac-

ture Mechanics of Concrete and Concrete Structures, pp. 1–7.

Zauel R, Yeni YN, Bay BK, Dong XN and Fyhrie DP (2006)

Comparison of the linear finite element prediction of deforma-

tion and strain of human cancellous bone to 3D digital volume

correlation measurements. Journal of Biomechanical Engi-

neering 128: 1–6.

Zitova B and Flusser J (2003) Image registration methods: a sur-

vey. Image and Vision Computing 21: 977–1000.

Authors biographies

Mark Gates is a research scientist at the University of

Tennessee, where he works on developing the PLASMA

and MAGMA software libraries for linear algebra on

multi-core and GPU-based computers. He received his

B.S., M.S, and Ph.D. in Computer Science from the Univer-

sity of Illinois at Urbana-Champaign, in 1998, 2007, and

2011, respectively. He also worked at NCSA from 1998–

2000, on optimizing applications for high-performance

wide-area networks. His research interests are in high-

performance computing, particularly linear algebra and

GPU computing.

Michael T Heath is Professor and Fulton Watson Copp

Chair in the Department of Computer Science at the Uni-

versity of Illinois at Urbana-Champaign. He received his

Ph.D. in Computer Science from Stanford University in

1978. His research interests are in scientific computing,

particularly numerical linear algebra and optimization. He

is an ACM Fellow, SIAM Fellow, Associate Fellow of the

AIAA, and a member of the European Academy of

Sciences. He received the Taylor L. Booth Education

Award from the IEEE Computer Society in 2009, and is

author of the widely adopted textbook Scientific Comput-

ing: An Introductory Survey, 2nd edition, published by

McGraw-Hill in 2002.

John Lambros is Professor and Associate Head for the

Graduate Program of Aerospace Engineering at the Univer-

sity of Illinois at Urbana-Champaign, where he is also Princi-

pal Investigator and Director of the Stress Wave Mitigation

Center. He received a B.Eng. in Aeronautical Engineering

from Imperial College in 1988, and a M.S. and Ph.D. in Aero-

nautics from California Institute of Technology in 1989 and

1994. His research interests are the multiple time and length

scale mechanical characterization of material response under

static and dynamic impact or fatigue loading.

Gates et al. 15

at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from