Upload
j
View
213
Download
0
Embed Size (px)
Citation preview
http://hpc.sagepub.com/Computing Applications
International Journal of High Performance
http://hpc.sagepub.com/content/early/2014/01/15/1094342013518807The online version of this article can be found at:
DOI: 10.1177/1094342013518807
published online 17 January 2014International Journal of High Performance Computing ApplicationsMark Gates, Michael T Heath and John Lambros
High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation
Published by:
http://www.sagepublications.com
can be found at:International Journal of High Performance Computing ApplicationsAdditional services and information for
http://hpc.sagepub.com/cgi/alertsEmail Alerts:
http://hpc.sagepub.com/subscriptionsSubscriptions:
http://www.sagepub.com/journalsReprints.navReprints:
http://www.sagepub.com/journalsPermissions.navPermissions:
What is This?
- Jan 17, 2014OnlineFirst Version of Record >>
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
Article
High-performance hybrid CPU and GPUparallel algorithm for digitalvolume correlation
Mark Gates1, Michael T Heath2 and John Lambros3
AbstractWe present a hybrid Message Passing Interface (MPI) and graphics processing unit (GPU)-based parallel digital volumecorrelation (DVC) algorithm for measuring three-dimensional (3D) displacement and strain fields inside a materialundergoing motion or deformation. Our algorithm achieves resolution comparable to that achieved in two-dimensional (2D) digital image correlation (DIC), in time that is commensurate with the image acquisition time, in thiscase, using microcomputed tomography (m CT) for scanning images. For DVC, the volume of data and number of correla-tion points both grow cubically with the linear dimensions of the image. We turn to parallel computing to gain sufficientprocessing power to scale to high resolution, and are able to achieve more than an order-of-magnitude increase inresolution compared with previous efforts that are not based on a parallel framework.
Keywordsdigital volume correlation, X-ray tomography, strain measurement, parallel computing, GPU computing, imageregistration
1. Introduction
Three-dimensional (3D) digital volume correlation (DVC)
is a technique used to measure 3D displacement and strain
fields throughout the interior of a material that has been
imaged using 3D techniques such as X-ray microcomputed
tomography (m CT) or confocal microscopy. DVC is an
extension of two-dimensional (2D) digital image correla-
tion (DIC) (Chu et al., 1985), which is widely used in
experimental mechanics to measure surface displacements
(Sutton et al., 2009). DVC provides a useful experimental
complement to 3D numerical simulations such as finite-
element analysis (FEA). For example, DVC can generate
the full-field 3D experimental results required to compare
with 3D finite-element simulations for validation purposes
(Zauel et al., 2006; Mostafavi et al., 2013), results that are
otherwise difficult or impossible to obtain. In addition to
validation, DVC can also play a vital role in determining
input parameters for simulations (Liu, 2005; Moulart
et al., 2009; Efstathiou et al., 2010; Rossi and Pierron,
2011). DVC has been used to analyze materials with com-
plex microstructures, such as bone and other biological
materials (Bay et al., 1999; Smith et al., 2002; Zauel
et al., 2006; Liu and Morgan, 2007; Franck et al., 2011),
rock (Lenoir et al., 2007), and concrete (Hild et al., 2013;
Yang et al., 2013); as well as complex behaviors, such as
material fatigue (Rannou et al., 2010; Mostafavi et al.,
2013), which can be difficult to simulate. Thus, although
only a relatively small number of researchers have studied
DVC, it is a powerful tool with significant future potential.
1.1. Background
The DVC process starts with a material sample possessing
a random internal pattern of features that are detectable by
the 3D imaging device. Such an internal ‘‘speckle’’ pattern
can arise either from inherent internal material microstruc-
ture, such as in bone, or by manufacturing samples with
embedded particles (Haldrup et al., 2006; Franck et al.,
2007; Germaneau et al., 2007). As depicted in Figure
1(a), a reference 3D image of the undeformed sample is
captured, then a motion or deformation is applied, and a
1 Department of Electrical Engineering and Computer Science, University
of Tennessee, Knoxville, USA2 Department of Computer Science, University of Illinois at Urbana-
Champaign, USA3 Department of Aerospace Engineering, University of Illinois at Urbana-
Champaign, USA
Corresponding author:
Mark Gates, Innovative Computing Laboratory, Department of Electrical
Engineering and Computer Science, University of Tennessee, Knoxville,
1122 Volunteer Boulevard, Suite 203, Knoxville, TN 37996, USA.
Email: [email protected]
The International Journal of HighPerformance Computing Applications1–15ª The Author(s) 2014Reprints and permissions:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/1094342013518807hpc.sagepub.com
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
3D image of the resulting deformed sample is captured
either under load (in situ) or after unloading (ex situ), as
schematically illustrated in Figure 1 (b). In Figure 1, the
3D images captured are indicated by 3 slices, out of a total
of approximately 1000 such tomographic image slices
through the material interior, each one revealing internal
pattern features. The 3D data are represented digitally as
intensity values at voxel locations of a 3D grid. We define
a grid of correlation points on the reference image, with a
3D subset surrounding each point. For each correlation
point, DVC determines the deformation that maps the
subset in the reference image (Figure 1 (a)) to the subset
in the deformed image (Figure 1 (b)) with the best correla-
tion. In Figure 1 this happens to be a rigid rotation about the
z-axis, which is illustrated in the DVC result of Figure 1 (c)
superposed on the reference image.
This mapping between the reference image and the
deformed image is given by a shape function, xðuÞ, which
defines the degrees of freedom (DOFs) to be determined at
each point. Common mappings include translation (3
DOFs), rotation (6 DOFs), and affine (12 DOFs), with more
DOFs being more expensive to compute but generally
yielding more accurate results.
To determine the deformation u, DVC seeks the best
match between the subset in the reference image, f ðxÞ, and
the subset in the deformed image, gðxðuÞÞ, by optimizing
an objective function that measures their similarity, com-
puted as a summation over all voxels x in the subset. We
use the least-squares objective function,
cðuÞ ¼P
x ðf ðxÞ � gðxðuÞÞÞ2Px f ðxÞ2
¼ f � gk k2
fk k2: ð1Þ
The normalized cross-correlation function (Chu et al.,
1985) has also been used for DVC and is amenable to the
same parallelization techniques presented here. In general,
f and g do not exactly match, but we seek the best possible
match by finding the deformation u that optimizes the
objective function cðuÞ.DVC achieves sub-voxel accuracy, resulting in a
deformation u with non-integer coordinates, requiring inter-
polation to evaluate the deformed image gðxðuÞÞ between
voxels. To achieve high-accuracy DVC results, we imple-
ment C2 tricubic B-spline interpolation (Dierckx, 1993).
1.2. Related work
DVC has been developed by various research groups since
1999, as summarized in Table 1. The evolution has moved
toward higher-order shape functions, starting from transla-
tion, then rotation, then an affine transformation. Image
interpolation has generally been cubic (C1) or cubic spline
(C2), although some larger studies have used less expensive
linear (C0) interpolation. Most DVC studies have been
relatively modest in size, computing fewer than 16,000
points. Some groups have computed up to 60,000 points
with linear image interpolation and lower-order shape
functions. Execution times have varied significantly, from
0.47 to 40 seconds per subset, depending on the images,
computer, subset size, and complexity of the method.
As an alternative to the local DVC described above,
Roux et al. (2008) introduced global DVC, where a
finite-element mesh is used to define a global mapping that
is optimized over the entire image simultaneously, instead
of computing individual subsets independently. Since the
entire image is coupled together, the parallelization of glo-
bal DVC would necessarily be different than the approach
used here for local DVC. Recently, Leclerc et al. (2012)
accelerated with a GPU a pixel-based global DVC, where
each element is a single pixel. In their approach, the GPU
computation is primarily linear algebra, solving the result-
ing system, which has several million DOFs.
Outside mechanical engineering, DIC and DVC share
much in common with image registration algorithms,
which seek to align images that either partially overlap,
or are taken at different times or with different imaging
techniques (multi-modal). Medical imaging in particular
uses both 2D and 3D image registration (Shams et al.,
2010; Flucka et al., 2011), for instance, fusing a CT scan
with an MRI scan, or aligning CT scans taken at different
times. A wide variety of techniques are used for registration
(Zitova and Flusser, 2003; Modersitzki, 2009), from track-
ing features to similarity measures based on pixel intensity,
such as least squares (i.e. sum-of-squared differences) or
mutual information. Several medical imaging researchers
have used GPUs to accelerate registration, typically accel-
erating computation of the objective function (Klein et al.,
2010; Saxena et al., 2010; Glocker et al., 2011). These have
had moderate resolution on images from 3 to 26 megavox-
els. While using similar techniques, the goal of image reg-
istration is different than DVC. Registration seeks to align
two images, either to fuse them into a new image or to com-
pare them directly. As such, it requires accuracy of approx-
imately one voxel. The time constraint may also be several
minutes or less, while a patient is undergoing an operation.
In contrast, DVC is concerned with the actual deformation
field and its derivatives, which yield strain. Thus, DVC
requires higher accuracy, on the order of 1=10 voxel,
measured at a large number of correlation points to give
a densely sampled displacement field.
1.3. Major contributions
Our goal is to perform DVC with resolution (i.e. density of
correlation points in each dimension) and accuracy compa-
rable to that achieved in 2D DIC, with a correlation time
that is commensurate with the image acquisition time.
Depending on the resolution, each CT scan can take an
hour, plus the time needed to reconstruct the 3D image.
Because of the vastly increased volume of data associated
with the reference and deformed images in 3D over 2D,
DVC requires substantially more computation and storage
than DIC to achieve similar resolution.
2 The International Journal of High Performance Computing Applications
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
The present work builds upon our previous work (Gates
et al., 2011), which focused on optimizing the serial algo-
rithm, as summarized in Section 2. In particular, we previ-
ously examined the coarse search strategy to find the initial
u0, the choice of optimization algorithm, spline basis func-
tion, and use of smoothing splines. A cursory explanation
of our statically scheduled parallel DVC algorithm was also
given, which has been refined and expanded in the current
work.
In this work, we explain in more detail our data storage
that facilitates the parallel DVC algorithm, and analyze the
cost and scalability of various implementations of a parallel
DVC algorithm. We observe that only a small portion of the
image data is in use at any one time, so by careful memory
management we have developed an out-of-core algorithm
that scales to much larger images, described in Section 3.
In contrast with the modest resolutions previously achieved
with DVC, summarized in Table 1, a typical 2D DIC grid
has up to 1002 correlation points, so a comparable resolu-
tion in 3D requires a grid with 1,000,000 points. Such high
resolution is required to resolve high strain gradients within
a material effectively. To reach our goal of a 1003 point
correlation grid, at 0.25 seconds per correlation point,
would require 3 days of serial computing time. We turn
to parallel computing to gain sufficient processing power
to scale to these large problem sizes. We explore two forms
of parallelism: coarse-grained parallelism using MPI to
compute different correlation points simultaneously, dis-
cussed in Section 4, and fine-grained parallelism using
GPUs to compute the objective function for each correla-
tion point in parallel, discussed in Section 5. In addition
to a statically scheduled parallel scheme, we propose a
x
y
z1 mm
)c()b()a(
Figure 1. DVC applied to a 3D image with 10� rotation about the z-axis, showing 3 out of approximately 1000 images: (a) referenceimage; (b) deformed image; (c) displacement field. The subset in reference image (a) is mapped to the subset in deformed image (b). Thedisplacement field (c) is computed on a 5� 5� 2 grid of correlation points. The subset size shown is 413 voxels.
Table 1. Selection of prior work on DVC. A dash (—) denotes entries not reported in papers; * denotes entries using global DVCinstead of local DVC.
Reference MaterialImageinterpolation
Shapefunction
Correlationpoints/elements
Bay et al. (1999) Bone Cubic Translation 5500Smith et al. (2002) Bone Cubic Rotation 125Verhulp et al. (2004) Al foam Cubic Affine 2130Zauel et al. (2006) Bone Cubic Translation —Franck et al. (2007) Agarose N/A Axial strain 3375Lenoir et al. (2007) Rock Linear Translation 60,000Roux et al. (2008) Solid foam — Affine 1331 *Forsberg et al. (2010) Wood Cubic spline Affine 960Hall et al. (2010) Sand linear Rotation 50,000Rannou et al. (2010) Cast iron — Affine þ crack 729 *Limodin et al. (2011) Cast iron Cubic spline Affine 15,625 *Gates et al. (2011) PDMS Cubic spline Affine 59,000Sjoodahl et al. (2012) Sugar Cubic spline Affine 8,300Morgeneyer et al. (2012) Al Cubic spline Affine 16,000 *This work Ceramic Cubic spline Affine 1,030,000
Gates et al. 3
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
dynamically schedule master–worker scheme, which exhi-
bits superior efficiency and scalability. For our GPU imple-
mentation, we highlight the importance of tuning to find the
best algorithm parameters, and demonstrate how multiple
CPU cores can effectively share a single GPU.
2. DVC algorithm
Before looking at the parallel implementation, we summarize
the serial implementation of DVC as detailed in Gates et al.
(2011), and analyze its computational cost. For added robust-
ness, here we have added a fast Fourier transform (FFT)
coarse search. Some choices made in anticipation of our
parallel algorithm are noted. The pseudocode is given in
Algorithm 2, in the context of our parallel implementation.
For mapping the reference image to the deformed
image, we use an affine shape function (Chu et al.,
1985), which defines 12 DOFs, namely the displacements
in each dimension and their first derivatives,
u ¼ u v w quqx
quqy
quqz
qvqx
qvqy
qvqz
qwqx
qwqy
qwqz
h iT
:
As illustrated in Figure 2, for each correlation point we let
f ðxÞ be the subset in the reference image and gðxðuÞÞ be the
subset in the deformed image, with a point x ¼ ½x; y; z� inthe reference subset related to the corresponding point xðuÞin the deformed subset by the affine shape function
xðuÞ ¼xþ uþ qu
qxxþ qu
qyyþ qu
qzz
yþ vþ qvqx
xþ qvqy
yþ qvqz
z
zþ wþ qwqx
xþ qwqy
yþ qwqz
z
264
375; ð2Þ
where x; y; z form a local coordinate system within each
subset.
For each correlation point, the DVC algorithm first
determines an initial u0 to start the optimization procedure
from. This initial deformation must be close enough to the
global optimum for the optimization algorithm to converge,
typically within a few pixels. If neighboring correlation
points have been previously computed, we start with a
first-order extrapolation from those results to determine a
starting u0. Otherwise, we start with a user-specified displa-
cement, often zero. Then a coarse search algorithm samples
the objective function at all voxels in a search region
around u0, seeking an improved u0.
To enhance robustness of the coarse search, when it can-
not find a u0 such that cðu0Þ is below a user-defined thresh-
old, in this work we augment it with a second coarse search
over a much larger region. This uses the normalized cross-
correlation objective function,
~cðuÞ ¼P
x f ðxÞgðxðuÞÞPx f ðxÞ2
Px gðxðuÞÞ2
� �1=2¼ f ; gh i
fk k � gk k ;
which can be evaluated efficiently for all voxels in a
kfft � kfft � kfft search region using the FFT (Gonzalez and
Richard, 2007) as
f ; gh i ¼ IFFT FFTðfpadÞ � FFTðgpadÞ� �
;
and
gk k ¼ IFFT FFTð1padÞ � FFTðg2padÞ
� �� �1=2
;
where the subset f is of size s3, g is of size ðsþ kfftÞ3, 1 is the
unit function of size s3; fpad, gpad, g2pad and 1pad are suitably
zero-padded to size ð2ðs� 1Þ þ kfftÞ3. We use the imple-
mentation in the FFTW library (Frigo and Johnson, 2005),
and ensure that 2ðs� 1Þ þ kfft has prime factors of only
2; 3; 5, and 7 for best efficiency. While ~cðuÞ differs from
cðuÞ, both have an optimum value where the images are
aligned, so ~cðuÞ provides a good coarse search for cðuÞ. This
combination of coarse search algorithms enhances robust-
ness of the overall DVC process, leveraging results from
previous correlation points when available, but not requiring
such results, which may be computed by another parallel
process and therefore be unavailable.
Once a suitable u0 has been found, a standard optimiza-
tion algorithm refines the solution u to sub-pixel accuracy.
Based on previous investigation (Gates et al., 2011), we use
the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algo-
rithm with a cubic line search due to Fletcher (1987),
although other algorithms such as Levenberg–Marquardt
have also been used (Bay et al., 1999). BFGS requires
evaluating the objective function cðuÞ and its gradient
rcðuÞ, but not its Hessian.
To evaluate cðuÞ between integer voxels, we use tricubic
B-spline interpolation. First, the B-spline coefficients must
be computed from the pixel intensities, sometimes known
as prefiltering. One potential scheme computes coefficients
for the entire n� n� n image, which involves solving three
banded ðnþ 2Þ � ðnþ 2Þ systems with an ðnþ 2Þ�½ðnþ 2Þ2� right-hand side (RHS) matrix, transposing the
RHS after each solve (Dierckx, 1993; Buis and Dyksen,
1996). Using a banded solver, such as LAPACK’s gbsv
(Anderson et al., 1999), the cost to solve for all the RHS is
Oð3n3Þ. A parallel implementation could distribute the RHS
among p processors, avoiding communication during each
x
xf (x)
g(x(u))
y z
x
Figure 2. Representation of reference subset f and deformedsubset g with linear shape function.
4 The International Journal of High Performance Computing Applications
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
solve but incurring three all-to-all communications, one after
each solve for the transpose, involving pðp� 1Þ messages
of n3=p2 floats each. The actual cost depends on the MPI
implementation and underlying network topology. This
becomes a preprocessing step that must be finished before
any correlation points can be computed.
Instead of computing spline coefficients over the entire
image, our scheme computes a spline over each subset,
plus a small padding region of b voxels. Before each
correlation function evaluation, we check that the bound-
ing box (bbox) for the deformed g subset (dashed line
in Figure 3) is still contained within this spline region,
and re-compute the spline over a new region if necessary.
For an N � N � N grid of correlation points, the cost
is Oð3N3ðsþ 2bÞ3Þ. For a spacing of d voxels between sub-
sets, N � n=d, yielding a total cost of Oð3n3ððsþ 2bÞ=dÞ3Þ.Unlike the previous scheme, our scheme has no commu-
nication. Asymptotically, both schemes have an Oðn3Þcost, so comparison would require experiments to deter-
mine the constants involved. In our scheme, instead of
being an expensive preprocessing step, the spline com-
putations now occur throughout the computation. This
is a significant benefit when computing a small number
of correlation points that do not cover the entire image,
for instance with a small-scale analysis before comput-
ing a full analysis.
Evaluating the objective function is a sum over all voxels
in the subset, so it has complexity Oðs3Þ, with the constant
accounting for 1 flop/voxel if no interpolation is needed (as
in the coarse search), to 278 flops/voxel for cubic B-spline
interpolation. Evaluating rc with a spline is an additional
248 flops/voxel.
The overall complexity for each correlation point is thus
Cpt ¼ Oðk3c s3Þ|fflfflfflffl{zfflfflfflffl}
coarsesearch
þOðtðsþ kfftÞ3 log2ðsþ kfftÞÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}FFTsearch
þ Oð3ðsþ 2bÞ3Þ|fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl}B�splinecoefficients
þOðkbfgss3=aÞ|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl}
BFGS
;ð3Þ
where s is the subset width (e.g. 41), b is the padding
around the subset (set to 5), kc is the width of the coarse
search volume (on average, 3), t is the percentage of points
that invoke the FFT search (on average, 1.7%), kfft is the
width of the FFT search region (set to 40), kbfgs is the num-
ber of cðuÞ evaluations during BFGS (on average, 21.5). On
average, BFGS takes 5.6 iterations and the cubic line search
adds 2.8 evaluations per BFGS iteration. When using the
GPU implementation introduced in Section 5, a is the
speedup factor of the GPU implementation over the CPU
implementation. The total serial complexity for an
N � N � N correlation grid is
N 3Cpt þ Oð2 sþ bþ dðN � 1Þf gn2Þ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}readimages
: ð4Þ
The number of slices read for each image is bounded by the
image size, n, in which case this becomes N 3Cpt þ Oð2n3Þ.
3. Data storage
To achieve our goal of computing high-resolution DVC in
an acceptable time, two interrelated issues must be
addressed: the first issue is how to store data so that we can
scale the problem to a large image size within the amount
of memory available to each processor; the second issue
is how to parallelize the DVC code to reduce the required
wall clock time.
For 2D DIC, processing two n� n images in memory
requires 2n2 bytes for image data and ðnþ 2Þ2 floats for the
B-spline coefficients used for image interpolation, which is
easily managed for typical resolutions. For 3D DVC, how-
ever, storing two n� n� n images can be a challenge. It
requires 2n3 bytes for image data and ðnþ 2Þ3 floats for
spline coefficients. For n ¼ 1024, this requires 6 GB total,
while at our scanner’s maximum 4000� 4000� 2300 res-
olution it requires 206 GB total, exceeding the memory of
most shared memory computers. A parallel implementation
could distribute the memory over p processors, requiring
Oð6n3=pÞ bytes per process. However, examining how
DVC accesses images reveals that we can use much less
memory. 3D images are typically stored as a series of 2D
images (e.g. in TIFF format), each representing a single
slice with a constant value of the z coordinate (see Figure
1 for coordinate definition). Each subset intersects only a
small number of these slices, so rather than attempting to
load the entire 3D image, we developed a data structure that
loads only those slices that are currently in use, as shown in
Figure 4. Thus, we reduce the required image data from
deformedg(u1) subset
bounding boxpadding
initial g(u0) subset
Figure 3. Derivative terms in initial transformation u0 are zero,so initial subset is square. The outer gray region is padding addedbefore computing the spline. After the optimization step, deriva-tive terms in u1 are non-zero, introducing strain and rotation.Dashed lines show the bounding box, which here is still insideinitial spline region.
Gates et al. 5
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
2n3=p to 2ðsþ bÞn2 per process, independent of p. For the
spline coefficients, the spline is computed over a single sub-
set instead of the entire image, reducing required spline coef-
ficients from ðnþ 2Þ3=p to ðsþ 2bþ 2Þ3 floats per process.
As s� n, this is a significant savings. The total memory is
thus Oð2sn2Þ bytes. Our scheme will require less memory
than loading the entire image for p < ð6nÞ=ð2sÞ. Figure 5
shows how our memory requirements scale with image size,
and compares this with computing the full image spline in
parallel for various p. A variant of this scheme for a shared
memory node would let cores within a node share a set of
images, reducing memory requirements to 2ðsþ bÞn2 per
node, instead of per process. This would require a change
in the parallel algorithm, so that processors within a shared
node are assigned correlation points in the same plane; how-
ever, we have not investigated this variation.
Our data storage scheme is implemented as a Cþþ class
called Image3D. From a user’s perspective, accessing vox-
els in an Image3D object operates much like a 3D array. To
access the voxel with indices i; j; k, we use g ¼ image(i,
j, k), which returns the i; j entry of slice k. There is an
additional function, image.load(k_begin, k_end),
to read a contiguous range of image slices. Internally,
Image3D maintains a vector of pointers to the 2D image
slices. At any time, one contiguous block of these slices
is loaded into memory, while pointers for all other slices
are null, marking them as not loaded.
This data structure makes the DVC algorithm scalable to
large problem sizes, even on a single processor, rather than
being restricted by the amount of available RAM. For
instance, to perform DVC on a 992� 1013� 994 image
required less than 240 MB of memory per processor, even
though two 3D images plus spline coefficients would take
5.6 GB. This data structure also has immediate benefits for
parallel computing, as each processor can manage its own
subset of image slices independently, simplifying the paral-
lel data distribution and communication.
4. Coarse-grained parallel computing
For parallel implementation, a computation can be decom-
posed into tasks at several different granularities. Given
multiple pairs of 3D images to analyze using DVC, the
coarsest granularity for parallel computing would be to run
multiple instances of DVC to analyze different images
simultaneously; we will not deal with this trivially paralle-
lizable case. More interestingly, within the DVC computa-
tion for a single pair of images, there are at least two
additional levels of granularity: a coarse-grained decompo-
sition computes multiple correlation points in parallel,
while a fine-grained decomposition computes the objective
function for a single correlation point in parallel. We
address the coarse-grained parallelism in this section and
discuss the fine-grained parallelism in Section 5.
To implement a coarse-grained parallel algorithm, we
assign different correlation points to different processors.
A key question is how to assign correlation points to specific
processors. In our storage scheme for 3D images, each pro-
cessor loads only the slices it needs for the current subset, as
shown in Figure 4. Therefore, assigning correlation points in
a plane with the same z coordinate to a single processor will
maximize reuse of image data already in memory. This also
makes the algorithm scalable to large data sets because each
processor reads only a fraction of the data, instead of every
processor reading and storing the entire data set. In addition,
this provides the maximum benefit of using neighboring cor-
relation points on the same processor to extrapolate a good
initial guess for the next correlation point.
However, this simple 1D decomposition by planes may
produce poor efficiency resulting from load imbalance,
where some processors are assigned more planes than
others, as shown in Figure 6 (a). We improve on this by
considering two other decompositions. One option is to
group points into rows, where all points in a row have the
same y and z coordinates, and divide rows evenly among
processors, yielding a maximum difference of one row
between different processors (Figure 6 (b)). Another
option, which is used for all static scheduling results here,
is to divide the total number of points evenly among
load}
Processor 1 Processor 2
load}
Figure 4. Each processor reads only image slices required for itscurrent subset. Different processors work on different subsets inparallel.
256 GB
64 GB
full, p = 1
full, p = 12
full, p = 96
full, p = 256
ours, s = 41
16 GB
4 GB
1 GB
256 GB
64 GB
16 GB
1000 2000 3000
image size (n)
4000
Figure 5. Growth in per-process memory requirements as afunction of image size, for various numbers of processes, p. Thefull scheme loads the entire image into memory. Our schemeloads OðsÞ slices and is independent of p.
6 The International Journal of High Performance Computing Applications
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
processors, yielding a maximum difference of one point
between different processors (Figure 6 (c)).
This static load balancing scheme assumes that each corre-
lation point takes approximately the same amount of time to
compute. However, variations in image quality (e.g. differing
speckle pattern over regions of the sample) and the underlying
deformation field can cause correlation points to take
different amounts of time to compute, leading to a load imbal-
ance. We used TAU (Shende and Malony, 2006) and Jump-
shot (Chan et al., 2008) to trace and visualize the
communication and idle time for the static load balancing
scheme, as shown in Figure 7 (a). Each horizontal bar repre-
sents one of three processes. The gray blocks show idle time at
the end of the computation for processes 0 and 1 as they wait
for process 2 to finish, demonstrating load imbalance.
To correct this imbalance, we developed a master–worker
dynamic load balancing scheme, where we assign one pro-
cess to be the master, and all other processes to be workers.
The master process, shown in Algorithm 1, creates a set of
tasks, with each task being a single row of the correlation
grid. For each worker process, the master initially sends two
tasks and sets up a non-blocking receive to wait for results
back from the worker. The master then waits on the set of
non-blocking receives for any incoming results from work-
ers. For any results that come in, it records the results and
sends a new row of correlation points to that worker process,
or a flag telling the worker it is finished. Each worker, shown
in Algorithm 2, receives a task, processes the correlation
points in it, sends results back to the master, starts a non-
blocking receive for a new task, and immediately starts on
the next queued task. By initially assigning two tasks, work-
ers always have a queued task to work on while waiting for
the next assignment from the master.
Figure 7 (b) shows a trace of the communication and
idle time for the dynamic load balancing scheme. The ver-
tical orange lines show the master process receiving
results from and sending a new task to each worker pro-
cess. The master is nearly always idle in a MPI_Waitsome
call (yellow), so it can run oversubscribed on the same
core as a worker process. The gray blocks at the end are
again idle time, but are significantly less than the idle time
for the static load balancing scheme in Figure 7 (a). The
wall clock time is reduced by 9%, from 194 seconds to
176 seconds, showing that the added communication for
dynamic load balancing is more than compensated by the
improved load balancing.
(a) (b) (c)
Figure 6. Decomposition of 5� 5� 5 grid of correlation pointsonto four processors: (a) by planes (63% efficiency); (b) by rows(89% efficiency); (c) by points (98% efficiency). Parallel efficiencygiven for simple case of all points taking equal time.
Gates et al. 7
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
We still want to be efficient with reusing the existing
image slices that have been loaded by each process. There-
fore, rather than assigning tasks in a round-robin fashion,
with task 0 going to process 0, task 1 to process 1, etc.,
we instead make an initial distribution of tasks identical
to the static load balancing ‘‘by rows’’ (Figure 6 (b)). If a
process finishes all of its initial distribution of tasks, the
master will re-assign it tasks from other processes. This
is managed by the assign function of Algorithm 1.
Similar to the serial complexity given in Equation (4),
the complexity for the parallel algorithm with p workers is
Cpar ¼ O N3
pCpt
� �þ O
N2
pmsgðNkdof Þ
� �|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
MPI messages
þO 2 sþ bþ dN
p
� �� 1
� � n2
� �|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
read images
;
ð5Þ
where msgðNÞ is the time to send a message of length N
and kdof is the number of DOFs (for affine, 12). Because the
messages are short, approximately 8 kB for N ¼ 100, the
messages are latency bound instead of bandwidth bound,
and minimal compared with the computational cost.
We implemented these parallel algorithms using MPI
(MPI Forum, 2009), to run in a variety of parallel and
distributed computing environments. To test the scal-
ability of our algorithms, we perform DVC with varying
numbers of processors on Keeneland, a cluster with 120
nodes, each with two 6-core Intel X5660 CPUs and
three NVIDIA M2090 GPUs (Vetter et al., 2011). We
test three different problem sizes, all of which cover a
552� 552� 552 voxel region of a 992� 1013� 994
image and use a 413 subset size. The small problem size
uses a 20 voxel grid spacing, yielding a 26� 26� 26
grid with 17,576 correlation points. The medium prob-
lem size uses a 10-voxel grid spacing, yielding a
51� 51� 51 grid with 132,651 correlation points. The
large problem size uses a 5-voxel grid spacing,
achieving our target resolution with a 101� 101� 101
grid with 1,030,301 correlation points.
In Figure 8 (a) we plot in log–log scale the wall clock time
versus number of processors. A slope of negative one, shown
by the triangle and black lines, indicates linear speedup. Both
the static (dashed blue lines) and dynamic (solid red lines)
load balancing schemes come close to linear speedup. In all
cases the dynamic scheme is faster than the static scheme.
Figure 8 (b) plots the parallel efficiency, defined as
ep ¼serial cost
parallel cost¼ t1
ptp;
where tp is the wall clock time with p processors. We esti-
mate the serial time by summing the computation time for
the static load balancing scheme, excluding any idle time.
The static scheme gradually loses efficiency with more pro-
cessors, while the dynamic scheme is better able to maintain
a high efficiency. For the small problem size (lines with
‘‘�’’), both schemes start to deviate from a straight line in
Figure 8 (a), which is also seen in Figure 8 (b) as a reduction
in parallel efficiency, due to the higher proportion of parallel
overhead compared with larger problems. For instance, pro-
cessors assigned points in the same plane will read images
redundantly, accounted for by the ceiling in Equation (5).
For the medium and large problem sizes, the dynamic
scheme maintains 98% efficiency up to the maximum num-
ber of processors tested, while the efficiency with the static
scheme drops to 80%. For the large problem with 96 proces-
sors, the dynamic scheme is 24% faster than the static
scheme. The largest problem—which is our target million
point correlation grid—would take an estimated 77.6 hours
on a single CPU core. The dynamic scheme solves it in 50
minutes with 96 processors; the static scheme in 62 minutes.
Since a pair of CT scans typically takes 1–2 hours, this type
of coarse-grained parallelism achieves our goal of comput-
ing high-resolution DVC in time commensurate with the
image acquisition time. However, the performance can be
improved even further as described in the next section.
(a)
process 0
process 0
process 1
process 1
process 2
0 25 50 75 100time (sec)
125 150 175 200
idle
idle
(b)
0 25 50 75 100time (sec)
125 150 175 200
process 2master
Figure 7. Idle time and communication for DVC with three processes using (a) static and (b) dynamic load balancing schemes. Grayblocks at end are idle time, blue is evaluating objective function inside BFGS, red is reading images. Orange lines between master andanother process indicate communication; master is nearly always idle in MPI_Waitsome (yellow).
8 The International Journal of High Performance Computing Applications
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
5. Fine-grained parallel computing withGPU
A complementary parallelization is to compute the objec-
tive function cðuÞ itself in parallel. Recall that the objective
function (1) evaluates the subset in the deformed image,
gðxðuÞÞ, for all voxels x in the current subset. Computing
the objective function and its gradient in parallel entails
computing the spline interpolation of gðxðuÞÞ at every
voxel in parallel, then doing parallel reductions to sum the
objective function and its derivatives. While the computa-
tion to communication ratio is insufficient to do this effi-
ciently with loosely coupled distributed computing such
as MPI, modern GPUs provide a lightweight thread model
that is ideal for this type of data-parallel computation.
With the advent of the CUDA (NVIDIA, 2007) and
OpenCL (Munshi, 2009) languages to allow generalpur-
pose GPU programming, GPUs have been used to acceler-
ate many applications, including image processing (Park
et al., 2011), FFT (NVIDIA, 2013), linear algebra (Tomov
et al., 2010), and other scientific applications (Lee et al.,
2010; Hwu, 2011). In this work, we use the CUDA exten-
sions to Cþþ developed by NVIDIA. In CUDA, a parallel
computation is decomposed into a 3D grid of blocks, which
are executed asynchronously on the GPU. Each block is
further decomposed into a 3D grid of threads, all of which
execute the same kernel function in lockstep on different
pieces of data, in SIMD (single instruction, multiple data)
fashion. A GPU uses multiple blocks to hide memory
latency. While one block is waiting for a memory read to
complete, another block can be executing. Having a suffi-
cient number of blocks and amount of computation within
each block to hide memory latency is therefore important to
achieving high performance.
For DVC, we assign each thread to compute one point in
the subset, and create blocks by tiling each slice of the subset
with 2D tiles, as illustrated in Figure 9 and detailed in Algo-
rithms 3 and 4. For instance, using 16� 16 blocks to cover a
313 subset results in a 2� 2� 31 grid of blocks, for a total
of 124 blocks, each containing 256 threads. Since 31 is not
evenly divisible by 16, tiles along two edges will have one
row or column that is outside the subset; results for these
points are set to zero. We copy the subset of the reference
image f and the spline coefficients of the deformed image
g, of size s3 and ðsþ 2bþ 2Þ3 floats, respectively, to 3D
textures on the GPU to take advantage of texture caching.
These do not need to be copied for each objective function
evaluation, but only when starting a new subset or if the
spline is recomputed. The vector u is copied to the GPU’s
constant memory, which is also cached.
512
256
128
64
time
(min
utes
)
32
16
81:1
1013 grid
513
dynamic, 263 grid
dynamic, 513 grid
dynamic, 1013 grid
static, 263 grid
static, 513 grid
static, 1013 grid
grid
263 grid
4
100
80
Para
llel e
ffic
ienc
y
60
40
20
02 4 6 12
number of CPU cores
(a)
24 48 72 96 2 4 6 12
number of CPU cores
(b)
24 48 72 96
Figure 8. (a) Strong parallel scaling in log–log scale for three problem sizes. Solid black lines show linear speedup. Weak scaling is alsodemonstrated by some data points, as 513 � 8ð263Þ and 1013 � 8ð513Þ. (b) Parallel efficiency for the same tests.
Figure 9. Tiles (small slabs) covering an image subset, showingthe decomposition of computation on GPU. Since the subset sizeis not a multiple of the tile size, tiles extend beyond the subsetboundary.
Gates et al. 9
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
Using the B-spline basis, each m� n tile uses approxi-
mately 4ðmþ 3Þðnþ 3Þ coefficients, depending on the
deformation u. For each point, evaluating the tricubic
spline and its first partial derivatives requires 526 flops.
Thus for an 8� 8 tile, the computation-to-memory ratio
is 69 flops/float. This large ratio allows for overlapping
computation and memory reads to hide memory latency,
yielding good performance on the GPU.
After computing the spline gðxðuÞÞ and its derivatives at
each point, each block does a series of standard parallel
sum reductions (Kirk and Hwu, 2010) to compute partial
sums of the objective function cðuÞ and its derivatives
within each block. Because CUDA does not provide syn-
chronization between blocks, a second CUDA kernel with
only one thread block makes a final summation of the par-
tial sums from each block, and the results are copied back
to the CPU.
Currently, other portions of the DVC computation
remain on the CPU. Particularly, the BFGS iteration itself
is computed on the CPU, because the small problem size,
only 12 DOFs, does not have sufficient parallelism to take
advantage of a GPU. Over 99% of the time for BFGS is in
the cðuÞ and rcðuÞ evaluations, which are GPU acceler-
ated. The coarse search is also performed on the CPU,
because it takes only 5% of the time, and has only a couple
of operations per memory read, so the cost of transferring
data to the GPU would outweigh any benefits. The FFT
search could be accelerated on the GPU, but as it is
invoked for only t ¼ 1:7% of subsets, the benefits would
be negligible. The best candidate for further acceleration
using the GPU is computing the B-spline coefficients,
which involves solving three systems with an
ðsþ 2bþ 2Þ� ½ðsþ 2bþ 2Þ2� RHS matrix and transpos-
ing the RHS after each solve. Benchmarking just these
two operations using the MAGMA linear algebra library
(Tomov et al., 2010), for s ¼ 41, shows a potential
speedup of 1.8, from 7.8 ms using the CPU to 4.3 ms using
the GPU. Ruijters and Thevenaz (2010) also developed an
algorithm to compute B-spline coefficients on the GPU.
10 The International Journal of High Performance Computing Applications
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
Computing splines on the GPU would, however, also
increase contention on the GPU, so the potential impact
on overall performance is unclear. Leaving some compu-
tation on the CPU allows time for other processes to share
the same GPU.
While this GPU code parallelizes the computation of the
objective function for a single subset, we also combined it with
our MPI-based DVC, making an implementation that exploits
multiple levels of parallelism, simultaneously doing both
coarse-grained parallelism across multiple CPU cores and, for
each CPU core, fine-grained parallelism using a GPU.
We use single precision floating point with CUDA, since
it is supported on all GPU cards and has twice the perfor-
mance of double precision floating point on recent cards
such as the M2090 (NVIDIA, 2011). The GPU algorithm
also computes in a different order than the CPU algorithm:
the parallel sum reduction adds terms together in a hierarch-
ical tree fashion, which tends to yield more accurate results
since numbers of similar magnitude are summed at each
step, as compared with a serial implementation that sums
terms into a single accumulator. To assess the impact of sin-
gle precision and algorithmic changes, we compare DVC
solutions computed using single and double precision. Glo-
bal error properties are identical between the solutions: both
have 1.45% error and a standard deviation in displacement
of 0.042 voxels. Differences between the computed displa-
cements, usingle � udouble, are small: 99.7% of points are
within 0.002 voxels, much less than the standard deviation
of 0.042 voxels, and all points are within 0.015 voxels.
We conclude that single precision is sufficient to compute
DVC to within its experimental accuracy limits.
Building on the parallel complexity in Equation (5), the
complexity for the GPU parallel algorithm is
Cgpu ¼ Cpar þ ON 3
ps3 þ ðsþ 2bÞ3 þ kbfgskdof
n o� �|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
CPU$ GPU data transfer
: ð6Þ
Recall that the speedup of the GPU implementation over
the CPU implementation of the objective function is
already represented in Cpar by a in (3).
To make a fair comparison, both CPU and GPU imple-
mentations should be optimized. In our case, the inner
loops of the CPU and GPU implementations are identical.
In both cases, the four summations for g; gx; gy; gz, shown
in Algorithm 4, are merged into a single set of triple nested
loops over k; j; i. On the CPU, the Intel icc compiler with -
xsse4 optimization is able to partially vectorize these loops.
However, because of the slightly irregular memory access
pattern due to the affine shape function (2), and loading
data from unaligned addresses, the vectorization efficiency
is limited. A single CPU core achieved 3.56 Gflop/s evalu-
ating the objective function, independent of subset size. For
comparison, the optimized matrix–vector product (sgemv)
in Intel’s MKL achieves 6.0 Gflop/s; sgemv has fewer
operations than spline interpolation, but a more regular data
access pattern that can be vectorized well.
We first test the performance of the GPU for computing
the objective function itself, apart from other parts of our
DVC code. Here we define speedup as the ratio of the com-
putation time using only a CPU core to the computation
time for the GPU algorithm (in both cases using single pre-
cision). We achieve a maximum speedup of 43:5 using a
GPU, as shown in Figure 10. However, the speedup is vari-
able and depends on the machine, tile size, and subset size.
In general, larger subset sizes have a larger speedup. A
decrease in performance is seen when a multiple of the tile
size is exceeded, requiring another row or column of tiles.
For instance, 8� 8, 8� 16, and 16� 16 tile sizes all show
a decrease in speedup when the subset size reaches 493
voxels. Medium size tiles with 64 to 128 threads (4� 16,
4� 32, 8� 8, 8� 16) have the best performance, with
speedups from 21.4 to 43.5, while small tiles and large tiles
(4� 4, 16� 32) show lesser speedups. For subsequent
tests, we chose the 8� 8 tile size because it performed well
45
40
35
30
spee
dup
25
20
15
10
5
030 35 40 45
subset size50 55 60
160
140
120
100
4×4
4×8
4×16
4×32
8×8
Gfl
op/s
8×16
8×32
16×16
16×32
CPU
80
60
40
20
0
Figure 10. GPU performance compared with one CPU core, for evaluating objective function with various subset and tile sizes. CPUachieves 3.56 Gflop/s, independent of subset size. We chose 8� 8 as the best overall tile size.
Gates et al. 11
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
consistently, being the first or second fastest for all subset
sizes. As we accounted for CPU-to-GPU communication
separately in our performance model, and the frequency
of communication depends on the number of function
evaluations during BFGS, this speedup does not include
communication. For instance, with s ¼ 41, copying f and
g to the GPU takes 0.71 ms (9.8 Gbit/s), while evaluating
cðuÞ and rcðuÞ an average of 19 times during BFGS takes
5.47 ms on the GPU and 193.33 ms on the CPU. Thus
including communication reduces the speedup from 35.3
to 31.5.
Because we are not fully utilizing the GPU, multiple
CPU cores can share a single GPU and all gain a speedup.
To illustrate, we perform a small weak scaling experiment
with static scheduling, where each CPU core computes
DVC for one slice of a 26� 26� p correlation grid.
Profiles of process 0 are shown in Figure 11 for various
numbers of processors and GPUs. Since process 0 performs
the same computation in all cases, any increase in time can
be attributed to resource contention and parallel overhead.
For the CPU-only code, the time is dominated by BFGS,
which incurs only a small increase in time with increasing
cores. For more than one core, idle time is introduced as
this process waits for other processes to finish. Reading
images from disk, the coarse search, FFT search, and com-
puting B-spline coefficients are all a small percentage of
the overall time. For one GPU, BFGS is up to 24.9 times
faster than the CPU-only version, while the overall compu-
tation is accelerated by 8.7 times, a consequence of
Amdahl’s law (Amdahl, 1967): we parallelize one portion
of the application, but the maximum speedup achievable
is limited by the amount of time spent in the remaining
serial code. As mentioned previously, some of this serial
code is also amenable to computation with the GPU, with
the B-spline computation being the largest and best candi-
date. With one GPU, as the number of CPU cores increase,
the BFGS time increases modestly up to four cores, then
increases more sharply for larger numbers of cores, due
to contention on the GPU. The time to copy data from the
CPU to the GPU also increases, but is still a small percent-
age of the overall time, indicating that PCIe bandwidth is
not a bottleneck. For two GPUs, at most six cores use each
GPU, so the increase in time is much less significant, while
for three GPUs, at most four cores use each GPU, so the
increase is slight.
To investigate the scalability of our algorithm, we per-
form a strong scaling experiment using multiple nodes of
Keeneland. As shown in Figure 12, for the 1013 grid size,
we achieve speedups of 3.5, 6.5, and 8.0 times using one,
two, and three GPUs, respectively, per 12-core node.
Again, we see that adding a second GPU alleviates conten-
tion, increasing overall performance 47% compared with
one GPU, while adding a third GPU per node has dimin-
ished returns, adding only 10%. We also observe near-
perfect linear scaling for both the CPU-only and GPU
codes. As before, the dynamic load balancing scheme
achieves 98% parallel efficiency for CPU-only code, and
88% or better parallel efficiency using GPUs.
6. Application
While this work is primarily concerned with the parallel
implementation and performance, we also want to empha-
size that our method maintains high accuracy. We demon-
strate the effectiveness of our enhanced, high-performance
DVC code to detect a 3D deformation field based on a par-
ticular elastic solution. We take two consecutive scans,
called baseline images, using the same settings and without
moving (other than the tomograph rotation) or deforming
the sample. Ideally, the displacement field for a pair of
baseline images is zero everywhere, but scanning will
introduce experimental noise. We use a pair of baseline
images, in order to include a realistic amount of experimen-
tal noise, and apply an artificial deformation to one image
2 4
225
200
175
150
time
(sec
)
125
100
75
50
25
6 8 10
CPU only 1 GPU 2 GPUs
Idle
Read
Coarse
Spline
GPU copy
BFGS
FFT
3 GPUs
8.7×
24.9×
12 2 4 6 8 10 12
CPU cores
2 4 6 8 10 12 2 4 6 8 10 12
Figure 11. Time for one process of weak scaling experiment with static scheduling, in all cases performing exactly the same com-putation on slice z ¼ 240. Changes in time with more processes are therefore due to resource contention and parallel overhead.
12 The International Journal of High Performance Computing Applications
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
using a known analytical solution for this particular
instance. The sample material is a ceramic foam, pictured
in Figure 1. It is scanned with an Xradia MicroCT scanner
at 10 m m resolution, resulting in a 992� 1013� 994 voxel
3D image.
For strain-inducing deformations, we use the 3D defor-
mation field around a rigid spherical inclusion of radius R
in a linear elastic material under uniaxial tension. The displa-
cement field for this problem was derived by Goodier
(1933). We apply an artificial deformation to the first base-
line image using the analytical solution with a sphere radius
R of 50 voxels (0.5 mm), then compute DVC between this
artificially deformed image and the second baseline image.
By using an artificial deformation, we have the exact solu-
tion to compare with and are better able to assess the accu-
racy of the DVC method, apart from additional uncertainties
inherent in using a load frame to perform in situ load tests.
We compute the DVC solution over a 5523 voxel region
with a 413 subset using our target 1013 grid with over a mil-
lion correlation points.
Figure 13 shows a vector plot of the 3D displacement
field measured by DVC around the sphere, viewed from just
off the z-axis so that effectively we see just the u and v dis-
placements. Tension is applied parallel to the x-axis, and the
Poisson effect causes compression in the y and z directions.
The distortion near the sphere is evident: within each row,
displacements near the sphere are smaller than those farther
from the sphere, approaching zero displacement at the sur-
face of the sphere. Overall, we achieve a 1.11% relative L2
error in the u displacement, with a mean absolute error of
0.06 voxels in u. We also apply a least-squares fit to deter-
mine the applied tension T (Gates et al., 2011), which we are
able to measure with 0.06% error.
7 Conclusion
We were able to achieve our goal of computing accurate,
high-resolution DVC in time commensurate with the image
acquisition time. Each scan can take up to an hour. We
performed DVC on a 1013 grid with a 413 subset size in
50 minutes using 96 processors. Using significantly less
hardware (and energy), the GPU-accelerated version
achieved this computation also in 50 minutes with just 1
node (12 cores and 3 GPUs), and in 6.2 minutes using 8
nodes (96 cores and 24 GPUs). Using GPUs accelerated the
computation by up to 8 times, compared to using CPUs
alone. Both the CPU and GPU versions are highly scalable,
work within memory constraints, and achieve 88% or better
parallel efficiency on up to 96 cores. This DVC algorithm
provides the capability to perform high-resolution DVC on
large 3D images obtained from modern m CT scanners.
Acknowledgments
We are grateful to Christian Espinoza and Professor W Kriven
of the Department of Materials Science and Engineering
at the University of Illinois for providing the ceramic foam
sample, and to Charles Mark Bee, Leilei Yin, and the Ima-
ging Technology Group at the Beckman Institute for use of
the Xradia MicroCT scanner.
Funding
This work was supported by the Center for Simulation of
Advanced Rockets under contract by the U.S. Department
of Energy (contract number B523819), the Institute for
Advanced Computing Applications and Technologies, and
the University of Illinois Campus Research Board (award
number 09084).
References
Amdahl GM (1967) Validity of the single processor approach to
achieving large scale computing capabilities. In: Proceedings
of the April 18–20, 1967, Spring Joint Computer Conference.
New York, NY: ACM Press, pp. 483–485.
512256
time
(min
utes
) 1286432168421
12 24
CPU cores
48 72 96
1:1
8×
1013 grid, CPU1013 grid, 1 GPU/node1013 grid, 2 GPU/node1013 grid, 3 GPU/node
513 grid, CPU513 grid, 1 GPU/node513 grid, 2 GPU/node513 grid, 3 GPU/nodeLinear speedup
Figure 12. Strong parallel scaling in log–log scale for two prob-lem sizes. Using one, two, and three GPUs per node achieves 3.5,6.5, and 8.0 times speedup, respectively, compared with CPUsalone for 1013 grid size.
Figure 13. Arrow plot of displacements measured with DVCaround spherical inclusion of radius R.
Gates et al. 13
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J,
et al. (1999) LAPACK Users’ Guide, 3rd edn. Philadelphia, PA:
SIAM Press. Available at: http://www.netlib.org/lapack/lug/
Bay BK, Smith TS, Fyhrie DP and Saad M (1999) Digital volume
correlation: Three-dimensional strain mapping using X-ray
tomography. Experimental Mechanics 39: 217–226.
Buis PE and Dyksen WR (1996) Efficient vector and parallel
manipulation of tensor products. ACM Transactions on Math-
ematical Software 22(1): 18–23.
Chan A, Gropp W and Lusk E (2008) An efficient format for
nearly constant-time access to arbitrary time intervals in large
trace files. Scientific Programming 16(2–3): 155–165.
Chu T, Ranson W, Sutton M and Peters W (1985) Applications of
digital-image-correlation techniques to experimental mechanics.
Experimental Mechanics 25: 232–244.
Dierckx P (1993) Curve and Surface Fitting with Splines. Oxford:
Oxford University Press.
Efstathiou C, Sehitoglu H and Lambros J (2010) Multiscale strain
measurements of plastically deforming polycrystalline tita-
nium: Role of deformation heterogeneities. International
Journal of Plasticity 26: 93–106.
Fletcher R (1987) Practical Methods of Optimization, 2nd edn.
New York: Wiley.
Flucka O, Vettera C, Weina W, Kamena A, Preimb B and Wester-
mannc R (2011) A survey of medical image registration on
graphics hardware. Computer Methods and Programs in Bio-
medicine 104: e45–e57.
Forsberg F, Sjodahl M, Mooser R, Hack E and Wyss P (2010) Full
three-dimensional strain measurements on wood exposed to
three-point bending: Analysis by use of digital volume corre-
lation applied to synchrotron radiation micro-computed tomo-
graphy image data. Strain 46: 47–60.
Franck C, Hong S, Maskarinec S, Tirrell D and Ravichandran G
(2007) Three-dimensional full-field measurements of large
deformations in soft materials using confocal microscopy
and digital volume correlation. Experimental Mechanics
47: 427–438.
Franck C, Maskarinec SA, Tirrell DA and Ravichandran G (2011)
Three-dimensional traction force microscopy: a new tool for
quantifying cell-matrix interactions. PLoS ONE 6: e17833.
Frigo M and Johnson S (2005) The design and implementation of
FFTW3. Proceedings of the IEEE 93(2): 216–231.
Gates M, Lambros J and Heath MT (2011) Towards high perfor-
mance digital volume correlation. Experimental Mechanics
51: 491–507.
Germaneau A, Doumalin P and Dupre JC (2007) 3D strain field mea-
surement by correlation of volume images using scattered light:
Recording of images and choice of marks. Strain 43: 207–218.
Glocker B, Sotiras A, Komodakis N and Paragios N (2011)
Deformable medical image registration: Setting the state of the
art with discrete methods. Annual Review of Biomedical Engi-
neering 13: 219–244.
Gonzalez RC and Richard E (2007) Digital Image Processing, 3rd
edn. Englewood Cliffs, NJ: Prentice-Hall.
Goodier JN (1933) Concentration of stress around spherical and
cylindrical inclusions and flaws. Journal of Applied
Mechanics 55(7): 39–44.
Haldrup K, Nielsen S, Mishnaevsky L Jr, Beckman F and Wert J
(2006) 3-dimensional strain fields from tomographic measure-
ments. Proceedings of SPIE 6318: 63181B.
Hall SA, Bornert M, Desrues J, et al. (2010) Discrete and conti-
nuum analysis of localised deformation in sand using X-ray
m CT and volumetric digital image correlation. Geotechnique
60(5): 315–322.
Hild F, Roux S, Bernard D, Hauss G and Rebai M (2013) On the
use of 3D images and 3D displacement measurements for the
analysis of damage mechanisms in concrete-like materials. In:
VIII International Conference on Fracture Mechanics of Con-
crete and Concrete Structures, pp. 1–12.
Hwu W (2011) GPU Computing Gems Emerald Edition. San
Mateo, CA: Morgan Kaufmann.
Kirk DB and Hwu WMW (2010) Programming Massively Paral-
lel Processors: A Hands-on Approach. San Mateo, CA: Mor-
gan Kaufmann.
Klein S, Staring M, Murphy K, Viergever MA and Pluim JPW
(2010) elastix: A toolbox for intensity-based medical image
registration. IEEE Transactions on Medical Imaging 29:
196–205.
Leclerc H, Periea JN, Hild F and Roux S (2012) Digital volume
correlation: what are the limits to the spatial resolution?
Mechanics and Industry 13: 361–371.
Lee VW, Kim C, Chhugani J, et al (2010) Debunking the 100x
GPU vs. CPU myth: An evaluation of throughput computing
on CPU and GPU. ACM SIGARCH Computer Architecture
News 38: 451–460.
Lenoir N, Bornert M, Desrues J, Besuelle P and Viggiani G (2007)
Volumetric digital image correlation applied to X-ray microto-
mography images from triaxial compression tests on argillac-
eous rock. Strain 43: 193–205.
Limodin N, Rethore J, Adrien J, Buffiere JY, Hild F and Roux S
(2011) Analysis and artifact correction for volume correlation
measurements using tomographic images from a laboratory X-
ray source. Experimental Mechanics 51: 959970.
Liu C (2005) On the minimum size of representative volume ele-
ment: An experimental investigation. Experimental Mechanics
45(3): 238–243.
Liu L and Morgan EF (2007) Accuracy and precision of digi-
tal volume correlation in quantifying displacements and
strains in trabecular bone. Journal of Biomechanics 40:
3516–3520.
Modersitzki J (2009) FAIR: Flexible Algorithms for Image Regis-
tration. Philadelphia, PA: SIAM.
Morgeneyer T, Helfen L, Mubarak H and Hild F (2012) 3D digital
volume correlation of synchrotron radiation laminography
images of ductile crack initiation: An initial feasibility study.
Experimental Mechanics, in press.
Mostafavi M, McDonald S, Mummery P and Marrow T (2013)
Observation and quantification of three-dimensional crack
propagation in poly-granular graphite. Engineering Fracture
Mechanics, in press.
Moulart R, Rotinat R and Pierron F (2009) Full-field evaluation of
the onset of microplasticity in a steel specimen. Mechanics of
Materials 41: 1207–1222.
MPI Forum (2009) MPI: A message-passing interface standard.
14 The International Journal of High Performance Computing Applications
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from
Munshi A (ed.) (2009) The OpenCL Specification. Khronos
OpenCL Working Group.
NVIDIA (2007) NVIDIA CUDA C Programming Guide.
NVIDIA (2011) Tesla M-class GPU computing modules acceler-
ating science.
NVIDIA (2013) CUFFT library.
Park IK, Singhal N, Lee MH, Cho S and Kim CW (2011) Design
and performance evaluation of image processing algorithms
on GPUs. IEEE Transactions On Parallel and Distributed Sys-
tems 22: 91–104.
Rannou J, Limodin N, Rethore J, et al. (2010) Three dimensional
experimental and numerical multiscale analysis of a fatigue
crack. Computer Methods in Applied Mechanics and
Engineering 199: 1307–1325.
Rossi M and Pierron F (2011) Identification of plastic constitutive
parameters at large deformations from three dimensional dis-
placement fields. Computational Mechanics 49: 53–71.
Roux S, Hild F, Viot P and Bernard D (2008) Three-dimensional
image correlation from X-ray computed tomography of solid
foam. Composites: Part A 39: 1253–1265.
Ruijters D and Thevenaz P (2010) GPU prefilter for accurate
cubic B-spline interpolation. The Computer Journal 55:
15–20.
Saxena V, Rohrer J and Gong L (2010) A parallel GPU algorithm
for mutual information based 3D nonrigid image registration.
In: Euro-Par 2010 Parallel Processing, Spring.
Shams R, Sadeghi P, Kennedy RA and Hartley RI (2010) A sur-
vey of medical image registration on multicore and the GPU.
IEEE Signal Processing Magazine 27: 50–60.
Shende S and Malony AD (2006) The TAU parallel performance
system. International Journal of High Performance Comput-
ing Applications 20: 287–311.
Sjodahl M, Siviour CR and Forsberg F (2012) Digital volume cor-
relation applied to compaction of granular materials. Procedia
IUTAM 4: 179–195.
Smith TS, Bay BK and Rashid MM (2002) Digital volume
correlation including rotational degrees of freedom during
minimization. Experimental Mechanics 42: 272–278.
Sutton MA, Orteu JJ and Schreier H (2009) Image Correlation for
Shape, Motion and Deformation Measurements. New York:
Springer.
Tomov S, Nath R, Ltaief H and Dongarra J (2010) Dense linear
algebra solvers for multicore with GPU accelerators. In:
2010 IEEE International Symposium on Parallel & Distribu-
ted Processing, Workshops and PhD Forum (IPDPSW).
Piscataway, NJ: IEEE, pp. 1–8.
Verhulp E, van Rietbergen B and Huiskes R (2004) A three-
dimensional digital image correlation technique for strain
measurements in microstructures. Journal of Biomechanics
37: 1313–1320.
Vetter J, Glassbrook R, Dongarra J, et al. (2011) Keeneland:
Bringing heterogeneous GPU computing to the computational
science community. IEEE Computing in Science and Engi-
neering 13: 90–95.
Yang Z, Ren W, Mostafavi M, Mcdonald SA and Marrow TJ
(2013) Characterisation of 3D fracture evolution in concrete
using in-situ X-ray computed tomography testing and digital
volume correlation. In: VIII International Conference on Frac-
ture Mechanics of Concrete and Concrete Structures, pp. 1–7.
Zauel R, Yeni YN, Bay BK, Dong XN and Fyhrie DP (2006)
Comparison of the linear finite element prediction of deforma-
tion and strain of human cancellous bone to 3D digital volume
correlation measurements. Journal of Biomechanical Engi-
neering 128: 1–6.
Zitova B and Flusser J (2003) Image registration methods: a sur-
vey. Image and Vision Computing 21: 977–1000.
Authors biographies
Mark Gates is a research scientist at the University of
Tennessee, where he works on developing the PLASMA
and MAGMA software libraries for linear algebra on
multi-core and GPU-based computers. He received his
B.S., M.S, and Ph.D. in Computer Science from the Univer-
sity of Illinois at Urbana-Champaign, in 1998, 2007, and
2011, respectively. He also worked at NCSA from 1998–
2000, on optimizing applications for high-performance
wide-area networks. His research interests are in high-
performance computing, particularly linear algebra and
GPU computing.
Michael T Heath is Professor and Fulton Watson Copp
Chair in the Department of Computer Science at the Uni-
versity of Illinois at Urbana-Champaign. He received his
Ph.D. in Computer Science from Stanford University in
1978. His research interests are in scientific computing,
particularly numerical linear algebra and optimization. He
is an ACM Fellow, SIAM Fellow, Associate Fellow of the
AIAA, and a member of the European Academy of
Sciences. He received the Taylor L. Booth Education
Award from the IEEE Computer Society in 2009, and is
author of the widely adopted textbook Scientific Comput-
ing: An Introductory Survey, 2nd edition, published by
McGraw-Hill in 2002.
John Lambros is Professor and Associate Head for the
Graduate Program of Aerospace Engineering at the Univer-
sity of Illinois at Urbana-Champaign, where he is also Princi-
pal Investigator and Director of the Stress Wave Mitigation
Center. He received a B.Eng. in Aeronautical Engineering
from Imperial College in 1988, and a M.S. and Ph.D. in Aero-
nautics from California Institute of Technology in 1989 and
1994. His research interests are the multiple time and length
scale mechanical characterization of material response under
static and dynamic impact or fatigue loading.
Gates et al. 15
at GEORGE MASON UNIV on May 31, 2014hpc.sagepub.comDownloaded from