Spatial interpolation of scattered geoscientiﬁc data · transferring spatial interpolation algorithms onto the GPU [4,5,9] show promising results. 2 Inverse distance interpolation

Spatial interpolation of scatteredgeoscientific data

Florian Hanzer

February 3, 2012

1 Introduction

Most data for environmental variables (e. g. meteorological variables, soilproperties etc.) are collected from point sources. For modeling and visual-ization purposes, the data is often needed to be available on a regular grid,which requires spatial interpolation of the scattered point measurements.

10 Geostatistical mapping

Assuming that the samples are representative, non-preferential and consistent, values of the target variable atsome new location s0 can be derived using a spatial prediction model. In statistical terms, a spatial predictionmodel draws realizations — either the most probable or a set of equiprobable realizations — of the feature ofinterest given a list of inputs:

z(s0) = E�

Z |z(si), qk(s0), �(h), s 2 A (1.1.2)

where z(si) is the input point data set, �(h) is the covariance model defining the spatial autocorrelationstructure (see further Fig. 2.1), and qk(s0) is the list of deterministic predictors, also known as covariates orexplanatory variables, which need to be available at any location within A. In other words, a spatial predictionmodel comprises list of procedures to generate predictions of value of interest given the calibration data andspatial domain of interest.

Fig. 1.7: Spatial prediction is a process of estimating the value of (quantitative) properties at unvisited site within the areacovered by existing observations: (a) a scheme in horizontal space, (b) values of some target variable in a one-dimensionalspace.

Fig. 1.8: Spatial prediction implies application of a predictionalgorithm to an array of grid nodes (point á point spatial pre-diction). The results are then displayed using a raster map.

In raster GIS terms, the geographical domain ofinterest is a rectangular matrix, i.e. an array withrows×columns number of grid nodes over the do-main of interest (Fig. 1.8):

z=¶

z(s j), j = 1, . . . , m©

; s j 2 A (1.1.3)

where z is the data array, z(s j) is the value at the gridnode s j , and m is the total number of grid nodes.Note that there is a difference between predictingvalues at grid node (punctual) and prediction val-ues of the whole grid cell (block), which has a fulltopology12.

There seem to be many possibilities to interpolate point samples. At the Spatial Interpolation Comparison2004 exercise, for example, 31 algorithms competed in predicting values of gamma dose rates at 1008 newlocations by using 200 training points (Dubois and Galmarini, 2004; Dubois, 2005). The competitors rangedfrom splines, to neural networks, to various kriging algorithms. Similarly, the software package 13 offersdozens of interpolation techniques: Inverse Distance, Kriging, Minimum Curvature, Polynomial Regression,Triangulation, Nearest Neighbor, Shepard’s Method, Radial Basis Functions, Natural Neighbor, Moving Aver-age, Local Polynomial, etc. The list of interpolators available in via its interpolation packages ( , ,

, , etc.) is even longer.

12The package in , for example, makes a distinction between the Spatial Pixel data frame (grid nodes) and a Spatial Grid dataframe (grid cells) to distinguish between regular grid with point support and block support.

13

Figure 1: Spatial prediction is the process of estimating the value of prop-erties at an unvisited site within the area covered by existing observations.Figure source: [3]

A variety of interpolation methods for these purposes is available, exam-ples are inverse distance weighting (IDW), Kriging, splines or polynomialregressions. Depending on the number of prediction locations (i. e., thegrid size) as well as the number of data points, interpolation can be a verytime- and performance consuming task.

Subject of this work was to implement the inverse distance weighting al-gorithm, which is a very simple interpolation method but widely used in

1

geoscientific applications, on a GPU. The properties of the algorithm makeit well suited for performing the calculation on a GPU rather than on theCPU, by doing so expecting a significant performance gain. Similar studiestransferring spatial interpolation algorithms onto the GPU [4, 5, 9] showpromising results.

2 Inverse distance interpolation

The general interpolation problem is stated as follows: given a set of irreg-ularly distributed points x(i), i = 1, . . . , n, and scalar values zi associatedwith each point, find an interpolating function z such that z(x(i)) = zi.

Inverse distance interpolation, also known as inverse distance weighting(IDW) or Shepard’s method [7], is one of the simplest spatial predictiontechniques. The value of a target variable at some new location can bederived as a weighted linear combination of the data values:

z(x) = ∑ni=1 Wi(x)zi

∑ni=1 Wi(x)

.

The weights Wi are functions of the euclidean distance between the datalocation x(i) and the prediction location x:

Wi(x) =1

||x− x(i)||p,

where the parameter p denotes the IDW power. It can be used to empha-size spatial similarity – by increasing p, greater influence is assigned to thepoints more distant to the interpolated point. A common choice is p = 2.

An alternative formulation of the IDW problem as a matrix-vector multi-plication is

z = Wz,

with

z =

z(x1)...

z(xm)

, z =

z(1)...

z(n)

, W =

W1(x1) · · · Wn(x1)...

...W1(xm) · · · Wn(xm)

,

where x1, . . . , xm denote the prediction points.

Due to its simplicity and robustness, the IDW method is widely used ingeoscientific applications [8]. Geostatistical methods such as kriging tendto provide more accurate results [10, 2], but require extensive statisticalanalysis of the data before their application, are more difficult to implement

2

and have larger computational demands. Drawbacks of IDW are, amongothers, that it tends to have poor performance on large data sets, and thatit gives too much influence on distant nodes (the total weight of distantnodes can be larger than the weight of nearby nodes).

The so-called Modified Shepard’s method [6] tries to address some of thesedrawbacks. Here, the interpolant has the following form:

z(x) = ∑i∈K Wi(x)Qi(x)∑i∈K Wi(x)

, |K| = Nw < N

Wi(x) =

(Rx − ||x− x(i)||

Rx||x− x(i)||

)2

, Rx = maxi∈K||x− x(i)||

Qi(x) = xT Ax + bTx + c, A ∈ RD×D, b ∈ RD, c ∈ R

The characteristics of this method are:

• It uses only Nw points for the interpolation, i. e. the Nw nearest neigh-bors of x(i).

• The nodal functions Qi(x) replace the constant zi of the original Shep-ard’s method. The functions are acquired using weighted least squaresfitting on a set of Nq nearest neightbors of x(i), with the constraintQi(x(i)) = zi.

• When combined with an efficient k-nearest neighbor (kNN) searchalgorithm it achieves a good performance even on large data sets.

• There are only two tunable parameters Nw and Nq.

3 Implementation

Within the framework of this report, the standard IDW algorithm (Shep-ard’s method) has been ported to the GPU using different approaches (Thrust,PGI Accelerator and OpenCL) and compared to the performance achievedusing CPU code. A next step would be the implementation of the ModifiedShepard’s method on the GPU.

3.1 Thrust

Thrust is an open-source template library built on top of Nvidia’s CUDA,featuring an interface similar to the C++ STL. Its intent is to provide devel-opers access to GPU computing on a high-level basis, rather than having

3

to tune their code to specific GPU architectures – the decision of how toimplement the computation is delegated to the library. This allows devel-opers to benefit from GPU performance with minimal programmer effort,while still providing full interoperability with CUDA C/C++ code.

HWU 2011 Ch26-9780123859631 2011/8/22 15:33 Page 361 #3

26.2 Diving In 361

the details of memory management, and even the choice of sorting algorithm are left to the discretionof the library implementor.

26.2.1 Iterators and Memory SpacesAlthough vector iterators are similar to pointers, they carry additional information. Notice that we didnot have to instruct the sort algorithm that it was operating on the elements of a device vector

or hint that the copy was from device memory to host memory. In Thrust the memory spaces ofeach range are automatically inferred from the iterator arguments and used to dispatch the appropriateimplementation.

In addition to memory space, Thrust’s iterators implicitly encode a wealth of information which canguide the dispatch process. For instance, our sort example above operates on ints, a primitive datatype with a fundamental comparison operation. In this case, Thrust dispatches a highly-tuned RadixSort algorithm [2] which is considerably faster than alternative comparison-based sorting algorithmssuch as Merge Sort [3]. It is important to realize that this dispatch process incurs no performance orstorage overhead: metadata encoded by iterators exists only at compile time, and dispatch strategiesbased on it are selected statically. In general, Thrust’s static dispatch strategies may capitalize on anyinformation that is derivable from the type of an iterator.

26.2.2 InteroperabilityThrust is implemented entirely within CUDA C/C++ and maintains interoperability with the rest ofthe CUDA ecosystem. Interoperability is an important feature because no single language or libraryis the best tool for every problem. For example, although Thrust algorithms use CUDA features likeshared memory internally, there is no mechanism for users to exploit shared memory directly

through Thrust. Therefore, it is sometimes necessary for applications to access CUDA C directly toimplement a certain class of specialized algorithms, as illustrated in the software stack of Figure 26.1.

Interfacing Thrust to CUDA C is straightforward and analogous to the use of the C++ STL withstandard C code. Data that resides in a Thrust container can be accessed by external libraries by

Application

Thrust

CUDA C/C++

BLAS, FFT ...

CUDA

FIGURE 26.1

Thrust is an abstraction layer on top of CUDA C/C++.

Figure 2: Thrust is an abstraction layer on top of CUDA C/C++. Figuresource: [1]

Thrust provides two vector containers: host_vector, which is stored inhost memory, and device_vector, which is stored in device memory. Justlike std::vector, they are generic containers able to store any data typeand can be resized dynamically. Copying data from the host to the deviceand vice versa is simply done with the = operator, and individual elementsof a vector can be accessed using the standard bracket notation.

Thrust incorporates a rich set of powerful algorithms for common pat-terns like sorting or reducing. The following example1 illustrates some ofThrust’s built-in transformations:

1 // allocate three device_vectors with 10 elements2 thrust::device_vector<int> X(10);3 thrust::device_vector<int> Y(10);4 thrust::device_vector<int> Z(10);5

6 // initialize X to 0,1,2,3, ....7 thrust::sequence(X.begin(), X.end());8

9 // compute Y = -X10 thrust::transform(X.begin(), X.end(), Y.begin(), thrust::negate<int>());11

12 // fill Z with twos

1Source: Thrust quick start quide (http://code.google.com/p/thrust/wiki/QuickStartGuide)

4

http://code.google.com/p/thrust/wiki/QuickStartGuide

http://code.google.com/p/thrust/wiki/QuickStartGuide

13 thrust::fill(Z.begin(), Z.end(), 2);14

15 // compute Y = X mod 216 thrust::transform(X.begin(), X.end(), Z.begin(), Y.begin(), thrust::modulus<

int>());17

18 // replace all the ones in Y with tens19 thrust::replace(Y.begin(), Y.end(), 1, 10);20

21 // print Y22 thrust::copy(Y.begin(), Y.end(), std::ostream_iterator<int>(std::cout, "\n")

);

As it can be seen in the example, the user does not have to choose the launchparameters for the GPU kernel by himself but rather delegates this task toThrust. Currently, Thrust tries to find a launch configuration with high-est occupancy, by comparing the resource usage of the kernel with the re-sources of the target GPU.

Beside using Thrust’s predefined algorithms, user-defined operations inform of C++ functors can be defined to implement a specific operation. Asan example, the following code defines a functor for squaring the elementsof a vector and illustrates its usage:

1 struct squareFunctor {2 __host__ __device__3 T operator() (T x) {4 return x * x;5 }6 };7

8 ...9

10 thrust::transform(weights.begin(), weights.end(), weights.begin(),squareFunctor<float>());

In the Thrust implementation of the IDW method, a functor was defined forcalculating the interpolated value at a single location. By use of Thrust’spowerful zip_iterator, which basically takes a number of iterators andyields a virtual range of tuples, the interpolated values for all predictionpoints can be calculated with a single for_each statement.

3.2 PGI Accelerator

The PGI Accelerator framework provides an extension to the commerciallydistributed C and Fortran compilers by the Portland Group. They pro-vide a set of OpenMP-like preprocessor directives which can be used torun standard C or Fortran code on the GPU (currently CUDA-only). Theadvantages of the this model is that existing code can easily be transferred

5

to the GPU with the addition of simple preprocessor directives and taskssuch as memory allocation, data movement and kernel invocation are del-egated to the compiler, although it also provides methods for fine-tuningthe data movement between the host and the accelerator.

Basic implementation of the IDW method using PGI Accelerator was straight-forward, as the IDW code was simply ported to standard Fortran and sur-rounded with preprocessor directives, letting the compiler decide the mostoptimal way to transfer the code to the GPU.

3.3 OpenCL

OpenCL (Open Computing Language) is an open framework for parallelprogramming across heterogeneous platforms (CPUs, GPUs and others).Similar to CUDA C, OpenCL includes a C-based language for writing ker-nels to be executed on OpenCL devices. Contrary to CUDA, which is cur-rently available only on Nvidia devices, OpenCL is available on variousplatforms, including Intel, AMD, Nvidia and ARM devices.

The implementation of the IDW method for OpenCL was done using theOpenCL C++ bindings, which provide an abstraction layer for the low-level OpenCL C API. The implementation was done pretty straightforward– for each interpolating location, a thread responsible for calculating theinterpolated value at that location is executed.

4 Results

In two-dimensional space, the only parameters affecting the performanceof the interpolation algorithm are the number of data points (n) and the gridsize, i. e. the number of prediction points (m). To compare the differentimplementations of the IDW algorithm, in a first test the number of datapoints was held constant while the number of prediction points was varied,and in a second test the number of data points was varied while usinga constant number of prediction points. As the locations and values ofthe data points have no influence on the performance of the IDW method,they were initialized with random values. All calculations were performedon an Intel Core i7-2600K 3.40 GHz CPU and an Nvidia GeForce GTX 580GPU.

6

Figure 3: Example for the interpolation of temperature measurements atthe grid points of a digital elevation model. The locations of the stationstaken into account for the interpolation (red triangles) with their respectivealtitude are displayed on the left, and the resulting temperature distribu-tion (after taking into account the elevation dependency) on the right.

7

points C OpenMP Thrust PGI OpenCL102 0.00 0.00 0.14 0.03 0.34103 0.02 0.00 0.14 0.03 0.33104 0.13 0.03 0.14 0.03 0.34105 1.19 0.26 0.15 0.04 0.37106 11.86 2.55 0.22 0.16 0.79107 118.79 26.71 0.98 1.33 4.48108 1189.85 276.70 8.79 12.96 n/a

Table 1: Execution time (seconds) of the different implementations of theIDW algorithm for a constant number of data points n = 103 and varyingvalues for the number of prediction points.

points C OpenMP Thrust PGI OpenCL101 0.00 0.00 0.13 0.03 0.31102 0.02 0.00 0.13 0.03 0.32103 0.12 0.03 0.14 0.03 0.35104 1.18 0.27 0.24 0.05 0.39105 11.78 2.62 1.20 0.22 0.84106 117.97 26.24 10.90 1.97 5.25107 1181.60 288.38 107.55 n/a n/a

Table 2: Execution time (seconds) of the different implementations of theIDW algorithm for a constant number of prediction points m = 104 andvarying values for the number of data points.

8

102 103 104 105 106 107 10810−3

10−2

10−1

100

101

102

103

prediction points

tim

e[s

]C

C + OpenMPThrust

PGIOpenCL

Figure 4: Execution time (seconds) of the different implementations of theIDW algorithm for a constant number of data points n = 103 and varyingvalues for the number of prediction points.

102 103 104 105 106 107 108

0

20

40

60

80

100

120

140

prediction points

spee

dup

CC + OpenMP

ThrustPGI

OpenCL

Figure 5: Speedup (compared to the single-core C implementation) of thedifferent implementations of the IDW algorithm for a constant number ofdata points n = 103 and varying values for the number of prediction points.

9

101 102 103 104 105 106 107

10−3

10−2

10−1

100

101

102

103

data points

tim

e[s

]C

C + OpenMPThrust

PGIOpenCL

Figure 6: Execution time (seconds) of the different implementations of theIDW algorithm for a constant number of prediction points m = 104 andvarying values for the number of data points.

101 102 103 104 105 106 107

0

10

20

30

40

50

60

data points

spee

dup

CC + OpenMP

ThrustPGI

OpenCL

Figure 7: Speedup (compared to the single-core C implementation) of thedifferent implementations of the IDW algorithm for a constant numberof prediction points m = 104 and varying values for the number of datapoints.

10

5 Conclusions

The results for the CPU implementations were as expected – calculationtime for the single-core implementation increased linearly with n and m,respectively, and the OpenMP-enabled CPU version gained a speedup ofaround 4 on the quad-core CPU. From the results of the GPU implemen-tations one can see that the computation time stays relatively constant forvalues of up to around m = 106 and n = 103, respectively, which impliesthat the data transfer to and from the CPU is the limiting factor there. Start-ing from values around m = 105 and n = 104, respectively, the GPU imple-mentations outperform the parallelized CPU implementation.

One notable conclusion is that for n � m, the Thrust-based implementa-tion is the fastest method, while for m � n it is only about twice as fast asthe 4-core CPU implementation. This is probably due to the structure of theThrust-based implementation, where a loop around all data points is per-formed within the IDW functor, which negatively impacts the performancewhen applied to a large number of data points. However, for geoscientificpurposes this issue can be rather neglected, as the number of predictionpoints typically exceeds the number of data points by several orders ofmagnitude. For these applications, the GPU-accelerated implementationsprovide an enormous performance gain.

References

[1] BELL, Nathan ; HOBEROCK, Jared: Thrust: A Productivity-OrientedLibrary for CUDA. Version: Oktober 2011. http://research.nvidia.com/sites/default/files/publications/Thrust%20-%20A%20Productivity-Oriented%20Library%20for%20CUDA.pdf. In: GPUComputing Gems, Jade Edition. 2011

[2] FALIVENE, Oriol ; CABRERA, Lluís ; TOLOSANA-DELGADO, Raimon ;SÁEZ, Alberto: Interpolation algorithm ranking using cross-validationand the role of smoothing effect. A coal zone example. In: Computers &Geosciences 36 (2010), Nr. 4, S. 512–519. http://dx.doi.org/10.1016/j.cageo.2009.09.015. – DOI 10.1016/j.cageo.2009.09.015

[3] HENGL, Tomislav: A practical guide to geostatistical mapping. Am-sterdam : Hengl, 2009 http://spatial-analyst.net/book/system/files/Hengl_2009_GEOSTATe2c1w.pdf. – ISBN 97890902498109090249818

[4] HENNEBÖHL, Katharina ; APPEL, Marius ; PEBESMA, Edzer:Spatial interpolation in massively parallel computing environments.

11

http://research.nvidia.com/sites/default/files/publications/Thrust%20-%20A%20Productivity-Oriented%20Library%20for%20CUDA.pdf



http://dx.doi.org/10.1016/j.cageo.2009.09.015

http://dx.doi.org/10.1016/j.cageo.2009.09.015

http://spatial-analyst.net/book/system/files/Hengl_2009_GEOSTATe2c1w.pdf

http://spatial-analyst.net/book/system/files/Hengl_2009_GEOSTATe2c1w.pdf

http://plone.itc.nl/agile_old/Conference/2011-utrecht/contents/pdf/shortpapers/sp_157.pdf

[5] HURAJ, Ladislav ; SILÁDI, Vladimír ; SILÁCI, Jozef: Comparison of de-sign and performance of snow cover computing on GPUs and multi-core processors. In: WSEAS Trans. Info. Sci. and App. 7 (2010), Oktober,Nr. 10, 1284–1294. http://dl.acm.org/citation.cfm?id=1973296.1973303. – ISSN 1790–0832

[6] RENKA, Robert J.: Multivariate interpolation of large sets of scattereddata. In: ACM Trans. Math. Softw. 14 (1988), Juni, Nr. 2, 139–148. http://dx.doi.org/10.1145/45054.45055. – DOI 10.1145/45054.45055. –ISSN 0098–3500

[7] SHEPARD, Donald: A two-dimensional interpolation function forirregularly-spaced data. In: Proceedings of the 1968 23rd ACM nationalconference. New York, NY, USA : ACM, 1968 (ACM 1968), 517–524

[8] SLUITER, R.: Interpolation methods for climate data: literature review. 2009

[9] SRINIVASAN, Balaji V. ; DURAISWAMI, Ramani ; MURTUGUDDE,Raghu: Efficient kriging for real-time spatio-temporal interpolation.In: 20th Conference on Probability and Statistics in the Atmospheric Sci-ence, American Meteorological Society (2010)

[10] ZIMMERMAN, Dale ; PAVLIK, Claire ; RUGGLES, Amy ; ARMSTRONG,Marc: An Experimental Comparison of Ordinary and UniversalKriging and Inverse Distance Weighting. In: Mathematical Geology31 (1999), Mai, Nr. 4, S. 375–390. http://dx.doi.org/10.1023/A:1007586507433. – DOI 10.1023/A:1007586507433

12



http://dl.acm.org/citation.cfm?id=1973296.1973303

http://dl.acm.org/citation.cfm?id=1973296.1973303

http://dx.doi.org/10.1145/45054.45055

http://dx.doi.org/10.1145/45054.45055

http://dx.doi.org/10.1023/A:1007586507433

http://dx.doi.org/10.1023/A:1007586507433

Documents

Spatial interpolation of scattered geoscientiﬁc data · transferring spatial interpolation algorithms onto the GPU [4,5,9] show promising results. 2 Inverse distance interpolation