27
This article was downloaded by: [Florida State University] On: 21 October 2014, At: 13:01 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK International Journal of Geographical Information Science Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/tgis20 Predictive modelling of seabed sediment parameters using multibeam acoustic data: a case study on the Carnarvon Shelf, Western Australia Zhi Huang a , Scott L. Nichol a , Justy P.W. Siwabessy a , James Daniell a & Brendan P. Brooke a a Marine and Coastal Environments Group, Geoscience Australia , Canberra , ACT , Australia Published online: 21 Oct 2011. To cite this article: Zhi Huang , Scott L. Nichol , Justy P.W. Siwabessy , James Daniell & Brendan P. Brooke (2012) Predictive modelling of seabed sediment parameters using multibeam acoustic data: a case study on the Carnarvon Shelf, Western Australia, International Journal of Geographical Information Science, 26:2, 283-307, DOI: 10.1080/13658816.2011.590139 To link to this article: http://dx.doi.org/10.1080/13658816.2011.590139 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

Predictive modelling of seabed sediment parameters using multibeam acoustic data: a case study on the Carnarvon Shelf, Western Australia

Embed Size (px)

Citation preview

This article was downloaded by: [Florida State University]On: 21 October 2014, At: 13:01Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

International Journal of GeographicalInformation SciencePublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/tgis20

Predictive modelling of seabedsediment parameters using multibeamacoustic data: a case study on theCarnarvon Shelf, Western AustraliaZhi Huang a , Scott L. Nichol a , Justy P.W. Siwabessy a , JamesDaniell a & Brendan P. Brooke aa Marine and Coastal Environments Group, Geoscience Australia ,Canberra , ACT , AustraliaPublished online: 21 Oct 2011.

To cite this article: Zhi Huang , Scott L. Nichol , Justy P.W. Siwabessy , James Daniell & BrendanP. Brooke (2012) Predictive modelling of seabed sediment parameters using multibeam acousticdata: a case study on the Carnarvon Shelf, Western Australia, International Journal of GeographicalInformation Science, 26:2, 283-307, DOI: 10.1080/13658816.2011.590139

To link to this article: http://dx.doi.org/10.1080/13658816.2011.590139

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoever orhowsoever caused arising directly or indirectly in connection with, in relation to or arisingout of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

International Journal of Geographical Information ScienceVol. 26, No. 2, February 2012, 283–307

Predictive modelling of seabed sediment parameters using multibeamacoustic data: a case study on the Carnarvon Shelf, Western Australia

Zhi Huang*, Scott L. Nichol, Justy P.W. Siwabessy, James Daniell and Brendan P. Brooke

Marine and Coastal Environments Group, Geoscience Australia, Canberra, ACT, Australia

(Received 30 August 2010; final version received 18 May 2011)

Seabed sediment textural parameters such as mud, sand and gravel content can be usefulsurrogates for predicting patterns of benthic biodiversity. Multibeam sonar mapping canprovide near-complete spatial coverage of high-resolution bathymetry and backscatterdata that are useful in predicting sediment parameters. Multibeam acoustic data col-lected across a ∼1000 km2 area of the Carnarvon Shelf, Western Australia, were usedin a predictive modelling approach to map eight seabed sediment parameters. Fourmachine learning models were used for the predictive modelling: boosted decision tree,random forest decision tree, support vector machine and generalised regression neuralnetwork. The results indicate overall satisfactory statistical performance, especially for%Mud, %Sand, Sorting, Skewness and Mean Grain Size. The study also demonstratesthat predictive modelling using the combination of machine learning models has pro-vided the ability to generate prediction uncertainty maps. However, the single modelswere shown to have overall better prediction performance than the combined models.Another important finding was that choosing an appropriate set of explanatory vari-ables, through a manual feature selection process, was a critical step for optimisingmodel performance. In addition, machine learning models were able to identify impor-tant explanatory variables, which are useful in identifying underlying environmentalprocesses and checking predictions against the existing knowledge of the study area.The sediment prediction maps obtained in this study provide reliable coverage of keyphysical variables that will be incorporated into the analysis of covariance of physicaland biological data for this area.

Keywords: multibeam acoustic; predictive modelling; seabed sediment

1. Introduction

Seabed sediment textural parameters such as mud, sand and gravel content can be usefulsurrogates for predicting patterns of benthic biodiversity (Thouzeau et al. 1991, Kostylevet al. 2001, Post et al. 2006, Beaman and Harris 2007, Pitcher et al. 2007a,b, Degraeret al. 2008, Post 2008, McArthur et al. 2010). These surrogacy relationships indicate thatsome physical environmental parameters are linked to ecological processes that influencethe distribution of seabed organisms.

Typically, sediment grain size data are only available from a limited number of sedimentgrab samples. To improve predictions of benthic biodiversity from these point data, reliablecontinuous layers of these parameters are needed. The aim of this study is to demonstrate

*Corresponding author. Email: [email protected]

ISSN 1365-8816 print/ISSN 1362-3087 online© 2012 Taylor & Francishttp://dx.doi.org/10.1080/13658816.2011.590139http://www.tandfonline.com

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

284 Z. Huang et al.

the effectiveness of generating continuous layers of seabed sediment parameters usingmultibeam acoustic data and advanced spatial predictive modelling techniques.

Over the last few years, multibeam sonar systems have become the most efficient tech-nique for seabed mapping as they provide near-complete spatial coverage of bathymetryand acoustic backscatter properties of the seabed (Wille 2005). Modern multibeam sonarsystems, such as the Simrad EM 3002 manufactured by Kongsberg Gruppen ASA ofNorway used in this study, transmit short pulses of several tens of microseconds at highfrequencies (hundreds of kHz) and form hundreds of narrow receiving beams (about 1◦wide). Hence, they are capable of resolving small features on the seafloor. Increasingly,it has become a fundamental tool for mapping seabed characteristics (Goff et al. 1999,Todd and Greene 2007). Acoustic data have often been used for seafloor classification orbenthic habitat mapping (e.g. Herzfeld and Higginson 1996, Lundblad et al. 2006, Lanieret al. 2007, Erdey-Heydorn 2008, Lucieer 2008). In some cases, bathymetry has been usedto predict the distribution of seabed sediment grain size (Leecaster 2003, Verfaillie et al.2006, 2009, Li et al. 2010).

Despite its potential value, multibeam backscatter data have not been directly usedas an explanatory variable for the prediction of seabed sediment grain size. The propor-tion of multibeam backscatter return from a seabed surface is governed by the acousticimpedance contrast and surface roughness scale relative to acoustic wavelength which areseabed habitat dependent. The former is sometimes referred to as ‘hardness’ and the latterimplies the frequency dependence. Hence, a seabed surface which seems rough for higherfrequency systems might appear flat in lower frequency systems. Goff et al. (2000), Kloseret al. (2001) and Siwabessy et al. (2006) have shown that sediment hardness and roughnesscorrelate with backscatter returns. In general, the backscatter intensity increases with anincrease in the acoustic impedance contrast or apparent surface roughness scale. Coarsergrains are therefore associated with higher backscatter returns. However, in some areas, thedifferences in backscatter return are the result of different sedimentary processes such asdeposition and erosion (Nitsche et al. 2004). In addition, most shallow water seabeds arenot homogeneous as often assumed in statistical analysis of acoustic backscatter. They arepatchy in space and changeable in time. In this study, we use multibeam backscatter datafor an area of seabed that is highly varied in physical character as an additional continuousvariable to bathymetry-derived parameters.

A predictive modelling approach is often used for generating continuous layers frompoint samples (e.g. McKenzie and Ryan 1999, Rigol et al. 2001, McBratney et al. 2003, Liet al. 2010). It is a two-step method. The first step is to model the relationship between a setof explanatory variables and a target variable from samples. The established model is thenused to predict the value of the target variable at a location for which values of the explana-tory variables are known, assuming that the established relationship holds true for thelocation of the prediction. For predictive modelling of complex data sets, machine learn-ing models offer most potential (Gahegan 2003, Li et al. 2010). Non-parametric machinelearning models are able to handle both linear and non-linear relationships. They also havethe advantages of handling high-dimensional inputs and being more error tolerant thantraditional regression approaches.

This article reports the results of predictive modelling of eight seabed sediment param-eters: %Mud, %Sand, %Gravel, Bulk Carbonate (%CaCO3), Mean Grain Size, Sorting,Skewness and Kurtosis using multibeam acoustic data for a ∼1000 km2 area of theCarnarvon Shelf, Western Australia. Multiple machine learning models are used and com-bined because of the advantages noted above and because they provide maps of bothprediction and an estimation of uncertainty in the prediction (Huang and Lees 2004, 2005).It should be noted that %Mud, %Sand and %Gravel are not independent parameters (i.e.

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

International Journal of Geographical Information Science 285

they add up to 100%). In this study, we model them separately to test whether differentenvironmental processes are responsible for their distributions. However, only two of thesesediment parameters are actually needed for prediction.

2. Materials and methods

2.1. Study area and data sets

The Carnarvon Shelf is located along the coast of Central Western Australia between CapeCuvier (24◦13.2′S, 113◦23.4′E) and North West Cape (21◦47.4′S, 114◦9.6′E) (Figure 1).Shelf width decreases northward from 33 km to 7 km and the shelf edge is at ∼150 m waterdepth, grading to the continental slope. Ningaloo Reef extends along the inner shelf for 270km as a fringing reef that is separated from the coast by a narrow (0.2–5 km) sandy lagoon.The oceanography of the shelf is influenced by a predominantly south-west swell (modalwave height 2 m), the south-flowing Leeuwin Current and microtidal regime (mean springtidal range 1.8 m). Cyclones also influence the Carnarvon Shelf, crossing North West Capeonce a decade on average (Bureau of Meteorology 2010).

For this study, we use data and samples collected in 2008 during a marine surveyof three representative areas of the Carnarvon Shelf, named Mandu, Point Cloates andGnaraloo (Brooke et al. 2009). The purpose of this survey was to acquire high-resolution,continuous bathymetry data across the shelf and co-located samples of seabed sediments,biota and underwater video for use in surrogacy research. The three sampling areas lieseaward of Ningaloo Reef in water depths of 30–250 m and incorporate a range of seabedphysical features, including extensive areas of sandy bedforms (e.g. sand waves, rippledscour depressions), flat sandy seabed and numerous reefs (ridges, mounds) with relief of5–20 m (Brooke et al. 2009).

Sediment samples were collected using a Smith-McIntyre grab (10 L, 0.1 m2 opening)from representative areas of seabed in each survey area (Mandu, n = 79; Point Cloates,n = 89; Gnaraloo, n = 91) identified from multibeam bathymetry maps. The texturalproperties of these samples were later analysed in the laboratory to determine the follow-ing: %Mud, %Sand and %Gravel by wet sieve separation (Lewis and McConchie 1994);Mean Grain Size, Sorting (standard deviation, SD), Skewness and Kurtosis on the %Sandand %Mud by laser granulometry (using a Malvern Mastersizer 2000 manufactured byMalvern Instruments Ltd in Worcestershire, UK) with summary statistics calculated usingGRADISTAT (Blott and Pye 2001); and %CaCO3 by acid dissolution (Müller and Gastner1971) (Table 1).

Multibeam acoustic data were the primary data inputs for this study. They werecollected using a Simrad EM 3002 300 kHz sonar system operated in a single head con-figuration. The multibeam bathymetry data processing was undertaken using Caris Hipsand Sips V6.1 software developed by CARIS in New Brunswick, Canada. The motion sen-sor, Differential Global Positioning System (DGPS) and heading data were cleaned usinga filter that averaged adjacent data points. Different sound velocity profiles were used tocorrect the changes in the speed of sound through the water column. The remaining arte-facts were then cleaned first automatically by applying several filters and then manuallythrough visual inspection. Tidal information for the survey was obtained from WXTide32software (a free program, http://www.wxtide32.com) with reference to the mean sea leveland Coordinated Universal Time (UTC) time. Using this information, the tidal variationswere accounted for in the processing. The multibeam backscatter data were processed usingCMST-GA MB Process v8.11.02.1 software, a multibeam backscatter processing toolboxco-developed by Geoscience Australia and the Centre for Marine Science and Technologyat Curtin University of Technology. The backscatter processing included correction for

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

286 Z. Huang et al.

Exmouth

Broome

Perth

Coral Bay

Exmouth

Sediment samples

Coast_line

Commonwealth Marine Park

Western Australia Marine Park

Figure 1. The study area showing generalised bathymetry and sample locations for Mandu Creek,Point Cloates and Gnaraloo survey sites.

transmission loss and ensonification area, and removal of the system-implemented modeland the angular dependence (Gavrilov et al. 2005a,b). The angularly equalised backscatterstrengths were normalised to the backscatter strength at an angle of 25◦ (Heap et al. 2009,Nichol et al. 2009). The bathymetry and backscatter data sets were gridded at 3 m spatialresolution.

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

International Journal of Geographical Information Science 287

Table 1. Summary statistics for the eight sediment parameters from 259 samples.

Variable Mean STD Min. Max.

%Mud 3.5 7.0 0.0 34.8%Sand 80.2 21.3 3.2 100.0%Gravel 16.4 20.2 0.0 96.2%CaCO3 92.3 4.5 72.9 100.0Mean Grain Size 392.4 195.2 70.5 995.3Sorting 3.1 1.6 1.4 8.5Skewness −1.0 1.8 −5.2 3.6Kurtosis 9.0 6.4 2.3 37.1

Notes: STD, standard deviation function; Min., minimum; Max., maximum.

The bathymetry grid was used to derive a suite of eight terrain and morphometricvariables (secondary variables) based on a review of previous studies (Lundblad et al.2006, Lanier et al. 2007, Wilson et al. 2007, Erdey-Heydorn 2008, Dunn and Halpin 2009,Verfaillie et al. 2009, Zieger et al. 2009) and expert knowledge of the study area (Table 2).These variables include Local Moran’s I (Moran 1950), which measures spatial autocorre-lation within a neighbourhood. It was included because it can also indicate local rugosityor heterogeneity (Holmes et al. 2008).

Textural measures derived from acoustic data, especially the second-order textural anal-ysis of backscatter data, have been shown to be useful in mapping seabed types (Cochraneand Lafferty 2002, Lucieer 2008). The secondary variables of grey-level co-occurrencematrices (GLCMs) homogeneity and variance (Haralick et al. 1973) and Local Moran’sI were thus chosen and calculated from the backscatter data set (Table 3). As argued byWilson et al. (2007), spatial scale is an important issue in habitat mapping, especially ter-rain analysis. Morphometric features have inherent fuzziness and a multi-scale approachis required to represent this (Fisher et al. 2004). Following Verfaillie et al. (2009), thesecondary variables were calculated at four spatial scales: 9, 15, 33 and 93 m (Tables 2and 3). These scales were selected from our knowledge of the study area and the data sets.

Table 2. Secondary variables derived from the bathymetry data set.

Variable Description Scales

Slope Slope gradient 9, 15, 33, 93 mRelief Topographic relief 9, 15, 33, 93 mSurface area ‘True’ surface area in relation to ‘planar’

surface area, an indicator of surfacerugosity (Jenness 2004)

9, 15, 33, 93 m

TPI (BPI) Topographic (Benthic) Position Index(Weiss 2001)

9, 15, 33, 93 m

Planar curvature The curvature of the surface perpendicularto the slope direction

9, 15, 33, 93 m

Profile curvature The curvature of the surface in the directionof slope

9, 15, 33, 93 m

Fuzzy morphometricfeatures

Peakness, pitness, passness, ridgeness,channelness and planarness (Wood 1996,Fisher et al. 2004)

33, 93 m

Local Moran’s I An indicator of spatial autocorrelation(Moran 1950)

9, 15, 33, 93 m

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

288 Z. Huang et al.

Table 3. Secondary variables derived from the backscatter data set.

Variable Description Scales

Local Moran’s I An indicator of spatial autocorrelation 9, 15, 33, 93 mHomogeneity GLCM homogeneity (Haralick et al. 1973);

four directions (north, east, north-east andsouth-east)

9, 15, 33, 93 m

Variance GLCM variance (Haralick et al. 1973); onedirection (north-east)

9, 15, 33, 93 m

Note: GLCM, grey-level co-occurrence matrix.

We found that the majority of fine-scale spatial patterns were well represented by the fourspatial scales. In addition, two spatial variables, Easting and Northing, were also used.These spatial variables were not considered as direct drivers for sediment processes; how-ever, their relationships with known and unknown drivers of sediment processes might beuseful for the prediction. One aim of this study was to see whether or not using Easting andNorthing on top of the environmental variables can further improve the prediction.

2.2. Modelling methods

Four non-parametric machine learning models were used to simulate the non-linearenvironment–sediment relationships and make predictions: boosted decision tree (BDT)(Friedman 2002), random forest decision tree (RFDT) (Breiman 2001), support vectormachine (SVM) (Cortes and Vapnik 1995) and general regression neural network (GRNN)(Specht 1990). BDT consists of a series of single decision trees, the first of which is fittedto the data. The residuals (error values) from the first tree are then fed into the secondtree which attempts to reduce the error. This process is repeated through a series of suc-cessive trees. The final predicted value is formed by adding the weighted contribution ofeach tree. RFDT also consists of an ensemble (collection) of decision trees, but insteadgrows a number of independent trees in parallel, that do not interact until all of them havebeen built. SVM performs classification by constructing an N-dimensional hyperplane thatoptimally separates the data into two categories. The vectors near the hyperplane are thesupport vectors. An SVM analysis finds the hyperplane that is oriented so that the marginbetween the support vectors is maximised. SVM has been modified to solve regressionproblems. GRNN has an input layer (one neuron for each predictor), a hidden layer (oneneuron for each training sample), a pattern layer (two neurons; one is the denominatorsummation unit, the other is the numerator summation unit) and output layer (one neuron).A hidden neuron computes the distance between the point being evaluated and a trainingsample, then applies a radial basis function (e.g. Gaussian function) using the sigma valueto the distance to compute the weight (influence) for each point. The output layer dividesthe value accumulated in the numerator summation unit by the value in the denominatorsummation unit and uses the result as the predicted target value. Decision trees were usedbecause they typically yield superior prediction performance for regression problems thantraditional regression approaches, particularly the BDT and RFDT models (e.g. De’ath andFabricius 2000, Leathwick et al. 2006, Francke et al. 2008). SVM and neural networks havealso been demonstrated to perform better (e.g. Huang et al. 2004, Mohandes et al. 2004,Khan and Coulibaly 2006). However, we are aware that the machine learning models arenot a magic bullet (Gahegan 2003). There are many issues such as parameter selection, the

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

International Journal of Geographical Information Science 289

‘black box’ problem and computational complexity associated with the machine learningmodels that still need further research.

The 259 sediment samples were randomly split into two sets: a training set (181 sam-ples) and a test set (78 samples). The models were developed using the training set. Theperformance of each model was then statistically evaluated against the test set using R2

(proportion of variance explained by model), root mean squared error and mean absoluteerror. The modelling results of SVM, GRNN and either BDT or RFDT (whichever per-formed better) were then combined. The reason only one of the decision trees was usedin the combination is that they belong to the same modelling family, and thus likely havethe same ‘blind spots’ (Huang and Lees 2004). Models from different modelling familieshave different modes for handling the same problem; therefore by combining them they cancounteract one another and potentially produce better modelling results. For each sedimentparameter, a weighted-average approach using the three individual R2 values as weightswas used to combine the three prediction maps into one prediction map for the parametercell by cell using the following equation:

Pc = R21

R21 + R2

2 + R23

× P1 + R22

R21 + R2

2 + R23

× P2 + R23

R21 + R2

2 + R23

× P3 (1)

where Pc is the combination map; R21, R2

2 and R23 are R2 values for individual models;

and P1, P2 and P3 are the three individual maps. Furthermore, an estimate of predictionuncertainty in terms of the level of agreement between the three models was created bycalculating the SD of the three individual prediction maps cell by cell:

U = STD (P1, P2, P3) (2)

where U is the uncertainty map and STD() the SD function.The four machine learning models were implemented through DTREG software (devel-

oped by Phillip H. Sherrod, Brentwood, TN, USA, http://www.dtreg.com). Both theselection of model parameters and selection of explanatory variables have an influence onprediction performance. This study used a set of steps to find a ‘best’ performing model foreach model type and each sediment parameter. It should be noted that there are unlimitedcombinations of modelling parameters, and the final ‘best’ performing model approachesthe optimal. The four main steps in the modelling procedure were as follows:

(1) Search model parameters. All explanatory variables were used including the twoprimary variables: bathymetry and backscatter and those secondary variableslisted in Tables 2 and 3, except the Easting and Northing. An intensive experimen-tal process was implemented to search a best combination of model parameterswith the highest prediction R2 (against the test set). For BDT, we varied the threemodel parameters as follows: ‘Maximum number of trees in series’ was variedbetween 400 and 1000 with an increment of 100; ‘Depth of individual trees’ and‘Minimum size node to split’ were varied between 5 and 10 with an incrementof 1. For RFDT, we varied the same three model parameters except that the ‘num-ber of trees’ was varied between 100 and 1000 with an increment of 100. ForGRNN, we varied two model parameters. Both the ‘Max sigma’ parameter andthe ‘Search steps’ parameter were varied between 5 and 100 with an increment of

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

290 Z. Huang et al.

5 before 50 and an increment of 10 after 50, respectively. For SVM, the param-eter selection process was more complicated. We varied three parameters: ‘C’,‘Gamma’ and ‘P’. The optimal values of these parameters must be searched froma range. For ‘C’, the lower range was set at ‘0.1’ (default value) and the upperrange was varied between 5 and 100,000 with an increment of either 5 or 10 or100 or 1000 or 10,000 at various points. For ‘Gamma’, the lower range was setat ‘0.01’ (default value) and the upper range was varied between 5 and 500 withan increment of either 5 or 10 or 50 at various points. For ‘P’, the lower rangewas set at ‘0.0001’ (default value) and the upper range was varied as ‘Gamma’.Two mechanisms, a grid search and a pattern search, were used to help the search.For the ‘grid search’, we varied six ‘interval’ combinations: (10, 2), (10, 1), (5, 2),(5, 1), (20, 2) and (20, 1). For the ‘pattern search’, we varied the ‘interval’ between5 and 10,000 with an increment of either 5 or 10 or 100 or 1000 at various points.

(2) Search explanatory variables. This was a manual feature selection process forwhich the best combination of model parameters from Step 1 was employed. Theprimary variables, bathymetry and backscatter, were always included as explana-tory variables. The secondary variables (Tables 2 and 3) were split into 15 variablegroups (Table 4). In the first iteration, only one variable group was added to themodel. This, however, was repeated for all variable groups. The variable group forwhich we obtained the highest prediction R2 (against the test set) was retained forthe next iteration, and the R2 value was recorded. Each of the following iterationsselected a different variable group until all variable groups have been added to themodel. The combination of the variable group(s) which achieved the highest pre-diction R2 value was selected as the best combination of explanatory variables.

(3) Refine model parameters. This involved repeating the procedure in Step 1 usingthe best combination of explanatory variables provided by Step 2.

(4) Add Easting and Northing. The best combination of explanatory variables and thebest combination of model parameters from Steps 2 and 3 were employed. TheEasting and Northing variables were added to the model to see whether or notthey further improved the model performance.

Table 4. Secondary variable groups.

Groupname Variables

Groupname Variables

slope Slope at four scales relief Relief at four scalessurface Surface area at four scales tpi TPI at four scalesplanar Planar curvature at four scales profile Profile curvature at four scalesmoran1 Local Moran’s I for

bathymetry at four scalesfuzzy33 Fuzzy morphometric features at

the scale of 33 mfuzzy93 Fuzzy morphometric features

at the scale of 33 mhomse Homogeneity of south-east

direction at four scaleshome Homogeneity of east direction

at four scaleshomn Homogeneity of north direction at

four scaleshomne Homogeneity of north-east

direction at four scalesvar Variance of north-east direction at

four scalesmoran2 Local Moran’s I for

backscatter at four scales

Note: TPI, Topographic Position Index.

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

International Journal of Geographical Information Science 291

The three statistics generated from the limited number of test samples could notrepresent the full picture of model performance. In particular, they could not be used toevaluate the spatial patterns of the prediction maps. Visual assessment was therefore usedas an additional criterion of prediction performance for its ability to evaluate spatial dis-tributions. The spatial distribution patterns of sediment parameters on the prediction mapswere checked against the local expert knowledge for their reasonableness. In this arti-cle, the detailed visual assessment of the prediction maps is reported only for the PointCloates area (Figure 2) and for three of the sediment parameters: %Mud, %Sand andSorting. The Point Cloates area has the most complex under water physical terrain, rang-ing from ridges, mounds and reefs in shallow water to flat, sand-dominated sediment indeeper water (Figure 2a). The backscatter data set for this area is characterised by a sharpboundary, which is approximately aligned north–south in the centre of the survey area(Figure 2b). West of the boundary has high backscatter return and east of the boundaryhas low return. The prediction maps for the other five sediment parameters can be foundin the Supplementary Material available online. A brief visual assessment for a stretch ofthe study area between Point Cloates and Mandu, which has no sediment samples, is alsoreported (Area 1, Figure 2). The purpose of this was to evaluate whether the models couldmake reasonable predictions immediately outside the sampled areas.

For R2 values, there is no consistent evaluation standard. In this study, R2 valuesbetween 0.3 and 0.5, between 0.5 and 0.7, between 0.7 and 0.9 and greater than 0.9 were

Figure 2. Physical data sets for the Point Cloates and an area to the north (Area 1) showing (a)bathymetry and (b) acoustic backscatter at 3 m spatial resolution.

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

292 Z. Huang et al.

regarded, respectively, as ‘acceptable’, ‘moderate’, ‘good’ and ‘excellent’. Importantly, thefour machine learning models were able to rank the importance of explanatory variables.BDT and RFDT calculated the importance of each explanatory variable by adding up theimprovement in classification gained by each split that used the explanatory variable. SVMand GRNN used a ‘sensitivity analysis’ approach to rank the variable importance, wherethe values of each explanatory variable were randomised and the effect on the accuracyof the model was measured. The modelling procedure detailed above (i.e. Step 2) alsoprovided a selection order of the secondary variable groups for each model. These resultsenabled evaluation of the likely driving forces of the sedimentary processes producing theseabed spatial patterns.

3. Results

3.1. Model statistical performance

The statistical results (Table 5) show that the prediction performance for %Mud was good(e.g. 81% variance explained for GRNN and the combined model); while moderate per-formance was obtained for Sorting, Skewness, %Sand and Mean Grain Size. For othersediment parameters, the performance was acceptable. There was no single best machinelearning model for all sediment parameters (Table 5, Figure 3). To evaluate the overallmodel statistical performance, for each sediment parameter we separately assigned a scoreof 0.5 to the model that ‘won the competition’ using either the measures of R2 and rootmean squared error or using the measure of mean absolute error. The total scores for BDT,RFDT, SVM, GRNN and the combined models are 2.5, 1.5, 0.0, 2.0 and 2.0, respectively.Thus, the overall performance of BDT was better than other individual models and thecombined model. For most sediment parameters, the combined model achieved equal orsecond best statistical performance compared with the best individual machine learningmodel. The averaging method used to combine the individual models may have let it down,especially one of the individual models was SVM which performed relatively poorly. Theoverall poor performance of SVM may be due to its complex parameter selection pro-cess that could hinder its chance of optimisation. SVM was originally designed to performclassification problem and its modification for regression problem might not have beenappropriate. Further testing of its performance against other machine learning models isneeded. Many approaches exist to combine the classification models (e.g. Huang and Lees2005), which are not appropriate for combining the regression models. New and betterapproaches of combining the regression models require further research.

3.2. Feature selection

During Step 2 of the modelling procedure, the general trend was that the highest statisticalperformance was achieved with only one or a few secondary variable groups being addedto the models (%Mud, %Sand, Sorting; Figures 4–6). Adding more secondary variablesactually decreased the R2 values for all individual machine learning models. Using thespatial variables, Easting and Northing, improved the statistical performance in most cases(Table 5, Figure 7).

The four machine learning models ranked the two primary variables as the most impor-tant explanatory variables for all sediment parameters (Table 6). Bathymetry was selectedas the most important explanatory variable for %Mud; backscatter and the secondary vari-ables (morphometric variables and texture measures) made only small contributions tothe models. The backscatter data set was identified as the most important for %Sand;

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

International Journal of Geographical Information Science 293

Table 5. Summary of models’ statistical performance.

Variable name BDT RFDT SVM GRNN Combined

%Mud (STD:7.34)

R2 0.77 0.67 0.74 0.81 0.81RMSE 3.52 4.23 3.73 3.18 3.21MAE 2.09 2.41 2.24 1.96 1.97UsedEN?1

Yes Yes No No NA

%Sand (STD:22.98)

R2 0.51 0.46 0.32 0.54 0.60RMSE 16.16 16.86 19.78 15.55 14.53MAE 11.39 11.95 14.26 9.86 10.43UsedEN?

Yes Yes No Yes NA

%Gravel(STD:

22.09)

R2 0.44 0.39 0.28 0.38 0.43RMSE 16.61 17.25 18.69 17.44 16.72MAE 11.24 12.72 13.15 11.50 11.35UsedEN?

Yes No Yes Yes NA

%CaCO3

(STD: 4.19)R2 0.31 0.37 0.31 0.26 0.39RMSE 3.49 3.34 3.48 3.60 3.26MAE 2.41 2.33 2.77 2.72 2.39UsedEN?

Yes Yes No Yes NA

Mean GrainSize (STD:200.23)

R2 0.51 0.45 0.37 0.39 0.53RMSE 139.59 148.49 158.38 156.23 137.62MAE 113.96 126.96 130.80 112.78 114.04UsedEN?

Yes Yes Yes Yes NA

Sorting (STD:1.49)

R2 0.61 0.60 0.54 0.56 0.61RMSE 0.92 0.94 1.02 0.99 0.93MAE 0.67 0.71 0.79 0.72 0.70UsedEN?

Yes Yes Yes No NA

Skewness(STD: 1.88)

R2 0.59 0.60 0.49 0.46 0.59RMSE 1.21 1.19 1.34 1.38 1.20MAE 0.94 0.93 1.13 1.09 1.00UsedEN?

Yes Yes Yes No NA

Kurtosis(STD: 6.34)

R2 0.34 0.27 0.24 0.32 0.35RMSE 5.14 5.42 5.55 5.23 5.12MAE 3.82 4.06 4.37 4.07 4.02UsedEN?

Yes Yes No Yes NA

Notes: The best statistics are highlighted as bold. NA, not applicable; RMSE, root mean squared error; MAE,mean absolute error; STD, standard deviation function; BDT, boosted decision tree; RFDT, random forest decisiontree; SVM, support vector machine; GRNN, general regression neural network; EN, Easting and Northing.1Whether or not the final model has included the spatial variables of Easting and Northing.

bathymetry and GLCM variance at the scale of 93 m were also important. For %Gravel,backscatter was the most important explanatory variable; GLCM variance at the scale of93 m, GLCM homogeneity (from south-east direction) at the scale of 93 m and bathymetryalso had significant contributions. In terms of %CaCO3, Mean Grain Size and Skewness,the most important explanatory variable identified was bathymetry; backscatter and LocalMoran’s I for bathymetry, at the scale of 9 m, were also notable. Bathymetry was identi-fied as the most important variable for Sorting; the contribution from backscatter was also

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

294 Z. Huang et al.

Figure 3. R2 values for all sediment parameters as modelled by boosted decision tree (BDT), ran-dom forest decision tree (RFDT), support vector machine (SVM), general regression neural network(GRNN) and the combined model.

Figure 4. Progressive R2 values for %Mud as calculated by the four machine learning models duringthe feature selection process.

notable. For Kurtosis, the most important explanatory variable was bathymetry; the surfacearea at the scales of 93 and 33 m and backscatter were also very important. In addition,the manual feature selection process (i.e. Step 2 of the modelling procedure) frequently

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

International Journal of Geographical Information Science 295

Figure 5. Progressive R values for %Sand as calculated by the four machine learning models duringthe feature selection process.

Figure 6. Progressive R values for sediment sorting as calculated by the four machine learningmodels during the feature selection process.

chose ‘moran1’ as the first secondary variable group to add to the models (e.g. 25 out of32 times). The variable group of ‘homse’ was often chosen first when modelling %Sandand %Gravel. Furthermore, in 30 out of 32 cases, ‘moran1’ was included in the final ‘best’model.

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

296 Z. Huang et al.

Figure 7. The differences of R values between using and not using the Easting and Northingvariables.

Table 6. The most important explanatory variable selected by the models.

BDT RFDT SVM GRNN

%Mud Bathymetry Bathymetry Bathymetry Bathymetry%Sand Backscatter Backscatter Backscatter Backscatter%Gravel Backscatter Backscatter Backscatter Backscatter%CaCO3 Bathymetry Bathymetry Bathymetry BathymetryMean Grain Size Bathymetry Bathymetry Bathymetry BathymetrySorting Bathymetry Bathymetry Bathymetry BathymetrySkewness Bathymetry Bathymetry Bathymetry BathymetryKurtosis Bathymetry surface_93m Bathymetry Bathymetry

Notes: BDT, boosted decision tree; RFDT, random forest decision tree; SVM, support vector machine; GRNN,general regression neural network.

3.3. Visual assessment

For the Point Cloates study area, the individual machine learning models produced broadlysimilar prediction maps for %Mud distribution (Figure 8). In all models, the predicted mudcontent is consistent with the pattern indicated by sediment samples, increasing in deeperwater (Figure 9a). The range of mud content predicted by the GRNN model (0–32%) isequivalent to the range among the sediment samples, with other models predicting differ-ent ranges (Table 7). Where sediment sample data are not available, such as in the deepestwater, the predicted mud content is noticeably different between the three individual mod-els (SVM, BDT and GRNN) and therefore associated with a high level of uncertainty(Figure 8e).

The spatial patterns of %Sand distribution are similar for all models, showing adecrease towards the deeper water (Figure 10). The north–south boundary seen in the

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

International Journal of Geographical Information Science 297

(a) (b)

(c) (d)

(e)

STD

20.6

0

0 10 205km

%Mud

60

0

Rock

Figure 8. Prediction maps of %Mud for the Point Cloates area as generated by machine learningmodels: (a) boosted decision tree (BDT), (b) support vector machine (SVM), (c) general regressionneural network (GRNN), (d) the combined model and (e) uncertainty map. Areas of hardground(reef, mounds and ridges) have been masked out and are labelled as rock.

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

298 Z. Huang et al.

Figure 9. Sample values for the Point Cloates area: (a) %Mud on the bathymetry data set, (b) %Sandon the backscatter data set and (c) Sorting on the bathymetry data set. Legends for the bathymetryand backscatter data sets are identical to those of Figure 2.

backscatter data set is clearly visible in the four prediction maps of the machine learn-ing models. Higher sand content generally corresponds with a weaker backscatter return,which is consistent with the backscatter values for sediment sample sites (Figure 9b). Inthis instance, only the range of values predicted by the GRNN model (0–100%) broadlymatched those of the sample data (Table 7). The BDT, SVM and combined models failedto predict the lower range of sand values, truncating at 32%, 43% and 38%, respectively.

The predictions of the sediment sorting parameter generally fall within the range ofsorting values measured in sediment samples (i.e. 1.4–8.5; moderately well sorted to verypoorly sorted), with the exceptions of the SVM and GRNN models which extend into thecategory of very well sorted (0.7 and 0.0, respectively). The spatial pattern for sorting pre-dicted by the models depicts higher values (i.e. poorer sorting) in deeper water (Figure 11)as reflected by the samples (Figure 9c). The predicted pattern, however, is not maintainedby all models in areas where samples are not available (far west) or are in low density

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

International Journal of Geographical Information Science 299

Table 7. Summary statistics of the prediction maps.

Variable name BDT SVM GRNN Combined

%Mud Min. 0 0 0 0Max. 17 59 32 29Mean 2.4 4.0 3.8 3.4STD 3.3 7.8 6.2 4.9

%Sand Min. 32 43 0 38Max. 100 100 100 100Mean 81 76. 7 75.5 78STD 11.9 12.3 21.7 14.4

%Gravel Min. 0 0 0 0Max. 63 61 62 56Mean 13.8 17.1 18.8 16.4STD 9.1 12.1 10.6 9.0

%CaCO3 Min. 88 85 1 66Max. 98 98 96 97Mean 92.8 92.2 91.9 92.3STD 2.1 2.1 1.5 1.7

Mean Grain Size Min. 117 216 0 189Max. 849 820 897 820Mean 449.6 444.7 467.3 453.6STD 137.3 142.2 191.7 141.7

Sorting Min. 1.4 0.7 0 1.5Max. 7.0 8.0 7.2 6.9Mean 3.2 3.4 3.1 3.2STD 1.3 1.5 1.4 1.3

Skewness Min. −4 −2.4 −4.5 −3.4Max. −0.2 1.4 3.3 0.3Mean −1.6 −1.8 −1.3 −1.6STD 0.5 0.5 1.1 0.6

Kurtosis Min. 1.0 3.2 2.5 3.7Max. 23.6 17.7 17.4 17.6Mean 9.0 8.6 9.4 9.0STD 4.4 1.5 3.3 2.7

Notes: BDT, boosted decision tree; SVM, support vector machine; GRNN, general regression neural network;STD, standard deviation function; Min., minimum; Max., maximum.

(southern end). Consequently, uncertainty values are higher in these parts of the study area(Figure 11e).

Visual assessment of the prediction maps for %Mud, %Sand and Sorting at Area 1(Figure 2) suggests that the combined model achieves a spatial pattern that is consistentwith the observed pattern of sediment distribution throughout the study area (Figure 12).

4. Discussion

4.1. Advantages of using multibeam acoustic data

This study is among only a few (Leecaster 2003, Verfaillie et al. 2006, 2009, Li et al.2010) that have directly used multibeam bathymetry data to predict and map continuousseabed sediment properties (rather than sediment classes) at a fine spatial scale. An addi-tional primary variable used in this study was acoustic backscatter data, particularly formodelling %Sand and %Gravel across the study area (Figures 10 and supplementary fig-ure S1 available online). Backscatter intensity has been shown previously to be potentially

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

300 Z. Huang et al.

(a) (b)

(c) (d)

(e)

STD

20

0

0 10 205km

%Sand

100

0

Rock

Figure 10. Prediction maps of %Sand for the Point Cloates area as generated by machine learningmodels: (a) boosted decision tree (BDT), (b) support vector machine (SVM), (c) general regressionneural network (GRNN), (d) the combined model and (e) uncertainty. Areas of hardground (reef,mounds and ridges) have been masked out and are labelled as rock.

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

International Journal of Geographical Information Science 301

(a) (b)

(c) (d)

(e)

STD

1.53

0.26

0 10 205km

Sorting

8

0

Rock

Figure 11. Prediction maps of Sorting at the Point Cloates area: (a) boosted decision tree (BDT),(b) support vector machine (SVM), (c) general regression neural network (GRNN), (d) the combinedmodel and (e) uncertainty.

useful in interpreting sediment distributions (e.g. Goff et al. 2000, Kloser et al. 2001,Lamarche et al. 2011). In this study, the backscatter response has informed the model byproviding distinctive signatures for sand-dominated areas of seabed and areas of mixed

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

302 Z. Huang et al.

Figure 12. The predictions of the combined model at Area 1: (a) %Mud, the combined model, (b)%Sand, the combined model and (c) Sorting, the combined model.

sand and gravel. In addition, this predictive approach of using multibeam acoustic data isfast and inexpensive compared with intensive seabed sampling. Significantly, this approachcan be extended to other areas where such multibeam acoustic data exist, especially tolarge areas where predictions of biodiversity patterns are needed for marine planning andmanagement.

4.2. Advantages of using multiple machine learning models

To our knowledge, this study is the first to map sediment parameters using multiplemachine learning models. The results indicate that predictive machine learning modelshave several advantages. They easily handled a large number of explanatory variables.The machine learning models were able to identify important explanatory variables.The machine learning models could predict (nearby) areas where no sample data exist(Figure 12).

An estimation of prediction uncertainty is essential for an informed interpretation ofthe modelling results, and as an input to further analysis. Interpolation techniques suchas ordinary kriging can produce a form of uncertainty based on distance measurement,with uncertainty increasing with distance from sample location. This form of uncertaintyestimation relies on sample density and does not take into account the variation of theenvironmental complexity in the study area. An uncertainty map for a single model isalso possible. However, this kind of uncertainty map has not been checked against other

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

International Journal of Geographical Information Science 303

independent predictions; their reliability is thus uncertain. This study presents an alterna-tive way of generating a prediction uncertainty map by combining multiple models. Theuncertainty maps that are generated show low uncertainty estimates at locations wherethe independent models agree and high uncertainty estimates where the models lackagreement.

4.3. Importance of feature selection process

This study found that choosing an appropriate set of explanatory variables (e.g. multibeambathymetry and backscatter parameters) was more critical than tuning model parameters.In all cases, too many input explanatory variables actually reduced prediction performance(Figures 4–6). This is shown by larger differences in R2 values when all explanatoryvariables were used. Among the models, GRNN was most sensitive to the selection ofexplanatory variables. The differences for BDT and RFDT models were more moder-ate, which is likely because of the inherent advantage of the automatic feature selectionattributed to decision trees. The manual feature selection process as implemented in thisstudy, when guided by expert knowledge, is effective in finding the essential minimum setof explanatory variables.

The ability to check prediction against expert knowledge and explain the underlyingenvironmental process is an important part of validation. For example, the prediction mapsof %Mud and Sorting (Figures 8 and 11) display clear depth-related patterns consistentwith bathymetry being identified as the most important explanatory variable for these twoparameters (Table 6). At Point Cloates, the higher predicted mud content for the outershelf is considered reliable on the basis that these sediments incorporate weathered (i.e.soft) relict carbonates that are beyond the depth of storm wave base, hence remain undis-turbed. The outer shelf also preserves local concentrations of relict carbonate gravel thatis also captured in the prediction maps (supplementary figure S1 available online). In theinput data sets, this pattern is represented by much stronger backscatter return to the westof the north–south boundary on the mid-shelf at Point Cloates (Figure 2b). Together, thehigher mud and gravel content in the deeper part of Point Cloates account for the predic-tion of poorly sorted sediment in these waters (Figure 11). Conversely, the sand-dominatedsediments immediately to the east of the north–south boundary have contributed to lowerbackscatter return and predictions of a higher degree of sorting (Figures 2b, 10 and sup-plementary figure S2 available online). Again, this is consistent with the oceanographicsetting, with wave-generated currents driving sediment transport and sorting on the innerto mid-shelf.

Apart from the obvious influence of water depth and backscatter patterns, the localtopography (e.g. measured by Local Moran’s I of bathymetry data and surface area) andthe textural properties of backscatter data (e.g. homogeneity and variance) have also playedan important part in predicting the sediment distribution. For Point Cloates, the topographiccomplexity of the inner shelf in particular is evident in the spatial heterogeneity of the pre-dictions for %Sand and Sorting. On the mid-shelf and outer shelf, the influence of localtopography on the models is interpreted to be related to sediment bedforms that are alsocaptured in the backscatter data. The fact that the secondary variables of different scaleswere identified as important drivers for the sediment distribution confirms the usefulnessof the multi-scale approach. For %Sand and %Gravel, their local patterns are more likelyinfluenced by textural properties of backscatter data at a relatively broad scale (e.g. 93 m).For %CaCO3, Mean Grain Size and Skewness, even a change in bathymetry data at thefinest scale (9 m) could have significant influence on local patterns, such as would occur

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

304 Z. Huang et al.

across a bedform field. For Kurtosis, the pattern of its local distribution is likely influ-enced by changes in surface rugosity at a relatively broad scale. In addition, Easting andNorthing provide location information for data points (thus the distance measures betweendata points). This distance information is considered as a surrogacy measure of spatialautocorrelation (Tobler 1970). The usefulness of Easting and Northing (Figure 7) indicatesthe existence of spatial autocorrelation for the sediment parameters.

4.4. Applications of sediment predictions

An ability to generate accurate maps of seabed sediment properties is essential for describ-ing the benthic physical environment, since it provides insights into sediment transportpathways and processes and for understanding the relationship between seabed sedimentproperties and benthic biodiversity (e.g. Beaman and Harris 2007, Pitcher et al. 2007a, Post2008). The sediment prediction maps obtained in this study provide reliable coverage ofkey physical variables that will be incorporated into the analysis of covariance of physicaland biological data for this area. The results will provide a test of the degree to which theseparameters are able to explain observed biodiversity patterns and thereby enable a bet-ter understanding of ecological processes that link physical habitat parameters and seabedbiology.

5. Conclusion

This study provides a new approach to mapping seabed sediment properties that providesrobust spatial layers and confidence metrics. The key findings are as follows:

• high-resolution multibeam acoustic data (bathymetry and backscatter) can be used topredict seabed sediment properties and generate continuous layers with satisfactoryresults;

• predictive modelling using machine learning models allows for a large number ofinput variables and prediction outside the sampled areas;

• combining multiple machine learning models provides the ability to generate usefulprediction uncertainty maps;

• single machine learning models may perform better than combined models;• choosing an appropriate set of explanatory variables, through a manual feature selec-

tion process (Step 2 of the modelling procedure), is a critical step for optimisingmodel performance; and

• variable importance can be identified, which is useful in explaining underlying envi-ronmental process and checking prediction against the existing knowledge of thestudy area.

Acknowledgements

This work has been funded through the Commonwealth Environment Research Facilities(CERF) programme, an Australian Government initiative supporting world class, pub-lic good research. The CERF Marine Biodiversity Hub is a collaborative partnershipbetween the University of Tasmania, CSIRO Wealth from Oceans Flagship, GeoscienceAustralia, Australian Institute of Marine Science and Museum Victoria. We thank Dr. JinLi and Dr. Hideyasu Shimadzu of Geoscience Australia for their valuable comments on ear-lier version of this article. A number of anonymous reviewers are acknowledged for their

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

International Journal of Geographical Information Science 305

constructive comments and suggestions that have significantly improved the article. Thiswork is published with permission of the Chief Executive Officer, Geoscience Australia.

ReferencesBeaman, R.J. and Harris, P.T., 2007. Geophysical variables as predictors of megabenthos assemblages

from the northern Great Barrier Reef, Australia. In: B.J. Todd and H.G. Greene, eds. Mappingthe seafloor for habitat characterization, Geological Association of Canada, Special Paper 47,St Johns, Newfoundland, Canada.

Blott, S.J. and Pye, K., 2001. GRADISTAT: a grain size distribution and statistics package for theanalysis of unconsolidated sediments. Earth Surface Processes and Landforms, 26, 1237–1248.

Breiman, L., 2001. Random forests. Machine Learning, 45, 5–32.Brooke, B., et al., 2009. Carnarvon Shelf survey post-survey report [online]. Geoscience Australia,

Record 2009/02, 90 pp. Available from: http://www.ga.gov.au/image_cache/GA13723.pdf[Accessed 10 December 2010].

Bureau of Meteorology, 2010. North West Cape project [online]. Available from:www.cawcr.gov.au/bmrc/wefor/research/nw_cape_project.htm [Accessed 14 April 2010].

Cochrane, G.R. and Lafferty, K.D., 2002. Use of acoustic classification of sidescan sonar data formapping benthic habitat in the Northern Channel Islands, California. Continental Shelf Research,22, 683–690.

Cortes, C. and Vapnik, V., 1995. Support-vector networks. Machine Learning, 20, 273–297.De’ath, G. and Fabricius, K.E., 2000. Classification and regression trees: a powerful yet simple

technique for ecological data analysis. Ecology, 81, 3178–3192.Degraer, S., et al., 2008. Habitat suitability as a mapping tool for macrobenthic community: an

example from the Belgian part of the North Sea. Continental Shelf Research, 28, 369–379.Dunn, D.C. and Halpin, P.N., 2009. Rugosity-based regional modeling of hard-bottom habitat.

Marine Ecology Progress Series, 377, 1–11.Erdey-Heydorn, M.D., 2008. An ArcGIS seabed characterization toolbox developed for investigating

benthic habitats. Marine Geodesy, 31, 318–358.Fisher, P., Wood, J., and Cheng, T., 2004. Where is Helvellyn? Fuzziness of multi-scale landscape

morphometry. Transactions of the Institute of British Geographers, 29, 106–128.Francke, T., Lopez-Tarazon, J.A., and Schroder, B., 2008. Estimation of suspended sediment

concentration and yield using linear models, random forests and quantile regression forests.Hydrological Processes, 22, 4892–4904.

Friedman, J.H., 2002. Stochastic gradient boosing. Computational Statistics & Data Analysis, 38,367–378.

Gahegan, M., 2003. Is inductive machine learning just another wild goose (or might it lay the goldenegg)? International Journal of Geographical Information Science, 17, 69–92.

Gavrilov, A.N., Siwabessy, P.J.W. and Parnum, I.M., 2005a. Multibeam echo sounder backscatteranalysis: theory review, methods and application to Sydney Harbour swath data. CMST Report2005-03, Cooperative Research Centre for Coastal Zone, Estuary and Waterway Management.

Gavrilov, A.N., et al., 2005b, Characterization of the seafloor in Australia’s coastal zone usingacoustic techniques. In: Proceedings of the international conference in underwater acousticmeasurements: technologies & results, 28 June to 1 July 2005, Heraklion, Crete, Greece.

Goff, J.A., Olson, H.C., and Duncan, C.S., 2000. Correlation of side-scan backscatter intensity withgrain-size distribution of shelf sediments, New Jersey margin. Geo-Marine Letters, 20, 43–49.

Goff, J.A., et al., 1999. Detailed investigation of continental shelf morphology using a high-resolutionswath sonar survey: the Eel margin, northern California. Marine Geology, 154, 255–269.

Haralick, R.M., Shanmugan, K., and Dinstein I., 1973. Textural features for image classification.IEEE Transactions on Systems, Man, and Cybernetics, 3, 610–621.

Heap, A.D., et al., 2009. Seabed environments and subsurface geology of the Capel and Faustbasins and Gifford Guyot, Eastern Australia – post survey report. Geoscience Australia, Record2009/22, 166 pp.

Herzfeld, U.C. and Higginson, C.A., 1996. Automated geostatistical seafloor classification – prin-ciples, parameters, feature vectors, and discrimination criteria. Computers & Geosciences, 22,35–41.

Holmes, K.W., et al., 2008. Modelling distribution of marine benthos from hydroacoustics andunderwater video. Continental Shelf Research, 28, 1800–1810.

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

306 Z. Huang et al.

Huang, Z. and Lees, B.G., 2004. Combining non-parametric models for multisource predictive forestmapping. Photogrammetric Engineering and Remote Sensing, 70, 415–426.

Huang, Z. and Lees, B.G., 2005. Representing and reducing error in natural resource classification.International Journal of Geographical Information Science, 19, 603–621.

Huang, Z., et al., 2004. Estimating foliage nitrogen concentration from HYMAP data usingcontinuum removal analysis. Remote Sensing of Environment, 93, 18–29.

Jenness, J.S., 2004. Calculating landscape surface area from digital elevation models. Wildlife SocietyBulletin, 32, 829–839.

Khan, M.S. and Coulibaly, P., 2006. Application of support vector machine in lake water levelprediction. Journal of Hydrologic Engineering, 11, 199–205.

Kloser, R.J., et al., 2001. Remote sensing of seabed types in the Australian South East Fishery; devel-opment and application of normal incident acoustic techniques and associated ‘ground truthing’.Marine and Freshwater Research, 52, 475–489.

Kostylev, V.E., et al., 2001. Benthic habitat mapping on the Scotian Shelf based on multibeambathymetry, surficial geology and sea floor photographs. Marine Ecology Progress Series, 219,121–137.

Lamarche, G., et al., 2011. Quantitative characterisation of seafloor substrate and bedforms usingadvanced processing of multibeam backscatter – application to Cook Strait, New Zealand.Continental Shelf Research, 31, S93–S109.

Lanier, A., Romsos, C., and Goldfinger, C., 2007. Seafloor habitat mapping on the Oregon continen-tal margin: a spatially nested GIS approach to mapping scale, mapping methods, and accuracyquantification. Marine Geodesy, 30, 51–76.

Leathwick, J.R., et al., 2006. Variation in demersal fish species richness in the oceans surroundingNew Zealand: an analysis using boosted regression trees. Marine Ecology Progress Series, 321,267–281.

Leecaster, M., 2003. Spatial analysis of grain size in Santa Monica Bay. Marine EnvironmentalResearch, 56, 67–78.

Lewis, D.W. and McConchie, D., 1984. Analytical sedimentology. New York: Chapman & Hall.Li, J., et al., 2010. Predicting seabed mud content across the Australian margin: comparison of

statistical and mathematical techniques using a simulation experiment. Geoscience Australia,Record 2010/11, 146 pp.

Lucieer, V.L., 2008. Object-oriented classification of sidescan sonar data for mapping benthic marinehabitats. International Journal of Remote Sensing, 29, 905–921.

Lundblad, E.R., et al., 2006. A benthic terrain classification scheme for American Samoa. MarineGeodesy, 29, 89–111.

McArthur, M.A., et al., 2010. On the use of abiotic surrogates to describe marine benthic biodiversity.Estuarine, Coastal and Shelf Science, 88, 21–32.

McBratney, A.B., Mendonca Santos, M.L., and Minasny, B., 2003. On digital soil mapping.Geoderma, 117, 3–52.

McKenzie, N.J. and Ryan, P.J., 1999. Spatial prediction of soil properties using environmentalcorrelation. Geoderma, 89, 67–94.

Mohandes, M.A., et al., 2004. Support vector machines for wind speed prediction. RenewableEnergy, 29, 939–947.

Moran, P.A.P., 1950. Notes on continuous stochastic phenomena. Biometrica, 37, 17–33.Müller, G. and Gastner, M., 1971. The ‘Karbonat-Bombe’, a simple device for the determination of

the carbonate content in sediments, soils, and other materials. Neues Jahrbuch für Mineralogie –Monatshefte, 10, 446–469.

Nichol, S.L., et al., 2009. Southeast Tasmania temperate reef survey, post survey report[online]. Geoscience Australia, Record 2009/43, 73 pp. Available from: http://www.ga.gov.au/image_cache/GA16757.pdf [Accessed 10 December 2010].

Nitsche, F.O., et al., 2004. Process-related classification of acoustic data from the Hudson RiverEstuary. Marine Geology, 209, 131–145.

Pitcher, C.R., et al., 2007a. Seabed biodiversity on the continental shelf of the Great Barrier ReefWorld Heritage Area. AIMS/CSIRO/QM/QDPI CRC Reef Research Task Final Report, 320 pp.

Pitcher, C.R., et al., 2007b. Mapping and characterisation of key biotic & physical attributes of theTorres Strait ecosystem. CSIRO/QM/QDPI CRC Torres Strait Task Final Report, 145 pp.

Post, A.L., 2008. The application of physical surrogates to predict the distribution of marine benthicorganisms. Ocean & Coastal Management, 51, 161–179.

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014

International Journal of Geographical Information Science 307

Post, A.L., Wassenberg, T.J., and Passlow, V., 2006. Physical surrogates for macrofaunal distributionsand abundance in a tropical gulf. Marine and Freshwater Research, 57, 469–483.

Rigol, J.P., Jarvis, C.H., and Stuart, N., 2001. Artificial neural networks as a tool for spatialinterpolation. International Journal of Geographical Information Science, 15, 323–343.

Siwabessy, P.J.W., et al., 2006. Analysis of statistics of backscatter strength from different seafloorhabitats. In: Acoustics 2006, New Zealand: Australian Acoustic Association, 20–22 November2006, Christchurch, New Zealand, 507–514.

Specht, D.F., 1990. Probabilistic neural networks. Neural Networks, 3, 109–118.Thouzeau, G., Robert, G., and Ugarte, R., 1991. Faunal assemblages of benthic megainvertebrates

inhabiting sea scallop grounds from eastern Georges Bank, in relation to environmental factors.Marine Ecology Progress Series, 74, 61–82.

Tobler, W., 1970. A computer movie simulating urban growth in the Detroit region. EconomicGeography, 46, 234–240.

Todd, B.J. and Greene, H.G., 2007. Mapping the seafloor for habitat characterization. St Johns,Newfoundland, Geological Association of Canada Special Paper 47.

Verfaillie, E., Van Lancker, V., and Van Meirvenne, M., 2006. Multivariate geostatistics for the pre-dictive modelling of the surficial sand distribution in shelf seas. Continental Shelf Research, 26,2454–2468.

Verfaillie, E., et al., 2009. Geostatistical modeling of sedimentological parameters using multi-scaleterrain variables: application along the Belgian Part of the North Sea. International Journal ofGeographical Information Science, 23, 135–150.

Weiss, A.D., 2001. Topographic position and landforms analysis. In: ESRI international userconference, 9–13 July 2001, San Diego, CA.

Wille, P.C., 2005. Sound images of the ocean in research and monitoring. New York: Springer.Wilson, M.F.J., et al., 2007. Multiscale terrain analysis of multibeam bathymetry data for habitat

mapping on the continental slope. Marine Geodesy, 30, 3–35.Wood, J., 1996. The geomorphological characterization of digital elevation models. Unpublished

PhD thesis. Department of Geography, University of Leicester.Zieger, S., Stieglitz, T., and Kininmonth, S., 2009. Mapping reef features from multibeam sonar data

using multiscale morphometric analysis. Marine Geology, 264, 209–217.

Dow

nloa

ded

by [

Flor

ida

Stat

e U

nive

rsity

] at

13:

01 2

1 O

ctob

er 2

014