6
Finer grain size increases effects of error and changes inuence of environmental predictors on species distribution models Brice B. Hanberry University of Missouri, 203 Natural Resources Building, Columbia, MO 65211, USA abstract article info Article history: Received 14 May 2012 Received in revised form 21 February 2013 Accepted 22 February 2013 Available online 5 March 2013 Keywords: Minimum mapping unit Pseudoabsences Resolution Soil Topographic Zoning Spatial resolution and zoning affect models and predictions of species distribution models. I compared grain sizes of 90 m grid cells to ecological units of soil polygons (approximately 209 ha composed of discontinuous polygons of 16 ha), and then introduced error into samples and examined inuence of topographic and soil variables. I used random forests, which is a machine learning classier, and open access data. Predictions based on 90 m grid cells were slightly more accurate than coarser-sized polygons, particularly false positive rates (mean values of 0.11 and 0.16, respectively). The trade-off for accuracy was the number of mapping units required to increase resolution. Probability of presence decreased with resolution. Similarly to grain size comparisons, error affected probability of presence more than accuracy of prediction. Unlike grain size comparisons, the relationship between count of each species (i.e., relative abundance) and area predicted as present was lost with addition of error. Introduction of absences into the modeling sample of presences through plot location error increased probability of presence and introduction of presences into the modeling sample of absences through use of background pseudoabsences decreased probability of presence. Finer resolution amplied the effect of background absences; area predicted for presence was reduced by a factor of 5.4 for grid cells and 1.4 for soil polygons. The choice of ne resolution grid cells or coarser shaped poly- gons resulted in different models, due to varying inuence of topographic variables on models. Use of coarser resolution (tens to hundreds of hectares) may be a worthwhile exchange for greater spatial extent of species distribution models and use of ecologically zoned polygons appeared to avoid the modiable areal unit problem. © 2013 Elsevier B.V. All rights reserved. 1. Introduction Species distribution models provide continuous maps of survey data. Choice of scale for both study extent and spatial resolution of analysis affects estimates and comparisons among studies (Dungan et al., 2002; Rahbek, 2005). For species distribution models, grain is the area of the spatial unit (or minimum mapping unit) for which probability of presence is predicted (McGeoch and Gaston, 2002). Decreasing grain to a ner area results in lower rates of presence and increasing the grain to a coarser area increases the rates of presence exponentially (He and Gaston, 2000). Optimal grain size does not have a gold standard and depends on research objectives, taxa, study region, and perhaps most importantly, the grain size of environmental predictors (Guisan and Thuiller, 2005). Although it may appear that the nest grain size would provide the best information about species distributions, models are limited by quality and resolution of the raw survey data (Gottschalk et al., 2011; Lawes and Piper, 1998). In addition, data can contain locational error, which may be worsened by poor matches with ne resolution environmental predictors or corrected by coarser grain (Graham et al., 2004; Guisan et al., 2007). In contrast, too coarse a grain may produce disparities between species and vegetation types or smoothed topo- graphic features, i.e., the oasis effect (Gottschalk et al., 2011; Lawes and Piper, 1998). Furthermore, grain may no longer be relevant if scaled up or down beyond reasonable scales for conservation or management goals (Huettmann and Diamond, 2006). If spatial resolution is arbitrary, environmental variables will be aggregated into different sizes (scaling) or spatial arrangements (zoning), changing mean and/or variance (the Modiable Areal Unit Problem; Jelinski and Wu, 1996; Openshaw and Taylor, 1979). Spe- cies distribution models are based on a variety of variable types, in- cluding topographic variables from digital elevation models and soil variables, with different grain sizes. Variables from digital elevation models appear to have unlimited and modiable resolution that is systematic rather than ecologically meaningful (Dark and Bram, 2007; Hay et al., 2001). Conversely, discontinuous soil polygons based on similar soil characteristics (i.e., ecologically meaningful) pro- duce soil map units (or basic entities) of large and varied shape and size. Converting soil variables to grid cells or topographic variables to soil polygons will produce distortion (i.e., change the mean and variance). Furthermore, changing the resolution of species distribu- tion models affects importance of environmental predictors (Rahbek and Graves, 2001). Ecological Informatics 15 (2013) 813 Corresponding author. Tel.: +1 573 875 5341x230; fax: +1 573 882 1977. E-mail address: [email protected]. 1574-9541/$ see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.ecoinf.2013.02.003 Contents lists available at SciVerse ScienceDirect Ecological Informatics journal homepage: www.elsevier.com/locate/ecolinf

Finer grain size increases effects of error and changes influence of environmental predictors on species distribution models

Embed Size (px)

Citation preview

Ecological Informatics 15 (2013) 8–13

Contents lists available at SciVerse ScienceDirect

Ecological Informatics

j ourna l homepage: www.e lsev ie r .com/ locate /eco l in f

Finer grain size increases effects of error and changes influence of environmentalpredictors on species distribution models

Brice B. Hanberry ⁎University of Missouri, 203 Natural Resources Building, Columbia, MO 65211, USA

⁎ Corresponding author. Tel.: +1 573 875 5341x230;E-mail address: [email protected].

1574-9541/$ – see front matter © 2013 Elsevier B.V. Allhttp://dx.doi.org/10.1016/j.ecoinf.2013.02.003

a b s t r a c t

a r t i c l e i n f o

Article history:Received 14 May 2012Received in revised form 21 February 2013Accepted 22 February 2013Available online 5 March 2013

Keywords:Minimum mapping unitPseudoabsencesResolutionSoilTopographicZoning

Spatial resolution and zoning affect models and predictions of species distribution models. I compared grainsizes of 90 m grid cells to ecological units of soil polygons (approximately 209 ha composed of discontinuouspolygons of 16 ha), and then introduced error into samples and examined influence of topographic and soilvariables. I used random forests, which is a machine learning classifier, and open access data. Predictionsbased on 90 m grid cells were slightly more accurate than coarser-sized polygons, particularly false positiverates (mean values of 0.11 and 0.16, respectively). The trade-off for accuracy was the number of mappingunits required to increase resolution. Probability of presence decreased with resolution. Similarly to grainsize comparisons, error affected probability of presence more than accuracy of prediction. Unlike grain sizecomparisons, the relationship between count of each species (i.e., relative abundance) and area predictedas present was lost with addition of error. Introduction of absences into the modeling sample of presencesthrough plot location error increased probability of presence and introduction of presences into the modelingsample of absences through use of background pseudoabsences decreased probability of presence. Finerresolution amplified the effect of background absences; area predicted for presence was reduced by a factorof 5.4 for grid cells and 1.4 for soil polygons. The choice of fine resolution grid cells or coarser shaped poly-gons resulted in different models, due to varying influence of topographic variables on models. Use of coarserresolution (tens to hundreds of hectares) may be a worthwhile exchange for greater spatial extent of speciesdistributionmodels and use of ecologically zoned polygons appeared to avoid themodifiable areal unit problem.

© 2013 Elsevier B.V. All rights reserved.

1. Introduction

Species distribution models provide continuous maps of surveydata. Choice of scale for both study extent and spatial resolution ofanalysis affects estimates and comparisons among studies (Dunganet al., 2002; Rahbek, 2005). For species distribution models, grain isthe area of the spatial unit (or minimum mapping unit) for whichprobability of presence is predicted (McGeoch and Gaston, 2002).Decreasing grain to a finer area results in lower rates of presence andincreasing the grain to a coarser area increases the rates of presenceexponentially (He and Gaston, 2000).

Optimal grain size does not have a gold standard and depends onresearch objectives, taxa, study region, and perhapsmost importantly,the grain size of environmental predictors (Guisan and Thuiller,2005). Although it may appear that the finest grain size would providethe best information about species distributions, models are limitedby quality and resolution of the raw survey data (Gottschalk et al.,2011; Lawes and Piper, 1998). In addition, data can contain locationalerror, which may be worsened by poor matches with fine resolutionenvironmental predictors or corrected by coarser grain (Graham et al.,

fax: +1 573 882 1977.

rights reserved.

2004; Guisan et al., 2007). In contrast, too coarse a grain may producedisparities between species and vegetation types or smoothed topo-graphic features, i.e., the oasis effect (Gottschalk et al., 2011; Lawesand Piper, 1998). Furthermore, grainmay no longer be relevant if scaledup or down beyond reasonable scales for conservation or managementgoals (Huettmann and Diamond, 2006).

If spatial resolution is arbitrary, environmental variables willbe aggregated into different sizes (scaling) or spatial arrangements(zoning), changing mean and/or variance (the Modifiable Areal UnitProblem; Jelinski and Wu, 1996; Openshaw and Taylor, 1979). Spe-cies distribution models are based on a variety of variable types, in-cluding topographic variables from digital elevation models and soilvariables, with different grain sizes. Variables from digital elevationmodels appear to have unlimited and modifiable resolution thatis systematic rather than ecologically meaningful (Dark and Bram,2007; Hay et al., 2001). Conversely, discontinuous soil polygonsbased on similar soil characteristics (i.e., ecologicallymeaningful) pro-duce soil map units (or basic entities) of large and varied shape andsize. Converting soil variables to grid cells or topographic variablesto soil polygons will produce distortion (i.e., change the mean andvariance). Furthermore, changing the resolution of species distribu-tion models affects importance of environmental predictors (Rahbekand Graves, 2001).

9B.B. Hanberry / Ecological Informatics 15 (2013) 8–13

Most research on grain has involved comparisons at very coarsekm2 scales (e.g., Araújo et al., 2005; Belmaker and Jetz, 2011;McPherson et al., 2006; Seo et al., 2009; van Rensburg et al., 2002),particularly for downscaling atlas distributions. Gottschalk et al.(2011) reviewed six studies of model comparisons at finer resolutionsthat found little difference in model performance, but in contrast,Gottschalk et al. (2011) showed best model performance for birdspecies at 1 m grain size and decreased performance with increasinggrain size. Although varying grain size alonemay not affect all models,I developed modeling scenarios that incorporated changes in grainsize, spatial arrangement (zoning), and locational error. I assessed ef-fects of grain and zoning on species distribution models by comparinga grain of 90 m (90 m × 90 m; 0.81 ha) grid cells to ecological units ofsoil polygons (mean area of 16 ha, SD = 92, but with unique predic-tions for ecological zones of 209 ha). The absolute difference in spatialresolution was not great, however use of shaped polygons insteadof grid cells is uncommon. I introduced error into samples by 1)mismatching plot location to environmental variables and 2) drawingpseudoabsence samples at random from the background study extent(i.e., contamination of controls; Lancaster and Imbens, 1996). Lastly, inaddition to standard accuracy metrics, I examined variation in proba-bility of presence, i.e., area predicted as present, and importance oftopographic and soil variables under the different modeling scenarios.

2. Methods

2.1. Grain and extent

Grids were 90 (90 × 90) m cells, resampled from a 30 (30 × 30) mdigital elevationmodel (I used ArcGIS (ESRI, version 10.0, Redlands, CA,USA) and SAS (SAS software, version 9.1, Cary, North Carolina, USA) forall data processing). This grain size balanced accuracy with computerefficiency (i.e., minimal expenditure of memory and processing) be-cause it allowed the shaped soil polygons to be relatively well repre-sented while producing a manageable 6 million cells. In addition, one90 m cell is slightly less than a hectare, which matches the scale of thesurvey area and likely would represent the finest scale for managementactivities. One caveat is that location of the surveyed plots is classifiedand the USDA Forest Service Forest Inventory and Analysis (FIA; seebelow for further details) extracted values from 4 to 90 m raster cellsaround plots, thus modeling of correctly located plots with a spatialunit of grid cells was based on a cell size of 180 m (180 × 180 m).

Ecological units were Soil Survey Geographic (SSURGO) Database(USDA Natural Resources Conservation Service, http://soildatamart.nrcs.usda.gov, http://soildatamart.nrcs.usda.gov/SSURGOMetadata.aspx)polygons, attributes of which are grouped by map units (polygonswith similar soil characteristics in a county) even though soil polygonsare discontinuous. There were about 310,000 polygons with meanarea of 16 ha (SD = 92) contained in 2364 map units with mean areaof 2071 ha (SD = 4349). Because 2000 ha may be larger than manydesired management activities, I reduced unique predictions by calcu-lating the mean value for each topographic variable by an ecologicalzone based on soil map unit, land type association (an ecological classi-fication), and bedrock geology, which then contained spatially distinctsoil polygons that averaged about 209 ha (SD = 598).

The study extent included about 4.9 million ha of the LaurentianMixed Forest province of Minnesota, which covers about 9.4 millionha (Fig. 1). I removed water and miscellaneous areas disturbed byhuman development (e.g., mines, pits, dumps). Also, soil surveyshave not been completed in seven counties: Cook, Crow Wing, Isanti,Koochiching, Lake, Pine, and part of St. Louis.

2.2. Environmental variables

I used seven soils and seven topographic predictor variables thatcharacterized tree occurrence. Seven variables were from soil tables

bymap unit for each county (i.e., polygonswith similar soil characteris-tics in a county). Categorical soil variables were 1) drainage class (verypoorly drained to excessively drained) and 2) hydric soil presence class.I then calculated five continuous soil variables, 1) mean water holdingcapacity (cm/cm), 2) pH, 3) organic matter (%), 4) clay (%), and 5)sand (%) to the depth, and weighted values by component percentage.From a 30 m DEM (digital elevation model; http://www.mngeo.state.mn.us/chouse/metadata/dem24ras.html), I calculated seven continu-ous terrain variables: 1) elevation (m), 2) slope (%), 3) transformedaspect (1 + sin(aspect / 180 / 3.14 + 0.79); Beers et al., 1966), 4)solar radiation (0700 to 1900 in 4 hour intervals on summer solsticefor re-sampled 60 m DEM), 5) topographic roughness (Sappingtonet al., 2007), 6) wetness convergence (T. Dilts, http://arcscripts.esri.com), and 7) topographic position index.

2.3. Tree surveys

The USDA Forest Service Forest Inventory and Analysis (FIA) sur-veys fixed plots, consisting of four subplots that are each 7.3 m in ra-dius (i.e., each subplot is 167 m2), during a five year cycle. I used FIAplots from the latest complete cycle during 2004–2008. The USDAForest Service joined our predictor variables to plots in a table basedon accurate spatial locations but they were able only to return infor-mation for 1489 plots to secure locations of plots. I selected the mostcommon overstory trees (count ≥ 250), about 35,000 trees for 15species or genus groups. I grouped these species into the followingcategories: American basswood (Tilia americana); ashes (Fraxinusnigra, Fraxinus pennsylvanica, Fraxinus americana); aspens (Populustremuloides, Populus grandidentata, Populus balsamifera), balsam fir(Abies balsamea), birch (Betula papyrifera), eastern white pine (Pinusstrobes), elms (Ulmus americana, Ulmus rubra, Ulmus thomasii), jackpine (Pinus banksiana), maples (Acer rubrum, Acer saccharum, Acersaccharinum), northern white cedar (Thuja occidentalis), red oaks(Quercus nigra, Quercus ellipsoidalis, Quercus rubra), red pine (Pinusresinosa), spruces (Piceamariana, Picea glauca), tamarack (Larix laricina),and white oaks (Quercus alba, Quercus macrocarpa).

2.4. Error for present and absent cases

The available FIA plot locations are perturbed to protect landownerprivacy. To introduce locational error, I used available plots downloadedfrom FIA DataMart (http://apps.fs.fed.us/fiadb-downloads/datamart.html, http://fia.fs.fed.us/library/database-documentation). There were3994 plots that intersected with our spatial units. Although therewere more trees available, I kept sample size for modeling the sameas when modeling with reduced samples.

I generated pseudoabsences using two strategies. I selected from1) from the background or entire study extent, without exclusion ofpolygons with tree presence (because presence was unknown forthe entire extent) and 2) plots surveyed without detection of speciespresence. Absences from the entire extent, i.e., background absences,are more likely to contaminate the pseudoabsent sample with presentcases than sampling from surveyed plots.

2.5. Modeling scenarios

Scenario 1: I separately joined soil and topographic variables,retaining intrinsic resolution and zoning, to plots for modeling andthen predicted to a) 90 m grid cells with aggregated soil variablesand b) soil polygons with aggregated topographic variables.

Scenario 2: I modeled and predicted species using soil polygons asthe grain and aggregated topographic variables.

Scenario 3: I modeled and predicted species using available plotswith perturbed location, creating mismatches between species occur-rence and environmental variables, and with a) grid cells and b) soilpolygons as the grain. Despite having a larger sample size from the

Fig. 1. Study area in black, a portion of Laurentian Mixed Forest (shaded) in Minnesota.

10 B.B. Hanberry / Ecological Informatics 15 (2013) 8–13

available plots, I kept the sample size formodeling the same for all sce-narios because increased sample size decreases values for predictedprobabilities.

Scenario 4: I modeled and predicted species using pseudoabsencesfrom the background extent, contaminating the absence sample, andwith a) grid cells and b) soil polygons as the grain.

2.6. Modeling and prediction

I randomly selected 67% of spatial units (grid cells or polygons)with the species, up to 2500 polygons, for modeling, and held backthe rest for validation. I used the same sample size based on theminimum number of species present (from the 1489 plots joined bythe USDA Forest Service) for all modeling scenarios. For modeling,I used random forest (Breiman, 2001; Cutler et al., 2007; otheralgorithms are available) classification, which improves on decisiontrees by growing multiple trees grown in parallel and using randomsubsets of both predictor variables and training data. Classificationresults from bootstrap aggregation (bagging) by the majority vote ofthe many trees. I used the randomForest package (Liaw and Wiener,2002) in R statistical software (R development core team 2010) withthe sampsize option (which is sampled without replacement), and

set the bag fraction, or subsampling rate, at 67% of the selected poly-gons with the species. To focus modeling on prediction of presencemore than absence, I then specified a modeling prevalence, or ratioof present cases to total cases, of 0.8. I set the number of trees at1000 and the number of variables randomly sampled at each split asthe square root of the number of predictors.

2.7. Validation and comparisons

I used the ROCR package (Sing et al., 2005) in R to calculate thetrue positive rate (or sensitivity) at a consistent 75% threshold forall species distribution models (i.e., statistical threshold close to themodeling prevalence (Liu et al., 2005); the threshold establishes thedivision between presence and absence of species). I also calculatedfalse positive rate at a 75% threshold and AUC values using a reservedfraction of surveyed sites that did not contain a record of the species.I determined area predicted as present as a fraction of the total area.I quantified the mean of predicted probabilities and correlation(Proc Corr; SAS software, Version 9.1, Cary, North Carolina, USA)among predicted probabilities. I ran a regression (Proc Reg; SAS soft-ware, Version 9.1, Cary, North Carolina, USA) between the count of

11B.B. Hanberry / Ecological Informatics 15 (2013) 8–13

the tree species and area predicted as present. I found the mean rankand value of soil and topographic variables.

3. Results

3.1. Scenarios 1 and 2

Scaling up and changing the spatial arrangement of topographicvariables affected prediction accuracy if grid cells were modifiedafter modeling (Table 1). That is, after modeling species with non-aggregated variables (i.e., variables were joined separately to plots),predictions for ecological units (soil polygons) with aggregated topo-graphic variables performed poorly (mean true positive rate of 0.79)compared to predictions for grid cells with aggregated soil variables(mean true positive rate of 0.95). However, if topographic variableswere aggregated to ecological units prior to modeling, predictionsto ecological units (soil polygons) with aggregated topographic vari-ables performed well (mean true positive rate of 0.94), albeit slightlyless well than grid cell predictions with aggregated soil variables(false positive rate of 0.16 compared to 0.11, respectively). In addi-tion, as resolution became finer, area of predicted presence decreased(mean area of 0.14 for grid cells and mean area of 0.22 for soilpolygons). Although there is no method to correlate models with dif-ferent mapping units without aggregating predicted probabilities,mean correlation was 0.78 between predictions for grid cells andsoil polygons.

3.2. Scenarios 3 and 4

Coarser grain did not resolve locational error of plots better thanfiner resolution. Error due to incorrect plot location slightly decreasedaccuracy (e.g., mean true positive rate of 0.90 for grid cells and0.87 for soil polygons; mean false positive rate of 0.09 for grid cellsand 0.21 for soil polygons). Error also increased area predicted forpresence by a factor of 1.5 for grid cell grain (mean area of 0.21 com-pared to 0.14 for grid cells with correct plot location) and a factor of1.3 for soil polygon grain (mean area of 0.29 compared to 0.22 forpolygons with correct plot location). Mean correlation of predictedprobabilities for all species between predictions with and withouterrors for grid cells and soil polygons ranged from 0.51 to 0.66.

Random absences from the background extent, as opposed tosurveyed absences, potentially contaminate the absence sample formodeling. Although background absences did not affect accuracy of

Table 1Area (fraction of total), true positive rate and false positive rate at a 75% threshold, and AUdistribution models from grid cells that have predictions using aggregated soil variables,variables, and ecological units (soil polygons) that have models and predictions using aggr

Grid cells Ecological uni

Predictions with aggregated soil Predictions wi

Area Tpr Fpr Auc Area

Ashes 0.23 0.98 0.16 0.98 0.36Aspens 0.55 0.98 0.24 0.98 0.65Balsam fir 0.07 0.96 0.27 0.98 0.13Basswood 0.11 0.97 0.08 0.99 0.27Birch 0.16 0.91 0.15 0.96 0.36Elm 0.23 0.88 0.17 0.94 0.36Jack pine 0.09 0.97 0.08 0.99 0.14Maples 0.12 0.95 0.06 0.99 0.23Red oaks 0.11 0.90 0.02 1.00 0.29Red pine 0.06 0.98 0.07 0.98 0.11Spruces 0.07 0.96 0.11 0.98 0.14Tamarack 0.09 0.98 0.14 0.99 0.19White cedar 0.02 0.98 0.02 1.00 0.05White oaks 0.15 0.90 0.03 0.98 0.41White pine 0.08 0.87 0.09 0.96 0.16Mean 0.14 0.95 0.11 0.98 0.26

models, background absences reduced predicted probabilities (Table 2;for grid cells, mean predicted probabilities using background absenceswere 47% of values using surveyed absences and for soil polygons,mean predicted probabilities using background absences were 62%of values using surveyed absences). Consequently, area predicted forpresence was reduced, to 18% of the area using surveyed absences forgrid cells and to 73% of area using surveyed absences for soil polygons.In addition, relationship between area of predicted presence and countof each species (i.e., relative abundance) was lost. The R2 values for a re-gression (y = a + b1X + b2X2) between count and area predicted aspresentwere 0.81 (grid cell grain) and0.76 (polygon grain) for surveyedabsences and for background pseudoabsences, R2 valueswere 0.11 (gridcell grain) and 0.25 (polygon grain).

3.3. Variable importance

Soil variables were more influential than topographic variables atfiner resolutions, whereas at coarser resolutions, soil and topographicvariables were almost equally important (Table 3). With the additionof locational error, variables became more important than for modelswithout error and the overall values of variables increased. Also,for the polygon grain, soil became less influential than topographicvariables. With the use of background absences for a grain of gridcells, the influence of environmental variables became similar andthus, less influential.

Change in importance of one topographic variable appeared tocause differences between models with grid cell and polygon grains.The greatest differences in false positive rate accuracy between gridcell and polygon grain predictions were for balsam fir and white oaks(grid cell predictions were 0.23 worse and 0.23 better, respectively).For balsam fir, solar radiation was the third most influential variablein the model with the polygon grain and the 10th most influentialvariable in the model with the grid cell grain. For white oaks, aspectwas the third most influential variable in the model with the polygongrain and the 12th most influential variable in the model with thegrid cell grain. Red oaks performed slightly better using the polygongrain; topographic convergence (wetness) was the third most influen-tial variable in the model with the polygon grain and the 10th mostinfluential variable in the model with the grid cell grain. Conversely,spruces performed slightly worse using the polygon grain; topographicroughness was the fifth most influential variable in the model withthe polygon grain and the 14th most influential variable in the modelwith the grid cell grain.

C values (using reserved plots without present cases as pseudoabsences), for speciesecological units (soil polygons) that have predictions using aggregated topographicegated topographic variables.

ts

th aggregated topographic Aggregated topographic

Tpr Fpr Auc Area Tpr Fpr Auc

0.64 0.21 0.80 0.36 0.97 0.28 0.940.92 0.21 0.90 0.58 0.98 0.21 0.950.56 0.10 0.87 0.16 0.97 0.04 0.990.89 0.18 0.91 0.20 0.97 0.14 0.970.73 0.29 0.80 0.32 0.91 0.30 0.890.81 0.30 0.87 0.30 0.88 0.24 0.920.93 0.10 0.97 0.11 0.96 0.06 0.990.71 0.14 0.89 0.22 0.94 0.17 0.950.93 0.06 0.93 0.21 0.92 0.04 0.990.69 0.24 0.92 0.08 0.98 0.14 0.960.73 0.16 0.89 0.14 0.94 0.16 0.950.90 0.13 0.93 0.19 0.98 0.13 0.960.71 0.05 0.92 0.05 0.98 0.05 0.980.86 0.26 0.89 0.33 0.90 0.27 0.940.85 0.16 0.92 0.11 0.85 0.11 0.940.79 0.17 0.89 0.22 0.94 0.16 0.95

Table 2Area (fraction of total) and true positive rate at a 75% threshold for species distributionmodels from grid cells and ecological units (soil polygons) using background pseudoabsences.Mean predicted probability for species distribution models from grid cells and ecological units (soil polygons) using background and surveyed pseudoabsences.

Grid cells Ecological units

Background Surveyed Background Surveyed

Area Tpr Probability Probability Area Tpr Probability Probability

Ashes 0.01 0.98 0.14 0.56 0.15 0.97 0.28 0.62Aspens 0.02 0.98 0.16 0.71 0.25 0.96 0.33 0.73Balsam fir 0.01 0.99 0.14 0.46 0.17 0.98 0.27 0.45Basswood 0.04 0.97 0.26 0.38 0.14 0.97 0.31 0.47Birch 0.03 0.97 0.21 0.56 0.25 0.94 0.34 0.60Elms 0.06 0.96 0.41 0.56 0.23 0.91 0.49 0.62Jack pine 0.01 0.98 0.17 0.27 0.07 0.98 0.18 0.27Maples 0.04 0.97 0.19 0.40 0.15 0.96 0.27 0.48Red oaks 0.01 0.96 0.25 0.37 0.19 0.92 0.33 0.45Red pine 0.08 0.99 0.14 0.27 0.07 0.99 0.17 0.26Spruces 0.00 0.98 0.10 0.36 0.19 0.98 0.18 0.30Tamarack 0.00 0.99 0.06 0.29 0.17 0.99 0.14 0.24White cedar 0.00 0.99 0.08 0.22 0.11 0.99 0.13 0.18White oaks 0.04 0.91 0.31 0.49 0.18 0.90 0.41 0.61White pine 0.05 0.89 0.36 0.42 0.11 0.84 0.36 0.45Mean 0.03 0.97 0.20 0.42 0.16 0.95 0.28 0.45

12 B.B. Hanberry / Ecological Informatics 15 (2013) 8–13

4. Discussion

I expected little difference in measured accuracy between the twograin sizes despite zoning given that the absolute resolution differencewas relatively small (unlike comparisons between 50 m × 50 m and50 km × 50 km, which smoothed montane features; Trivedi et al.,2008 or similarly Randin et al., 2009). Both of the grain sizes andarrangements produced accurate predictions, however, predictionsbased on 90 m grid cells were slightly more accurate. By taxa,Guisan et al. (2007) documented no loss of performance when in-creasing grain size from combined 100 m or 1 km grid cells to 1 kmor 10 km, thus loss of performance possibly arose from incorrectecological zoning (i.e., shape) of the soil polygons. The trade-off forslightly greater accuracy was computer efficiency; ecological units ofpolygons had about 5% of the mapping units compared to grid cells,easing the burden on computer processing and memory limits. Inaddition, it may be difficult to represent rare groups of categorical var-iables randomly into modeling samples (limit of about 5000) and setsof predictions (limit of about 1.5 million) when sample size is in themillions. I also expected that probability of presence would decreasefor the smaller grain size of 90 m grid cells compared to ecologicalunits composed of discontinuous soil polygons and indeed, probabilityof presence decreased with resolution. Presence of a species in 58% ofthe mapping units would be nonsensical if grain size was approxi-mately the size of a tree growing space (i.e., 1 to 10 m2).

Introduced error affected area of predicted presence and impor-tance of environmental variables more than accuracy, similarly tograin size effects, nonetheless, correlation among predictions withand without error from plot location or pseudoabsence generationstrategy was low. Coarser resolution did not improve the mis-match

Table 3Importance of soil and topographic (topo) variables by rank and scaled value (valuescloser to 1 are more influential) for species distribution models from grid cells andecological units (soil polygons) using correct plot locations and surveyed absences,incorrect plot locations, and random background pseudoabsences.

Grid cells Ecological units

Rank Value Rank Value

Soil Topo Soil Topo Soil Topo Soil Topo

Correct location, surveyedabsences

6.24 8.76 0.53 0.38 7.70 7.30 0.48 0.51

Incorrect plot location 6.82 8.18 0.64 0.54 8.60 6.40 0.51 0.65Background absences 7.66 7.34 0.28 0.33 7.41 7.59 0.49 0.49

between species occurrence and environmental variables. Error fromplot location effectively contaminated the present cases with absentcases, by including environmental variables where species should beabsent in the sample of presences. Error made the presence of speciesmore probable, inflating predicted area of presence. However, in-creased commission error, or predicted presence when there shouldbe absence, was not clearly shown by false positive rates. In contrast,use of background absences that contained the species reduced thearea predicted for presence, even though also this did not decreasetrue positive rates. Furthermore, finer grain size magnified this effectcompared to coarser grain size. Background absences also disruptedthe relationship between relative abundance and area. Models usingbackground absences at a finer resolution particularly will producethis problem, because themost common species (aspen) had a smallerdistribution than seven other species and the next two most commonspecies (spruces and tamarack) had the smallest distributions of allspecies. Contamination of absences appeared to be more severe thancontamination of presences, but this may reflect the amount of expo-sure to error rather a property of the type of contamination.

Topography became less influential with finer grain size, appar-ently due to the decreased influence of one topographic variable,which varied among models. It is not possible to determine whetherthe one topographic variable is a genuine influence or a spuriouseffect due to aggregation of spatial units. Even at much coarser andsmoothed scales, the influence of topography also decreased withfiner grain size (Rahbek and Graves, 2001). Minnesota has little topo-graphic relief and it is possible that areas with greater heterogeneityin topography may produce a different relationship between topogra-phy and grain size.

Unlike Belmaker and Jetz (2011) at much coarser scales, I did notfind that environmental variables became less important at finerresolutions. Instead, error from background absences decreased theimportance of variables at finer resolutions. Nonetheless, influenceby type of variables varies with large changes in scale and even finescale climate data, modified by topography, produces differentmodelsthan coarse scale climate data (Franklin et al., 2013).

5. Conclusion

Grain and zoning differences produced different models ratherthan affecting performance. Therefore, use of coarser resolution (tensto hundreds of hectares instead of fractions of hectares) may be aworthwhile trade in exchange for analyzing a greater spatial extentgiven the limited computer resources and use of ecologically zoned

13B.B. Hanberry / Ecological Informatics 15 (2013) 8–13

polygons appeared to avoid the modifiable areal unit problem. Grainand zoning affected species distribution models by varying the influ-ence of topographic variables, and perhaps specifically just one topo-graphic variable per model. Error increased (by plot location error)or decreased (by background absences for fine resolution grid cells)influence of environmental variables and presumably known errorproduced poorer models even though error did not affect performance.Statistical methods, or at least, random forests, appeared to handleinclusion of error robustly.

Maps of finer resolution decreased area of predicted presence as aconsequence of reduced mapping unit size. Error had a greater effecton area of predicted presence than resolution because effect by errorwas not monotonic. Use of background absences de-coupled the rela-tionship between relative abundance and area, which may producedistribution maps that appear similar among species. Finer resolutionintensified reduction of area and disruption of correlation betweenabundance and area.

Acknowledgments

I thank R. McCullough and the USDA Forest Service, National FirePlan.

References

Araújo, M.B., Thuiller, W., Williams, P.H., Reginster, I., 2005. Downscaling Europeanspecies atlas distributions to a finer resolution: implications for conservation plan-ning. Global Ecology and Biogeography 14, 17–30.

Beers, T.W., Dress, P.E., Wensel, L.C., 1966. Aspect transformation in site productivityresearch. Journal of Forestry 64, 691–692.

Belmaker, J., Jetz, W., 2011. Cross-scale variation in species richness–environmentassociations. Global Ecology and Biogeography 20, 464–474.

Breiman, L., 2001. Random forests. Machine Learning 40, 5–32.Cutler, D.R., Edwards Jr., T.C., Beard, K.H., Cutler, A., Hess, K.T., Gibson, J., Lawler, J.J.,

2007. Random forests for classification in ecology. Ecology 88, 2783–2792.Dark, S.J., Bram, D., 2007. The modifiable areal unit problem (MAUP) in physical

geography. Progress in Physical Geography 31, 471–479.Dungan, J.L., Perry, J.N., Dale, M.R.T., Legendre, P., Citron-Posey, S., Fortin, M.-J.,

Jakomulska, A., Miriti, M., Rosenberg, M.S., 2002. A balanced view of scale in spatialstatistical analysis. Ecography 25, 626–640.

Franklin, J., Davis, F.W., Ikegami, M., Syphard, A.D., Flint, L.E., Flint, A.L., Hannah, L.,2013. Modeling plant species distributions under future climates: how fine scaledo climate projections need to be? Global Change Biology 19, 473–483.

Gottschalk, T.K., Aue, B., Hotes, S., Ekschmitt, K., 2011. Influence of grain size on species—habitat models. Ecological Modelling 222, 3403–3412.

Graham, C.H., Ferrier, S., Huettman, F., Moritz, C., Peterson, A.T., 2004. New develop-ments in museum-based informatics and applications in biodiversity analysis.Evolution 19, 497–503.

Guisan, A., Thuiller, W., 2005. Predicting species distribution: offering more than sim-ple habitat models. Ecology Letters 8, 993–1009.

Guisan, A., Graham, C.H., Elith, J., Huettmann, F., NCEAS Species Distribution ModellingGroup, 2007. Sensitivity of predictive species distribution models to change ingrain size. Diversity and Distributions 13, 332–340.

Hay, G.J., Marceau, D.J., Dub, P., 2001. A multiscale framework for landscape analysis:object-specific analysis and upscaling. Landscape Ecology 16, 471–490.

He, F., Gaston, K.J., 2000. Occupancy–abundance relationships and sampling scales.Ecography 23, 503–511.

Huettmann, F., Diamond, A.W., 2006. Large-scale effects on the spatial distribution ofseabirds in the Northwest Atlantic. Landscape Ecology 21, 1089–1108.

Jelinski, D.E., Wu, J., 1996. The modifiable areal unit problem and implications for land-scape ecology. Landscape Ecology 11, 129–140.

Lancaster, T., Imbens, G., 1996. Case–control studies with contaminated controls. Journalof Econometrics 71, 145–160.

Lawes, M.J., Piper, S.E., 1998. There is less to binary maps than meets the eye: the use ofspecies distribution data in the southern African sub-region. South African Journalof Science 94, 207–210.

Liaw, A., Wiener, M., 2002. Classification and regression by randomForest. R News 2,18–22.

Liu, C., Berry, P.M., Dawson, T.P., Pearson, R.G., 2005. Selecting thresholds of occurrencein the prediction of species distributions. Ecography 28, 385–393.

McGeoch, M.A., Gaston, K.J., 2002. Occupancy frequency distributions: patterns, artefactsand mechanisms. Biological Reviews 77, 311–331.

McPherson, J., Jetz, W., Rogers, D., 2006. Using coarse-grained occurrence data to pre-dict species distributions at finer spatial resolutions—possibilities and limitations.Ecological Modelling 192, 499–522.

Openshaw, S., Taylor, P., 1979. A million or so correlation coefficients: three experimentson the modifiable areal unit problem. In: Wrigley, N. (Ed.), Statistical Applications inthe Spatial Sciences. Pion, London, pp. 127–144.

Rahbek, C., 2005. The role of spatial scale and the perception of large-scale species-richness patterns. Ecology Letters 8, 224–239.

Rahbek, C., Graves, G.R., 2001. Multiscale assessment of patterns of avian speciesrichness. Proceedings of the National Academy of Sciences of the United Statesof America 98, 4534–4539.

Randin, C.F., Engler, R., Normand, S., Zappa, M., Zimmermann, N.E., Pearman, P.B., Vittoz,P., Thuiller, W., Guisan, A., 2009. Climate change and plant distribution: local modelspredict high-elevation persistence. Global Change Biology 15, 1557–1569.

Sappington, J.M., Longshore, K.M., Thompson, D.B., 2007. Quantifying landscaperuggedness for animal habitat analysis: a case study using bighorn sheep in theMojave Desert. Journal of Wildlife Management 71, 1419–1426.

Seo, C., Thorne, J.H., Hannah, L., Thuiller, W., 2009. Scale effects in species distributionmodels: implications for conservation planning under climate change. Biology Letters5, 39–43.

Sing, T., Sander, O., Beerenwinkel, N., Lengauer, T., 2005. ROCR: visualizing classifierperformance in R. Bioinformatics 21, 3940–3941.

Trivedi, M.R., Berry, P.M., Morecroft, M.D., Dawson, T.P., 2008. Spatial scale affectsbioclimate model projections of climate change impacts on mountain plants.Global Change Biology 14, 1089–1103.

Van Rensburg, B.J., Chown, S.L., Gaston, K.J., 2002. Species richness, environmentalcorrelates, and spatial scale: a test using South African birds. American Naturalist159, 566–577.