15
A bootstrap estimation scheme for chemical compositional data with nondetects Javier Palarea-Albaladejo a *, Josep Antoni Martín-Fernández b and Ricardo Antonio Olea c The bootstrap method is commonly used to estimate the distribution of estimators and their associated uncertainty when explicit analytic expressions are not available or are difcult to obtain. It has been widely applied in environmental and geochemical studies, where the data generated often represent parts of whole, typically chemical concentrations. This kind of constrained data is generically called compositional data, and they require specialised statistical methods to properly account for their particular covariance structure. On the other hand, it is not unusual in practice that those data contain labels denoting nondetects, that is, concentrations falling below detection limits. Nondetects impede the implementation of the bootstrap and represent an additional source of uncertainty that must be taken into account. In this work, a bootstrap scheme is devised that handles nondetects by adding an imputation step within the resampling process and conveniently propagates their associated uncertainly. In doing so, it considers the constrained relationships between chemical concentrations originated from their compositional nature. Bootstrap estimates using a range of imputation methods, including new stochastic proposals, are compared across scenarios of increasing difculty. They are formulated to meet compositional principles following the log-ratio approach, and an adjustment is introduced in the multivariate case to deal with nonclosed samples. Results suggest that nondetect bootstrap based on model-based imputation is generally preferable. A robust approach based on isometric log-ratio transformations appears to be particularly suited in this context. Computer routines in the R statistical programming language are provided. Copyright © 2014 John Wiley & Sons, Ltd. Keywords: bootstrap; nondetects; compositional data; log-ratio analysis; imputation 1. INTRODUCTION The presence of trace elements in concentrations below a certain method or instrument detection limit (DL) is common when analysing experimental samples. They are encountered in many subelds within the earth sciences such as air quality [1], water chemistry [2], geochemistry [3] and many others. Usually called nondetects or simply less-thans, those unobserved values are reported by the laboratories as indicators of limiting concentrations at which a certain compound may be present, but a particular analytical technique is not capable of detecting them. Depending on the severity of the DL, estimates from the observed data may well not reect the characteristics of the true underlying data distribution. Simply leaving nondetects out from the calculations is generally regarded as suboptimal. It may lead to biased estimates, as a result of preferentially discarding low values, and wrong conclusions as the inuence of the values actually detected is overstated. Reference [4] provides a comprehensive treatment of nondetects in environmental data, but it does not go into the compositional nature of data involving, for example, chemical concentrations or similar units. Particular approaches for this latter case in chemistry have been recently reviewed in [5]. Compositional data consist of vectors of positive components that are intrinsically related to each other as they represent parts of some whole, for example, chemical concentrations, abundance of species, time spent in different activities and so on. They typically come up in chemistry when raw data are normalised and are commonly represented as closed data sum- ming up to a constant value, for example, 100 for percentages, although currently, compositional data are understood in a broader sense [6]. Technical difculties are expected, and misleading conclusions may be drawn when standard data anal- ysis techniques are applied on closed data. The fundamental problem is that compositional data are dened on a sample space, the simplex, which is equipped with a particular geometry different from that of the ordinary real space [7]. A re- cent monograph [8] provides a detailed account of these issues. Three main approaches have been proposed to overcome them: (i) BoxCox-type transformations [9]; (ii) hyperspherical transfor- mation [10,11]; and (iii) log-ratio transformations [12]. In this work, we will stick to the log-ratio approach as it satises basic * Correspondence to: Javier Palarea-Albaladejo, Biomathematics and Statistics Scotland, JCMB, The Kings Buildings, Edinburgh, EH9 3JZ, UK. E-mail: [email protected] a J. Palarea-Albaladejo Biomathematics and Statistics Scotland, JCMB, The Kings Buildings, Edinburgh, EH9 3JZ, UK b J. A. Martín-Fernández Dept. Informàtica, Matemàtica Aplicada i Estadística, UdG, Campus Montilivi Edici P-IV, E-17071, Girona, Spain c R. A. Olea US Geological Survey, Reston, VA 20192, USA Research article Received: 27 November 2013, Revised: 10 March 2014, Accepted: 12 March 2014, Published online in Wiley Online Library: 2 April 2014 (wileyonlinelibrary.com) DOI: 10.1002/cem.2621 J. Chemometrics 2014; 28: 585599 Copyright © 2014 John Wiley & Sons, Ltd. 585

A bootstrap estimation scheme for chemical compositional data with nondetects

Embed Size (px)

Citation preview

Page 1: A bootstrap estimation scheme for chemical compositional data with nondetects

A bootstrap estimation scheme for chemicalcompositional data with nondetectsJavier Palarea-Albaladejoa*, Josep Antoni Martín-Fernándezb

and Ricardo Antonio Oleac

The bootstrap method is commonly used to estimate the distribution of estimators and their associated uncertaintywhen explicit analytic expressions are not available or are difficult to obtain. It has been widely applied inenvironmental and geochemical studies, where the data generated often represent parts of whole, typicallychemical concentrations. This kind of constrained data is generically called compositional data, and they requirespecialised statistical methods to properly account for their particular covariance structure. On the other hand, itis not unusual in practice that those data contain labels denoting nondetects, that is, concentrations falling belowdetection limits. Nondetects impede the implementation of the bootstrap and represent an additional source ofuncertainty that must be taken into account. In this work, a bootstrap scheme is devised that handles nondetectsby adding an imputation step within the resampling process and conveniently propagates their associateduncertainly. In doing so, it considers the constrained relationships between chemical concentrations originated fromtheir compositional nature. Bootstrap estimates using a range of imputation methods, including new stochasticproposals, are compared across scenarios of increasing difficulty. They are formulated to meet compositionalprinciples following the log-ratio approach, and an adjustment is introduced in the multivariate case to deal withnonclosed samples. Results suggest that nondetect bootstrap based on model-based imputation is generallypreferable. A robust approach based on isometric log-ratio transformations appears to be particularly suited inthis context. Computer routines in the R statistical programming language are provided. Copyright © 2014 JohnWiley & Sons, Ltd.

Keywords: bootstrap; nondetects; compositional data; log-ratio analysis; imputation

1. INTRODUCTION

The presence of trace elements in concentrations below acertain method or instrument detection limit (DL) is commonwhen analysing experimental samples. They are encountered inmany subfields within the earth sciences such as air quality [1],water chemistry [2], geochemistry [3] and many others. Usuallycalled nondetects or simply less-thans, those unobserved valuesare reported by the laboratories as indicators of limitingconcentrations at which a certain compound may be present,but a particular analytical technique is not capable of detectingthem. Depending on the severity of the DL, estimates from theobserved data may well not reflect the characteristics of the trueunderlying data distribution. Simply leaving nondetects out fromthe calculations is generally regarded as suboptimal. It may leadto biased estimates, as a result of preferentially discarding lowvalues, and wrong conclusions as the influence of the valuesactually detected is overstated. Reference [4] provides acomprehensive treatment of nondetects in environmental data,but it does not go into the compositional nature of datainvolving, for example, chemical concentrations or similar units.Particular approaches for this latter case in chemistry have beenrecently reviewed in [5].Compositional data consist of vectors of positive components

that are intrinsically related to each other as they represent partsof some whole, for example, chemical concentrations,abundance of species, time spent in different activities and soon. They typically come up in chemistry when raw data are

normalised and are commonly represented as closed data sum-ming up to a constant value, for example, 100 for percentages,although currently, compositional data are understood in abroader sense [6]. Technical difficulties are expected, andmisleading conclusions may be drawn when standard data anal-ysis techniques are applied on closed data. The fundamentalproblem is that compositional data are defined on a samplespace, the simplex, which is equipped with a particulargeometry different from that of the ordinary real space [7]. A re-cent monograph [8] provides a detailed account of these issues.Three main approaches have been proposed to overcome them:(i) Box–Cox-type transformations [9]; (ii) hyperspherical transfor-mation [10,11]; and (iii) log-ratio transformations [12]. In thiswork, we will stick to the log-ratio approach as it satisfies basic

* Correspondence to: Javier Palarea-Albaladejo, Biomathematics and StatisticsScotland, JCMB, The King’s Buildings, Edinburgh, EH9 3JZ, UK.E-mail: [email protected]

a J. Palarea-AlbaladejoBiomathematics and Statistics Scotland, JCMB, The King’s Buildings, Edinburgh,EH9 3JZ, UK

b J. A. Martín-FernándezDept. Informàtica, Matemàtica Aplicada i Estadística, UdG, Campus MontiliviEdifici P-IV, E-17071, Girona, Spain

c R. A. OleaUS Geological Survey, Reston, VA 20192, USA

Research article

Received: 27 November 2013, Revised: 10 March 2014, Accepted: 12 March 2014, Published online in Wiley Online Library: 2 April 2014

(wileyonlinelibrary.com) DOI: 10.1002/cem.2621

J. Chemometrics 2014; 28: 585–599 Copyright © 2014 John Wiley & Sons, Ltd.

585

Page 2: A bootstrap estimation scheme for chemical compositional data with nondetects

compositional principles, such as scale invariance and subcom-positional coherence [12], and is nowadays regarded as the main-stream methodology.

Motivated by a real data analysis involving bootstrapping ontrace metal concentrations in coal ash, the nondetect problemis depicted here in the context of bootstrap inference ofdistributional statistics from compositional data sets. Thebootstrap method [13, 14] facilitates statistical inference onestimators and their associated accuracy (e.g. [15, 16]). Weuse the standard nonparametric bootstrap framework todevise a bootstrap resampling scheme that handlesnondetects by taking into account the restricted nature ofthe relationships between the elements of the composition.The presence of nondetects implies that they are randomlyspread throughout the bootstrap resamples. Given a data setX = [xij] of size n × d consisting of n independent samples ofa d-part composition x= [x1, …, xd] containing nondetects, weadd a compositional nondetect replacement step within thebootstrap scheme to replace them in each bootstrap resample

Xb. An estimate θb of the distributional statistic of interest θ(e.g. lower percentiles and geometric mean) is computed from

each resample Xb. After iteration, the estimates θb are used to

compute a final bootstrap summary estimate θ and itsassociated variability. This process can be sketched as follows(Figure 1):

1. Randomly sample rows with replacement from X to producea bootstrap resample Xb of size n× d.

2. Replace nondetects in Xb by imputed values →Xb.3. Compute and save statistics of interest from Xb→θb.4. Repeat 1–3 B times.5. Compute bootstrap summary estimates →θ .

In this way, nondetects in each bootstrap resample areimputed coherently with their compositional nature, and theuncertainty associated with them is consistently incorporatedto the bootstrap estimates providing valid inferences [17]. In arelated work, although for the particular case of bootstrappercentile estimation, a number of bootstrap methods forestimating the confidence intervals of percentiles from datacontaining nondetects are examined [18]. Bootstrap percentilesare obtained by least squares based on the lognormal andtruncated lognormal distributions. Unfortunately, the composi-tional aspect of the data is not considered.In the following, we describe a bundle of seven approaches to

replace nondetects within the preceding bootstrap plan, from asimple ad hoc imputation to more sophisticated model-basedmethods. Popular approaches are redefined to meet with thecompositional aspect of the data. A final adjustment isintroduced for multivariate compositional methods to provideresults on the original scale without altering their relativestructure, and some new variants featuring a random compo-nent are also proposed. They all are then demonstrated andcompared on the basis of a set of scenarios of increasingcomplexity created from a real data set. A package of functionswritten for the R statistical language [19], zCompositions [20],implements most of the imputation methods described in thefollowing sections under a unified approach and facilitates theexploration of nondetect patterns. Note, however, that by the timethat the comparative study of Section 6 was performed, a versionof the robust isometric log-ratio (ilr)-expectation–maximisation(EM) algorithm (Section 3.4) was only available in the R packagerobCompositions [21]. MATLAB counterparts can also be obtainedeither from the website www.compositionaldata.com or byrequest from the authors.

Figure 1. Nondetect bootstrap resampling and estimation scheme (see text for details).

J. Palarea-Albaladejo, J. A. Martín-Fernández and R. A. Olea

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics 2014; 28: 585–599

586

Page 3: A bootstrap estimation scheme for chemical compositional data with nondetects

2. UNIVARIATE NONDETECT REPLACEMENT

This section comprises those methods working in a component-wise basis, which disregard any information on possible relation-ships among variables. Even so, the components not affected bynondetects will still require a multiplicative adjustment to meetcompositional requirements. These replacement procedures aregenerally easier to implement than the multivariate approacheswe will discuss later on. Besides, they work on the raw data,and no data transformation is required.

2.1. Simple multiplicative replacement

One of the most used procedures to eluding the nondetectproblem in environmental studies is simple substitution [4], thatis, replacing them with a constant fraction of the DL, usuallybetween 50% and 70%. Using the same idea but on composi-tional data, reference [22] shows that multiplicative adjustmentson the observed components of a sample are required to fulfilboth the constant-sum constraint and desirable basic properties,such as maintaining the ratios between the observed compo-nents. The multiplicative replacement formula then replacesthe components xij of a compositional sample containingnondetects with an imputed value δ as

x ij ¼δij if xij < DLij

xijð1�∑kjxik<DLijδik

ciÞ if xij ≥ DLij

8<: (1)

where ci is the sum of the observed components (e.g. 1 if theyare proportions). Note that two subscripts are explicitly usedfor δ and DL, allowing for both possible multiple DLs on the samecomponent and different DLs along components. In [22], theauthors found that imputing nondetects with δ= 0.65DL (i.e.65% of the corresponding limit of detection) minimises thedistortion of the covariance structure when the proportion ofnondetects is less than 10%. The replacement (1) will provideresults on the same scale as the original values, and themultiplicative adjustments for the detected components (i.e.those satisfying xij≥DLij) guarantee that they will be equivalentto those that would be obtained on another scale (e.g. workingon percentages).

2.2. Random uniform multiplicative replacement

An alternative to simple substitution that can be found in theliterature consists of considering that nondetects in a sampleare distributed according to a uniform distribution belowthe limit of detection [23]. This proposal can be fitted into themultiplicative replacement scheme (1) by simply drawing theimputed value δ at random from a uniform model in (0, DL):

δij eUniformj 0;DLij� �

The components taking values over the limit of detection arethen adjusted as stated in (1). As with the simple substitution(simple multiplicative replacement) method, a main advantageof this procedure is simplicity and easy implementation.However, giving any value under the DL equal chance to occurdoes not seem a reasonable conjecture in most practical settings.

2.3. Random censored lognormal multiplicativereplacement

This approach exploits the statistical characteristics of the datadistribution assuming a certain probabilistic model for thecomplete data, and not only for the tail under the DL. Inparticular, the lognormal distribution has been traditionally usedto model right-skewed data distributions, as is typically the casewhen dealing with chemical concentrations or similar [4]. Alognormal variate x relates with a normally distributed variate yby y= lnx. Their respective parameters are also closely related as

μLN ¼ eμNþ12σ

2N and σ2LN ¼ e2μNþσ2N eσ

2N � 1

� �where μ and σ2 stand for the corresponding means and vari-ances. Therefore, the estimated lognormal parameters μLN andσ2LN are usually obtained from the estimates μN and σ2

N of theassociated normal distribution. Unbiased estimates require ageneralisation of the maximum likelihood (ML) approach. Thus,the likelihood function L is the product of two components,one depending on the DL and the other involving the detectedvalues (ordered data):

L μN; σ2N

� � ¼ n!n� ncð Þ!nc! Φ

ln DLð Þ � μN

σN

� �� �nc

1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi2πσ2N� �nq exp

Xni¼ncþ1

yi � μNð Þ2

�2σ2N

0BBBB@

1CCCCA

where nc is the number of nondetects and Φ the cumulativedistribution function of a standard normal variate (see e.g. [24]for more details). Following the idea introduced in [5], once theML estimates have been computed, values of the lognormaldistribution are randomly generated from the left tail underthe DL to impute nondetects as

δij e LogNormalj μLNj; σ2

LNj

� �

The remaining components are again adjusted by (1) inaccordance with their compositional nature. For picking outlognormal values under DL, a simple rejection scheme is enoughin this case. But other strategies such as using a truncatedlognormal distribution would be more efficient (Section 3.2).

3. MULTIVARIATE COMPOSITIONALREPLACEMENT

Next, we present some methods, including new proposals, basedon compositional models that make use of the relationshipstructure of the data. As mentioned earlier, proper compositionalmethods are based on log-ratio transformations. The additivelog-ratio (alr) transformation [12] has been successfully used forstatistical modelling (e.g. [25, 26, 27, 28]). It is defined for a d-partcomposition x= [x1, …, xd] considering the ratios of thecomponents xi≠j to a chosen xj and generating an alr data vectory in the d� 1 real space as

y ¼ alr xð Þ ¼ lnx1xj;…; ln

xj�1

xj; ln

xjþ1

xj;…; ln

xdxj

(2)

Bootstrap compositional data with nondetects

J. Chemometrics 2014; 28: 585–599 Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem

587

Page 4: A bootstrap estimation scheme for chemical compositional data with nondetects

Note that one component xj must be chosen as divisor. Hence,whenever a procedure is based on the alr transformation, one isrequired to check whether the results will depend on the divisor.From this transformation, the additive logistic normal (ALN)distribution is defined, which implies that the familiar multivari-ate normal model with mean vector μ and covariance matrix Σis assumed for the alr-transformed data. Among other appealingproperties, reference [29] shows that the ALN distribution isinvariant to permutation of components. That is, ALN-basedinference is consistent regardless of the component used as alrdivisor. Without loss of generality, we will consider xd as the alrdivisor in the following formulas.

It is known that the alr is not an isometric transformationbetween the simplex and the real space [12, 30]. In [30], a newclass of ilr transformations is introduced, although an orthonor-mal basis must be taken by the practitioner in order todetermine the transformation. When it is chosen using aconvenient sequential binary partition based on expert knowl-edge, interpretable ilr-transformed variates can be obtained[31]. The ilr transformation is becoming increasingly popularand has been also revealed as convenient when adopting arobust approach [32].

In this work, we will adhere to the alr transformation for datamodelling except when using robust methods, in which case wewill require the ilr (Section 3.4). First, the alr is simple and allowsfor the DL to be easily moved into the alr-transformed space. Sec-ond, assuming a normal distribution for the transformed data, ei-ther through alr or ilr, the conditional expected values of thetransformed nondetects adopt the form of censored regressionequations. Unlike the ilr, the alr provides an immediate one-to-one mapping between original and transformed componentswith nondetects (e.g. x1 is properly represented by the alr variatelnx1/xd) that facilitates regression estimation and transformationback to the original space. Something analogous under the ilr ap-proach actually involves the specification of different ilr transfor-mations [33] and an extra computing burden within an alreadycomputationally intensive bootstrap context. On top of that, refer-ence [34] proves that imputation based on classical censored re-gression provides equivalent results regardless of whichtransformation, alr or ilr, is used.

3.1. alr-EM algorithm

This method [35, 36] is an adaptation of the well-known EMalgorithm [37] to deal with nondetects in a compositionallysound way. Assuming an ALN model, starting values for theparameters μ and Σ in the alr space are obtained from thecompletely detected samples. They are then iteratively updatedapproaching their ML estimates under mild conditions. In thekth iteration, an alr-transformed nondetect yij is imputed by itsestimated conditional expected value y ij , given the current

estimates of μ kð Þ and Σ kð Þ, as

y kð Þij ¼ yti;�j β

kð Þj � σ kð Þ

j

ϕψij�yti;�j β

kð Þj

σ kð Þj

� �

Φψij�yti;�j β

kð Þj

σ kð Þj

� � (3)

whereyt�j denotes the transposed vector of observed alr variates,

β j is the estimated vector of regression coefficients, σ2j is the

estimated conditional variance of the variate yj, ψij=DLij/xdstands for the corresponding alr-transformed DL, and ϕ and Φ

refer to the standard normal density and cumulative distributionfunctions, respectively. After convergence, a filled-in alr data setY is obtained. It is then transformed back into a replaced data setX in the original space by the inverse alr transformation:

x i ¼ alr�1 y ið Þ ¼

x ij ¼ ey ijXj≠d

ey ij þ 1

x id ¼ 1

∑j≠d

ey ij þ 1

8>>>>>><>>>>>>:

(4)

Note that practitioners are sometimes working with anonclosed subset (subcomposition) of a full composition.However through the alr (2) and inverse alr (4) transformations,the samples are inevitably closed, adding up to a constant(c=1 in this case). We introduce a simple adjustment that getsthe imputed subcomposition expressed back in the originalnonclosed form while preserving its structure. Let x ij be the esti-mated nondetect as returned by the inverse alr transformation(4) after imputation and x ik , with k≠ j, the value given for anyother component that was originally observed. Then, makinguse of the fact that the relative ratios between components mustbe preserved, the estimated nondetect in the original scale x�ijcan be recovered as

x�ij ¼ x ijxikx ik

(5)

where xik is the value of the originally observed component k.Importantly, it can be checked that by subsequently closingthe imputed data set resulting from (5), we obtain the sameoutput as if we instead had applied the replacement methodon a closed version of the original subcomposition (principle ofscale invariance).

3.2. Random censored ALN

The proposal introduced here may be understood as a multivar-iate version of the approach in Section 2.3. In fact, reference [12]shows that a composition is distributed as an ALN model if itbuilds up from a basis of lognormal positive variates. We thusapply the same idea, but conditioning to the observedcomponents through an ALN model. As estimated parametersμ and Σ of the ALN model, we use the approximate ML estimatesfrom the alr-EM algorithm.In the alr space, values are drawn from the conditional normal

distributions of every alr nondetect on the observed alr variates.Unlike in the random censored lognormal case, a simplerejection scheme from the conditionals is not efficient enoughin this case and produces a very slow routine. Alternatively, wesimulate values of truncated normal distributions in the interval(�∞, ψ) following the algorithms described in [38]. This impliesthat conditional normal distributions are used to generate valuesexplicitly restricted to lie below the threshold ψ. Hence,estimated nondetect alr values y ij are obtained as

y ij e TruncNormalj yti;�j β j; σ2j ; yij < ψijjμ; Σ

��(6)

then transformed back by inverse alr (4) and expressed inoriginal units by (5).

J. Palarea-Albaladejo, J. A. Martín-Fernández and R. A. Olea

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics 2014; 28: 585–599

588

Page 5: A bootstrap estimation scheme for chemical compositional data with nondetects

3.3. alr-DA algorithm

This new proposal is a data augmentation (DA) algorithm [39]based on the truncated ALN model within a Bayesian framework.The DA algorithm consists of an iterative simulation-basedprocedure that eventually yields draws from the joint posteriordistribution of the nondetected components and the ALNparameters. We will use it for generating plausible nondetectsfrom their predictive distributions. Under ALN modelling, thedistributions adopt the form of regression equations in thealr space. We add the extra information about the DL as inEqn (6). Thus, in the kth iteration, alr nondetects are randomlyimputed by

y kð Þij e TruncNormalj yti;�j β

kð Þj ; σ2 kð Þ

j ; yij < ψijjμ kð Þ; Σ kð Þ��

This scheme closely resembles the EM-based procedure(Section 3.1). The main difference is that DA nondetects andparameter estimates are updated by simulation and notdeterministically. Unlike the EM algorithm, estimates of thecovariance matrix from the imputed data can be computedwithout corrections to the variances. We assume the usualconjugate normal inverted-Wishart family with noninformativeprior for the model parameters. Assessing convergence of theDA algorithm usually requires intervening methods that are notfeasible in our bootstrap context. But, on the one hand, thechoice of EM estimates as initial parameters for the DA algorithmassures starting near the centre of the target distribution [40]. Onthe other hand, convergence of this type of algorithms using thenormal family should be fast [38]. Given the above, we assume1500 iterations as enough for convergence. Finally, imputed dataare back-transformed in original units by (4) and (5).

3.4. Robust ilr-EM algorithm

This method is a recent modification of the alr-EM algorithmintroduced in [34] to downgrade the influence of extremeobservations by using robust censored regression. As mentionedearlier, robust statistics requires ilr-type transformations [32],but linking components and their log-ratio counterparts isnot straightforward. It requires tailored ilr transformationsfor each component to be imputed, which are picked outfrom a family proposed by [33]. In particular, given a permuta-tion x(j) = [xj, x1, …, xd] of a composition x= [x1, …, xd], where xjcontains nondetects, we define an ilr transformation z= ilr(x)on x(j) with components

zi ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffid � i

d � i þ 1

rln

x jð Þiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

∏dk¼iþ1x

jð Þk

d�i

q ; i ¼ 1;…; d � 1 (7)

By (7), the ilr variate z1 fully represents xjð Þ1 ¼ xj in the ilr space.

From here, the EM iterations operate on z1, analogous to the alr-EM algorithm described in Section 3.1 (Eqn (3)), but now basedon robust EM estimates of the censored regression parameters:

z kð Þi1 ¼ zti;�1β

kð Þ1 � σ kð Þ

1

ϕψi1�zti;�1 β

kð Þ1

σ kð Þ1

� �

Φψi1�zti;�1 β

kð Þ1

σ kð Þ1

� �

where the ilr-transformed DL is obtained as

ψi1 ¼ffiffiffiffiffiffiffiffiffiffiffid � 1d

rln

DL jð Þi1ffiffiffiffiffiffiffiffiffiffiffiffiffi∏d

k¼2x jð Þik

d�1

r

After convergence, the filled-in ilr data set Z is back-transformed into a replaced data set X by applying inverse ilrtransformation and adjustment (5). Note that the above mustbe repeated for every component xj with nondetects withinevery bootstrap resample, so the procedure involves a significantcomputational burden. On the other hand, the transformation (7)actually incorporates all components other than the target oneinto the log-ratio divisor. An initial estimate of nondetects in anyother component is then required (e.g. by simple multiplicativereplacement as shown in Section 2.1).

4. THE FORT UNION DATA SET

The data set was kindly provided by scientists at the US Geolog-ical Survey and can be accessed at http://energy.er.usgs.gov/coalqual.htm. It comprises 229 samples with five metal traceelements (Cr, Cu, Hg, U and V reported in parts per million)obtained of ash coal samples from the Fort Union formation inthe Montana, USA, side of the Powder River Basin. The formationis mostly Palaeocene in age, and the coal is the result ofdeposition in conditions ranging from fluvial to lacustrine. Allsamples were taken from the same seam at different sites froman area of 430 km by 300 km, which implies that on average,the sampling spacing is 24 km. Using the spatial coordinates ofthe data, a semivariogram analysis was conducted for eachchemical element in order to check for a potential spatialdependence structure in the data (not shown here). No spatialdependence patterns were observed for any component, whichallowed us to assume independence of the chemical samples atdifferent locations. This is not uncommon for sparsely drilledreconnaissance studies of coal resources, as spatial correlationusually extends within short distances. For example, in a portionof the Blue Gem coal in Kentucky drilled at an average spacing of4 km, authors in [41] found spatial correlation within distancesvarying between 8 and 20 km. Also note that, in the case of theFort Union coal, preferential directions in the fluctuations ofany property are not present as a result of the geological factthat deposition of sediments and organic material is controlledby the always-changing directions of meanders, which, onaverage, make all variables directionally isotropic. The traceelements we use in this work are a product of coal combustion,primarily found in ashes originating from coal-fired powerstations. They are typically classified according to their volatility.The higher the volatility, the higher the amount of elementreleased into the atmosphere with the flue gases. Among them,Hg is by far the most volatile element and the one that ends upsettled in soil or forming harmful chemical compounds in water.

The aforementioned chemical components actually representa fully observed subcomposition of a much larger chemicalcomposition. The five elements are not closed to a constantsum. Note that, as the samples are expressed in parts per millionand all concentrations were originally measured, a residual ele-ment could be defined to fill up the gap to 106. Table I summa-rizes the common descriptive univariate statistics for thepreceding chemical elements. It can be seen that Cr, Hg and Uhave the lowest concentrations, while V exhibits the highest.

Bootstrap compositional data with nondetects

J. Chemometrics 2014; 28: 585–599 Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem

589

Page 6: A bootstrap estimation scheme for chemical compositional data with nondetects

This pattern is corroborated by the negative sign of theexpected value E(lnxi/xj) when the divisor is V. These expectedvalues appear in Table II, in the lower triangular array of the so-called variation array of the data [12]. In Figure 2, a biplot,suitably based on the centred log-ratio (clr) transformation [42]reveals that clr(Hg) and clr(U) have the highest relative variabilityas they show the longest rays. The covariance structure of thedata is determined by the log-ratio variances Var(lnxi/xj)populating the upper triangle of the variation array (theminimum value is 0.30 for Var(lnCu/V)). Their sum divided by d

provides a single measure of total variability (1.22). Theseparation between vertices in the compositional biplot andthe log-ratio variances (overall high) reflect no strong chemicalassociations between components. This is expected to have anegative effect on the results from multivariate methods basedon the ALN model.From the original Fort Union coal ash data, a collection of

eight scenarios with different distributions of nondetects wascreated. Geochemists at the US Geological Survey assisted us indesigning realistic situations according to their practicalexperience. They provided us with a range of plausible DLs foreach component, along with the effective DLs reported whenrecording the samples (DL range and FU DLs columns in Table III).From this, four different levels of nondetects were considered:low (<5%), moderate (5–15%), medium (15–25%) andmedium–high (25–50%). DLs were accordingly established forevery component at each level as shown in Table III. Note thatvanadium was left free of nondetects for being used as alrdivisor. Otherwise, a preliminary simple multiplicative replace-ment or a sequential replacement strategy as described in [36]may save the problem.As a result, eight data sets {X1, …, X8} with artificially forced

nondetects were generated. The resulting percentage distribu-tion of nondetects by chemical element is shown in Table IV. Inscenario 1, 26 out of 229 (11.35%) samples contain at less onenondetect. In scenario 8, that amount rises to 140 samples(61.13%).Recall that the data set consists of a nonclosed subcom-

position of concentrations in parts per million includingnondetects, and we will provide results in that form by usingthe adjustment (5). Alternatively, the scientists might havedecided working in, for example, percentages by closing thesubcomposition to 100. It is important to note that in such acase, the corresponding DLs would have to be accordinglytransformed. In any case, the principle of scale invariance wouldbe satisfied as detailed in Section 3.1.

5. RESULTS

The nondetect bootstrapping scheme outlined in Section 1 withB= 1000 resamples was applied in each scenario {X1, …, X8},considering the seven replacement methods described inSections 2 and 3. We focus on bootstrap inference on the fifthpercentile (P5) and geometric mean (GeoM), as distributionalstatistics of common interest in environmental and geologicalstudies which are directly affected by the presence ofnondetects. Such statistics will be useful in the assessment ofthe distortion introduced by nondetects in data location. Onthe other hand, compositional variability measures such as the

Table I. Fort Union coal ash univariate descriptive statistics (in ppm): minimum, 5th percentile, 25th percentile, geometric mean,median, 75th percentile and maximum

Min P5 P25 Geometric mean Median P75 Max

Cr 0.72 1.24 2.35 4.07 3.71 7.07 28.80Cu 16.00 26.40 37.00 49.54 47.00 67.00 203.00Hg 0.14 0.26 0.48 0.74 0.71 1.17 5.77U 0.21 0.50 0.91 1.53 1.36 2.36 17.40V 10.00 25.00 49.00 70.05 70.00 130.00 500.00

Table II. Fort Union coal ash data: variation array and totalvariability

Cr Cu Hg U V

Cr 0.00 0.55 1.08 0.47 0.35Cu �2.50 0.00 0.52 0.49 0.30Hg 1.71 4.21 0.00 0.89 0.95U 0.98 3.48 �0.73 0.00 0.51V �2.84 �0.35 �4.56 �3.82 0.00

Total variability: 1.22.

clr(Cr)

clr(Cu)

clr(Hg)

clr(U)

clr(V)

Figure 2. clr biplot of Fort Union coal ash data set: arrows representrelative relationships among chemical elements. Points reflect relativeposition of samples with respect to each clr component (76% variationexplained by the two first axes).

J. Palarea-Albaladejo, J. A. Martín-Fernández and R. A. Olea

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics 2014; 28: 585–599

590

Page 7: A bootstrap estimation scheme for chemical compositional data with nondetects

variation matrix, total variability and relative difference in covari-ance matrix (as introduced in [34]) will be used in accounting fordistortion in data variability.The methods will be denoted in short by MR65% (simple

multiplicative replacement using 65%DL), R-Unif (randomuniform multiplicative replacement), R-LogN (random censoredlognormal multiplicative replacement), alrEM (alr-EM algorithm),R-ALN (random censored ALN), alrDA (alr-DA algorithm) andRob-ilrEM (robust ilr-EM algorithm). Table V summarises theseabbreviations and also provides information about the datarepresentation used for each method: raw data (no datatransformation applied), alr transformation (oblique coordinatesrepresentation) or ilr transformation (orthogonal coordinatesrepresentation).

5.1. Bootstrap inference on the fifth percentile

5.1.1. Nondetect bootstrap distribution of the fifth percentile

Kernel density estimation of the sample distribution of the P5was performed for each chemical element including nondetects.A Gaussian kernel with bandwidth determined by the commonSilverman rule [43] was considered. Results by component,scenario and univariate replacement method (Section 2) areshown in Figure 3. The vertical dotted line refers to the true P5value.For the MR65% method, the distribution of P5 is centred at

65%DL, so it shifts rightwards with increasing DLs. The P5 valueis here based on 11–12 samples (out of 229). When thepercentage of nondetects is lower than 5% (Table IV), the

estimated P5 is not affected. However, our resampling processgenerates resamples with both lower and higher percentages ofnondetects, hence the observed variability (scenarios 1 and 2).On the other end, when the DL is high enough to produce morethan 5% nondetects in any resample, all of them yield exactly thesame estimated P5 at 0.65DL. Such a point estimate, althoughperhaps accidentally close to the true P5, will tend to overestimateit. Note that no kernel density can then be generated fromMR65%in those cases (from scenario 3 or 4 in Figure 3).

Table III. Fort Union coal ash data: detection limits (DL) provided (left) and detection limits imposed by level of nondetects (right)

Low Moderate Medium Medium–High

DL range FU DLs (<5%) (5–15%) (15–25%) (25–50%)

Cr 0.1–9 0.10 1.00 1.50 2.00 2.50(3.10) (8.70) (20.50) (28.80)

Cu 10–500 20.00 25.00 30.00 35.00 45.00(3.90) (8.70) (16.60) (45.00)

Hg 0.05–2 0.10 0.25 0.35 0.40 0.60(4.80) (14.80) (18.80) (38.00)

U 0.1–1 0.15; 1.00 0.40 0.70 0.85 1.00(2.60) (9.60) (20.10) (30.10)

Resulting percentage of nondetects appear in parenthesis. FU, Fort Union.

Table IV. Eight synthetic nondetect scenarios from the Fort Union coal ash data: percentage of nondetects by chemical elementand percentage overall total by scenario

Scenarios Cr Cu H U V Total

1. Some low 3.1 3.9 4.8 0.0 0 2.42. All low 3.1 3.9 4.8 2.6 0 2.93. Some moderate, remaining low 8.7 8.7 14.8 2.6 0 7.04. All moderate 8.7 8.7 14.8 9.6 0 8.45. Some medium 20.5 16.6 14.8 2.6 0 10.96. Some medium, rest moderate 20.5 16.6 18.8 9.6 0 13.17. All medium 20.5 16.6 18.8 20.1 0 15.28. Some medium–high 28.8 45.0 18.8 9.6 0 20.4

Table V. Methods reference table: abbreviations refer to thedifferent replacement methods applied within the bootstrapscheme and corresponding data representations

Method Shortenedform

Datarepresentation

1. Simple multiplicativereplacement (65%DL)

MR65% raw

2. Random uniformmultiplicative replacement

R-Unif raw

3. Random lognormalmultiplicative replacement

R-LogN raw

4. alr-EM algorithm alrEM alr5. Random censored ALN R-ALN alr6. alr-DA algorithm alrDA alr7. Robust ilr-EM algorithm Rob-ilrEM ilr

Bootstrap compositional data with nondetects

J. Chemometrics 2014; 28: 585–599 Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem

591

Page 8: A bootstrap estimation scheme for chemical compositional data with nondetects

The nondetect bootstrap procedure based on R-Unif leads toincreasing underestimation with higher scenarios. The centre ofthe R-Unif density for a particular component depends on the

percentage of nondetects and the percentile of the uniformdistribution in (0, DL) that corresponds to the 11–12 lower valuesamong those nondetects. For example, for Cu in scenarios 5–7

Figure 3. Nondetect bootstrap distributions of the fifth percentile from univariate replacement methods: simple substitution by 65%DL, random uniformimputation and random lognormal imputation. Vertical dotted line represents the true sample value (P5Cr=1.24, P5Cu=26.4, P5Hg=0.26 and P5U=0.50).

J. Palarea-Albaladejo, J. A. Martín-Fernández and R. A. Olea

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics 2014; 28: 585–599

592

Page 9: A bootstrap estimation scheme for chemical compositional data with nondetects

(DLCu= 35, 16.6% nondetects), 38 uniform values are generatedon average for each resample. The P5 estimation thencorresponds (on average) with the 11/38 = 29% percentile of auniform distribution in (0, 35), that is, 10.13.Finally, the R-LogN method, although showing an underesti-

mation pattern, gives rise to the best estimates of P5among the univariate proposals in terms of bias and variance

stability, especially for Cu. This suggests that the lognormalmodel fits reasonably well with the univariate distributions.Normality tests on the original log-transformed componentsprovide evidence along these lines (p-values> 0.1784 fromShapiro–Wilk tests). Only U shows some departure from normal-ity, likely because of the presence of some outliers at the uppertail of the distribution.

t34

Figure 4. Nondetect bootstrap distributions of the fifth percentile: multivariate methods (alr-EM algorithm, random ALN imputation, alr-DA algorithmand robust ilr-EM algorithm) and univariate random lognormal imputation. Vertical dotted line represents the true sample value (P5Cr=1.24, P5Cu=26.4,P5Hg=0.26 and P5U=0.50).

Bootstrap compositional data with nondetects

J. Chemometrics 2014; 28: 585–599 Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem

593

Page 10: A bootstrap estimation scheme for chemical compositional data with nondetects

Kernel density estimation results for all scenarios usingmultivariate methods are compared in Figure 4 with the resultsfrom R-LogN. It can be seen that R-ALN and alrDA mostlygenerate comparable results. They all tend to underestimate asthe portion of samples below DL increases, but that is morepronounced for alrEM, R-ALN and alrDA. The most stable patternand best overall performance are shown by Rob-ilrEM, althoughR-LogN provides comparable estimates in some cases (see Cu).

An investigation on the adequacy of the ALN model and therelationships between the alr variates sheds some light on theseresults. The pairwise scatterplot matrix of the original,uncensored alr variates in Figure5 reveals relatively weakpairwise linear correlations, with that between ln(Cu/V) and ln(Hg/V) being the highest one (0.68 using Pearson’s coefficient).Note that these correlation values and scatterplots depend onthe alr divisor, as the alr transformation leads to an oblique basis.On the diagonal, qq-plots show a noticeable departure fromunivariate normality for ln(Cu/V) and ln(U/V), possibly becauseof outliers at the tails of the distribution. A chi-square plot [44](not shown here) helped us to identify three samples, #43,#134 and #167, as potential multivariate outliers. When omittingthese samples, a generalized Shapiro–Wilk test supportsmultivariate normality for the uncensored alr-transformed data(MVW=0.9927, p-value = 0.2838). That is, the ALN comes out tobe a plausible model for the Fort Union coal ash subcomposition,in accordance with the relation between univariate lognormaldistributions and the ALN model stressed in Section 3.2.

5.1.2. Bootstrap 95% confidence intervals for the fifth percentile

Bootstrap confidence intervals using the usual 2.5 and 97.5percentiles as limits are displayed by chemical element andscenario in Figure 6.

Appropriate results for MR65% can only be obtained for thevery first scenarios, as the same value (0.65DL) is generated forlower and upper limits once nondetects far exceed 5% (e.g. forCr or Cu from scenario 3; Table IV). R-Unif provides highlymisleading intervals just from intermediate scenarios. R-LogNand, especially, Rob-ilrEM yield the best results. Rob-ilrEM coversthe true value in most scenarios, providing the overall mostaccurate confidence intervals.

5.2. Bootstrap inference on the geometric mean

Figure 7 displays the kernel distribution of the bootstrapgeometric means. Note that having all values participating inthe computation of GeoM, as well as the relatively large size ofthe data set, eases any effect of the imputation method. Thebootstrap distributions are more stable here than in the P5 case,and all methods show similar spread within a particular scenario.From scenario 3, the multivariate methods show betterbehaviour than the univariate, with Rob-ilrEM standing out. Forthe others, systematic underestimation is generally observed.Bootstrap confidence intervals for GeoM are displayed in

Figure 8. They show good coverage of the true GeoM in mostscenarios by all the replacement methods, R-Unif being theworst case. Rob-ilrEM remains noticeably stable over scenarios,although it produces an exceptionally wide interval for Cu inScenario 8. That can also be observed in Figure 6 in relation toP5. Note that the associated kernel densities show a long erraticlower tail (Figures 4 and 7). Not so markedly, that pattern is alsopresent for the other components in the more difficult scenariowith more nondetects.

We computedffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffid2a GeoMb;GeoMð Þ=d

qas an overall nondetect

bootstrap estimation error. The vector GeoMb denotes the

Figure 5. Scatterplot matrix of alr-transformed Fort Union data (V as alr divisor): off-diagonal plots show pairwise scatterplots for alr variates containingnondetects. Normal qq-plots for each one are displayed in diagonal.

J. Palarea-Albaladejo, J. A. Martín-Fernández and R. A. Olea

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics 2014; 28: 585–599

594

Page 11: A bootstrap estimation scheme for chemical compositional data with nondetects

estimated geometric means across components, and GeoM cor-responds to the real ones. The Aitchison distance, da, is usedhere as a suitable measure of difference between compositionsrather than the ordinary Euclidean distance [45]. Figure 9ashows that the multivariate methods provide lower errors withmore difficult nondetect patterns, with Rob-ilrEM surpassingany other.

5.3. Compositional variability measures

Distortion in data variability is measured from four componentsof the variation matrix (Var(lnCu/V), Var(lnCr/Cu), Var(lnHg/U)and Var(lnCr/Hg)) and the total variability. We focus on theabsolute differences between their bootstrap estimates and the

true values (Table II). Figures 9b and c show the respective resultsalong with a bootstrap estimate of the relative difference incovariance matrix in Figure 9d.

As can be seen in Figure 9, the overall picture is quite similar inall cases, and the ranking of the methods is roughly replicated. Inscenarios 1 and 2, all methods produce similar errors, with theexception of R-Unif standing out for its poor performance overall the scenarios. As the number of nondetects increases, theRob-ilrEM, followed by the R-LogN, provides the most reliableresults when it comes to estimating compositional variability.Note, however, that Rob-ilrEM produces disproportionateerrors in scenario 8 (not plotted to avoid vertical scale distortion).Figure 9b confirms that these problems appear again where Cris involved.

Figure 6. Bootstrap 95% confidence intervals for the fifth percentile: horizontal segments indicate bootstrap confidence intervals by scenario. Verticaldotted line represents the true sample value (P5Cr=1.24, P5Cu=26.4, P5Hg=0.26, P5U=0.50).

Bootstrap compositional data with nondetects

J. Chemometrics 2014; 28: 585–599 Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem

595

Page 12: A bootstrap estimation scheme for chemical compositional data with nondetects

6. DISCUSSION

Bootstrap estimation of distributional statistics was reviewedfor the case of chemical samples of compositional naturecontaining nondetects. The uncertainty associated with themwas incorporated to the estimates by including an imputationstep. Results from a collection of imputation methods satisfyingcompositional principles were contrasted on the basis ofordinary summary statistics.

The multiplicative adjustment used for univariate methods(MR65%, R-Unif and R-LogN) observes the sum of the components

and the ratios between those components without nondetects.Importantly, it guarantees equivalent results regardless of themeasuring scale. The MR65% and R-Unif are compositionalcounterparts of popular approaches used in the wider earthsciences community. For point estimation of the lowestpercentiles, results from nondetect bootstrap based on simplemultiplicative replacement are completely determined by thefraction of the DL chosen to impute (0.65DL in our case). Thatchoice does not actually rely on any data modelling, so itsaccuracy may be rather arbitrary. Besides, the results for P5indicated that bootstrap distribution and confidence intervals for

Scenario 1

3.0 3.5 4.0 4.5 5.0Scenario 2

3.5 4.0 4.5 5.0

Scenario 3

3.0 3.5 4.0 4.5 5.0Scenario 4

3.0 3.5 4.0 4.5 5.0

Scenario 5

2.5 3.0 3.5 4.0 4.5 5.0Scenario 6

2.5 3.0 3.5 4.0 4.5 5.0

Scenario 7

2.5 3.0 3.5 4.0 4.5Scenario 8

0 1 2 3 4 5

Scenario 1

40 45 50 55Scenario 2

45 50 55

Scenario 3

40 45 50 55Scenario 4

40 45 50 55

Scenario 5

35 40 45 50 55Scenario 6

35 40 45 50 55

Scenario 7

35 40 45 50 55Scenario 8

30 35 40 45 50 55

Scenario 1

0.6 0.7 0.8 0.9Scenario 2

0.6 0.7 0.8 0.9

Scenario 3

0.5 0.6 0.7 0.8 0.9Scenario 4

0.5 0.6 0.7 0.8 0.9

Scenario 5

0.5 0.6 0.7 0.8 0.9Scenario 6

0.5 0.6 0.7 0.8 0.9

Scenario 70.5 0.6 0.7 0.8 0.9

Scenario 80.4 0.5 0.6 0.7 0.8 0.9

Scenario 1

1.2 1.4 1.6 1.8Scenario 2

1.2 1.4 1.6 1.8

Scenario 3

1.2 1.4 1.6 1.8Scenario 4

1.2 1.4 1.6 1.8

Scenario 5

1.2 1.4 1.6 1.8Scenario 6

1.2 1.4 1.6 1.8

Scenario 71.0 1.2 1.4 1.6 1.8

Scenario 80.5 1.0 1.5

MR65% R−Unif R−LogN alrEM R−ALN alrDA Rob−ilrEM

Figure 7. Nondetect bootstrap distributions of geometric mean: univariate imputation methods (solid lines) versus multivariate methods (discontinuouslines). Vertical dotted line represents the true sample value (GeoMCr=4.07, GeoMCu=49.54, GeoMHg=0.74 and GeoMU=1.53).

J. Palarea-Albaladejo, J. A. Martín-Fernández and R. A. Olea

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics 2014; 28: 585–599

596

Page 13: A bootstrap estimation scheme for chemical compositional data with nondetects

such low percentiles would be only feasible for very low DLs.However, the method may provide comparable results for lowlevels of nondetects when the estimates involve the whole rangeof values, such as the geometric mean, or compositionalvariability. When based on R-Unif, the bootstrap provided left-biased estimates even from the very first scenarios. It performedparticularly poorly for variability measures. Hence, the R-Unif wasrevealed as an inadvisable method in any case. Among theunivariate methods, R-LogN came out as the most reliable,assuming that the lognormal is a sensible model for the data.In many cases, it even behaved better than multivariatealternatives such as alrEM, R-ALN or alrDA in relation to P5 andthe variability measures. Such a performance encourages furtherinvestigation of this approach in future developments. In

particular, the authors in [46], based on the principle of workingon log-ratio coordinates, discuss the modelling of positivevariates using a normal distribution defined on the positive realspace.

From our experiment, we can overall conclude that themultivariate methods provided better results than the univariateones. This was so in spite of the fact that the ALR correlationswere low in our case study. However, a method such as alrEM,successful in nonbootstrap contexts, did not perform ascompetently as we would have expected in comparison withits robust counterpart Rob-ilrEM. The R-ALN and alrDA proposalsseemed not to take the challenges within a bootstrap schemevery well. The results from these new approaches were veryclose to each other in most cases, and also to those from alrEM,

Figure 8. Bootstrap 95% confidence intervals for geometric mean: horizontal segments indicate bootstrap confidence intervals by scenario. Verticaldotted line represents the true sample value (GeoMCr=4.07, GeoMCu=49.5, GeoMHg=0.73 and GeoMU=1.53).

Bootstrap compositional data with nondetects

J. Chemometrics 2014; 28: 585–599 Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem

597

Page 14: A bootstrap estimation scheme for chemical compositional data with nondetects

especially when the geometric mean was the target. Both R-ALNand alrDA used ALN parameter estimates from alrEM as thestarting point. However, they did not improve the resultssignificantly. Besides, the alrDA involved an importantcomputational burden as DA iterations add up to bootstrapiterations, which may compromise its use in highly multivariatestudies. However, its possibilities out of the bootstrapframework, especially in small-sample situations, still may beworth some additional investigation.

We could argue that the weak alr correlation structure of thedata set used did not favour any multivariate method. But whatis even a more relevant feature of this bootstrap framework isthe variety of multivariate patterns of nondetects that can begenerated from the random resampling process on its own.The complexity of these patterns will increase with higherscenarios, and the amount of data available within each one willdecrease. The global correlation structure will be also increas-ingly distorted. In consequence, the ability of those methods toextract reliable interrelationship information from certainresamples can drop dramatically. The Rob-ilrEM, however, bymaking use of a more stable core of data as a result of robustmethodologies, better managed these situations and came outto be less sensitive to departures from model assumptions orextreme values. The nondetect bootstrap based on it stood outas the most reliable and accurate approach in our study. Onthe downside, it must be mentioned that the algorithm is quitecomputationally demanding. We have pointed out above somenumerical instability issues in the highest scenario. In our trials,

we also found convergence problems with more intricatescenarios. Thus, further refinements are required on thatprocedure and its computer implementation for highly complexsettings. In those cases, the simpler R-LogN can still be a reliablealternative for the practitioner.Finally, given the preceding results, we emphasize two lines of

work that would represent natural extensions of the studypresented here. On the one hand, a number of nondetectscenarios were carefully designed to represent realisticconditions, following guidance from applied scientists in thefield. However, a comprehensive simulation study covering, forexample, different correlation structures and sample sizes, mayprovide further insight into the performance of the proposals.On the other hand, note that spatial correlation can be acommon issue in some scientific fields within the naturalsciences. The standard bootstrap method however assumesindependent and identically distributed data. Further develop-ments should include ways to handle spatial correlation in thecase that independence is not a sensible assumption. Alongthese lines, reference [47] provides a procedure for applyingthe bootstrap on correlated observations, which may be adaptedto accommodate nondetects in chemical data of compositionalnature according to our proposal.

Acknowledgements

This research has been supported by the Scottish Government’sRural and Environment Science and Analytical Services Division

Figure 9. Distortion in basic compositional statistics: (a) estimation error in compositional geometric mean, (b) estimation error in some componentsof variation matrix, (c) estimation error in total variability and (d) relative difference in covariance matrix.

J. Palarea-Albaladejo, J. A. Martín-Fernández and R. A. Olea

wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics 2014; 28: 585–599

598

Page 15: A bootstrap estimation scheme for chemical compositional data with nondetects

(RESAS), the Spanish Ministry of Economy and Competitivenessunder the project ‘METRICS’ Ref. MTM2012-33236 and theAgència de Gestiò d’Ajuts Universitaris i de Recerca of theGeneralitat de Catalunya under the project Ref. 2009SGR424.

REFERENCES1. Hopke PK, Xie Y, Paatero P. Mixed multiway analysis of airborne

particle composition data. J. Chemometrics 1999; 13: 343–352.2. Buccianti A, Nisi B, Martín-Fernández JA, Palarea-Albaladejo J.

Methods to investigate the geochemistry of groundwaters withvalues for nitrogen compounds below the detection limit.J. Geochem. Explor. DOI: 10.1016/j.gexplo.2014.01.014 [January 2014].

3. Montero-Serrano JC, Palarea-Albaladejo J, Martín-Fernández JA,Martínez M, Gutiérrez J. Sedimentary chemofacies characterizationby means of multivariate analysis. Sediment. Geol. 2010; 228: 218–228.

4. Helsel DR. Statistics for Censored Environmental Data using Minitaband R (2nd edn). John Wiley & Sons: Hoboken, USA, 2012.

5. Palarea-Albaladejo J, Martín-Fernández JA. Values below detectionlimit in compositional chemical data. Anal. Chim. Acta 2013;764: 32–43.

6. Egozcue JJ, Pawlowsky-Glahn V. Basic concepts and procedures. InCompositional Data Analysis: Theory and Applications, Pawlowsky-Glahn V, Buccianti A (eds.). John Wiley & Sons: Chichester, UK,2011; 12–28.

7. Pawlowsky-Glahn V, Egozcue JJ. Geometric approach to statisticalanalysis on the simplex. Stoch. Env. Res. Risk A. 2001; 15: 384–398.

8. Pawlowsky-Glahn V, Buccianti A (eds). Compositional Data Analysis:Theory and Applications. John Wiley & Sons: Chichester, UK, 2011.

9. Rayens WS, Srinivasan C. Box–Cox transformations in the analysis ofcompositional data. J. Chemometrics 1991; 5: 227–239.

10. Wang H, Liu Q, Mok HMK, Fu L, Tse WM. A hypersphericaltransformation forecasting model for compositional data. Eur. J.Oper. Res. 2007; 179: 459–468.

11. Lemberge P, De Raedt I, Janssens KH, Wei F, Van Espen PJ.Quantitative analysis of 16–17th century archeological glass vesselsusing PLS regression of EPXMA and μ-XRF data. J. Chemometrics2000; 14: 751–763.

12. Aitchison J. The Statistical Analysis of Compositional Data. Chapman &Hall: London, UK, 1986.

13. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Chapman &Hall: Boca Raton, USA, 2003.

14. Chernick MR. Bootstrap Methods: A Guide for Practitioners andResearchers. Wiley Inter-science: Hoboken, USA, 2008.

15. Wehrens R, van der Linden WE. Bootstrapping principal componentregression models. J. Chemometrics 1997; 11: 157–171.

16. Kiers HAL. Bootstrap confidence intervals for three-way methods.J. Chemometrics 2004; 18: 22–36.

17. Efron B. Missing data, imputation, and the bootstrap. J. Am. Stat.Assoc. 1994; 89: 463–478.

18. Imaizumi Y, Suzuki N, Shiraishi H. Bootstrap methods for confidenceintervals of percentiles from dataset containing nondetectedobservations using lognormal distribution. J. Chemometrics 2006;20: 68–75.

19. R Core Team. R: a language and environment for statisticalcomputing. R Foundation for Statistical Computing, Vienna, Austria,2013. http://www.R-project.org/ [November 2013].

20. Palarea-Albaladejo J, Martín-Fernández JA, Buccianti A. Composi-tional methods for estimating elemental concentrations below thelimit of detection in practice using R. J. Geochem. Explor. DOI:10.1016/j.gexplo.2013.09.003 [November 2013].

21. Templ M, Hron K, Filzmoser P. robCompositions: an R-package forrobust statistical analysis of compositional data. In CompositionalData Analysis: Theory and Applications, Pawlowsky-Glahn V, BucciantiA (eds.). John Wiley & Sons: Chichester, UK, 2011; 341–355.

22. Martín-Fernández JA, Barceló-Vidal C, Pawlowsky-Glahn V. Dealingwith zeros and missing values in compositional data sets usingnonparametric imputation. Math. Geo. 2003; 35: 253–278.

23. Dorey FJ, Little RJ, Schenker N. Multiple imputation forthreshold-crossing data with interval censoring. Stat. Med.1993; 12: 1589–1603.

24. Huybrechts T, Dewulf OTJ, van Langenhove H. How to estimatemoments and quantiles of environmental data sets with non-detect observations? A case study of volatile organic compoundsin marine water samples. J. Chromatogr. A 2002; 975: 123–133.

25. Billheimer D. Compositional receptor modeling. Environmetrics 2001;12: 451–467.

26. Kolb C, Martín-Fernández JA, Abart R, Lammer H. The chemicalvariability at the surface of Mars: implication for sediment formationand rock weathering. Icarus 2006; 183: 10–29.

27. Howel D. Multivariate data analysis of pollutant profiles: PCB levelsacross Europe. Chemosphere 2007; 67: 1300–1307.

28. Kacjan-Marsic N, Sircelj H, Kastelec D. Lipophilic antioxidants andsome carpometric characteristics of fruits of ten processing tomatovarieties grown in different climatic conditions. J. Agric. Food Chem.2012; 58: 390–397.

29. Aitchison J, Shen SM. Logistic-normal distributions. Some propertiesand uses. Biometrika 1980; 67: 261–272.

30. Egozcue JJ, Pawlowsky-Glahn V, Mateu-Figueras G, Barceló-Vidal C.Isometric logratio transformations for compositional data analysis.Math. Geo. 2003; 35: 279–300.

31. Egozcue JJ, Pawlowsky-Glahn V. Groups of parts and their balancesin compositional data analysis. Math. Geo. 2005; 37: 795–828.

32. Filzmoser P, Hron K. Robust statistical analysis. In Compositional DataAnalysis: Theory and Applications, Pawlowsky-Glahn V, Buccianti A(eds.). John Wiley & Sons: Chichester, UK, 2011, 59–72.

33. Hron K, Templ M, Filzmoser P. Imputation of missing values forcompositional data using classical and robust methods. Comput.Stat. Data An. 2010; 54: 3095–3107.

34. Martín-Fernández JA, Hron K, Templ M, Filzmoser P, Palarea-Albaladejo J. Model-based replacement of rounded zeros incompositional data: classical and robust approaches. Comput. Stat.Data An. 2012; 56: 2688–2704.

35. Palarea-Albaladejo J, Martín-Fernández JA, Gómez-García J. Aparametric approach for dealing with compositional rounded zeros.Math. Geo. 2007; 39: 625–645.

36. Palarea-Albaladejo J, Martín-Fernández JA. A modified EM alr-algorithm for replacing rounded zeros in compositional data sets.Comput. Geosci. 2008; 34: 902–917.

37. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incom-plete data via the EM algorithm. J. R. Stat. Soc. Ser. B 1977; 39: 1–38.

38. Robert CP. Simulation of truncated normal variables. Stat. Comput.1995; 5: 12–15.

39. Tanner MA, Wong WH. The calculation of posterior distributions bydata augmentation. J. Am. Stat. Assoc. 1987; 82: 528–540.

40. Schafer JL. Analysis of Incomplete Multivariate Data. Chapman & Hall:London, UK, 1997.

41. Geboy NJ, Olea RA, Engle MA, Martín-Fernández JA. Using simulatedmaps to interpret the geochemistry, formation, and quality of theBlue Gem coal bed, Kentucky, USA. Int. J. Coal Geol. 2013; 112: 26–35.

42. Aitchison J, Greenacre M. Biplots for compositional data. J. R. Stat.Soc. Ser. C 2002; 51: 375–392.

43. Silverman BW. Density Estimation. Chapman & Hall: London, UK,1986.

44. Garrett RG. The chi-square plot: a tool for multivariate outlier recognition.J. Geochem. Explor. 1989; 32: 319–341.

45. Palarea-Albaladejo J, Martín-Fernández JA, Soto JA. Dealing withdistances and transformations for fuzzy c-means clustering ofcompositional data. J. Classif. 2012; 29: 144–169.

46. Mateu-Figueras G, Pawlowsky-Glahn V, Egozcue JJ. The normal distri-bution in some constrained sample spaces. SORT 2013; 37: 29–56.

47. Solow AR. Bootstrapping correlated data. Math. Geo. 1985; 17: 769–775.

Bootstrap compositional data with nondetects

J. Chemometrics 2014; 28: 585–599 Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem

599