22
Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissible. October 1, 2013 Paul J. McMurdie 1 , Susan Holmes 1* 1 Statistics Department, Stanford University Stanford, CA, USA * E-mail: [email protected] Abstract The interpretation of count data originating from the cur- rent generation of DNA sequencing platforms requires special attention. In particular, the per-sample library sizes often vary by orders of magnitude from the same sequencing run, and the counts are overdispersed rela- tive to a simple Poisson model [1]. These challenges can be addressed using an appropriate mixture model that simultaneously accounts for library size differences and biological variability. This approach is already well- characterized and implemented for RNA-Seq data in R packages such as edgeR [2] and DESeq [3]. Unfor- tunately, this class of techniques has been overlooked in the microbiome literature [4, 5], where the currently- recommended library size normalization is done by com- puting proportions or by rarefying counts through random subsampling until each library has the same sum [6]. We use statistical theory, extensive simulations, and empiri- cal data to show that variance stabilizing normalization using a mixture model like the negative binomial is ap- propriate for microbiome count data. In simulations de- tecting differential abundance, normalization procedures based on a Gamma-Poisson mixture model provided sys- tematic improvement in performance over crude propor- tions or rarefied counts – both of which led to a high rate of false positives. In simulations evaluating clustering ac- curacy, we found that the rarefying procedure discarded samples that were nevertheless accurately clustered by al- ternative methods, and that the choice of minimum library size threshold was critical in some settings, but with an optimum that is unknown in practice. Techniques that use variance stabilizing transformations by modeling mi- crobiome count data with a mixture distribution, such as those implemented in edgeR and DESeq, substantially improved upon techniques that attempt to normalize by rarefying or crude proportions. Based on these results and well-established statistical theory, we advocate that investigators avoid rarefying altogether. We have pro- vided microbiome-specific extensions to these tools in the R package, phyloseq. Introduction Modern, massively parallel DNA sequencing technolo- gies have changed the scope and technique of investiga- tions across many fields of biology [7, 8]. In gene ex- pression studies the standard measurement technique has shifted away from microarray hybridization to direct se- quencing of cDNA, a technique often referred to as RNA- Seq [9]. Even though the statistical methods available for analyzing microarray data have matured to a high level of sophistication [10], these methods are not di- rectly applicable to RNA-Seq data because it consists of discrete counts (DNA sequencing reads) rather than con- tinuous values derived from fluorescence intensity. In high throughput sequencing, the total reads per sample (library size; sometimes referred to as depths of cover- age) can vary by orders of magnitude within a single se- quencing run. Comparison across samples with different library sizes requires more than a simple linear scaling ad- 1 arXiv:submit/0813239 [q-bio.QM] 1 Oct 2013

Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

Waste Not, Want Not:Why Rarefying Microbiome Data is Inadmissible.

October 1, 2013

Paul J. McMurdie1, Susan Holmes1∗1 Statistics Department, Stanford UniversityStanford, CA, USA∗ E-mail: [email protected]

Abstract

The interpretation of count data originating from the cur-rent generation of DNA sequencing platforms requiresspecial attention. In particular, the per-sample librarysizes often vary by orders of magnitude from the samesequencing run, and the counts are overdispersed rela-tive to a simple Poisson model [1]. These challengescan be addressed using an appropriate mixture modelthat simultaneously accounts for library size differencesand biological variability. This approach is already well-characterized and implemented for RNA-Seq data in Rpackages such as edgeR [2] and DESeq [3]. Unfor-tunately, this class of techniques has been overlookedin the microbiome literature [4, 5], where the currently-recommended library size normalization is done by com-puting proportions or by rarefying counts through randomsubsampling until each library has the same sum [6]. Weuse statistical theory, extensive simulations, and empiri-cal data to show that variance stabilizing normalizationusing a mixture model like the negative binomial is ap-propriate for microbiome count data. In simulations de-tecting differential abundance, normalization proceduresbased on a Gamma-Poisson mixture model provided sys-tematic improvement in performance over crude propor-tions or rarefied counts – both of which led to a high rateof false positives. In simulations evaluating clustering ac-curacy, we found that the rarefying procedure discarded

samples that were nevertheless accurately clustered by al-ternative methods, and that the choice of minimum librarysize threshold was critical in some settings, but with anoptimum that is unknown in practice. Techniques thatuse variance stabilizing transformations by modeling mi-crobiome count data with a mixture distribution, such asthose implemented in edgeR and DESeq, substantiallyimproved upon techniques that attempt to normalize byrarefying or crude proportions. Based on these resultsand well-established statistical theory, we advocate thatinvestigators avoid rarefying altogether. We have pro-vided microbiome-specific extensions to these tools in theR package, phyloseq.

Introduction

Modern, massively parallel DNA sequencing technolo-gies have changed the scope and technique of investiga-tions across many fields of biology [7, 8]. In gene ex-pression studies the standard measurement technique hasshifted away from microarray hybridization to direct se-quencing of cDNA, a technique often referred to as RNA-Seq [9]. Even though the statistical methods availablefor analyzing microarray data have matured to a highlevel of sophistication [10], these methods are not di-rectly applicable to RNA-Seq data because it consists ofdiscrete counts (DNA sequencing reads) rather than con-tinuous values derived from fluorescence intensity. Inhigh throughput sequencing, the total reads per sample(library size; sometimes referred to as depths of cover-age) can vary by orders of magnitude within a single se-quencing run. Comparison across samples with differentlibrary sizes requires more than a simple linear scaling ad-

1

arX

iv:s

ubm

it/08

1323

9 [

q-bi

o.Q

M]

1 O

ct 2

013

Page 2: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

justment because it also implies different levels of uncer-tainty, as measured by the sampling variance of the pro-portion estimate for each gene.

Variation in read counts among technical replicateswith comparable library sizes is adequately modeled by aPoisson random variable in most cases [11]. However, weare usually interested in understanding variation amongbiological replicates, wherein a mixture model is neces-sary to account for the added uncertainty [12]. Takinga hierarchical model approach with the Gamma-Poissonhas provided a satisfactory fit to RNA-Seq data [13], aswell as a valid regression framework that uses gener-alized linear models [14]. A Gamma mixture of Pois-son variables gives the negative binomial (NB) distribu-tion [12,13] and several RNA-Seq analysis packages nowmodel the counts, K, for gene i, in sample j according to:

Kij ∼ NB(sjµi, φi) (1)

where sj is a linear scaling factor for sample j that ac-counts for its library size, µi is the mean proportion forgene i, and φi is the dispersion parameter for gene i. Thevariance is νi = sjµi + φisjµ

2i , with the NB distribution

becoming Poisson when φ = 0. Recognizing that φ > 0and estimating its value is important in gene-level tests.This reduces false positive genes that appear significantunder a Poisson distribution, but not after accounting fornon-zero dispersion.

The uncertainty in estimating φi for every gene whenthere is a small number of samples — or a small numberof biological replicates — can be mitigated by sharing in-formation across the thousands of genes in an experiment,leveraging a systematic trend in the mean-dispersion re-lationship [13]. This approach substantially increases thepower to detect differences in proportions (differential ex-pression) while still adequately controlling for false pos-itives [3]. Many R packages implementing this model ofRNA-Seq data are now available, differing mainly in theirapproach to modeling dispersion across genes [15].

Analogous to the development of gene expression re-search, culture independent [16] microbial ecology re-search has migrated away from detection of species (orOperational Taxonomic Units, OTUs) through microar-ray hybridization of rRNA gene PCR amplicons [17] todirect sequencing of highly-variable regions of these am-plicons [18], or even direct shotgun sequencing of mi-

crobiome metagenomic DNA [19]. Although these lattermicrobiome investigations use the same DNA sequenc-ing platforms and represent the processed sequence datain the same manner — a feature-by-sample contingencytable where the features are OTUs instead of genes —the modeling and normalization methods just describedfor RNA-Seq analysis have not been transferred to micro-biome research [4, 5].

Standard microbiome analysis workflows begin with anad hoc library size normalization by random subsamplingwithout replacement, or so-called rarefying [6]. Rarefy-ing is most often defined by the following steps.

1. Select a minimum library size, NL.

2. Discard libraries (samples) that are smaller than NLin size.

3. Subsample the remaining libraries without replace-ment such that they all have size NL.

Often NL is chosen to be equal to the size of the small-est library that is not considered an artifact, though inexperiments with large variation in library size identify-ing artifact samples can be subjective. In many casesresearchers have also failed to repeat the random sub-sampling step or record the pseudorandom number gen-eration seed/process – both of which are essential forreproducibility. To our knowledge, rarefying was firstrecommended for microbiome counts in order to mod-erate the sensitivity of the UniFrac distance [20] to li-brary size, especially differences in the presence of rareOTUs attributable to library size [21]. In these and sim-ilar studies the principal objective is to compare micro-biome samples from different sources, a research taskthat is increasingly accessible with declining sequencingcosts and the ability to sequence many samples in par-allel using barcoded primers [22, 23]. Rarefying is nowan exceedingly common precursor to microbiome mul-tivariate workflows that seek to relate sample covariatesto sample-wise distance matrices [6, 24, 25]; for exam-ple, integrated as a recommended option in QIIME’s [26]beta-diversity-through-plots.py workflow,in Sub.sample in the mothur software library [27], andin daisychopper.pl [28]. This perception in the mi-crobiome literature of “rarefying to even sampling depth”as a standard normalization procedure appears to explain

2

Page 3: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

why rarefied counts are also used in studies that attemptto detect differential abundance of OTUs between prede-fined classes of samples [29–33], in addition to studiesthat use proportions directly [34].

Statistical motivationUnfortunately, rarefying biological count data is unjus-tified despite its current ubiquity in microbiome analy-ses. The following is a minimal example to explain whyrarefying is theoretically inadmissible, especially with re-gards to variance stabilization. A mathematical proof ofthe sub-optimality of the subsampling approach is pre-sented in the supplementary material (Supporting Infor-mation File SA).

Suppose we want to compare two different samples,called A and B, comprised of 100 and 1000 reads, respec-tively. In these hypothetical communities only two typesof microbes have been observed, OTU1 and OTU2, ac-cording to Table 1, Left.

Table 1. A minimal example of the effect of rarefying on power.Original Abundance

A BOTU1 62 500OTU2 38 500Total 100 1000

Rarefied AbundanceA B

OTU1 62 50OTU2 38 50

100 100Standard Tests for Difference

P-value χ2 Prop FisherOriginal 0.0290 0.0290 0.0272Rarefied 0.1171 0.1171 0.1169

Hypothetical abundance data in its original (Left) and rarefied (Middle)form, with corresponding formal test results for differentiation (Right).

Formally comparing the two proportions according toa standard test is done either using a χ2 test (equivalentto a two sample proportion test here) or a Fisher exacttest. By rarefying (Table 1, middle) so that both sampleshave the same number of counts, we are no longer able todifferentiate between them (Table 1, right). This loss ofpower is completely attributable to reducing the size of Bby a factor of 10. Decreasing the number of observationsin B increases the confidence intervals corresponding toeach proportion, such that the proportions are no longer

distinguishable from those in A, even though they are dis-tinguishable in the original data.

The variance of the proportion’s estimate p is multi-plied by 10 when the total count +is divided by 10. In thisbinomial example the variance of the proportion estimateis V ar(Xn ) = pq

n = qnE(Xn ), a function of the mean. This

is a common occurrence and one that is traditionally dealtwith in statistics by applying variance-stabilizing trans-formations. We show in Supporting Information File SAthat the relation between the variance and the mean formicrobiome count data can be estimated and the modelused to find the optimal variance-stabilizing transforma-tion. Calling the proportions of taxa i pi = Kij/sj , wecan see from the above example that it is inappropriate tocompare them without accounting for the variability in thedenominators, because of their unequal variances (knownas heteroscedasticity). In other words, the uncertaintyassociated with each value in the table is fundamentallylinked to the total number of observations, which varieswidely across samples, often much more than the 10-folddifference in our example. This uncertainty must be con-sidered when testing for a difference between proportions(an OTU), or sets of proportions (a microbial community).Although rarefying does equalize variances, it does soonly by inflating the variances in all samples to the largest(worst) value among them at the cost of discriminatingpower (increased uncertainty). Rarefying adds additionaluncertainty through the random subsampling step, suchthat Table 1 shows the best-case, approached only with asufficient number of repeated rarefying trials (See Sup-porting Information File S3, minimal example). In thissense alone, the random step in rarefying is unnecessary.Each count value can be transformed to a common-scaleby rounding Kijsmin/sj . Although this common-scaleapproach is an improvement over the rarefying methodhere defined, both methods suffer from the same problemsrelated to lost data.

In this article we demonstrate the applicability of a vari-ance stabilization technique based on a mixture model ofmicrobiome count data. This approach simultaneouslyaddresses both problems of (1) DNA sequencing librariesof widely different sizes, and (2) count data that variesmore than expected under a Poisson model. We utilize themost popular implementations of this approach currentlyused in RNA-Seq analysis, namely edgeR [2] and DE-Seq [3], adapted here for microbiome data. This approach

3

Page 4: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

allows valid comparison across OTUs while substantiallyimproving both power and accuracy in the detection ofdifferential abundance of microbes. We also comparethe Gamma-Poisson mixture model performance againsta method that models OTU proportions using a zero-inflated gaussian distribution, implemented in a recently-released package called metagenomeSeq [35].

Results

common_scale rarefied

1e+00

1e+03

1e+06

1e+09

1e+00

1e+03

1e+06

1e+09

1e+00

1e+03

1e+06

1e+09

1e+00

1e+03

1e+06

1e+09

1e+00

1e+03

1e+06

1e+09

DietP

atternsD

ietPatterns

GlobalP

atternsG

lobalPatterns

GlobalP

atterns

2004−−

HighFat

2008−−

LowFat

freshwater biom

e−−

None

human−

associated habitat−−

oralm

arine biome−

−N

one

1 100 10000 1 100 10000

Mean

Var

ianc

e

Common−Scale Variance versus Mean, multiple studies/replicates

Figure 1. Variance versus mean for OTU counts. Each pointin each panel represents a different OTU’s mean/variance estimatefor a biological replicate and study. The data in this figure comefrom the “Global Patterns” survey [36] and the “Long-Term Di-etary Patterns” study [37], with results from many more studies in-cluded in Supporting Information File S3. (Right) Variance versusmean abundance for rarefied counts. (Left) Common-scale variancesand common-scale means, estimated according to Equations 7 and6 from Anders and Huber [3], implemented in the DESeq package(Supporting Information File SA). The solid gray line denotes theσ2 = µ case (Poisson; φ = 0). The red curve denotes the fit-ted variance estimate using DESeq [3], with method=‘pooled’,sharingMode=‘fit-only’, fitType=‘local’. The distri-bution of library sizes (total counts per sample) from these studies spansseveral orders of magnitude.

We first demonstrate through a survey of publicly-available empirical microbiome data that the parametersof a NB model, especially φi, can be adequately esti-mated among biological replicates in microbiome data(Figure 1), despite a weak assertion to the contrary inWhite et al [38]. This also demonstrates that overdis-persion is present (φ > 0) in virtually all cases, beforeand after rarefying, as expected. In order to preciselyquantify the negative effects of rarefying, and the rela-tive benefits of an appropriate mixture model, we createdtwo microbiome simulation workflows based on repeatedsubsampling from empirical data organized according toFigure 2. Simulation A represents a hypothetical experi-ment in which the main goal is to distinguish microbiomesamples by distance measures and clustering. SimulationB represents a hypothetical experiment in which the goalis to detect microbes that are differentially abundant be-tween two known classes of samples.

Overdispersion in Microbiome DataWe performed a survey of publicly available microbiomecount data, to evaluate the variance-mean relationship forOTUs among sets of biological replicates (Figure 1). Inevery instance, the variances were larger than could be ex-pected under a Poisson model (overdispersed), especiallyat larger values of the common-scale mean. By definition,these OTUs are the most abundant, and often receive thegreatest interest in many studies. After rarefying, the ab-solute scales are decreased because of the omitted data,but the variance (increases, stays the same?) and overdis-persion is still present (Figure 1, left).

Simulation StrategyWe performed extensive simulations of microbiome datafollowed by commonly used statistical analysis from theRNA-Seq and microbiome literature. Because the cor-rect answer in every simulation is known, we are ableto evaluate the resulting power and accuracy of each sta-tistical method, and thus quantify the improvements onemethod provides over another in a given set of conditions.Our simulations include two major approaches intendedto (1) assess the accuracy of clustering results when dif-ferentiating microbiome samples; and (2) detect microbialspecies (OTUs) that are differentially abundant between

4

Page 5: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

OTU

s

Repeat mixture and count simulation for each replicate.

5X

samples

OTU

s

count matrix

count matrix

Ocean Feces

OTU

s

Ocean Feces

Sum across samples for each environment

Simulate mixing; add total/m counts from Ocean to Feces, and vice versa. O

TUs

Ocean Feces

Simulate microbiome counts by repeated sampling from these multinomials

Microbiome count data from the Global Patterns dataset

samples

simulated Oceancounts

simulated Fecescounts

Microbiome Clustering Simulationsamples

OTU

s

count matrix

Environment

OTU

s

Sum across samples for each environment

Simulate microbiome counts by repeated sampling from multinomial.

Repeat simulation of samples and random effect for each replicate.

5X

samplesO

TUs 74 samples drawn

from same multinomial

Differential Abundance Simulation

OTU

s 37 simulated testsamples

Randomly select 30 OTUs, multiply the counts by the effect-size, m, in test samples only

Perform clustering, evaluate accuracy.

Simulated Experiment

Simulated Experiment

Repeat for each environment type in Global Patterns dataset.

Perform differential abundance tests, evaluate performance.

37 simulated nullsamples

A B

Figure 2. Overview of our simulation strategies. Both clustering (A) and differential abundance (B) simulations are represented. All simulationsbegin with real microbiome count data from a survey experiment referred to here as “the Global Patterns dataset” [36]. A rectangle with tick marksand index labels represents an abundance count matrix (“OTU table”), while a much thinner rectangle with only OTU tick marks represents amultinomial of OTU counts/proportions. In both simulation designs, the variable m is used to refer to the effect size, but its meaning is different ineach simulation. Small stars emphasize a multinomial or sample in which a perturbation (our effect) has been applied.

5

Page 6: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

Bray−Curtis Euclidean PoissonDist UniFrac−u UniFrac−w top−MSD

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

10002000

10000

1.5 2.0 2.5 3.0 3.5 1.5 2.0 2.5 3.0 3.5 1.5 2.0 2.5 3.0 3.5 1.5 2.0 2.5 3.0 3.5 1.5 2.0 2.5 3.0 3.5 1.5 2.0 2.5 3.0 3.5Effect Size

Acc

urac

y

Normalization

DESeqVS

None

Proportion

Rarefy

UQ−logFC

Clustering Accuracy

Figure 3. Clustering accuracy in simulated two-class mixing. The clustering accuracy (vertical axis) that results following different normal-ization and distance methods, represented by color shading and panel columns, respectively. The horizontal axis is the effect size, which in thiscontext is an “unmixed factor”, the ratio of target to non-target simulated counts between two microbiomes that effectively have no overlappingOTUs (Fecal and Ocean microbiomes in the Global Patterns dataset [36]). Higher values of effect size indicate an easier clustering task. For furtherdetails see Methods section and Figure 2. Partitioning around medoids (PAM) was used as the clustering algorithm in all evaluations of accuracy.The solid path lines represent the mean values, with observations at the joints, and with a vertical bar representing one standard deviation aboveand below each observation.

two classes of microbiome samples. In both simulationtypes we vary the library and effect sizes across a range oflevels that are relevant for recently-published microbiomeinvestigations (Figure 2).

Sample Clustering

In simulations evaluating clustering accuracy, we foundthat rarefying undermined the performance of down-stream clustering methods by discarding samples withsmall library sizes that nevertheless were accurately clus-tered by alternative procedures on the same simulated data(Figure 3). The extent to which rarefying procedures per-formed worse depended on the effect-size (difficulty ofthe clustering task), the typical library size of the samplesin the simulation, and most of all, the choice of thresholdfor the minimum library size (Figure 4).

We further investigated this dependency on minimumlibrary threshold in additional simulations. We found thatsamples were trivial to cluster for the largest library sizesusing any distance method, even with the threshold setto the smallest sample size in the simulation (no sam-ples discarded, all correctly clustered). However, at moremodest sample sizes typical of highly-parallel experimen-

tal designs the optimum choice of library size threshold isless clear. A small threshold implies retaining more sam-ples but with a smaller number of reads (less information)per sample; whereas a larger threshold implies more dis-carded samples, but with more reads/information in thesamples that remain. In our simulations the optimumchoice of threshold hovered around the 15th-percentileof sample sizes for most simulations and normaliza-tion/distance procedures (Figure 4). Regions within Fig-ure 4 in which all distances have converged to the samenegatively-sloped line are regions for which the minimumlibrary threshold completely controls clustering accuracy(all samples not discarded are accurately clustered). Sec-tions to the left of this convergence are regions in whichthere is a positive or neutral compromise between discard-ing fewer samples and retaining enough counts per samplefor accurate clustering.

Differential Abundance Detection

Our simulations demonstrate improvement in sensitivityand specificity when normalization and subsequent testsare based upon a relevant mixture model (Figure 5). Mul-tiple t-tests with correction for multiple inference did not

6

Page 7: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

1000 2000 5000 10000

0.6

0.8

1.0

0.6

0.8

1.0

0.6

0.8

1.0

1.151.25

1.5

0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4

Library Size Minimum Quantile

Acc

urac

y

Distance

Bray−Curtis

DESeq−VS

edgeR

PoiClaClu

UniFrac−u

UniFrac−w

Clustering Accuracy, rarefied counts only

Figure 4. Normalization by rarefying only, dependency on library size threshold. Unlike the analytical methods represented in Figure 3, hererarefying is the only normalization method used, and with different values of the minimum library size threshold, indicated as sample-size quantile(horizontal axis). Panel columns, panel rows, and point/line shading indicate median library size, effect size, and distance method applied afterrarefying, respectively. A thin guide line demarcates the maximum achievable accuracy (because discarded samples cannot be accurately clustered),y = 1− x.

perform well on this data, whether on rarefied counts oron crude proportions. A direct comparison of the perfor-mance of more sophisticated parametric methods appliedto both original and rarefied counts demonstrates strongpotential of these methods and large improvements if rar-efying is not used at all.

In general, the proportion of false positives from testsbased on crude proportions or rarefied counts was unac-ceptably high, and increased with the effect size. This isan undesirable phenomenon in which the increased rela-tive abundance of the true-positive OTUs (the effect) islarge enough that the null (unmodified) OTUs appear sig-nificantly more abundant in the null samples than in thetest samples. This explanation is easily verified by thesign of the test statistics of the false positive OTU abun-dances, which was uniformly positive (Supporting Infor-mation File S3). Importantly, this side-effect of a strongdifferential abundance was observed rarely in edgeR per-formance results, and was essentially absent in DESeqresults – which had a false positive rate near-zero undermost conditions and no correlation between false positiverate and effect size (Supporting Information File S1). Inmost simulations crude proportions outperformed rarefiedcounts (Figure 5), but also suffered from a higher rate offalse positives at larger values of effect size (Supporting

Information File S1).

7

Page 8: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

DESeq DESeq2 edgeR metagenomeSeq mt

0.6

0.8

1.0

0.6

0.8

1.0

200050000

2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0

EffectSize

AU

C

Samples/Class

3

6

10

17

Normalization

Model/None

Proportion

Rarefied

Differential Abundance Detection Performance

Figure 5. Performance of differential abundance detection with and without rarefying. Performance summarized here by the “Area Under theCurve” (AUC) metric of a Receiver Operator Curve (ROC) [39] (vertical axis). Briefly, the AUC value varies from 0.5 (random) to 1.0 (perfect),and incorporates both sensitivity and specificity. The horizontal axis indicates the effect size, shown as the actual multiplication factor appliedto the OTU abundances. Each curve traces the respective normalization method’s mean performance of that panel, with a vertical bar indicatinga standard deviation in performance across all replicates and microbiome templates. The right-hand side of the panel rows indicates the mediannumber of reads per sample (left), and the number of samples per simulated experiment (right). Colors indicate the type of normalization applied.The green “Model” color indicates that normalization is incorporated within the model used by the analysis method labeled above the first threepanel columns. The right-most panel column, “mt”, are the results using a two-sample Welch t-statistic (unequal variances). All P-values wereadjusted for multiple hypotheses using B-H, and a detection significance threshold of 0.05.

8

Page 9: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

Discussion

Historically, the ecological community has favored theuse of repeated subsampling without replacement to es-timate sequencing/observation coverage and species rich-ness [40]. This non-parametric approach, called rarefac-tion, can be justified for the generation and interpreta-tion of accumulation curves. However, an analysis basedupon a single subsampling of the data (rarefying) can-not be justified in a statistical framework because it re-quires the omission of available valid data. Some of thejustification for the rarefying procedure appears to origi-nate from microbiome-wide exploratory comparisons forwhich it was believed that a larger library size also cap-tures more noise in the form of rare species, leading toa depth-dependent increase in both alpha-diversity mea-sures and beta-diversity dissimilarities [21,41], especiallyUniFrac [42]. As we demonstrate here, it is more data-efficient to model the noise and address extra species us-ing statistical normalization methods based on variancestabilization and robustification methods. Though be-yond the scope of this work, a Bayesian approach tospecies abundance estimation would allow the inclusionof pseudo-counts from a Dirichlet prior that should alsosubstantially decrease this sensitivity.

We enumerate the following statistical costs associatedwith the rarefying procedure:

1. Rarefied counts represent only a small fraction of theoriginal data, implying an increase in Type-II error(Table 1, Supporting Information File S2).

2. Rarefied counts remain overdispersed relative to aPoisson model, implying an increase in Type-I er-ror. Overdispersion is theoretically expected forcounts of this nature, and we unambiguously de-tected overdispersion in our survey of publicly avail-able microbiome counts (Figure 1). Estimatingoverdispersion is also more difficult after rarefyingbecause of the lost information (Figure 5). In oursimulations, Type-I error was much worse for rar-efied counts than original counts (Supplementary In-formation File S1).

3. Rarefying counts requires an arbitrary selection ofa library size minimum threshold that affects down-

stream inference (Figure 4), but for which an optimalvalue is not known for new empirical data.

4. The random aspect of subsampling is unnecessaryand adds artificial uncertainty (Supporting Informa-tion File S3, minimal example, bottom). Subsam-pling could be achieved instead by rounding theexpected value of each count in the new smallerlibrary size (Kijsmin/sj), avoiding the additionalsampling error as well as the often-ignored need torecord/publish the random seed and repeat the ran-dom step [21].

Microbiome Sample ComparisonsThe costs listed above may be tolerable for experimentsin which the the primary goal is to interpret distances be-tween microbiome samples — such as clustering, prin-ciple coordinate analysis, multivariate regression [43],etc. — provided that the effect-size(s) distinguishing pat-tern(s) of interest or the original library sizes are largeenough to compensate for the lost data. For novel or sub-tle effects, it is unknown if either the minimum libraryor effect sizes are large enough to adequately compen-sate. These sizes tolerating rarefied counts were evidentin the right-hand side of the plots in Figure 3, especiallythe larger library sizes in the bottom panels. Even in theselarge-size scenarios there remains a practical cost of rar-efying, because most or all discarded samples would havebeen accurately clustered were they not discarded (Fig-ure 4). This latter point might also be tolerable if thereis also an excess of biological replicates. However, incomplex experimental settings with subtle effects or a re-striction on library sizes, the cost of rarefying should notbe considered tolerable. Furthermore, clustering on rar-efied counts was matched or outperformed by clusteringon sample proportions, particularly when Bray-Curtis orweighted-UniFrac was used as distance (Figure 3).

Microbial Differential AbundanceIn clinical settings and other hypothesis-driven experi-ments, we are typically interested in detecting differentialabundance of species/OTUs between two or more sampleclasses. In this setting we find that the costs of rarefy-ing are substantial and intolerable in essentially all sce-

9

Page 10: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

narios. Both rarefied counts and sample proportions re-sult in an unacceptably high rate of false positive OTUs.As we described theoretically in the introduction, this isexplained by differences among biological replicates thatmanifest as overdispersion, leading to a subsequent under-estimate of the true variance if a relevant mixture modelis not used. We detected overdispersion among biologi-cal replicates in all publicly available microbiome countdatasets that we surveyed (Figure 1). Failure to accountfor this overdispersion – by using crude proportions or rar-efied counts – results in a systematic bias that increasesthe Type-I error rate even after correcting for multiple-hypotheses (e.g. Benjamini-Hochberg [44]). In otherwords, we predict that many previously reported differ-entially abundant OTUs are false positives attributable toa noise model that underestimates uncertainty.

In our simulations this propensity for Type-I error in-creased with the effect size, e.g. the fold-change in OTUabundance among the true-positive OTUs. For rarefiedcounts, we also detected a simultaneous increase in Type-II error attributable to the forfeited data (Supporting In-formation File S2), consistent with the loss of power de-scribed in the Introduction’s minimal example (see Ta-ble 1). It may be tempting to imagine that the increasedvariance estimate due to rarefying could be counterbal-anced by the variance underestimate that results fromomitting a relevant mixture model. However, such a sce-nario constitutes an unlikely special case, and false pos-itives will not compensate for the false negatives in gen-eral. In our simulations both Type-I and Type-II error in-creased for rarefied counts (Figure 5).

Fortunately, we have demonstrated that strongly-performing alternative methods for normalization and in-ference are already available. In particular, an analysisthat models counts with the Negative Binomial imple-mented in DESeq [3] was able to accurately and specif-ically detect differential abundance over the full range ofeffect sizes, replicate numbers, and library sizes that wesimulated (Figures 5). DESeq-based analyses are rou-tinely applied to more complex tests and experimentaldesigns using the generalized linear model interface inR [45], and so are not limited to a simple two-class de-sign.

Based on our simulation results and the widely en-joyed success for highly similar RNA-Seq data, we rec-ommend using DESeq to perform analysis of differen-

tial abundance in microbiome experiments. It shouldbe noted that we did not comprehensively explore allavailable RNA-Seq analysis methods, which is an activearea of research. Comparisons of many of these meth-ods on empirical [46, 47] and simulated [15, 48, 49] datafind consistently effective performance for detection ofdifferential expression. One minor exception is an in-creased Type-I error for edgeR compared to later meth-ods [46], which was also detected in our results relative toDESeq (Supplementary Information File S1). Generallyspeaking, the reported performance improvements be-tween these methods are incremental relative to the largegains attributable to applying a relevant mixture modelof the noise with shared-strength across OTUs (features).However, some of these alternatives from the RNA-Seqcommunity may outperform DESeq on microbiome datameeting special conditions, for example a large propor-tion of true positives and sufficient replicates [50], smallsample sizes [15], or extreme values [51].

Although we did not explore the topic in the simula-tions here described, a procedure for further improvingdifferential expression detection performance, called In-dependent Filtering [52], theoretically applies to micro-bial differential abundance as well. Some heuristics forfiltering low-abundance OTUs are already described inthe documentation of various microbiome analysis work-flows [26, 27], and in many cases these are a form of In-dependent Filtering. More effort is needed to optimize In-dependent Filtering for differential abundance detection,and rigorously define the theoretical basis and heuristicsapplicable to microbiome data. Ideally a formal appli-cation of Independent Filtering of OTUs would replacethe current ad hoc approach, which includes a troublingamount of flexibility, poor justification, and the opportu-nity to introduce bias.

ImplicationsOur results have substantial implications for past andfuture microbiome analyses, particularly regarding theinterpretation of differential abundance. Most micro-biome studies utilizing high-throughput DNA sequencingto acquire culture-independent counts of species/OTUshave used either proportions or rarefied counts to addresswidely varying library sizes. The theoretical and simu-lated results presented here suggest that findings of dif-

10

Page 11: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

ferential abundance based on rarefied counts or propor-tions are biased toward false positives, and might warrantre-evaluation. Current and future investigations into mi-crobial differential abundance should avoid using rarefiedcounts. Instead, uncertainty should be modeled using ahierarchical mixture model, such as the Poisson-Gammaor Binomial-Beta models, and normalization should bedone using the relevant variance-stabilizing transforma-tions. This can easily be put into practice using pow-erful implementations in R, like edgeR and DESeq, thatperformed well on our simulated microbiome data. Wehave provided convenient wrappers for edgeR and DE-Seq that are tailored for microbiome count data, and thesewrappers are included in the most recent release of thephyloseq package [53] with corresponding tutorials. Wehave also provided the complete code and documentationnecessary to exactly reproduce the simulations, analyses,surveys, and examples shown here, including all figures(Supplementary Information File S3). This example offully reproducible research can and should be applied tofuture publication of microbiome analyses [54–56].

Materials and MethodsWe have included in Supporting Information File S3 thecomplete source code for computing the survey, simula-tions, normalizations, and performance assessments de-scribed in this article. Where applicable, this code in-cludes the RNG seed so that the simulations and randomresampling procedures can be reproduced exactly. In-terested investigators can inspect and modify this code,change the random seed and other parameters, and ob-serve the results (including figures). For ease of in-spection, we have authored the source code in R fla-vored markdown [57], through which we have gener-ated HTML5 files for each simulation that include ourextensive comments interleaved with code and results.Our simulation output can be optionally-modified and re-executed using the the knit2html function in the knitrpackage. This function will take the location of the sim-ulation source files as input, evaluate its R code in se-quence, generate graphics and markdown, and producethe complete HTML5 output file that can be viewed inany modern web browser. These simulations, analyses,and graphics rely upon the cluster [58], foreach [59], gg-

plot2 [60], phyloseq [53], plyr [61], reshape2 [62], andROCR [39] R packages; in addition to the DESeq [3],edgeR [2], and PoiClaClu [63] R packages for RNA-Seq data, and tools available in the standard R distri-bution [64]. The Global Patterns [36] dataset includedin phyloseq was used as empirical microbiome templatedata. The code to perform the survey and generate Fig-ure 1 is also included as a R Markdown source file inSupporting Information File S3, and includes the codeto acquire the data using the phyloseq interface to themicrobio.me/qiime server.

AcknowledgmentsWe would like to thank the developers of the open sourcepackages leveraged here for improved insights into mi-crobiome data, in particular Gordon Smyth and his groupfor edgeR [2], and Wolfgang Huber and his team forDESeq [3]; whose useful documentation and continuedsupport have been invaluable. The Bioconductor and Rteams [64, 65] have provided valuable support for ourdevelopment and release of code related to microbiomeanalysis in R. We would also like to thank Rob Knight andhis lab for QIIME [26], which has drastically decreasedthe time required to get from raw phylogenetic sequencedata to OTU counts. Hadley Wickham created and contin-ues to support the ggplot2 [60] and reshape [62]/plyr [61]packages that have proven useful for graphical represen-tation and manipulation of data, respectively. RStudioand GitHub have provided immensely useful and free ap-plications that were used in the respective developmentand versioning of the source code published with thismanuscript.

11

Page 12: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

References[1] Robinson MD, Smyth GK (2007) Moderated statis-

tical tests for assessing differences in tag abundance.Bioinformatics (Oxford, England) 23: 2881–2887.

[2] Robinson MD, McCarthy DJ, Smyth GK (2009)edgeR: a Bioconductor package for differential ex-pression analysis of digital gene expression data.Bioinformatics (Oxford, England) 26: 139–140.

[3] Anders S, Huber W (2010) Differential expressionanalysis for sequence count data. Genome biology11: R106.

[4] Di Bella JM, Bao Y, Gloor GB, Burton JP, ReidG (2013) High throughput sequencing methods andanalysis for microbiome research. Journal of micro-biological methods .

[5] Segata N, Boernigen D, Tickle TL, Morgan XC,Garrett WS, et al. (2013) Computational meta’omicsfor microbial community studies. Molecular sys-tems biology 9: 666.

[6] Koren O, Knights D, Gonzalez A, Waldron L, SegataN, et al. (2013) A guide to enterotypes across thehuman body: meta-analysis of microbial commu-nity structures in human microbiome datasets. PLoScomputational biology 9: e1002863.

[7] Shendure J, Lieberman Aiden E (2012) The expand-ing scope of DNA sequencing. Nature biotechnol-ogy 30: 1084–1094.

[8] Shendure J, Ji H (2008) Next-generation DNA se-quencing. Nature biotechnology 26: 1135–1145.

[9] Mortazavi A, Williams BA, McCue K, SchaefferL, Wold B (2008) Mapping and quantifying mam-malian transcriptomes by RNA-Seq. Nature meth-ods 5: 621–628.

[10] Allison DB, Cui X, Page GP, Sabripour M (2006)Microarray Data Analysis: from Disarray to Con-solidation and Consensus. Nat Rev Genet 7: 55–65.

[11] Marioni JC, Mason CE, Mane SM, Stephens M, Gi-lad Y (2008) Rna-seq: an assessment of technical re-producibility and comparison with gene expressionarrays. Genome research 18: 1509–1517.

[12] Lu J, Tomfohr JK, Kepler TB (2005) Identifyingdifferential expression in multiple SAGE libraries:an overdispersed log-linear model approach. BMCBioinformatics 6: 165.

[13] Robinson MD, Smyth GK (2007) Small-sample esti-mation of negative binomial dispersion, with appli-cations to SAGE data. Biostatistics (Oxford, Eng-land) 9: 321–332.

[14] Cameron AC, Trivedi P (2013) Regression analysisof count data .

[15] Yu D, Huber W, Vitek O (2013) Shrinkage esti-mation of dispersion in Negative Binomial modelsfor RNA-seq experiments with small sample size.Bioinformatics (Oxford, England) 29: 1275–1282.

[16] Pace NR (1997) A molecular view of microbial di-versity and the biosphere. Science 276: 734–740.

[17] Wilson KH, Wilson WJ, Radosevich JL, DeSan-tis TZ, Viswanathan VS, et al. (2002) High-Density Microarray of Small-Subunit RibosomalDNA Probes. Appl Environ Microbiol 68: 2535-2541.

[18] Huse SM, Dethlefsen L, Huber JA, Mark Welch D,Welch DM, et al. (2008) Exploring microbial diver-sity and taxonomy using SSU rRNA hypervariabletag sequencing. PLoS genetics 4: e1000255.

[19] Riesenfeld CS, Schloss PD, Handelsman J (2004)Metagenomics: genomic analysis of microbial com-munities. Annu Rev Genet 38: 525–552.

[20] Lozupone C, Knight R (2005) UniFrac: a new phy-logenetic method for comparing microbial commu-nities. Applied and Environmental Microbiology 71:8228–8235.

[21] Lozupone C, Lladser ME, Knights D, Stombaugh J,Knight R (2010) UniFrac: an effective distance met-ric for microbial community comparison. The ISMEJournal .

[22] Hamady M, Walker JJ, Harris JK, Gold NJ, KnightR (2008) Error-correcting barcoded primers for py-rosequencing hundreds of samples in multiplex. Na-ture Methods 5: 235–237.

12

Page 13: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

[23] Liu Z, DeSantis TZ, Andersen GL, Knight R(2008) Accurate taxonomy assignments from 16SrRNA sequences produced by highly parallel py-rosequencers. Nucleic Acids Research 36: e120.

[24] Hamady M, Lozupone C, Knight R (2009) Fastunifrac: facilitating high-throughput phylogeneticanalyses of microbial communities including anal-ysis of pyrosequencing and phylochip data. TheISME Journal .

[25] Yatsunenko T, Rey FE, Manary MJ, Trehan I,Dominguez-Bello MG, et al. (2012) Human gut mi-crobiome viewed across age and geography. Nature486: 222–227.

[26] Caporaso J, Kuczynski J, Stombaugh J, Bittinger K,Bushman F, et al. (2010) QIIME allows analysis ofhigh-throughput community sequencing data. Na-ture methods 7: 335–336.

[27] Schloss PD, Westcott SL, Ryabin T, Hall JR,Hartmann M, et al. (2009) Introducing mothur:Open-Source, Platform-Independent, Community-Supported Software for Describing and ComparingMicrobial Communities. Applied and Environmen-tal Microbiology 75: 7537–7541.

[28] Gilbert JA, Field D, Swift P, Newbold L, OliverA, et al. (2009) The seasonal structure of microbialcommunities in the Western English Channel. Envi-ronmental Microbiology 11: 3132–3139.

[29] Charlson ES, Chen J, Custers-Allen R, Bittinger K,Li H, et al. (2010) Disordered microbial communi-ties in the upper respiratory tract of cigarette smok-ers. PLoS ONE 5: e15216.

[30] Price LB, Liu CM, Johnson KE, Aziz M, Lau MK,et al. (2010) The effects of circumcision on the penismicrobiome. PLoS ONE 5: e8422.

[31] Kembel SW, Jones E, Kline J, Northcutt D, StensonJ, et al. (2012) Architectural design influences thediversity and structure of the built environment mi-crobiome. The ISME Journal 6: 1469–1479.

[32] Flores GE, Bates ST, Caporaso JG, Lauber CL, LeffJW, et al. (2013) Diversity, distribution and sources

of bacteria in residential kitchens. EnvironmentalMicrobiology 15: 588–596.

[33] Kang DW, Park JG, Ilhan ZE, Wallstrom G, LabaerJ, et al. (2013) Reduced incidence of prevotella andother fermenters in intestinal microflora of autisticchildren. PLoS ONE 8: e68322.

[34] Segata N, Haake SK, Mannon P, Lemon KP, Wal-dron L, et al. (2012) Composition of the adult di-gestive tract bacterial microbiome based on sevenmouth surfaces, tonsils, throat and stool samples.Genome biology 13: R42.

[35] Paulson JN, Stine OC, Bravo HC, Pop M (2013) Ro-bust statistical methods for differential abundanceanalysis of marker gene survey data in submission.

[36] Caporaso JG, Lauber CL, Walters WA, Berg-LyonsD, Lozupone CA, et al. (2011) Global patterns of16S rRNA diversity at a depth of millions of se-quences per sample. Proceedings of the NationalAcademy of Sciences 108: 4516-4522.

[37] Wu GD, Chen J, Hoffmann C, Bittinger K, Chen YY,et al. (2011) Linking long-term dietary patterns withgut microbial enterotypes. Science 334: 105–108.

[38] White JR, Nagarajan N, Pop M (2009) Statisticalmethods for detecting differentially abundant fea-tures in clinical metagenomic samples. PLoS com-putational biology 5: e1000352.

[39] Sing T, Sander O, Beerenwinkel N, Lengauer T(2005) ROCR: visualizing classifier performance inR. Bioinformatics (Oxford, England) 21: 3940–3941.

[40] Sanders HL (1968) Marine benthic diversity: Acomparative study. The American Naturalist 102:pp. 243-282.

[41] Chao A, Chazdon RL, Colwell RK, Shen TJ (2005)A new statistical approach for assessing similarity ofspecies composition with incidence and abundancedata - Chao - 2004 - Ecology Letters - Wiley OnlineLibrary. Ecology Letters .

13

Page 14: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

[42] Schloss PD (2008) Evaluating different approachesthat test whether microbial communities have thesame structure. The ISME Journal 2: 265–275.

[43] Zapala MA, Schork NJ (2006) Multivariate regres-sion analysis of distance matrices for testing associ-ations between gene expression patterns and relatedvariables. Proceedings of the National Academyof Sciences of the United States of America 103:19430–19435.

[44] Benjamini Y, Hochberg Y (1995) Controlling theFalse Discovery Rate: A Practical and Powerful Ap-proach to Multiple Testing. Journal of the Royal Sta-tistical Society Series B (Methodological) 57: 289–300.

[45] Hastie TJ, Pregibon D (1992) Generalized linearmodels. In: Chambers JM, Hastie TJ, editors, Statis-tical Models in S, Chapman & Hall/CRC, chapter 6.

[46] Nookaew I, Papini M, Pornputtapong N, ScalcinatiG, Fagerberg L, et al. (2012) A comprehensive com-parison of RNA-Seq-based transcriptome analysisfrom reads to differential gene expression and cross-comparison with microarrays: a case study in Sac-charomyces cerevisiae. Nucleic Acids Research 40:10084–10097.

[47] Bullard J, Purdom E, Hansen K, Dudoit S (2010)Evaluation of statistical methods for normalizationand differential expression in mrna-seq experiments.BMC bioinformatics 11: 94.

[48] Sun J, Nishiyama T, Shimizu K, Kadota K (2013)TCC: an R package for comparing tag count datawith robust normalization strategies. BMC Bioin-formatics 14: 219.

[49] Soneson C, Delorenzi M (2013) A comparisonof methods for differential expression analysis ofRNA-seq data. BMC Bioinformatics 14: 91.

[50] Hardcastle TJ, Kelly KA (2010) baySeq: empiri-cal Bayesian methods for identifying differential ex-pression in sequence count data. BMC Bioinformat-ics 11: 422.

[51] Ozer HG, Parvin JD, Huang K (2012) DFI: gene fea-ture discovery in RNA-seq experiments from multi-ple sources. BMC genomics 13 Suppl 8: S11.

[52] Bourgon R, Gentleman R, Huber W (2010) Inde-pendent filtering increases detection power for high-throughput experiments. Proceedings of the Na-tional Academy of Sciences of the United States ofAmerica 107: 9546–9551.

[53] McMurdie PJ, Holmes S (2013) phyloseq: an Rpackage for reproducible interactive analysis andgraphics of microbiome census data. PLoS ONE 8:e61217.

[54] Gentleman R, Temple Lang D (2004) Statisticalanalyses and reproducible research. BioconductorProject Working Papers : 2.

[55] Peng RD (2011) Reproducible research in computa-tional science. Science 334: 1226-1227.

[56] Donoho DL (2010) An invitation to reproduciblecomputational research. Biostatistics (Oxford, Eng-land) 11: 385–388.

[57] Allaire J, Horner J, Marti V, Porte N The mark-down package: Markdown rendering for R. Ac-cessed 2013 March 22. URL http://CRAN.R-project.org/package=markdown. Rpackage version 0.5.4.

[58] Maechler M, Rousseeuw P, Struyf A, Hubert M,Hornik K (2013) cluster: Cluster Analysis Basicsand Extensions. R package version 1.14.4 — Fornew features, see the ’Changelog’ file (in the pack-age source).

[59] Analytics R (2011) foreach: Foreach looping con-struct for R. R package version 1.3.2.

[60] Wickham H (2009) ggplot2: elegant graphics fordata analysis. Springer New York.

[61] Wickham H (2011) The split-apply-combine strat-egy for data analysis. Journal of Statistical Software40: 1–29.

[62] Wickham H (2007) Reshaping data with the reshapepackage. Journal of Statistical Software 21: 1–20.

14

Page 15: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

[63] Witten DM (2011) Classification and clustering ofsequencing data using a Poisson model. The Annalsof Applied Statistics 5: 2493–2518.

[64] R Development Core Team (2011) R: A Languageand Environment for Statistical Computing. R Foun-dation for Statistical Computing, Vienna, Austria.ISBN 3-900051-07-0.

[65] Gentleman RC, Carey VJ, Bates DM, Bolstad B,Dettling M, et al. (2004) Bioconductor: open soft-ware development for computational biology andbioinformatics. Genome Biology 5: R80.

[66] Lu J, Tomfohr J, Kepler T (2005) Identifying dif-ferential expression in multiple sage libraries: anoverdispersed log-linear model approach. BMCbioinformatics 6: 165.

[67] Rice JA (2007) Mathematical statistics and dataanalysis. Cengage Learning.

[68] Anscombe FJ (1948) The transformation of poisson,binomial and negative-binomial data. Biometrika35: 246–254.

[69] Berger JO (1985) Statistical decision theory andBayesian analysis. Springer.

[70] Zhou YH, Xia K, Wright FA (2011) A powerful andflexible approach to the analysis of RNA sequencecount data. Bioinformatics (Oxford, England) 27:2672–2678.

15

Page 16: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

A Mathematical SupplementIn this supplementary material we go over some of thestatistical details pertaining to the use of hierarchicalmixture models such as the Negative Binomial and theBeta Binomial, which are appropriate for addressing ad-ditional sources of variability inherent to microbiome ex-perimental data, while still retaining statistical power. Wehave concentrated our comparison efforts on the Gamma-Poisson mixture model as some authors [66] have re-marked that this approach seems to be the most statis-tically robust approach in the sense that the presence ofoutliers and model misspecification does not over-perturbthe results. We show how a Negative Binomial distribu-tion can occur in different ways leading to different pa-rameterizations. We then show that there are transfor-mations we can apply to these random variables, suchthat the transformed data have a variance which is muchcloser to constant than the original. These variance sta-bilizing transformations lead to more efficient estimatorsand give better decision rules than those obtained viathe normalization-through-subsampling method known asrarefying.

Two parameterizations of the negative bino-mialIn classical probability, the negative binomial is often in-troduced as the distribution of the number of successesin a sequence of Bernoulli trials with probability of suc-cess p before the number r failures occur. Thus with thetwo parameters r and p, the probability distribution forthe negative binomial is given as

X ∼ NB(r; p)

P (X = k) =

(k + r − 1

k

)(1 − p)rpk

=Γ(k + r)

k!Γ(r)(1 − p)rpk

The mean of the distribution ism = pr1−r and the variance

V ar(X) = pr(1−p)2 . Sometimes the distribution is given a

different parameterization which we use here. This takesas the two parameters: the mean m and r = 1−p

p m, then

the probability mass distribution is rewritten:

X ∼ NB(m; r)

P (X = k) =

(k + r − 1

k

)(

r

r +m)r(

m

r +m)k

=Γ(k + r)

k!Γ(r)(

r

r +m)r(

m

r +m)k

The variance is V ar(X) = m(m+r)r = m + m2

r , wewill also use φ = 1

r and call this the overdispersion pa-rameter, giving V ar(X) = m + φm2. When φ = 0the distribution of X will be Poisson(m). This is the(mean=m,overdispersion=φ) parametrization we will usefrom now on.

Negative Binomial as a hierarchical mixturefor read counts

In biological contexts such as RNA-seq and microbialcount data the negative binomial distribution arises as ahierarchical mixture of Poisson distributions. This is dueto the fact that if we had technical replicates with thesame read counts, we would see Poisson variation witha given mean. However, the variation among biologicalreplicates and library size differences both introduce ad-ditional sources of variability.

To address this, we take the means of the Poisson vari-ables to be random variables themselves having a Gammadistribution with (hyper)parameters shape r and scalep/(1 − p). We first generate a random mean, λ, for thePoisson from the Gamma, and then a random variable, k,from the Poisson(λ). The marginal distribution is:

P (X = k) =

∫ ∞0

Poλ(k) × γ(r, p1−p )

=

∫ ∞0

λk

k!e−λ × λr−1e−λ

1−pp

( p1−p )rΓ(r)

=(1 − p)r

prk!Γ(r)

∫ ∞0

λr+k−1e−λ/pdλ

=(1 − p)r

prk!Γ(r)pr+kΓ(r + k)

=Γ(r + k)

k!Γ(r)pk(1 − p)r

16

Page 17: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

Variance StabilizationStatisticians usually prefer to deal with errors acrosssamples or in regression situations which are indepen-dent and identically distributed. In particular there is astrong preference for homoscedasticity (equal variances)across all the noise levels. This is not the case whenwe have unequal sample sizes and variations in the accu-racy across instruments. A standard way of dealing withheteroscedastic noise is to try to decompose the sourcesof heterogeneity and apply transformations that make thenoise variance almost constant. These are called variancestabilizing transformations.

Take for instance different Poisson variables with meanµi. Their variances are all different if the µi are different.However, if the square root transformation is applied toeach of the variables, then the transformed variables willhave approximately constant variance1. More generally,choosing a transformation that makes the variance con-stant is done by using a Taylor series expansion, calledthe delta method. We will not give the complete develop-ment of variance stabilization in the context of mixturesbut point the interested reader to the standard texts in The-oretical statistics such as [67] and one of the original ar-ticles on variance stabilization [68]. Anscombe showedthat there are several transformations that stabilize thevariance of the Negative Binomial depending on the val-ues of the parameters m and r, where r = 1

φ , sometimescalled the exponent of the Negative Binomial. For largem and constant mφ, the transformation

sinh−1

√(

1

φ− 1

2)x+ 3

81φ − 3

4

gives a constant variance around 14 . Whereas for m large

and 1φ not substantially increasing, the following simpler

transformation is preferable

log(x+1

2φ)

These two transformations are actually used in what isoften known as a generalized logarithmic transformationapplied in microarray variance stabilizing transformationsand RNA-seq normalization [3].

1Actually if we take the transformation x −→ 2√x we obtain a

variance approximately equal to 1.

Modeling read countsIf we have technical replicates with the same number ofreads sj , we expect to see Poisson variation with meanµ = sjui, for each taxa i whose incidence proportion wedenote by ui. Thus the number of reads for the sample jand taxa i would be

Kij ∼ Poisson (sjui)

We use the notational convention that lower case lettersdesignate fixed or observed values whereas upper case let-ters designate random variables.

For biological replicates within the same group – suchas treatment or control groups or the same environments– the proportions ui will be variable between samples. Aflexible model that works well for this variability is theGamma distribution, as it has two parameters and can beadapted to many distributional shapes. Call the two pa-rameters ri and pi

1−pi . So that Uij the proportion of taxa iin sample j is distributed according to Gamma(ri,

pi1−pi ).

Thus we obtain that the read counts Kij have a Poisson-Gamma mixture of different Poisson variables. As shownabove we can use the Negative Binomial with parameters(m = uisj) and φi as a satisfactory model of the variabil-ity.

Now we can add to this model the fact that the sam-ples belong to different conditions such as treatment andcontrol or different environments. This is done by sepa-rately estimating the values of the parameters, for each ofthe different biological replicate conditions/classes. Wewill use the index c for the different conditions, we thenhave the counts for the taxa i and sample j in condition chaving a Negative Binomial distribution with mc = uicsjand φic so that the variance is written

uicsj + φics2ju

2ic (2)

We can estimate the parameters uic and φic from thedata for each OTU and sample condition. This is usu-ally best accomplished by leveraging information acrossOTUs – taking advantage of a systematic relationship be-tween the observed variance and mean – to obtain highquality shrunken estimates. The end result provides avariance stabilizing transformation of the data that allowsa statistically efficient comparisons between conditions.This application of a hierarchical mixture model is very

17

Page 18: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

similar to the random effects models used in the con-text of analysis of variance. A very complete compari-son of this particular choice of Gamma-Poisson mixtureto the Beta-Binomial and nonparametric approaches canbe found in [15].

By comparison, the procedures involving a systematicdownsampling (rarefying) are inadmissible in the statis-tical sense, because there is another procedure that dom-inates it using a mean squared error loss function. Witha Bayesian formalism we can show that the hierarchicalBayes model gives a Bayes rule that is admissible [69].

Other mixture modelsIf instead of modeling the read counts one uses the propor-tions as the random variables, with differing variances dueto different library sizes, the Beta-Binomial model is thestandard approach. This has also been used for RNA-seqdata [70] and the package metaStats [38] uses this modelalthough they don’t use variance stabilizing transforma-tions of the data.

18

Page 19: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

Supporting Information Legends

19

Page 20: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

DESeq DESeq2 edgeR metagenomeSeq mt

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

2000 2000

2000 2000

8000 8000

8000 8000

5000050000

5000050000

36

1017

36

1017

36

1017

2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0EffectSize

Fals

e P

ositi

ve R

ate

Normalization

Model/None

Proportion

Rarefied

Differential Abundance Detection Performance, False Positives

Supporting Information File S1 False positives during detection of differentially abundant microbes.

20

Page 21: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

DESeq DESeq2 edgeR metagenomeSeq mt

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

0.2

0.5

0.8

2000 2000

2000 2000

8000 8000

8000 8000

5000050000

5000050000

36

1017

36

1017

36

1017

2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0 2.5 5.0 7.5 10.0EffectSize

Pow

er

Normalization

Model/None

Proportion

Rarefied

Differential Abundance Detection Performance, Power

Supporting Information File S2 Power results for detection of differentially abundant microbes.

21

Page 22: Waste Not, Want Not: Why Rarefying Microbiome Data is …statweb.stanford.edu/~susan/papers/WNWNarxiv10_01.pdf · 2013-10-01 · that use proportions directly [34]. Statistical motivation

Supporting Information File S3 A zip file with all Rmd source code and HTML output.

22