John Bunge [email protected] Department of Statistical Science Cornell University 1 Statistical Inference for Biodiversity

John [email protected]

Department of Statistical ScienceCornell University

1

Statistical Inference for Biodiversity

mailto:[email protected]

2

Thanks to:Amy Willis

David Mark WelchKathryn Barger

Colleagues too numerous to mention

3

Bioinformaticists

4

Statisticians

5

The fundamental idea of statisticsDistinction between observed, finite sample

(data) andunderlying, true reality or populationPopulation = target of inference

Q: What is population in microbial ecology?A? Carry out operative procedure to ∞ effort

Inferential statistics addresses questions:What can we say about reality, based on an

observed sample?What can we not say?With what degree of certainty?

6

Mathematics: DeductiveAll men are mortalSocrates is a man

Therefore Socrates is mortal

Statistics: Inductive"Data! Data! Data!" he cried impatiently. "I can't

make bricks without clay." – Sherlock Holmes

Inferential statistics vs. exploratory data analysis & descriptive statistics

Given distinction between sample & population, how to make inference from sample to population?

7

Statistics = guessingBUT: informed guessingInference takes place in a framework = set of

assumptionsMajor distinction: Bayesian vs. frequentist

Bayesian statistics Prior = state of belief before data collection

Posterior = state of belief after data analysis

Subjective vs. objective (noninformative) priorsObjective prior requires theory of information &

minimization of sameE.g.: Estimating species richness C:

objective prior = 1/√Cnot “flat” (or anything else)

8

Frequentist statisticsReality is fixed (during the experiment)Sample is one of ∞ collection of possible samplesIndividual sample has meaning only as a

representative of this collection This is the basic epistemology of Western culture

Notion of error:Sample varies relative to reality in repeated

samplingEstimate varies relative to estimand in repeated samplingBias, variance characterize this variation

E.g.: varies relative to C in repeated sampling

9

Plato’s Republic, VII,7

Behold! human beings living in an underground den, which has a mouth open towards the light and reaching all along the den; here they have been from their childhood […]Above and behind them a fire is blazing at a distance, […] you will see, if you look, a low wall built along the way, like the screen which marionette players have in front of them, over which they show the puppets. […]They see only their own shadows, or the shadows of one another, which the fire throws on the opposite wall of the cave […]To them, I said, the truth would be literally nothing but the shadows of the images.

10

Old Testament

Ecclesiastes 1:15

What is crooked cannot be straightened; what is lacking cannot be counted.

New Testament

Corinthians 13:12

For now we see through a glass, darkly, but then face to face: now I know in part; but then shall I know even as also I am known.

11

Questions:How to construct an estimate from a sample?Given the form of an estimate (an estimator), how does it behave relative to the truth?

What is its bias (if any)?What is its variance?What happens when the sample becomes very large (asymptotics)?What optimality criteria does it satisfy?

These are questions for mathematical analysis (not simulation)But for analysis, need more framework (assumptions)

Optimality theory:Parametric/nonparametric, unbiasedness, maximum

likelihood

12

Parametric vs. nonparametricParametric models: determined by a small # of

parameters (numbers). Smooth, restrictive and (typically) unrealistic, but useful.

E.g.: Negative binomial model for species abundances, 2 parameters. Almost never fits well.

Nonparametric models: cannot be determined by small # of parameters. More general and

unrestricted, but less informative.

E.g.: Nonparametric estimators of C have “arbitrarily bad informativity” (& essentially ∞ bias).

13

UnbiasednessUnbiased estimates do not always exist.

E.g. Species problem: unbiased estimator exists only when size of population is known & sample ≥ ½ of population

E.g. General estimator for species richness

would be unbiased, except impossible:n = # of observed species known but

p0 = P(missing a species)unknown

�̂�=𝑛

1−𝑝0

14

Maximum LikelihoodUsually exists in parametric or nonparametric case.Based on maximizing likelihood functionRequires numerical optimization (search) – may be complexTypically biased in small samples but extensive asymptotic optimality including unbiasednessStandard errors & confidence intervals follow from asymptotics

E.g.: MLE

Empirical version of unbiased estimator. Estimate p0 based on sample counts.

�̂�=𝑛

1−�̂�0

15

Some considerationsConfidence intervals. Often associated with

“error bars.”Considerable logical jump from standard error (SE) (of estimate) to confidence interval. Different

theoretical bases. CI not necessarily = estimate ± 2 SE.

E.g. CI for # of species is asymmetric

Hypothesis testing. Null and alternative hypotheses,H0 and HA, are statements about underlying reality NOT about data or samples.p-value = strength of evidence against H0.

16

How to do statistical analysis1. Talk to an expert. Do not reinvent the wheel! You will

invent a square wheel.

Fact: Universities/institutions have statistics departments.

17

Statistics Departments

University of California DavisUniversity of California San FranciscoBaylor College of MedicineUniversity of VictoriaHarvard Medical SchoolUniversity of Hawaii at ManoaTufts University Sackler School of Biomedical SciencesH. Lee Moffitt Cancer CenterUniversity of MarylandMIT/WHOI

Random sample of n = 10 STAMPS participants’ home institutions, with links to statistics departments

http://anson.ucdavis.edu/

http://www.epibiostat.ucsf.edu/biostat/

http://statistics.rice.edu/Content.aspx?id=1305

http://www.uvic.ca/science/math-statistics/research/statistics/index.php

http://statistics.fas.harvard.edu/

http://math.hawaii.edu/wordpress/

http://www.tuftsctsi.org/Services-and-Consultation/Biostatistics-Epidemiology-Research-Design.aspx?c=130833317201605859

https://www.moffitt.org/clinical-trials-research/research-science/academics/population-science/biostatistics-and-bioinformatics/

http://www.math.umd.edu/

http://statistics.mit.edu/

18

2. Statistical computer packages are tools: R, SAS, WinBUGS, etc. Programming ≠ statistical analysis.

3. Choose your methods.Inferential vs. descriptive statisticsFrequentist vs. BayesianParametric/nonparametric, unbiasedness, maximum likelihood, asymptotics, etc.

4. Try different things: don’t get stuck on one solution.

19

What is α-diversity?C := True but unknown total # of taxa in population (or community), observed + unobserved. Issues: definition of population, % OTU cutoff, etc.

NOT observed # of taxa n

“Indices” (Simpson, Shannon, etc.) not considered here.Indices can be inferred from sample data (not

computed from sample data)

C is not an “index,” and n is not an estimate of C.

Review: “Estimating the number of species in microbial diversity studies,” Bunge/Willis/Walsh (2014)http://www.annualreviews.org/doi/full/10.1146/annurev-statistics-022513-115654

20

How to estimate C from an observed sample?Typically: given OTU frequency tableSufficient statistic (data summary): f1, f2, f3, …

C = f0 + f1 + f2 + f3 + …n = f1 + f2 + f3 + …

f0 = missing dataf1 = # singletonsetc.

Here: consider frequentist methods only.

Two principal methods: Estimate C 1) based on counts fj;2) based on ratios of counts fj/fj+1 .

21

The “rank abundance curve” is not a graph and has no statistical application or interpretation

22

Frequency count data example

Singletons ≈ 2x doubletons – may be 10x!

11,338 sequences grouped inton = 1,187 OTUs

Apple orchard soil data from Walsh et al. (doi:10.3389/fmicb.2013.00255)

j fj j fj

1 317 124 12 179 128 13 127 133 14 77 134 15 66 149 16 61 159 17 39 170 18 42 184 19 29 195 1

10 24 208 111 12 232 112 27 … 262 1

23

High diversity typical of microbial data

Data acquisition / bioinformatic issues

Spurious singletons?Correct at what stage? Statistical approach?

0 50 100 150 200 250 3000

50

100

150

200

250

300

350

Apple orchard data - original scale

frequency

coun

t

0 50 100 150 200 250 3001

10

100

1000

Apple orchard data - log scale

frequency

coun

t

0 50 100 150 200 250 3000

50

100

150

200

250

300

350

ObservedOther 3--TwoMixedExp/Tau 23Other 2--ThreeMixedExp/Tau 262Other 1--ThreeMixedExp/Tau 118Best--ThreeMixedExp/Tau 184

Frequency

Coun

ts

24

CatchAll fitted models for apple orchard data

Τ = 184

25

Essentials of method:Fit curve to frequency count graphProject curve upwards to left, to estimate (predict) f0

“Extrapolation step”

Questions:

�̂� 0=𝑛�̂�01−�̂�0

�̂�=𝑛+ 𝑓 0=𝑛

1− �̂�0

Which curve?How to fit?How to obtain SE, CI, goodness-of-fit assessment?

Soil example: 637, 1,824 (SE 122)

26

The “rarefaction curve” is not a statistical analysis and cannot estimate total taxonomic richness

27

Some statistical theory

MIXED POISSON MODEL• C classes/taxa/species in population. Each species independently

contributes Poisson-distributed # of representatives to the sample.

• Counts ~ zero-truncated mixed Poisson.

)(Poisson~ 11 X

)(Poisson~ 22 X)(Poisson~ 33 X

)(Poisson~ CCX

sample

28

The mixed-Poisson model

Species (taxon) i contributes a Poisson-distributed number Xi of replicates to the sample – i.e., taxon i appears in the sample Xi times.

Units appear independently in the sample

Fundamental problem: heterogeneity, i.e., unequal Poisson means λi

• Standard approach: model λi‘s as i.i.d. replicates from some mixing distribution F

• Frequency counts fi are then marginally i.i.d. F-mixed Poisson random variables

• Zero-truncated since zero counts Xi are unobservable

29

The mixed-Poisson model cont’d

Mixing distribution F, i.e., distribution of sampling intensities λ, is also called species abundance distribution

Probably a misnomer Mathematical treatment (marginalization) implies

that each species contribution to the sample is independent and identically distributed

Both assumptions are certainly wrong How to account for dependent or differently

distributed species counts? Not in standard model.

30

Mixing distributions F

Parametric, low-dimensional parameter vectorNone ≡ point mass at λ ≡ all equal species sizesGamma (Fisher, 1943)LognormalInverse Gaussian, generalized inverse Gaussian (Sichel)ParetoLog-tStable

Finite mixture of exponentials - semiparametric

Basically:Fit mixed-Poisson model to frequency count data by maximum likelihoodCan be computationally intensive (numerical

search required)Yields estimates of all parameters plus SE, CI, and goodness-of-fit assessment.

Soil example: Model = mixture of 3 exponential distributions, excellent fit (p = 0.6), 1,824 (SE 122).

Data-analytic issue:Upper frequency cutoff τ

Soil example: τ = 184 < 262 = fmax

Second approach: ratios of frequency counts

Transform data to ratios of successive counts fj+1/fj . Why?

Reason #1: Data regularity.Original idea: look at (j+1)fj+1/fj vs. j

j fj (j+1)fj+1/fj1 317 1.132 179 2.133 127 2.434 77 4.295 66 5.556 61 4.487 39 8.628 42 6.219 29 8.28

10 24 5.5011 12 27.0012 27 7.70

0 5 10 15 20 25 30 350.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00Ratio plot - apple orchard data

j

(j+1)

f_(j+

1)/f

_j

Reason #2: Probability theory

Katz (1945): Above ratio function is linear if and only if distribution is (i) Poisson, (ii) negative binomial, or (iii) binomial.

Vast extension due to Kemp (1968 & later):

Kemp & others characterized probability distributions for which above equation holds.

Ratio approach not restricted to mixed Poisson

jf

fjjr

j

j

1)1(:)(

𝑓 𝑗+1

𝑓 𝑗

=𝛽0+𝛽1 𝑗+𝛽2 𝑗

2+⋯+𝛽𝑝 𝑗𝑝

𝛼0+𝛼1 𝑗+𝛼2 𝑗2+⋯+𝛼𝑞 𝑗

𝑞

Willis, A. & Bunge, J. (2015). Estimating diversity via frequency ratios. Biometrics. DOI: 10.1111/biom.12332

Idea: Fit ratio-of-polynomials function to ratio plot fj+1/fj

Ratio plot of soil dataset due to Schuette et al. (2010)

Essentials of method:

Project curve left-ward to j = 0 (y-axis): y-intercept is

Then and .

Same questions arise:Which curves?How to fit?

Fit by nonlinear regression not maximum likelihood.ML intractable here due to complexity or non-

existence of likelihood

�̂� 0=𝑓 1�̂�0

�̂�=𝑛+ 𝑓 0=𝑛+𝑓 1�̂�0

36

Statistical issues

Nonlinear regression y = f(x) + ε

Heteroscedastic (changing variance)

Autocorrelated: f2/f1 is correlated with f3/f2, etc. (tridiagonal)

Collinear: parameter estimates of α’s and β’s highly correlated unless corrected

Nontrivial numerical challenges

Essentially: Complex iteratively reweighted least squares scheme required.

37

Statistical issues

Model selection: algorithm simultaneously fits, (re-)weights, and selects lowest-order acceptable model.

Standard errors: computed via asymptotic theory (delta method).

Confidence intervals are again asymmetric.

Asymptotic theory: Consistency and asymptotic normality verified in canonical cases.

Data-analytic issue: Current method can only use contiguous frequencies, so stops at first gap between frequencies. τ = last contiguous frequency.

38

Results from ratio-based analysis program breakawaySchuette soil data: 5008 (SE 689),95% CI (2717, 216,157)

Compare results from ML finite-mixture-based analysis program CatchAll

Schuette soil data: 4891 (SE 199.2),95% CI (4534, 5317)

Uncertainty of low-frequency/high-diversity data:Ratio method can ignore (omit) singleton count f1.Results in lower estimate

E.g. Schuette soil data: = 3445 (SE 963)

39

NEVERthrow away data when doing

statistical inference“Not even wrong” – Richard Feynman

Documents

John Bunge [email protected] Department of Statistical Science Cornell University 1 Statistical Inference for Biodiversity