Upload
francis-cody-henry
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
John [email protected]
Department of Statistical ScienceCornell University
1
Statistical Inference for Biodiversity
2
Thanks to:Amy Willis
David Mark WelchKathryn Barger
Colleagues too numerous to mention
3
Bioinformaticists
4
Statisticians
5
The fundamental idea of statisticsDistinction between observed, finite sample
(data) andunderlying, true reality or populationPopulation = target of inference
Q: What is population in microbial ecology?A? Carry out operative procedure to ∞ effort
Inferential statistics addresses questions:What can we say about reality, based on an
observed sample?What can we not say?With what degree of certainty?
6
Mathematics: DeductiveAll men are mortalSocrates is a man
Therefore Socrates is mortal
Statistics: Inductive"Data! Data! Data!" he cried impatiently. "I can't
make bricks without clay." – Sherlock Holmes
Inferential statistics vs. exploratory data analysis & descriptive statistics
Given distinction between sample & population, how to make inference from sample to population?
7
Statistics = guessingBUT: informed guessingInference takes place in a framework = set of
assumptionsMajor distinction: Bayesian vs. frequentist
Bayesian statistics Prior = state of belief before data collection
Posterior = state of belief after data analysis
Subjective vs. objective (noninformative) priorsObjective prior requires theory of information &
minimization of sameE.g.: Estimating species richness C:
objective prior = 1/√Cnot “flat” (or anything else)
8
Frequentist statisticsReality is fixed (during the experiment)Sample is one of ∞ collection of possible samplesIndividual sample has meaning only as a
representative of this collection This is the basic epistemology of Western culture
Notion of error:Sample varies relative to reality in repeated
samplingEstimate varies relative to estimand in repeated samplingBias, variance characterize this variation
E.g.: varies relative to C in repeated sampling
9
Plato’s Republic, VII,7
Behold! human beings living in an underground den, which has a mouth open towards the light and reaching all along the den; here they have been from their childhood […]Above and behind them a fire is blazing at a distance, […] you will see, if you look, a low wall built along the way, like the screen which marionette players have in front of them, over which they show the puppets. […]They see only their own shadows, or the shadows of one another, which the fire throws on the opposite wall of the cave […]To them, I said, the truth would be literally nothing but the shadows of the images.
10
Old Testament
Ecclesiastes 1:15
What is crooked cannot be straightened; what is lacking cannot be counted.
New Testament
Corinthians 13:12
For now we see through a glass, darkly, but then face to face: now I know in part; but then shall I know even as also I am known.
11
Questions:How to construct an estimate from a sample?Given the form of an estimate (an estimator), how does it behave relative to the truth?
What is its bias (if any)?What is its variance?What happens when the sample becomes very large (asymptotics)?What optimality criteria does it satisfy?
These are questions for mathematical analysis (not simulation)But for analysis, need more framework (assumptions)
Optimality theory:Parametric/nonparametric, unbiasedness, maximum
likelihood
12
Parametric vs. nonparametricParametric models: determined by a small # of
parameters (numbers). Smooth, restrictive and (typically) unrealistic, but useful.
E.g.: Negative binomial model for species abundances, 2 parameters. Almost never fits well.
Nonparametric models: cannot be determined by small # of parameters. More general and
unrestricted, but less informative.
E.g.: Nonparametric estimators of C have “arbitrarily bad informativity” (& essentially ∞ bias).
13
UnbiasednessUnbiased estimates do not always exist.
E.g. Species problem: unbiased estimator exists only when size of population is known & sample ≥ ½ of population
E.g. General estimator for species richness
would be unbiased, except impossible:n = # of observed species known but
p0 = P(missing a species)unknown
�̂�=𝑛
1−𝑝0
14
Maximum LikelihoodUsually exists in parametric or nonparametric case.Based on maximizing likelihood functionRequires numerical optimization (search) – may be complexTypically biased in small samples but extensive asymptotic optimality including unbiasednessStandard errors & confidence intervals follow from asymptotics
E.g.: MLE
Empirical version of unbiased estimator. Estimate p0 based on sample counts.
�̂�=𝑛
1−�̂�0
15
Some considerationsConfidence intervals. Often associated with
“error bars.”Considerable logical jump from standard error (SE) (of estimate) to confidence interval. Different
theoretical bases. CI not necessarily = estimate ± 2 SE.
E.g. CI for # of species is asymmetric
Hypothesis testing. Null and alternative hypotheses,H0 and HA, are statements about underlying reality NOT about data or samples.p-value = strength of evidence against H0.
16
How to do statistical analysis1. Talk to an expert. Do not reinvent the wheel! You will
invent a square wheel.
Fact: Universities/institutions have statistics departments.
17
Statistics Departments
University of California DavisUniversity of California San FranciscoBaylor College of MedicineUniversity of VictoriaHarvard Medical SchoolUniversity of Hawaii at ManoaTufts University Sackler School of Biomedical SciencesH. Lee Moffitt Cancer CenterUniversity of MarylandMIT/WHOI
Random sample of n = 10 STAMPS participants’ home institutions, with links to statistics departments
18
2. Statistical computer packages are tools: R, SAS, WinBUGS, etc. Programming ≠ statistical analysis.
3. Choose your methods.Inferential vs. descriptive statisticsFrequentist vs. BayesianParametric/nonparametric, unbiasedness, maximum likelihood, asymptotics, etc.
4. Try different things: don’t get stuck on one solution.
19
What is α-diversity?C := True but unknown total # of taxa in population (or community), observed + unobserved. Issues: definition of population, % OTU cutoff, etc.
NOT observed # of taxa n
“Indices” (Simpson, Shannon, etc.) not considered here.Indices can be inferred from sample data (not
computed from sample data)
C is not an “index,” and n is not an estimate of C.
Review: “Estimating the number of species in microbial diversity studies,” Bunge/Willis/Walsh (2014)http://www.annualreviews.org/doi/full/10.1146/annurev-statistics-022513-115654
20
How to estimate C from an observed sample?Typically: given OTU frequency tableSufficient statistic (data summary): f1, f2, f3, …
C = f0 + f1 + f2 + f3 + …n = f1 + f2 + f3 + …
f0 = missing dataf1 = # singletonsetc.
Here: consider frequentist methods only.
Two principal methods: Estimate C 1) based on counts fj;2) based on ratios of counts fj/fj+1 .
21
The “rank abundance curve” is not a graph and has no statistical application or interpretation
22
Frequency count data example
Singletons ≈ 2x doubletons – may be 10x!
11,338 sequences grouped inton = 1,187 OTUs
Apple orchard soil data from Walsh et al. (doi:10.3389/fmicb.2013.00255)
j fj j fj
1 317 124 12 179 128 13 127 133 14 77 134 15 66 149 16 61 159 17 39 170 18 42 184 19 29 195 1
10 24 208 111 12 232 112 27 … 262 1
23
High diversity typical of microbial data
Data acquisition / bioinformatic issues
Spurious singletons?Correct at what stage? Statistical approach?
0 50 100 150 200 250 3000
50
100
150
200
250
300
350
Apple orchard data - original scale
frequency
coun
t
0 50 100 150 200 250 3001
10
100
1000
Apple orchard data - log scale
frequency
coun
t
0 50 100 150 200 250 3000
50
100
150
200
250
300
350
ObservedOther 3--TwoMixedExp/Tau 23Other 2--ThreeMixedExp/Tau 262Other 1--ThreeMixedExp/Tau 118Best--ThreeMixedExp/Tau 184
Frequency
Coun
ts
24
CatchAll fitted models for apple orchard data
Τ = 184
25
Essentials of method:Fit curve to frequency count graphProject curve upwards to left, to estimate (predict) f0
“Extrapolation step”
Questions:
�̂� 0=𝑛�̂�01−�̂�0
�̂�=𝑛+ 𝑓 0=𝑛
1− �̂�0
Which curve?How to fit?How to obtain SE, CI, goodness-of-fit assessment?
Soil example: 637, 1,824 (SE 122)
26
The “rarefaction curve” is not a statistical analysis and cannot estimate total taxonomic richness
27
Some statistical theory
MIXED POISSON MODEL• C classes/taxa/species in population. Each species independently
contributes Poisson-distributed # of representatives to the sample.
• Counts ~ zero-truncated mixed Poisson.
)(Poisson~ 11 X
)(Poisson~ 22 X)(Poisson~ 33 X
)(Poisson~ CCX
sample
28
The mixed-Poisson model
Species (taxon) i contributes a Poisson-distributed number Xi of replicates to the sample – i.e., taxon i appears in the sample Xi times.
Units appear independently in the sample
Fundamental problem: heterogeneity, i.e., unequal Poisson means λi
• Standard approach: model λi‘s as i.i.d. replicates from some mixing distribution F
• Frequency counts fi are then marginally i.i.d. F-mixed Poisson random variables
• Zero-truncated since zero counts Xi are unobservable
29
The mixed-Poisson model cont’d
Mixing distribution F, i.e., distribution of sampling intensities λ, is also called species abundance distribution
Probably a misnomer Mathematical treatment (marginalization) implies
that each species contribution to the sample is independent and identically distributed
Both assumptions are certainly wrong How to account for dependent or differently
distributed species counts? Not in standard model.
30
Mixing distributions F
Parametric, low-dimensional parameter vectorNone ≡ point mass at λ ≡ all equal species sizesGamma (Fisher, 1943)LognormalInverse Gaussian, generalized inverse Gaussian (Sichel)ParetoLog-tStable
Finite mixture of exponentials - semiparametric
Basically:Fit mixed-Poisson model to frequency count data by maximum likelihoodCan be computationally intensive (numerical
search required)Yields estimates of all parameters plus SE, CI, and goodness-of-fit assessment.
Soil example: Model = mixture of 3 exponential distributions, excellent fit (p = 0.6), 1,824 (SE 122).
Data-analytic issue:Upper frequency cutoff τ
Soil example: τ = 184 < 262 = fmax
Second approach: ratios of frequency counts
Transform data to ratios of successive counts fj+1/fj . Why?
Reason #1: Data regularity.Original idea: look at (j+1)fj+1/fj vs. j
j fj (j+1)fj+1/fj1 317 1.132 179 2.133 127 2.434 77 4.295 66 5.556 61 4.487 39 8.628 42 6.219 29 8.28
10 24 5.5011 12 27.0012 27 7.70
0 5 10 15 20 25 30 350.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00Ratio plot - apple orchard data
j
(j+1)
f_(j+
1)/f
_j
Reason #2: Probability theory
Katz (1945): Above ratio function is linear if and only if distribution is (i) Poisson, (ii) negative binomial, or (iii) binomial.
Vast extension due to Kemp (1968 & later):
Kemp & others characterized probability distributions for which above equation holds.
Ratio approach not restricted to mixed Poisson
jf
fjjr
j
j
1)1(:)(
𝑓 𝑗+1
𝑓 𝑗
=𝛽0+𝛽1 𝑗+𝛽2 𝑗
2+⋯+𝛽𝑝 𝑗𝑝
𝛼0+𝛼1 𝑗+𝛼2 𝑗2+⋯+𝛼𝑞 𝑗
𝑞
Willis, A. & Bunge, J. (2015). Estimating diversity via frequency ratios. Biometrics. DOI: 10.1111/biom.12332
Idea: Fit ratio-of-polynomials function to ratio plot fj+1/fj
Ratio plot of soil dataset due to Schuette et al. (2010)
Essentials of method:
Project curve left-ward to j = 0 (y-axis): y-intercept is
Then and .
Same questions arise:Which curves?How to fit?
Fit by nonlinear regression not maximum likelihood.ML intractable here due to complexity or non-
existence of likelihood
�̂� 0=𝑓 1�̂�0
�̂�=𝑛+ 𝑓 0=𝑛+𝑓 1�̂�0
36
Statistical issues
Nonlinear regression y = f(x) + ε
Heteroscedastic (changing variance)
Autocorrelated: f2/f1 is correlated with f3/f2, etc. (tridiagonal)
Collinear: parameter estimates of α’s and β’s highly correlated unless corrected
Nontrivial numerical challenges
Essentially: Complex iteratively reweighted least squares scheme required.
37
Statistical issues
Model selection: algorithm simultaneously fits, (re-)weights, and selects lowest-order acceptable model.
Standard errors: computed via asymptotic theory (delta method).
Confidence intervals are again asymmetric.
Asymptotic theory: Consistency and asymptotic normality verified in canonical cases.
Data-analytic issue: Current method can only use contiguous frequencies, so stops at first gap between frequencies. τ = last contiguous frequency.
38
Results from ratio-based analysis program breakawaySchuette soil data: 5008 (SE 689),95% CI (2717, 216,157)
Compare results from ML finite-mixture-based analysis program CatchAll
Schuette soil data: 4891 (SE 199.2),95% CI (4534, 5317)
Uncertainty of low-frequency/high-diversity data:Ratio method can ignore (omit) singleton count f1.Results in lower estimate
E.g. Schuette soil data: = 3445 (SE 963)
39
NEVERthrow away data when doing
statistical inference“Not even wrong” – Richard Feynman