Using model-based statistical inference to learn about evolution

  • View
    433

  • Download
    0

  • Category

    Science

Preview:

Citation preview

Using model-based statistical inferenceUsing model-based statistical inferenceto learn about evolutionto learn about evolution

Frederick “Erick” MatsenFrederick “Erick” Matsenhttp://matsen.fredhutch.org/http://matsen.fredhutch.org/

@ematsen@ematsen

My group develops mathematical and computationaltoolsfor model-based statistical inference on continuous and discrete mathematical objects motivated by evolutionary sequence analysisof microbes and the immune system.

What is model-based statistical inference?What is model-based statistical inference?

Modern technology gives us the ability to in great detailobserve

But very detailed observation is not the same as understanding

To understand we need to simplify and abstract.

What abstractions do we have at our disposal?What abstractions do we have at our disposal?

 

3

 

x

is useful and we love it dearly! is useful and we love it dearly!xx

allows us to describe knowledge in an implicit way:x

f(x) = y

then we can work towards solving for .x

Alternatively, one might be interested in taking the average of between two values and .

f(x)a b

Define Define as area as areaff((xx)) ddxx∫∫ bb

aa

a b

is average is average11//((bb −− aa)) ⋅⋅ ff((xx)) ddxx∫∫ bb

aa

a b

average on (a, b)

Variables allow us to solveVariables allow us to solve

?xy

Problem 1: given , solve for .Problem 2: predict if a 10% bigger charge will hit the castle.Say the answer to this is , such that is 1 if that will make the cannonball hit the castle, and 0 otherwise.

y x

(x)hit10 (x)hit10 x

Variables allow us to solveVariables allow us to solve

?xy

… in a deterministic framework.

Life is a probabilistic process.

How do we abstract probabilistic quantities?

X

Random variables Random variables abstract variables abstract variablesXXIt doesn’t have a fixed value: we have to “ask” it for a value.

Random variables are capricious,but they are well defined behind their stochastic exterior.

Random variable sampling determined byRandom variable sampling determined bydistributionsdistributions

Sometimes discrete:

P(heads)P(tails)

= 0.51= 0.49

Sometimes continuous:

Working with Working with random variablesrandom variables ::XX

We can solve for in “equations” like , obtainingexpressions such as this is called inference.

X f(X) ∼ YP(X ∣ Y );

We can also average with respect to :

where now we are averaging out with respect to a probability.

X

∫ f(X) dP(X ∣ Y )

Probabilistic approach to predictionProbabilistic approach to prediction

?XY

: horizontal distance traveled by a cannonball (random variable): cannon angle (inferred random variable)

Problem 1: given observed distribution , infer distribution of .Problem 2: find probability that a 10% bigger charge will hit castle.

YX

Y X

Solve to get .1. Integrate .2.

f(X) = Y P(X ∣ Y )∫ (X) dP(X ∣ Y )hit10

Biological experiments are measurements withBiological experiments are measurements withuncertaintyuncertainty

?X YCATTCTTGTACG

GTTCGGCGAAGA

GCGTAAAATAGG

AGGGGTTGCATG

CTTCACTGGCAT

expressionlevel ofcertaingenes

risk

Model-based statistical inference Model-based statistical inference ✓✓We can solve for in “equations” like ,

inferring an unknown distribution for (what can we learn about the angle of the cannon).

X f(X) ∼ YX

We can push uncertainty through an analysis using integrals like

(we don’t care what the angle of the cannon is really, we just want toknow with what probability the shot is going to hit the castle!)

f(X) dP(X ∣ Y ).∫ b

a

Now, what is model-based statistical inferenceNow, what is model-based statistical inferenceon on discrete mathematical objectsdiscrete mathematical objects??

Motivation: we would like to decide whether anMotivation: we would like to decide whether anindividual has been individual has been superinfectedsuperinfected, i.e. infected, i.e. infected

with a second viral variantwith a second viral variantin a separate eventin a separate event

single infection superinfection

Integrate out phylogenetic uncertaintyIntegrate out phylogenetic uncertainty?X Y

CATTCTTGTACG

GTTCGGCGAAGA

GCGTAAAATAGG

AGGGGTTGCATG

CTTCACTGGCAT

To decide superinfection, we would like to calculate

where is now a phylogenetic-tree-valued random variable.

f(X) dP(X ∣ Y )∫S

X

Time to count your blessings.Time to count your blessings. Real numbers are equipped with a total order. ( ) Real numbers are equipped with a simply-computed distancethat is compatible with the total order. ( ) Real numbers form a continuum. ( )

3 < 4

|7 − 3| = 4

2.9 < 2.95 < 3

We can thus define the integralWe can thus define the integral

a ba b

for real-valued and .f(x)dx∫ b

af(X) dP(X ∣ Y )∫ b

a

Integrating over phylogenetic trees?Integrating over phylogenetic trees?Phylogenetic trees have discrete topologies, there is no canonical

distance between them, nor a natural total order.

But we still want to do inference and integration in this setting!

ACATGGCTC...ATACGTTCC...TTACGGTTC...ATCCGGTAC...ATACAGTCT...

...

Joint work with postdoc Chris Whidden.

Notion of proximity of trees?Notion of proximity of trees?

Subtree-prune-regraft (rSPR) definitionSubtree-prune-regraft (rSPR) definition

1 4 5 61 2 3 4 5 6 1 2 34 5 6

2 3

These trees are then distance 1 apart.

Tree graph connected by rSPR movesTree graph connected by rSPR moves

Tree inference bounces around graphTree inference bounces around graph

Probability is # of visits to nodesProbability is # of visits to nodes

Subset to high probability nodesSubset to high probability nodes

node size proportional to posterior probability;color shows distance tohighest probability tree.

The top 4096 trees for a data setThe top 4096 trees for a data set

Graph effects matterGraph effects matterFor more details:

Chris Whidden and FM. Quantifying MCMC exploration of phylogenetic treespace. Systematic Biology 2015.

… so what do we know about this graph?

Is the tree graph positively curved?Is the tree graph positively curved?

Is it flat?Is it flat?

Is it negatively curved?Is it negatively curved?

curvature

SP

R distance

imbalanced

balanced

Model-based statistical inference on discreteModel-based statistical inference on discreteand continuous mathematical objects and continuous mathematical objects ✓✓When we perform inference on , we can have be

something continuous, discrete, or continuous and discrete.f(X) ∼ Y X

Discrete-ness brings special challenges; graphs are helpful.

Next: use model-based statistical inference toNext: use model-based statistical inference tolearn about adaptive immunitylearn about adaptive immunity

Joint with Trevor Bedford (VIDD), Connor McCoy (now at Google),Vladimir Minin (UW Statistics), and Duncan Ralph (postdoc).

Data from Harlan Robins (PHS/Adaptive).

Jenner’s 1796 vaccineJenner’s 1796 vaccine

A revolutionary advance.

Where are we 200 years later?Where are we 200 years later?

Vaccine trials still take a long time and are very costly.

Where are we 200 years later?Where are we 200 years later?

Justinventedvaccines.I rock.LOL

Vaccine trials still take a long time and are very costly.

Vaccines manipulate the adaptive immuneVaccines manipulate the adaptive immunesystemsystem

Current practice for trials:

Stimulate immune system1. Battle-test immune system via pathogen exposure2.

What can we learn from antibody-making B cells without battle-testing?

Antibodies bind antigensAntibodies bind antigens

B cell diversification processB cell diversification processV genes D genes J genes

Affinitymaturation

Somatic hypermutation

VDJrearrangement

includingerosion and

non-templatedinsertion

AntigenNaive B cell

Experienced B cell

Overall goal: reconstruct processOverall goal: reconstruct process

ACATGGCTC...ATACGTTCC...TTACGGTTC...ATCCGGTAC...ATACAGTCT...

reality

inference

......

Why reconstruct B cell lineages?Why reconstruct B cell lineages?

...

1. Vaccine design

This one is really good.How can we elicit it?

Why reconstruct B cell lineages?Why reconstruct B cell lineages?

...

1. Vaccine design

Why reconstruct B cell lineages?Why reconstruct B cell lineages?

...

1. Vaccine design

?

2. Vaccine assay

Why reconstruct B cell lineages?Why reconstruct B cell lineages?

...

1. Vaccine design

3. Evolutionary analysis to learn about underlying mechanisms

2. Vaccine assay

Goal 1: how are antibodies “drafted”?Goal 1: how are antibodies “drafted”?

ACATGGCTC...ATACGTTCC...TTACGGTTC...ATCCGGTAC...ATACAGTCT...

reality

rearrangement groups

......

“Solve” “Solve” , where, whereff((XX)) ∼∼ YYV genes D genes J genes

Affinitymaturation

Somatic hypermutation

VDJrearrangement

includingerosion and

non-templatedinsertion

AntigenNaive B cell

Experienced B cell

is a statistical model of recombination and maturation are parameters of that model (including clusters) are antibody repertoire sequences

fXY

VDJ annotation problem:VDJ annotation problem:from where did each nucleotide come?from where did each nucleotide come?

Somatic hypermutation

Sequencing primerSequencing error

3’V deletion

VD insertion

5’D deletion

3’D deletion5’J deletion

DJ insertion

Biological process

Sequencing

Inference

G

This is a key first step in BCR sequence analysis.

Rich probabilistic models workRich probabilistic models work

hamming distance

0 5 10 15

freq

uen

cy

0.0

0.1

0.2

0.3

HTTNpartis (k=5)partis (k=1)ighutiliHMMunealignigblastimgt

HTTN

Integrate out annotation uncertaintyIntegrate out annotation uncertaintyfor better clusteringfor better clustering

Goal 2: how are antibodies “revised”?Goal 2: how are antibodies “revised”?Estimate per-residue level of natural selection on receptor

sequences from healthy individuals.ω = dN/dS

■ Large : diversifying sites

■ near 1: neutral sites ■ Small : purifying sites

ω

ω

ω

AAC AAG

GTGGTC

more likely

less likely

In antibodies

CCA CCT

Pro Pro

Thr Ile

ATCACC

synonymous

nonsynonymous

For selection

AAC AAG

GTGGTC

more likely

less likely

In antibodies

CCA CCT

Pro Pro

Thr Ile

ATCACC

synonymous

nonsynonymous

For selection

AAC AAG

GTGGTC

more likely

less likely

In antibodies

Solution: use “out-of-frame” sequencesto determine neutral mutation rate.

antigen

light chain

purifying

neutral

diversifying

ConclusionConclusion We like to “solve equations” like , where and arerandom variables. We especially like the case when is sequence data and issomething weird. We can use these tools to learn about B cell receptor sequenceevolution.

f(X) ∼ Y X Y

Y X

Next steps: phylogeneticsNext steps: phylogenetics Understand the impact of data on curvature Extend work to other models of tree space Use understanding to design biased proposals that don’t get stuck Implement phylogenetic algorithms that can update trees given moresequences Continue building community with phyloseminar.org phylobabble.org

Next steps: B cellsNext steps: B cells

ACATGGCTC...

ATACGTTCC...

TTACGGTTC...

ATCCGGTAC...

ATACAGTCT...

reality

inference

......

Learn more about the mutation process in B cell maturation to betterreconstruct ancestral sequences; evolutionary dynamics

Etiology of Burkitt’s lymphoma

Next steps: B cellsNext steps: B cells

Origin of protective antibodies;optimization of vaccination strategies

Watching immune repertoires evolve through time

Wish I had time to talk aboutWish I had time to talk about

Evolution of innate immunity & viralantagonists; Origin of SIVcpz

Founder HIV sequence identificationfor sieve analysis

Wish I had time to talk aboutWish I had time to talk about

Human microbiome

Simian foamy virus variation;innate immune defense

Wish I had time to talk aboutWish I had time to talk about

HIV superinfectionDrug resistance mutations

Thank you to my group membersThank you to my group members

Thank you to the Fred Hutch communityThank you to the Fred Hutch community Brilliant students, postdocs, and staff scientist collaborators Computational biology program, esp. “scouts” and Marty Fantastic admin support: Sara, Melissa, and Anissa Fantastic computing support: esp. Dirk, Carl, Erik, and Michael

supporters: Katie P, Dan G, and Garnet Patience with my meddling: Larry, Myra, Jon C

fredhutch.io

Recommended