95
TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles Tandy Warnow Department of Computer Science The University of Illinois at Urbana-Champaign

TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

TIPP and SEPP: Metagenomic Analysis using Phylogeny-Aware Profiles

Tandy Warnow Department of Computer Science

The University of Illinois at Urbana-Champaign

Page 2: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Metagenomic taxonomic identification and phylogenetic profiling

Metagenomics, Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes

Page 3: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

1. What is this fragment? (Taxonomic classification)

2. What is the taxonomic distribution in the dataset? (Note: helpful to use marker genes.)

3. What are the organisms in this metagenomic sample doing together?

Basic Questions

Page 4: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

1. What is this fragment? (Taxonomic classification)

2. What is the taxonomic distribution in the dataset? (Note: helpful to use marker genes.)

3. What are the organisms in this metagenomic sample doing together?

THIS TALK

Page 5: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

https://www.slideshare.net/EastBayWPMeetup/custom-post-types-and-custom-taxonomies

Page 6: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

TIPP and SEPP

TIPP (Bioinformatics 2014) performs

Taxonomic identification

Abundance profiling

SEPP (Pacific Symposium on Biocomputing 2012):

Adds reads into phylogenies (can be used for taxonomic identification)

Both available at https://github.com/smirarab/sepp

Page 7: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

TIPP for Taxonomic ID – output file

Page 8: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU
Page 9: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Objective: Distribution of the species (or genera, or families, etc.) within the sample.

For example: The true distribution of the sample at the species-level is:

50% species A

20% species B

15% species C

14% species D

1% species E

Abundance Profiling

Page 10: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Objective: Distribution of the species (or genera, or families, etc.) within the sample.

For example: distributions at the species-level

True Distribution Estimated Distribution

50% species A 42% species A

20% species B 18% species B

15% species C 11% species C

14% species D 10% species D

1% species E 0% species E

19% unclassified

Abundance Profiling

Page 11: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Objective: Distribution of the species (or genera, or families, etc.) within the sample.

We compared TIPP to

PhymmBL (Brady & Salzberg, Nature Methods 2009)

NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011)

MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop lab at the University of Maryland

MetaPhlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard

mOTU (Bork et al., Nature Methods 2013)

MetaPhyler, MetaPhlAn, and mOTU are marker-based techniques (but use different marker genes).

Marker gene are single-copy, universal, and resistant to horizontal transmission.

Testing TIPP in Bioinformatics 2014

Page 12: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

High indel datasets containing known genomes

Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and mOTU terminates with an error message on all the high indel datasets.

Page 13: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

“Novel” genome datasets

Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets.

Page 14: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

TIPP vs. other abundance profilers

TIPP is highly accurate, even in the presence of high indel rates and novel genomes, and for both short and long reads.

The other tested methods have some vulnerability (e.g., mOTU is only accurate for short reads and is impacted by high indel rates).

Improved accuracy is due to the use of eHMMs; single HMMs do not provide the same advantages, especially in the presence of high indel rates.

Page 15: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

This talk

Page 16: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

https://www.slideshare.net/EastBayWPMeetup/custom-post-types-and-custom-taxonomies

Page 17: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Phylogenies and Taxonomies

Taxonomies:

Rooted, labels at every node for each taxonomic level

More or less based on phylogenies

Phylogenies:

Estimated from sequences (usually)

Branch lengths reflect amount of change

Edges/nodes sometimes given with support (typically bootstrap)

Phylogenies are usually unrooted (because root can be difficult to infer)

Page 18: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

https://www.researchgate.net/Rooted-neighbor-joining-16S-rRNA-gene-based-phylogenetic-tree-of-uncultured-bacteria-The_fig1_279565247 [accessed 26 Jul, 2018]

Rooted neighbor-joining 16S rRNA gene-based phylogenetic tree of uncultured bacteria.

Page 19: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

How are Phylogenies Estimated?

Input: Sequences (DNA, RNA, or Aminoacid), unaligned

Output: Tree on the sequences

Notes:

Two steps: first align, then compute a tree on the alignment

Many different techniques for each step

Page 20: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

DNA Sequence Evolution

AAGACTT

TGGACTTAAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Page 21: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

…ACGGTGCAGTTACCA…

MutationDeletion

…ACCAGTCACCA…

Indels (insertions and deletions)

Page 22: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

…ACGGTGCAGTTACC-A… …AC----CAGTCACCTA…

The true multiple alignment Reflects historical substitution, insertion, and deletion events Defined using transitive closure of pairwise alignments computed on edges of the true tree

…ACGGTGCAGTTACCA…

SubstitutionDeletion

…ACCAGTCACCTA…

Insertion

Page 23: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

AGAT TAGACTT TGCACAA TGCGCTTAGGGCATGA

U V W X Y

U

V W

X

Y

Phylogeny Estimation

Page 24: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Input: unaligned sequences

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Page 25: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Phase 1: Alignment

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

Page 26: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Phase 2: Construct tree

S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA

S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA

S1

S4

S2

S3

Page 27: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Multiple Sequence Alignment

Figure from Braun et al., BMC Evol. Biol., 2011. “Homoplastic Microinversions and the Avian Tree of Life”

Page 28: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Figure from Braun et al., BMC Evol. Biol., 2011. “Homoplastic Microinversions and the Avian Tree of Life”

Page 29: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Phylogeny estimation

First compute a multiple sequence alignment Computationally challenging, accuracy on large divergent datasets can be low Best current methods for moderate-sized DNA datasets: MAFFT Best current methods for large datasets: PASTA and UPP

Then compute a tree Typically using maximum likelihood heuristics (e.g., RAxML, IQTree, PhyML, and FastTree) Computationally challenging, accuracy on large divergent datasets can be low Produces trees with branch lengths and other numeric model parameters

Page 30: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

This talk

Page 31: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Phylogenetic Placement

Page 32: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

https://www.researchgate.net/Rooted-neighbor-joining-16S-rRNA-gene-based-phylogenetic-tree-of-uncultured-bacteria-The_fig1_279565247 [accessed 26 Jul, 2018]

Rooted neighbor-joining 16S rRNA gene-based phylogenetic tree of uncultured bacteria.

Page 33: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Marker-based Taxon Identification

ACT..TAGA..AAGC...ACATAGA...CTTTAGC...CCAAGG...GCAT

ACCG CGAG CGG GGCT TAGA GGGGG TCGAG GGCG GGG •. •. •. ACCT

Fragmentary sequences from some gene

Full-length sequences for same gene, and an alignment and a tree

Page 34: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Input

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC

Page 35: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Align Sequence

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

Page 36: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Place Sequence

S1

S4

S2

S3Q1

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

Page 37: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Phylogenetic Placement in 2011

Align each query sequence to backbone alignment HMMER (Finn et al., NAR 2011) PaPaRa (Berger and Stamatakis, Bioinformatics 2011)

Place each query sequence into backbone tree pplacer (Matsen et al., BMC Bioinformatics, 2011) EPA (Berger and Stamatakis, Systematic Biology 2011)

Note: pplacer and EPA solve same problem (maximum likelihood placement under standard sequence evolution models)

Page 38: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

HMMER vs. PaPaRa Alignments

Increasing rate of evolution

0.0

Page 39: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Phylogenetic Placement in 2011

Page 40: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Input

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC

Page 41: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Align Sequence using HMMER

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

Page 42: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Place Sequence using pplacer

S1

S4

S2

S3Q1

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

Page 43: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

What is HMMER+pplacer?

HMMER (Finn et al., NAR 2011) is used to add the read s into the backbone alignment, thus producing an “extended alignment”. HMMER is based on profile Hidden Markov Models (profile HMMs).

pplacer (Matsen et al. BMC Bioinformatics 2010) is used to add read s into the best location in the tree T. pplacer is based on phylogenetic sequence evolution models (e.g., GTR), and uses maximum likelihood.

Page 44: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU
Page 45: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

From http://codecereal.blogspot.com/2011/07/protein-profile-with-hmm.html

A general topology for a profile HMM

Page 46: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Profile Hidden Markov Models

Profile HMMs are probabilistic generative models to represent multiple sequence alignments.

HMMER software suite can

Build a profile HMM given a multiple sequence alignment A

Use the profile HMM to add a sequence s into A, and return the “probability” that the HMM generated s (the “score”)

Page 47: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

INPUT

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC

Page 48: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Align Sequence using HMMER

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

Page 49: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Align Sequence using HMMER

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

1. Build a profile HMM for the backbone alignment 2. Add Q1 to backbone alignment: Compute a

maximum likelihood path through the profile HMM for Q1 and use it to compute the extended alignment.

3. Record the score for the alignment!

Page 50: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

What is pplacer?

pplacer: software developed by Erick Matsen and colleagues. See http://matsen.fhcrc.org/pplacer/ Input: read s, alignment A (on S and s), tree on S Output:

“Best” location to add s in T (under maximum likelihood).

For every edge e in T, the value p(e) for the probability for s being placed on e (these probabilities add up to 1)

Page 51: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Place Sequence using pplacer

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

1. For every edge e in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!)

2. Return Te that has the best ML score.

Page 52: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Place Sequence using pplacer

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

1. For every edge e in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!)

S1

S4

S2

S30.5

0.4

0.05

0.03

0.02

Page 53: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Place Sequence using pplacer

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

1. For every edge e in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!)

2. Find edge e* producing the best ML score

S1

S4

S2

S30.5

0.4

0.05

0.03

0.02

Page 54: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Place Sequence using pplacer

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

1. For every edge e in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!)

2. Find edge e* producing the best ML score

S1

S4

S2

S30.5

Page 55: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Place Sequence using pplacer

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

1. For every edge e in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!)

2. Add Q1 into edge e*

S1

S4

S2

S30.5

Page 56: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Place Sequence using pplacer

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

1. For every edge e in T, let Te be the tree created by adding Q1 to that edge. Compute the maximum likelihood (ML) score of the tree Te for the extended alignment. (Use the ML scores to assign probabilities p(e) to all edges e!)

2. Add Q1 into edge e*

S1

S4

S2

S3Q1

Page 57: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

HMMER vs. PaPaRa Alignments

Increasing rate of evolution

0.0

Page 58: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

One Hidden Markov Model for the entire alignment?

HMM 1

Page 59: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

One HMM works beautifully for small-diameter trees

HMM!1!

Page 60: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

One HMM works poorly for large-diameter trees

HMM!1!

Page 61: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

One Hidden Markov Model for the entire alignment?

Page 62: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Or 2 HMMs?

HMM 1

HMM 2

Page 63: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

HMM 1

HMM 3 HMM 4

HMM 2

Or 4 HMMs?

Page 64: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

This talk

Page 65: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Ensemble of HMMs

Basic idea

Represent a multiple sequence alignment A on S with a collection of HMMs

Construct the eHMM by dividing S into subsets of taxa, and build an HMM on each subset (using HMMER)

SEPP (PSB 2012): SATé-enabled Phylogenetic Placement

TIPP (Bioinformatics 2014): Applications of the eHMM technique to (a) taxonomic identification and (b) metagenomic abundance classification

UPP (Genome Biology 2015): constructs ultra-large alignments

HIPPI (BMC Genomics 2016): protein family classification

All available at https://github.com/smirarab/sepp:

Page 66: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

SEPP Design

To insert query sequence Q1 into backbone tree T

Represent the backbone MSA with a collection of profile HMMs, based on maximum ”alignment subset size”

Score Q1 against every profile HMM in the collection

The best scoring HMM is used to compute the extended alignment

Use pplacer to add Q1 into tree T (restricted to ”subtree” based on maximum ”placement subset size”)

Page 67: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

One Hidden Markov Model for the entire alignment?

Page 68: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Or 2 HMMs?

HMM 1

HMM 2

Page 69: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

HMM 1

HMM 3 HMM 4

HMM 2

Or 4 HMMs?

Page 70: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

SEPP Parameter Exploration

Alignment subset size and placement subset size impact the accuracy, running time, and memory of SEPP:

Small alignment subset sizes best

Large placement subset size best

10% rule (both subset sizes 10% of backbone) had best overall performance

Page 71: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

SEPP (10%-rule) on simulated data

0.00.0

Increasing rate of evolution

Page 72: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

SEPP and eHMMs

An ensemble of HMMs provides a better model of a multiple sequence alignment than a single HMM, and is better able to

(a) detect homology between full length sequences and fragmentary sequences and (b) add fragmentary sequences into an existing alignment

especially when there are many indels and/or substitutions.

Page 73: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Metagenomic Taxon Identification

Objective: classify short reads in a metagenomic sample

Page 74: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

SEPP does Phylogenetic Placement

Page 75: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

SEPP can also be used for Taxon Identification

Page 76: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

TIPP can also be used for Taxon Identification

Page 77: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

TIPP (https://github.com/smirarab/sepp)

TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes

TIPP pipeline 1. Uses BLAST to assign reads to marker genes 2. Computes UPP/PASTA reference alignments 3. Uses reference taxonomies, refined to binary trees using reference alignment 4. Modifies SEPP by considering statistical uncertainty in the extended alignment and

placement within the tree. Can consider more than one extended alignment Can consider more than one placement in the tree for each extended alignment Assign taxonomic label based on MRCA (most recent common ancestor) of all selected placements for all selected extended alignments

Page 78: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

TIPP for Taxonomic ID – output file

Page 79: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Objective: Distribution of the species (or genera, or families, etc.) within the sample.

For example: The distribution of the sample at the species-level is:

50% species A

20% species B

15% species C

14% species D

1% species E

Abundance Profiling

Page 80: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Objective: Distribution of the species (or genera, or families, etc.) within the sample.

For example: The distribution of the sample at the species-level is:

50% species A

20% species B

15% species C

14% species D

1% species E

TIPP can be used for Abundance Profiling

Page 81: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

TIPP (https://github.com/smirarab/sepp)

TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes

TIPP pipeline 1. Uses BLAST to assign reads to marker genes 2. Computes UPP/PASTA reference alignments 3. Uses reference taxonomies, refined to binary trees using reference alignment 4. Modifies SEPP by considering statistical uncertainty in the extended alignment and

placement within the tree. Can consider more than one extended alignment Can consider more than optimal placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignment

Page 82: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

TIPP (https://github.com/smirarab/sepp)

TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes

TIPP pipeline 1. Uses BLAST to assign reads to marker genes 2. Computes UPP/PASTA reference alignments 3. Uses reference taxonomies, refined to binary trees using reference alignment 4. Modifies SEPP by considering statistical uncertainty in the extended alignment and

placement within the tree. Can consider more than one extended alignment Can consider more than one placement in the tree for each extended alignment Assign taxonomic label based on MRCA of all selected placements for all selected extended alignments

Page 83: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

TIPP (https://github.com/smirarab/sepp)

TIPP (Nguyen, Mirarab, Liu, Pop, and Warnow, Bioinformatics 2014), marker-based method that only characterizes those reads that map to the Metaphyler’s marker genes

TIPP pipeline 1. Uses BLAST to assign reads to marker genes 2. Computes UPP/PASTA reference alignments 3. Uses reference taxonomies, refined to binary trees using reference alignment 4. Modifies SEPP by considering statistical uncertainty in the extended alignment and

placement within the tree. Can consider more than one extended alignment Can consider more than one placement in the tree for each extended alignment Assign taxonomic label based on MRCA (most recent common ancestor) of all selected placements for all selected extended alignments

Page 84: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

TIPP Design (Step 4)Input: marker gene reference alignment (computed using PASTA), species taxonomy, alignment support threshold (default 95%) and placement support threshold (default 95%)

For each marker gene and its associated bin of reads:

Builds eHMM to represent the MSA

For each read:

(1) Use the eHMM to produce a set of extended MSAs that include the read, sufficient to reach the specified alignment support threshold.

(2) For each extended MSA, use pplacer to place the read into the taxonomy (optimizing maximum likelihood) and identify the clade in the tree with sufficiently high likelihood to meet the specified placement support threshold. Note: This gives the taxonomic information — but may not classify the read if the threshold is high, because that will move the read towards the root! Taxonomically characterize each read at the MRCA of these clades.

Page 85: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Objective: Distribution of the species (or genera, or families, etc.) within the sample.

We compared TIPP to

PhymmBL (Brady & Salzberg, Nature Methods 2009)

NBC (Rosen, Reichenberger, and Rosenfeld, Bioinformatics 2011)

MetaPhyler (Liu et al., BMC Genomics 2011), from the Pop lab at the University of Maryland

MetaPhlAn (Segata et al., Nature Methods 2012), from the Huttenhower Lab at Harvard

mOTU (Bork et al., Nature Methods 2013)

MetaPhyler, MetaPhlAn, and mOTU are marker-based techniques (but use different marker genes).

Marker gene are single-copy, universal, and resistant to horizontal transmission.

Abundance Profiling

Page 86: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

High indel datasets containing known genomes

Note: NBC, MetaPhlAn, and MetaPhyler cannot classify any sequences from at least one of the high indel long sequence datasets, and mOTU terminates with an error message on all the high indel datasets.

Page 87: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

“Novel” genome datasets

Note: mOTU terminates with an error message on the long fragment datasets and high indel datasets.

Page 88: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

TIPP vs. other abundance profilers

TIPP is highly accurate, even in the presence of high indel rates and novel genomes, and for both short and long reads.

All other methods have some vulnerability (e.g., mOTU is only accurate for short reads and is impacted by high indel rates).

Improved accuracy is due to the use of eHMMs; single HMMs do not provide the same advantages, especially in the presence of high indel rates.

Page 89: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Using SEPP

SEPP algorithmic parameters: Alignment subset size (how many sequences for each profile HMM in the ensemble?) Placement subset size (how much of the tree to search for optimal placement?)

Default settings are acceptable, but you can improve accuracy (but increase running time) by:

increasing placement subset size and decreasing alignment subset size

Page 90: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Using TIPP

TIPP algorithmic parameters (other than SEPP parameters) Reference markers, alignments, and refined taxonomy Alignment threshold (default 95%) Placement threshold (default 95%)

Note: The default alignment and placement thresholds were optimized for abundance profiling, not for Taxon ID. Reducing the placement threshold will increase probability of taxonomic classification at the species level (but could also increase the false positive rate)

Page 91: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Our Publications using eHMMsS. Mirarab, N. Nguyen, and T. Warnow. "SEPP: SATé-Enabled Phylogenetic Placement." Proceedings of the 2012 Pacific Symposium on Biocomputing (PSB 2012) 17:247-258. N. Nguyen, S. Mirarab, B. Liu, M. Pop, and T. Warnow "TIPP:Taxonomic Identification and Phylogenetic Profiling." Bioinformatics (2014) 30(24):3548-3555. N. Nguyen, S. Mirarab, K. Kumar, and T. Warnow, "Ultra-large alignments using phylogeny aware profiles". Proceedings RECOMB 2015 and Genome Biology (2015) 16:124 N. Nguyen, M. Nute, S. Mirarab, and T. Warnow, HIPPI: Highly accurate protein family classification with ensembles of HMMs. BMC Genomics (2016): 17 (Suppl 10):765

All software are available in open source form at https://github.com/smirarab/sepp

Page 92: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

TIPP is under development!

We are modifying TIPP’s design to improve taxonomic identification and abundance profiling on shotgun sequencing data Stay tuned!

Developers: Erin Molloy, Mike Nute, Nidhi Shah, Mihai Pop, and Tandy Warnow

Page 93: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

The Tutorial (by Mike Nute)

SEPP for taxon ID Will show you how to run SEPP Will show you how to use branch lengths in SEPP’s placement of reads to get interesting insights

TIPP for taxon ID and abundance profiling

You may also wish to look at last year’s tutorial by Erin Molloy, at https://github.com/ekmolloy/stamps-tutorial/

Page 95: TIPP and SEPP: Metagenomic Analysis using Phylogeny …tandy.cs.illinois.edu/stamps-warnow-2018.pdfshort and long reads. The other tested methods have some vulnerability (e.g., mOTU

Acknowledgments

PhD students: Nam Nguyen (now postdoc at UCSD), Siavash Mirarab (now faculty at UCSD), Bo Liu (now at Square), Erin Molloy, and Mike Nute Mihai Pop, University of Maryland

NSF grants to TW: DBI:1062335, DEB 0733029, III:AF:1513629 NIH grant to MP: R01-A1-100947 Also: Guggenheim Foundation Fellowship (to TW), Microsoft Research New England (to TW), David Bruton Jr. Centennial Professorship (to TW), Grainger Foundation (to TW), HHMI Predoctoral Fellowship (to SM) TACC, UTCS, and UIUC computational resources