Supplementary Text Detection of simulated reticulate events · Supplementary Text Detection of simulated reticulate events To illustrate how persistent homology captures reticulate

Supplementary Text

Detection of simulated reticulate events

To illustrate how persistent homology captures reticulate events, we first modeled a

population undergoing clonal evolution for 30 generations with random admixture

through reassortment at time step 15. Figure S4A-B shows a representative simulation

and highlights the ancestral lineages of reassorting sequences as well as their closely

related isolates. When we apply persistent homology to cumulative datasets from time

steps 1 to t such that ancestral sequences are included (from 30 to 900 sequences), no

higher-dimensional structure (b1 > 0) can be observed until the reassortment at t=15

(Figure S4C). However, 2-D topology remains trivial over the entire time and genome.

We can also detect the admixture event by analyzing sequences at time step 15 without

ancestral sequences. In this example, the one-dimensional generator positively identifies

all reassortants (Figure S4D).

We next considered evolution over a range of values for µ and r. Here, we considered

two different indicators of high-dimensional homology: the size of the longest non-zero

dimensional bar (TOP) (Figure S5A-B, E-F), and the rate of non-trivial homology

generators in a higher dimension (ICR) (Figure S5C-D, G-H).

In nature, complex reassortments may arise between multiple parental strains. To

simulate this phenomenon, we considered populations with a variable number of

segments S in their genomes evolving under a constant molecular clock. At time step 15,

we performed a single reassortment such that each segment reassorts independently

among half the population. We plotted the average higher-dimensional TOP and ICR as

the number of segments S increases (Figure S6). Once again, we see substantial one-

dimensional structure for any type of reassortment (S>1). Interestingly, we see two-

dimensional, and at times three-dimensional, topology only for more complex

reassortments of S>2. This result demonstrates persistent homology’s ability to detect

reassortments between multiple parental strains, a feature not shared by NSS, PHI,

MaxChi, or any other method. As we will see, determining the generators that produce

these 2-D and 3-D holes promises to identify the different strains involved in these

complex reassortments.

Comparative analysis with other methods for detecting recombination

To benchmark the performance of persistent homology, we also investigated alternative

methods specifically tailored towards the detection of recombination: the pairwise

homoplasy index (PHI)(1), neighbor similarity score (NSS)(2), and MaxChi(3), all of

which produce p-values for the significance of observing a reticulate event (see

Methods). We first considered the simulation of a single population admixture at time

step 15 following the procedure described above (Figure S7A). Interestingly, TOP and

ICR remain zero prior to t=15 when only clonal evolution has taken place; all other

methods vacillate between 0 and 1, with PHI even reaching significance level < 0.05

during t=[3,5]. For t>15, both TOP and ICR become non-zero and detect reassortment

correctly. Of the other methods, only MaxChi detects reassortment. NSS and PHI have

low sensitivity when analyzing a large number of sequences (>100 of sequences). These

results indicate that TOP and ICR have higher specificity than all other methods to detect

a single admixture event. This finding relates to Supplementary Theorem 2.1, which

states that non-zero TOP implies a non-additive tree.

We next compared the statistical power of persistent homology and other methods to

detect reticulate events under 1) variable µ; 2) variable r; and 3) including and excluding

ancestral sequences (300 and 100 sequences respectively) (Figure S7B-E). Here, we

show that overall the sensitivity of persistent homology is higher than that of NSS and

PHI, and nearly as high as MaxChi. It should be noted that sensitivity of persistent

homology can be increased even further with a greater number of “witness points” at the

expense of increasing computational cost (see Methods). Moreover, persistent homology

has very high specificity at r=0 except in settings of high µ (>0.025 substitutions per site

per generation), leading to homoplasy.

Rare recombination in influenza evolution

Despite sporadic reports(4), homologous recombination in influenza plays a much

smaller role in the evolution of the virus compared to reassortment(5). Again, our results

reflect this notion; one-dimensional topology of the concatenated genome outstrips those

of individual segments. In support, b1 = 0 in H3N2 and H1N1pdm (ignoring homoplasy

events at ε ≤ 2). However, some small one-dimensional intervals in swine H1N1 and

avian influenza do exist. Swine H1N1 displays irreducible cycles in PB1 and NS1

alignments (Table S7). To a greater extent, non-homoplasy events in avian influenza

number 5, 5, 2, 2, 4, and 1 for PB2, PB1, PA, NP, M1, and NS1 respectively (Table S3).

Two of these recombination events in PB2 have been detected in previous studies, while

two irreducible cycles in PB1 most likely result from poor annotation or sequencing

error(5). Our analysis therefore indicates little homologous recombination in influenza,

with most events explained by homoplasy (small bars) and sequence errors.

Methods

Sequence collection, annotation, and alignment. Complete coding sequences of

influenza segments were collected from NCBI(6), downloaded as of August 2012.

Complete coding sequences of env, gag, and pol were downloaded from the LANL HIV

database(7) as of September 2012. Complete coding sequences of hepatitis C, dengue,

WNV, rabies, and Newcastle virus were downloaded from the Virus Pathogen Database

and Analysis Resource (ViPR) database(8) as of September 2012 (Table S1). Sequences

were aligned using ClustalW2 v. 2.0.12 using default parameters and manually curated.

For avian, H3N2, and H1N1pdm influenza, we first performed hierarchical clustering

using complete linkage, creating a representative, non-redundant dataset of 1000

sequences each. All alignments and code are available upon request.

Implementation of persistent homology. To characterize vertical and horizontal

evolution of viral genomes, we employed a persistent homology technique called

barcoding(9): a method of identifying topological invariants in cloud data. First,

sequenced isolates are transformed in high-dimensional space using pairwise genetic

distance. Distances were calculated by considering all sites or only informative sites (at

least two sequences with a nucleotide different from consensus). All analysis was

performed using Nei-Tamura (shown in this paper) and Hamming genetic distance, with

both producing similar results. From this cloud data, points can be linked based on certain

criteria to form a simplicial complex: a structure of points, line segments, triangles, and

even n-dimensional counterparts. When this criterion is a distance less than some ε, a

filtered simplicial complex called a Vietoris-Rips stream is created.

An objective in the study of topology is to identify features of filtered simplicial

complexes that persist across all values of ε. A useful characteristic is the Betti number,

which in dimension 0 is b0, the number of connected components in a particular

simplicial complex. In the context of viral genomics, b0 at a given ε can define the

number of clades or subspecies in a viral population. In addition, the minimum ε that

produces b0 can be a measure of surveillance completeness(10). On the other hand,

higher-dimensional Betti number can detect reticulate evolution, such as recombination

or reassortment.

Calculating Vietoris-Rips streams for large datasets can be computationally expensive,

and an alternative method called lazy witness streams can be used that produces a smaller

number of simplicial complexes from a subset of “landmark” points(11). Indeed, de Silva

et al. assert that in many cases, these “witness complexes” are capable of robustly

reproducing the homology groups of the Vietoris-Rips stream with even less noise(12).

Suppose a landmark subset of points L is pre-selected from point cloud data either at

random or using a max-min scheme. Let d(S,p) be the vector of distances between point p

and all points in set S. In dimension 0, a pair of points a and b is then linked if there

exists a witness point z such that the maximum of d(a,z) and d(b,z) is less than the sum of

ε and the υ’th smallest element in d(L,z). In this study, we used υ= 2. Higher-dimensional

simplices exist if all pairs of points are linked as well.

Using the Javaplex software package at http://code.google.com/p/javaplex, we

implemented lazy witness stream to filter different viral strains with the following

parameters:

Fig. 3A: 20% of dataset as landmarks, max scale ε of 1000 nucleotides, and max dimension of 0. Fig. 4C: 20% landmarks, max scale ε of 500 nucleotides, and max dimension of 2. Fig. 4D: one-third landmarks, max scale ε of 1000 nucleotides, and max dimension of 1. Fig. 5D: one-third landmarks, max scale ε of 1000 nucleotides, and max dimension of 1. Fig. 5E: 20% landmarks, max scale ε of 150 nucleotides, and max dimension of 2. Fig. S10: one-third landmarks, max scale ε of 1000 nucleotides, and max dimension of 1. Fig. S11: one-third landmarks, max scale ε of 1000 nucleotides, and max dimension of 1. All supplementary tables used one-third of sequences as landmarks except for 20% landmarks in Table S3, S5, and S13.

While lazy witness streams can largely recapitulate the homology groups of Vietoris-Rips

at a lower computational cost, the sensitivity to detect irreducible cycles can drop if the

number of landmark points is insufficient. However, through our evolutionary

simulations, we show that even with a landmark set of 20% of the sequences, our

sensitivity surpasses other popular methods of recombinant detection (see Simulations).

Mathematical underpinnings of persistence homology. For ordinary topological

spaces X, the (mod-2) homology of X in dimension k is a Boolean vector space whose

dimension counts the number of independent occurrences of certain patterns thought of as

k-dimensional holes. It is computed using matrix operations applied to a k-dimensional

boundary matrix. Elements in this vector space can be represented by cycles, by which

we mean formal sums of k-dimensional simplices, i.e. (k+1)-tuples of points spanning a

higher dimensional analogue of segments, triangles, tetrahedra, etc. We are dealing with

Boolean vector spaces, so we may think of the sums as exclusive unions of k-simplices.

The cycle condition means that if we take the exclusive unions of all boundary faces we

obtain the empty collection of (k-1)-simplices. In dimension one, for example, this

means that we have a collection of edges that form a closed loop. The choice of cycle

representative is not unique, but can be varied by any cycle, which is the boundary of a

collection of (k+1)-dimensional simplices.

Persistent homology is a methodology that adapts homology to the study of point clouds

or finite metric spaces. It produces a nested family of simplicial complexes, by applying

the k-dimensional homology to a family of vector spaces, parametrized by the scale

parameter, equipped with linear transformations from each vector space to any other with

higher value of the scale parameter. Just as vector spaces are determined up to

isomorphism by their dimension, so these algebraic structures can be determined up to

isomorphism by a barcode, a collection of intervals, each one specified by a left hand and

a right hand endpoint. Each interval in a barcode will be called a “bar”. A bar with left

and right hand endpoints (l,r) corresponds to a homology element which first occurs at

the scale parameter value r, and which becomes zero exactly when the scale parameter

reaches r. Such a homology element can also be represented by a cycle as above, in this

case in the complex with scale parameter value l.

Principal Coordinate Analysis. Genomic sequences can be separated by genetic

distance and embedded in a high-dimensional space. Although the resulting pairwise

distance matrix provides input for persistent homology, the relationship between

sequences is not easily visualized without dimensionality reduction. Principal coordinate

analysis (PCoA), otherwise known as classical multi-dimensional scaling, transforms a

dissimilarity matrix into a coordinate matrix that minimizes a loss function, called

strain(13). The first p columns of the resulting coordinate matrix provide an

approximation of the data points separated by Euclidean distance in dimension p. We use

PCoA only to visualize a dataset (Figure 2A) or a simplicial complex (Figure 3E, 5E).

Evolutionary Simulations. Stochastic evolution through antigenic drift was modeled as

a neutral Wright-Fisher model with a constant haploid population size N of dinucleotide

sequences of genomic length L base pairs, with each sequence evolving at a mutation rate

of µ nucleotides per site per generation. Each simulation began with an initial population

of 40% nucleotide diversity. We also performed reassortment for S number of segments

each of 500 nucleotides at a given rate r, which governs the proportion of individuals

exchanging segments at each generation. To simulate recombination, we randomized the

length and coordinates of the genomic segment exchanged among a proportion of

individuals parameterized by r for each generation.

To characterize the evolutionary topology for a given set of conditions, we considered

datasets that either included or excluded ancestral sequences. To simulate with ancestral

sequences, we analyzed cumulative datasets with multiple time steps, or generations, of

evolved sequences. Otherwise, we analyzed only a single generation of sequences. For

each dataset, we then determined a metric space of pairwise genetic distances (Jukes-

Cantor). Based on these distances, we applied Javaplex to perform barcoding analysis at

different dimensions based on a lazy witness complex constructed with 20% of sequences

serving as landmarks. Such analysis can determine the Betti number across all filtration

genetic distances and generations. Alternatively, for a given time step t, we can examine

how the topological properties—topological obstruction (TOP) and irreducible cycle rate

(ICR)—change as we modulate µ and r. Of course, sensitivity to detect irreducible cycles

can be improved by increasing the number of landmark points.

One consideration is the ability of persistent homology to differentiate between real

reticulate events and homoplasies. When r=0, evolution is purely clonal; therefore, any

deviations from trivial homology derive from homoplasy. By estimating the bar length

distribution in simulations with r=0, we found that all non-trivial homology is

represented by bars within 1-D or 2-D TOP ≤ 4 (99% of simulations). For analysis of

statistical power during simulations, we therefore considered 1-D or 2-D TOP > 4 genetic

distance to be a reticulate event.

Reassortment patterns. To determine patterns of reassortment, we calculate the

probability pij that two segments co-segregate given that we observe #b1ij events in a total

of #b1, where #b1ij is the number of non-trivial cycles between segments i and j, and #b1 is

the number of non-trivial cycles in all genomes. Assuming that segments do not have

preferential co-segregation, pij follows a binomial ~ bino (#b1ij, #b1,1/2). Network figures

were generated using Cytoscape v2.8.3(14).

Phylogenetic analysis. Phylogeny for avian HA in Figure 4 was constructed using

neighbor joining, ignoring all gap sites using SeaView, version 4.3.3. Phylogenies for

H3N2 and H1N1pdm influenza were constructed using maximum likelihood using the

generalized time-reversible (GTR) model with an across site rate variation model of four

different rate classes using the nearest-neighbor interchange (NNI) tree searching with

bootstrap values calculated after 100 replicates using SeaView, version 4.3.3.

Comparison to methods for detecting recombination. NSS, PHI, and MaxChi are

examples of methods for detecting recombination. NSS considers neighboring pairs of

sites and detects regions where incompatibilities cluster together(2). PHI scores the

likelihood of observing the minimum number of homoplasies for every pair of sites in the

alignment based on a parsimony criterion(1). For each position in an alignment, MaxChi

compares the distribution of polymorphic sites to the left and right of a putative

breakpoint and scores the site based on the chi-squared test(3). Calculation of NSS, PHI,

and MaxChi was performed using the PhiPack software package provided by Bruen, et

al.(15) using default parameters. We used a window size of 100 base pairs for PHI and

2/3 the number of polymorphic sites for MaxChi. All p-values were assessed using a

permutation test of 1000 permutations. P-values less than 0.05 were considered

significant for detecting recombination.

Computation time and scalability. The memory required by Javaplex for the

computation of persistent homology depended on several factors beyond the input dataset

itself, particularly the number of dimensions analyzed, the number of landmarks chosen

(when using lazy witness stream), and the maximum filtration value, all of which

influence the number of possible simplices. For this reason, we optimized these

parameters such that analysis was completed within computational limits. As the input

into the specific persistent homology program is a pairwise distance matrix, genome

length does not affect computation time. On a single computational node, for analyzing

simulation data of 300 sequences with variable mutation and recombination rate,

computational time for a Javaplex-implemented lazy witness stream with 20% of the

dataset as landmarks took on average 70.58±50.66 seconds standard deviation, compared

to PhiPack taking 689.63±24.78 seconds on the same datasets. For a fixed number of

landmark points, the computation of the witness complexes can in principle be

parallelized readily. This feature is not built into existing software but can readily be

carried out, and will be a focus of future software development.

References

1.# Bruen#TC,#Philippe#H,#&#Bryant#D#(2006)#A#Simple#and#Robust#Statistical#Test#for#Detecting#the#Presence#of#Recombination.#Genetics#172(4):2665K2681.#

2.# Jakobsen#IB#&#Easteal#S#(1996)#A#program#for#calculating#and#displaying#compatibility#matrices#as#an#aid#in#determining#reticulate#evolution#in#molecular#sequences.#Comput-Appl-Biosci#12(4):291K295.#

3.# Smith#JM#(1992)#Analyzing#the#mosaic#structure#of#genes.#J-Mol-Evol#34(2):126K129.#4.# He#CQ,-et-al.#(2009)#Homologous#recombination#as#an#evolutionary#force#in#the#avian#

influenza#A#virus.#Mol-Biol-Evol#26(1):177K187.#5.# Krasnitz#M,#Levine#AJ,#&#Rabadan#R#(2008)#Anomalies#in#the#influenza#virus#genome#

database:#new#biology#or#laboratory#errors?#J-Virol#82(17):8947K8950.#6.# Bao#Y,-et-al.#(2008)#The#Influenza#Virus#Resource#at#the#National#Center#for#

Biotechnology#Information.#J-Virol#82(2):596K601.#7.# Anonymous#(The#Los#Alamos#HIV#sequence#database.#http://www.hiv.lanl.gov/.#8.# Pickett#BE,-et-al.#(2012)#ViPR:#an#open#bioinformatics#database#and#analysis#resource#

for#virology#research.#Nucleic-Acids-Res#40(D1):D593KD598.#9.# Carlsson#G,#Zomorodian#A,#Collins#A,#&#Guibas#L#(2004)#Persistence#barcodes#for#

shapes.#in#Proceedings-of-the-2004-Eurographics/ACM-SIGGRAPH-symposium-on-Geometry-processing#(ACM,#Nice,#France),#pp#124K135.#

10.# Chan#JM#&#Rabadan#R#(2013)#Quantifying#Pathogen#Surveillance#Using#Temporal#Genomic#Data.#mBio#4(1).#

11.# Carlsson#G#&#de#Silva#V#(2004)#Topological#estimation#using#witness#complexes.#in#Symposium-on-PointIBased-Graphics,-ETH#(Zurich,#Switzerland).#

12.# Silva#Vd#&#Carlsson#G#(2004)#Topological#estimation#using#witness#complexes.#Eurographics-Symposium-on-PointIBased-Graphics.#

13.# Cox#T#&#Cox#M#(2001)#Multidimensional-Scaling.#14.# Smoot#ME,#Ono#K,#Ruscheinski#J,#Wang#PL,#&#Ideker#T#(2011)#Cytoscape#2.8:#new#

features#for#data#integration#and#network#visualization.#Bioinformatics#27(3):431K432.#15.# Bruen#T#(2005)#PhiPack:#PHI#test#and#other#tests#of#recombination.#McGill-University,-

Montreal,Quebec.

Supplementary Information: Theorems on Topological Obstructionto Phylogeny

We refer the reader to [1] for basic material on simplicial complexes, including barycentric subdi-visions, as well as the homology groups Hi of simplicial complexes.

1 Definitions

Definition 1.1 By a tree, we will mean a finite connected one-dimensional simplicial complex withno simple cycles, i.e. cycles in which no edge or vertex appears more than once. By an additivetree, we will mean a tree equipped with a real value function (called the weight) on the edges, takingvalues in the positive real numbers. By an edge path in a tree T , we will mean a sequence of vertices{v0, v1, . . . , vk} so that each pair (vi, vi+1) is an edge in T . If we have an additive tree, the lengthof an edge path is the sum of the weights of the edges appearing in the edge path. The edge-pathdistance between two vertices in an additive tree is the length of the shortest edge-path connectingthe two vertices. An additive forest is a disjoint union of additive trees.

Given any additive tree, we have now defined the structure of a finite metric space on its set ofvertices. Not all finite metric spaces can be realized as coming from additive trees. We will refer tometric spaces which do arise from additive trees as tree-like. As we shall see, persistent homologygives a useful measure of the extent to which a finite metric space is close to one which is tree-like.

Definition 1.2 By an additive graph, we will mean a graph � in which each edge (v, v0) is assigneda positive length ��(v, v0). There is an associated edge-path metric on the set of vertices, withthe distance from v to w given by the minimum of

Pi �(vi, vi+1) over the set of all edge paths

{(v0, v1), (v1, v2), . . . , (vn�2, vn�1), (vn�1, vn)} with v0 = v and vn = w. Of course the n can varyin the set of edge paths. We will denote this metric by d�, and write M(�) for the metric space(V (T ), d�). If the underlying graph of � is a tree, d� satisfies the four point condition, see [4]. Itis also shown in [4] that the converse holds in the sense that that any finite metric space satisfyingthe four point condition can be embedded isometrically in the metric space of vertices on an additivetree, i.e and additive graph whose underlying graph is a tree. We will refer to a finite metric spaceas tree-like if it satisfies the four point condition, or equivalently is isometrically embeddable inM(T ) for some additive tree T = (T,�T ).

Definition 1.3 For any finite metric space M = (M,d), we construct the Vietoris-Rips complexesV (M, r) of M as follows. The vertex set of V (M, r) is the set M , and we declare that a k-tuple{m0,m1, . . . ,mk} spans a k-simplex of V (M, r) if and only if d(mi,mj) r for all 0 i, j r.Note that if r0, then there is a natural inclusion V (M, r) ,! V (M, r0) which is the identity onthe vertex set.

1

2 Theorems

Our goal is to prove that for any r, the complex V (M, r) for a tree-like metric space is a disjointunion of acyclic complexes, i.e. that the groups Hk(V (M, r)) vanish for all k > 0. We begin witha lemma. For any graph �, the set of leaves of � will be the set of vertices which are contained inexactly one edge.

Lemma 2.1 If M is a tree-like finite metric space, then M can be isometrically embedded in M(T )for some additive tree T in such a way that all the leaves of T are included in the image of M inthe vertex set of T .

Proof: Let T be any additive tree, and i : M ! V (T ) be an isometric embedding in the vertexset of T . If e is any leaf which is not contained in the image of i, then M is included in T � {e},and it is clear that any minimal edge-path from m0 to m1, where m0,m1 2 M , will not includee. The reason is that if (f, e) is the unique edge containing e, then any occurrence of e in an edgepath is of the form . . . (f, e)(e, f) . . ., and can therefore be deleted. Thus, M embeds isometricallyin M(T � {e},�), and we can proceed to “prune” leaves until we arrive at a situation in which allleaves are in the image of M . ⇤

We will also need the following proposition which describes the relationship between the edge-pathmetrics on additive trees and edge-path metrics on additive subtrees.

Proposition 2.1 Let T denote an additive tree, and suppose that T 0 ✓ T is a sub-additive treeobtained by removing a set of leaves {e1, . . . , en}. Then we have the equality

dT 0 = dT |V (T 0)⇥ V (T 0)

so M(T 0) is a metric subspace of M(T ) in the sense that the distance function on M(T 0) is justthe restriction of the distance function on M(T ).

Proof: It su�ces to consider the case where one leaf is removed, since one can then proceed byinduction. Let e denote the one leaf, and suppose v0 and v1 are vertices in T 0. Let f denotethe unique vertex of T so that (e, f) is an edge in T . Any edge path between v0 and v1 whichcontains e may be shortened by removing the occurrence of e, since the edge path will be of theform . . . (f, e)(e, f) . . ., and the occurrence of (f, e)(e, f) can simply be removed. This means thatthe edge path of minimal length from v0 to v1 does not contain any occurrence of the vertex e,which is what we need to show. ⇤

The following proposition will also we required.

Proposition 2.2 Let M be any finite metric space. Let m0,m1 2 M be two points which maximizedistance, i.e. so that d(m0,m1) � d(m,m0) for any m,m0 2 M . Let M0 = M � {m0} andM1 = M � {m1}, regarded as metric subspaces. Then for any r < d(m0,m1), we have that

V (M, r) = V (M0, r) [ V (M1, r)

and that V (M0, r) \ V (M1, r) = V (M0 \M1, r). We have M0 \M1 = M � {m0,m1}.

2

Proof: This follows easily since no simplex of V (M, r) contains both m0 and m1, because of theinequality on r. ⇤

The value of this proposition is that it will permit us to perform an induction on the number ofpoints within a tree-like metric space. The important homological result is the following. We saya topological space W is acyclic if it is connected and Hi(W ) = {0} for all i � 1.

Proposition 2.3 Let X be a simplicial complex, with two subcomplexes U and V so that X = U[V .If U , V , and U \ V are all acyclic, then so is X.

Proof: This is a direct application of the Mayer-Vietoris exact sequence (see [5] for details of thisconstruction). ⇤

Theorem 2.1 Let M be any tree-like finite metric space, and let r � 0. Then the complex V (M, r)is a disjoint union of acyclic complexes. In particular, Hi(V (M, r)) = {0} for i � 1.

Proof: We will proceed by induction on the cardinality of M . Let 'r denote the equivalencerelation generated by the relation ⇠r, defined by m ⇠r m

0 if and only if d(m,m0) r. The metricspace M now breaks up into equivalence classes M =

`↵M↵ under 'r, and it is clear that

V (M, r) =a

↵

V (M↵, dM |M↵ ⇥M↵), r)

Each of the metric spaces (M↵, dM |M↵⇥M↵) is also tree-like. We now perform the induction, andsuppose that the result of the theorem holds for all metric spaces of cardinality < n. Consequently,we may suppose that V (M, r) is connected, since otherwise its Vietoris-Rips complex is a disjointunion of complexes which are acyclic by the inductive hypothesis. We suppose next that M isembedded as a metric subspace of M(T ), with all leaves being included in the image of M , whichcan be done by Lemma 2.1. Next, we select two points of maximum distance in M , say m0 andm1. We first observe that the mi’s are leaves. From Proposition 2.2 above, we see that it willsu�ce to prove that the metric spaces M � {m0}, M � {m1}, and M � {m0,m1} all have acyclicVietoris-Rips complexes. By induction on the cardinality, and using Proposition 2.1 above, it willsu�ce to prove that they are all connected.

By an r-path from x to x0 in a finite metric space X, we will mean a sequence of elementsx0, x1, . . . , xs 2 X such that x0 = x, xs = x0, and d(xi, xi+1) r for i = 0, 1, . . . , s � 1. Such anr-path corresponds exactly to an edge path from x to x0 in V (X, r). We will say a finite metricspace X is r-connected if every pair of points in X can be connected by an r-path. We have seenthat in our induction on the cardinality n of a tree-like finite metric space M , we may assume thatM is r-connected. Making the choices of m0 and m1 as above, it will now su�ce to prove thatM � {m0}, M � {m1}, and M � {m0,m1} are all r-connected.

Given any additive tree, we say a vertex is linear if it is contained in exactly two edges. Otherwise,we say it is a junction. For any leaf e, there is a unique nearest junction j(e),unless there are nojunctions in the tree. In the case where there are no junctions, the graph is simply a line, for which

3

the result is easy to check. For any pair (v, e), where v is a vertex in a tree T and e is an edge ofT containing e, we will denote by Br(e, v), the branch of T through v and e, to be the subtree onall vertices v0 so that the unique shortest edge path from v0 to v contains e. We consider the setof nodes {v0, v1, . . . , vn} so that each {j(m0), vi} is an edge in T , and construct all the branchesBi = Br(j(m0), {j(m0), vi}). The union of the Bi’s is all of T , and we may assume that mi 2 Bi

for i = 0, 1. It is clear that B0 is simply a linear tree, beginning at j(m0) and ending at m0. It isalso clear that for any vertex v 2 Bi, with i � 2,

d(v, j(m0)) d(m0, j(m0)) (2–1)

for if d(v, j(m0)) > d(m0, j(m0)), then d(m1, v) = d(m1, j(m0)) + d(j(m0), v) > d(m1, j(m0)) +d(j(m0),m0) = d(m1,m0), contradicting the maximality of d(m0,m1). It follows that

d(w,m0) � d(w, v) (2–2)

where v 2 Bi for i � 2 and w 2 Bi for i � 1.

We now prove the connectivity of V (M � {m0}, r). Suppose that we are given elements m,m0 ofM � {m0}. M is known to be r-connected, so there is an r-path v0, v1, . . . , vk from m to m0 in M .If vs 6= m0 for all s, the r-path lies in M � {m0}, so we may assume that for some vs, vs = m0.We have d(vs�1,m0) r and d(m0, vs+1) r. We will need to do a case by case analysis. First,suppose that the set M \ (B0 � {m0}) is non-empty. Let m be the point of M \ (B0 � {m0})which is closest to m0. Then it is clear that replacing the segment vs�1m0vs+1 of the r-path byvs�1mvs+1 produces an r-path from m to m0 with the given occurrence of m0 removed. In the casewhere M \ (B0 � {m0}) is empty, we select any leaf m⇤ in Bi, for some i � 2. It is now clearthat we may replace the fragment vs�1m0vs+1 by vs�1m

⇤vs+1 to obtain an r-path from m to m0,also with the given occurrence of m0 removed. This follows from the inequality 2–2 above, sincevs�1 and vs+1 lie in some Bi, with i � 1. This means that we have replaced the given r-path withone with a smaller number of occurrences of m0. Proceeding this way, we obtain an r-path fromm to m0 lying entirely in M � {m0}, proving that V (M � {m0}, r) is connected. The result forM � {m1} follows identically, as it does for M � {m0,m1}, since we can remove the occurrences ofm0 and m1 independently, since they never appear adjacently in an r-path. This works since m0

is never replaced by m1 and vice versa. ⇤

We also describe a result which allows us to accommodate noise or errors in the computations ofthe metrics. M. Gromov has introduced a metric on the set of all compact metric spaces, calledthe Gromov-Hausdor↵ metric (see [3]). There is also a metric called the bottleneck distance on theset of all persistence barcodes (see [2]). We let Mfin and B denote the families of finite metricspaces and persistence barcodes respectively, and let dGH and dB denote the Gromov Hausdor↵and bottleneck distances respectively. For each non-negative integer k, we have the assignmentHk : Mfin ! B which assigns to each finite metric space its associated persistence barcode. Thethe following is proved in [2].

Theorem 2.2 Each of the assignments are distance non-increasing, so for any two compact metricspaces X,Y , we have

dB(Hk(X), Hk(Y )) dGH(X,Y )

4

References

[1] Armstrong, M.A. Basic Topology, Undergraduate Texts in Mathematics, Springer, 1983

[2] Chazal, F., D. Cohen-Steiner, L. Guibas, F. Memoli, and S. Oudot, Gromov-Hausdor↵ sta-ble signatures for shapes using persistence, Eurographics Symposium on Geometry Processing2009, M. Alexa and M. Kazhdan, editors, The Eurographics Association and Blackwell Pub-lishing Ltd.

[3] Metric structures for Riemannian and non-Riemannian spaces, Progress in Mathe-matics 152, Birkhaauser Boston, 1999.

[4] P. Buneman, A Note on the Metric Properties of Trees, Journal of Combinatorial Theory (B),17, 48-50, 1974

[5] A. Hatcher, Algebraic Topology, Cambridge University Press, Cambridge, 2002, ISBN:0-521-79160-X; 0-521-79540-0

5

Supplementary Figure Legends

Figure S1. Reticulate structure representing the reassortment of three parental strains.

The reticulate structure (right) results from merging the three parental phylogenetic trees

(left).

Figure S2. Approximating the unobserved topological space of evolution E. We observe

a sample of data points. A way to approximate topology of E is to consider the union of

balls centered at each data point with a radius of ε.

Figure S3. Homology groups of different dimensions in simplicial complexes.

Figure S4. Representative simulation of a single admixture event. The simulation

followed a Wright-Fisher model with a constant population size of 30 sequences of 1,000

nucleotides each for 15 generations with µ=0.005 substitutions per site per generation. In

addition, we simulated a single recombination event at time step 15 such that four

random sequences exchanged half their genomes between themselves randomly. A)

Representation of the evolutionary model of a single admixture across generations. The

ancestral lineages of the four reassortants and two closely related isolates are highlighted.

B) An equivalent depiction of these lineages. C) Persistent homology analysis for

cumulative datasets from time step 1 to t (including ancestral sequences) reveals no one-

dimensional topological structure until t=15. D) Persistent homology of generation 15

(without ancestral sequences) identified a one-dimensional generator [96.0, 126.0): [3,6]

+ [3,4] + [5,6] + [1,4] + [1,2] + -[5,2]. This generator of sequences represents an

irreducible cycle of reassortants 1, 3, 4, and 5, as well as closely related isolates 2 and 6.

This generator is depicted via principal coordinate analysis.

Figure S5. Topological obstruction (TOP) and irreducible cycle rate (ICR) of simulated

reticulate events. TOP and ICR during simulated reassortment and recombination.

Simulations were performed at different mutation and reassortment/recombination rates.

TOP was calculated as the maximum barcode length in both dimension 1 and 2. ICR was

calculated as the number of barcodes in either dimension 1 and 2, normalized by the

maximum genetic distance creating a barcode. A) TOP during reassortment. B) ICR

during reassortment. C) TOP during recombination. D) ICR during recombination.

Figure S6. Persistent homology of reassortments between multiple parental strains. We

consider different populations of dinucleotide genomes of 5000 bases comprised of S

number of segments. We evolve these sequences at a constant mutation rate of 0.03

substitutions per site per time step. At time step 15, we performed a complex

reassortment such that for half the population, each segment reassorts independently. We

performed 1000 trials for each number of segments S from 1 to 4. We plotted the average

TOP and ICR for sequences isolated from time steps 14 and 15.

Figure S7. Comparative analysis of methods detecting reticulate evolution. We

investigated the performance of persistent homology compared to NSS, PHI, and

MaxChi. (A) Simulation was performed with a single reassortment event occurring at

t=15. For comparison, we normalized 1-D TOP and ICR to one. Power plots were also

calculated based on fifty trials for a given mutation rate and a given (B, C) reassortment

rate or (D, E) recombination rate throughout generations. These simulations were also

performed with (B, D) and without (C, E) ancestral sequences. Datasets with ancestral

sequences spanned 10 generations of 30 sequences each. Datasets without ancestral

sequences comprised a single generation of 100 sequences. For each method tested, red

indicates greater confidence in detecting horizontal evolution.

Figure S8. Persistent homology prediction of an H3N2 reassortment supported by

phylogenetic analysis. Barcoding analysis was performed for concatenated PB2 and HA

segments of H3N2. A one-dimensional generator G3 indicated an H3N2 reassortment

defined by sequences ABG80094, ABI21430, AFG72106, AFG72679, AFH00097,

AFH00229, AFH00427, AFH00801, and AFJ74406. Phylogenetic analysis for A) PB2

and B) HA separately produced incongruent tree structures, corroborating the occurrence

of a reassortment. C) Reticulate cycles in the NeighborNet phylogenetic network of

concatenated PB2 and HA confirms the existence of incompatibilities. Network was

created using SplitsTree v. 4.12.6.

Figure S9. Persistent homology prediction of an H1N1pdm reassortment supported by

informative sites. Barcoding analysis was performed for the concatenated genome of

H1N1pdm. A one-dimensional generator G2 nominates a candidate H1N1pdm

reassortment defined by sequences ADA86070, ADD75067, ADH01927, ADI99739,

ADK33740, ADM31737, and AFN18786. A) Informative sites in which at least two

sequences differed from the major allele are depicted. Two sets of segments display two

allelic patterns. Given two alleles A and B, PB2, M1, and NS1 conform to the allelic

pattern AABBBBB. PB1, PA, and HA, on the other hand, follow AAAABBA, with only

a few exceptions for HA. B) NeighborNet phylogenetic network of these informative

sites confirms these incompatibilities through two reticulate cycles. Network was created

using SplitsTree v. 4.12.6. These results support a potential reassortment pattern that

occurred during H1N1pdm evolution.

Figure S10. Quantifying topological features. A) Topological obstruction is calculated

using the maximum barcode length in non-zero dimensions. B) The irreducible cycle rate

is the number of higher-dimensional barcodes normalized by the time span of sequence

collection. For specific parameters, see Methods.

Figure S11. Barcoding plots of dengue virus polyprotein show no evidence of

recombination. Analyzing A) DENV-1, B) DENV-2, C) DENV-3, and D) DENV-4

subtypes uncovered topological structure at dimension 0 only. For specific parameters,

see Methods.

Gene$A$

Gene$B$

Gene$C$

Gene$A$

Gene$B$

Gene$C$

Gene$A$

Gene$B$

Gene$C$

Gene$A$

Gene$B$

Gene$C$

Parental$Strains$

Reassortant$Strain$

R$

R$

R$

R$

Figure'S1'

ε"

ε"

Two"0'D""Components"

1'D"Hole"

2'D"Hole"

Figure'S3'

Documents

Supplementary Text Detection of simulated reticulate events · Supplementary Text Detection of simulated reticulate events To illustrate how persistent homology captures reticulate