Estimation of the Duplication History under a Stochastic ...ffh8x/d/smtd.pdf · duplication and substitution mutations. The starting point is a short sequence which we refer to as

Estimation of the Duplication History

under a Stochastic Model for Tandem Repeats

Farzad Farnoud Hassanzadeh†, Moshe Schwartz‡, Jehoshua Bruck�

†University of Virginia, ‡Ben-Gurion University of the Negev, �California Institute of [email protected], [email protected], [email protected]

Abstract. We present a stochastic model for tandem duplication and substitution mutations thatcan be used to estimate relative mutation rates and the total number of mutations from a singlesequence. Important parameters of the model include the probability of a substitution mutation andthe probabilities of tandem duplications of various lengths. Our model indicates that if the probability ofsubstitution mutations is insigni�cant, little information can be obtained from a single sequence (whichhas undergone only tandem duplication). On the other hand, in the presence of both substitution andtandem duplication, one can estimate the parameters of the model, using which we can estimate thetotal number of mutations of each type. We validate our estimation method via Monte Carlo simulationand show that it outperforms the state of the art algorithm. We also apply our method to the tandemrepeat regions in the human genome, where it demonstrates the di�erent behavior of micro- and mini-satellites and can be used to compare mutation rates across chromosomes.

1 Introduction

Tandem repeats, which form about 3% of the human genome [1], are segments of DNA that primarily consistof repeats of a certain subsequence. The number of copies in tandem repeats is highly variable and is proneto change due to tandem duplication mutations. Furthermore, tandem repeats are also subject to pointmutations [2]. The variability of tandem repeats enables them to be used in population genetics [3] as wellas human identity testing [4]. Tandem repeats are also important since they may cause expansion diseases,gene silencing [5], and rapid morphological variation [6].

In this work, we present and analyze a model of the evolution of tandem repeat sequences via tandemduplication and substitution mutations. The starting point is a short sequence which we refer to as the seed.At each mutation step, either a tandem duplication or a substitution mutation occur, each with a givenprobability. Furthermore, the probability of a tandem duplication depend on the length of the duplicationevent. We show analytically that certain statistical features of the sequence converge as the number ofmutations increases. This in turn allows us to i) predict the behavior of the sequence after a large number ofsteps if we have the parameters of the model, or ii) estimate the parameters of the model given the sequenceafter a large number of mutations. In other words, given a sequence that is the result of the aforementionedprocess, we can estimate mutation probabilities without any other information.

We study two separate cases in the evolution of tandem repeats. First, we consider the case in whichsubstitution mutations do not occur and the only type of mutation is tandem duplication. We show thatin this case, while the prediction of evolutionary behavior is easy, there is not enough information in thesequence, after a large number of mutations, for accurate estimation of model parameters, including theprobabilities of tandem duplications of given lengths. This is because as the number of mutations increases,the sequence demonstrates periodic behavior, lacking features that can be leveraged for estimation. Perhapssurprisingly, the period of this sequence is not necessarily the most common or the shortest possible tandemduplication length.

We then consider the more interesting case in which both tandem duplication and substitution mutationsoccur. In this case, substitutions disrupt the periodic pattern that would arise from tandem duplications. Asa result, after a large number of mutations, the resulting sequence is more complex and informative, allowingus to estimate the model parameters. Speci�cally, from this sequence we can estimate the probability of asubstitution in each step, as well as the probabilities of tandem duplications of di�erent lengths. Furthermore,we can estimate the total number of mutations that gave rise to the sequence under study. We will applythis method to the tandem repeats in the human genome, which enables us to investigate the prevalence of

0 2 4 6 8 10

0

0.2

0.4

0.6

0.8

Fig. 1: An example of estimating q as q̂.

substitutions in repeats of di�erent lengths and to compare the average number of mutations across chromo-somes. We demonstrate that longer repeats have a relatively higher number of substitution mutations andthat the average numbers of mutations in some chromosomes is higher than others. Interestingly, this agreeswith another measure of mutation activity, i.e., comparison with the chimpanzee genome: The chromosomeswith higher number of mutations in repeated regions are the same as the ones that have diverged most fromchimpanzee chromosomes.

In previous work on modeling tandem duplication and substitution mutations, it is often assumed thatat each step, the length of the sequence grows by at most one repeat unit, see e.g., [7], [8], and referencesin the latter. This assumption appears to be more based on its simplifying e�ect rather than evidencesupporting it. Our model however allows duplications of length longer than one repeat unit at a timeand makes estimating the probabilities of such duplications possible. Note that models that do not allowlonger duplications may underestimate the probability of substitution and overestimate the probability oftandem duplication since more duplication events are needed to account for the observed copy number.Furthermore, unlike [7] and [8] that use Markov chains and branching processes for modeling, our analysisis based on stochastic approximation, which enables the description of new aspects of the problem. Inparticular, we see that the observed period in a tandem repeat sequence is not necessarily the most commonduplication length (Theorem 2) and that the presence of substitutions allows the estimation of mutationprobabilities (Equation (14)).

The Model and Examples of Estimation

Let s be a circular sequence over some alphabet A that evolves over time through tandem duplication andsubstitution mutations. The reason that we choose s to be a circular string, and not a linear one, is to avoidthe technical di�culties of dealing with its boundaries. The starting sequence, called the seed, is denoted bys(0). At each step i, a mutation converts the sequence s(i−1) to s(i). In our notation s(i) is the instance of sat time i. However, if it causes no ambiguity, we may use s instead of s(i). We use Li to denote the lengthof s(i).

If the mutation occurring at time i is a substitution, its position is chosen at random among all symbolsof s(i). That symbol is then changed randomly to one of the other symbols of A. If the mutation is atandem duplication, the length ` of the substring that is duplicated is chosen according to a predetermineddistribution and its position is chosen randomly and uniformly. We use q0 to denote the probability thatthe event at any given step is a substitution and use q`, ` > 0, to denote the probability that a substring of

length ` is duplicated in tandem. We assume that there exists K such that q` = 0 for ` > K. Finally, we letq = (q0, q1, . . . , qK) where

∑Ki=0 qi = 1. Note that q represents di�erent mutation probabilities given that a

mutation occurs and not the mutation probabilities per generation.Let A = {A,C,G,T}. As an example, suppose that s(i−1) = TCACAGT, and that at time i the underlined

substring of length ` = 3 is duplicated in tandem. The result equals s(i) = TCACAACAGT, where the insertedcopy of ACA is over-lined. If in the next step a substitution occurs, the result may be s(i+1) = TGACAACAGT.

In the estimation task, one is given the sequence s(n) and from this sequence only, she must recoverthe values of q and n. Postponing the details to Section 4.1, we present two example of estimating theseparameters to make the task more clear. In the �rst example, the sequence s(n) is obtained via simulationby letting s(0) = ACAAGATGC, q0 = 0.1533, qd = 0.7212, q2d = 0.0784, q3d = 0.0471 for d = 9 (thesevalues are chosen randomly as described in Section 4.1). We randomly apply n = 100 mutations to s(0). Theresult s(100) = ACAAGATGCACAAGA · · · along with d is then passed to the estimator (described is Section 4and referred to as SEMR). This provides us with the estimate n̂ of n as 64.53 and the estimate q̂ of q asin Figure 1. Note that the only information given to the estimator is the sequence s(100) and d. It can beobserved that n̂ and q̂ are reasonable estimates of n and q. Furthermore, as we show later, the accuracy ofthe estimates increases as the number n of mutations grows.

The next example illustrates the estimation of parameters for a segment of the human genome, namelyChromosome 1: 933,911�935,015 (Ensembl), given in Figure 2. For this sequence, SEMR gives the proba-

bility of substitution as q̂0 = 0.3687, and the probabilities of duplications of length i as q̂28i as (q̂28i)5i=1 =

(0.1499, 0.1257, 0.0232, 0.2656, 0.0669). All other q̂i are estimated to be 0. Furthermore, the number of mu-tations is estimated as n̂ = 17.67 out of which 11.15 are estimated to be tandem duplications and 6.51substitutions.

GCTCCGTTACAGGTGGGCAGGGGAGGCG GCTGCGTTACAGGTGGGCAGGGGAGGCGGCTGCGTTACAGGTGGGCAGGGGAGGCG GCTGCGTTACAGGTGGGCAGGGGAGGCGGCTGCGTTACAGGTGGGCAGGGGAGGCG GCTCCGTTACAGGTGGGCAGGGGAGGCGGCTGCGTTACAGGTGGGCAGGGGAGGCG GCTGCGTTACAGGTGGGCAGGGGAGGCGGCTGCGTTACAGGTGGGCAGGGGAGGCG GCTCCGTTACAGGTGGGCAGGGGAGGCGGCTCCGTTACAGGTGGGCAGGGGAGGCG GCTGCGTTACAGGTGGGCAGGGGAGGCGGCTGCGTTACAGGTGGGCGGGGGAGGCG GCTGCGTTACAGGTGGGCGGGGGAGGCGGCTGCGTTACAGGTGGGCAGGGGGGGCG GCTGCGTTACAGGTGGGCGGGGGAGGCGGCTGCGTTACAGGTGGGCGGGGGAGGCG GCTCCGTTACAGGTGGGCGGGGGAGGCGGCTGCGTTACAGGTGGGCGGGGGAGGCG GCTGCGTTACAGGTGGGCGGGGGGGGCGGCTGCGTTACAGGTGGGCGGGGGAGGCT GCTCCGTTACAGGTGGGCGGGGGAGGCTGCTCCGTTACAGGTGGGCGGGGGGGGCG GCTGCGTTACAGGTGGGCGGGGGGGGCGGCTGCGTTACAGGTGGGCGGGGGAGGCG GCTGCGTTACAGGTGGGCGGGGGAGGCGGCTCCGTTACAGGTGGGCGGGGGAGGCG GCTGCGTTACAGGTGGGCGGGGGAGGCGGCTGCGTTACAGGTGGGCAGGGGAGGCG GCTGCGTTACAGGTGGGCAGGGGAGGCGGCTGCGTTACAGGTGGGCGGGGGAGGCG GCTCCGTTACAGGTGGGCGGGGGAGGCGGCTGCGTTACAGGTGGGCGGGGGAGGCG GCTGCGTTACAGGTGGGCGGGGGAGGCGGCTGCGTTACAGGTGGGCGGGGGGGGCG GCTGCGTTACAGGTGGGCGGGCGG

Fig. 2: A tandem repeat segment of the human genome.

The general analysis of the model is presented in the next section. This analysis is specialized to thecases with and without substitutions in Sections 3 and 4, respectively. Simulation results of our estimationmethod, as well as its application to the human genome, are also given in Section 4. The paper is concludedin Section 5.

2 General Analysis via Stochastic Approximation

In this section, we �rst present a general framework for the problem of analyzing the evolution of sequencesunder duplication and substitution mutation processes using stochastic approximation, which relates theproblem to an ordinary di�erential equation (ode). We then specialize this method to the analysis of theautocorrelation function in systems with tandem duplication and substitution.

For an ordered set U, let Rn = (Run)u∈U be a vector representing the number of appearances of objects

u ∈ U in the sequence s at time n and let ρn = Rn

Lnbe the normalized version of Rn. For example, U can

be the set of all strings over A with length at most three. Our goal is to �nd out how ρn changes with n by�nding a di�erential equation whose solution approximates ρn.

De�ne Fn to be the �ltration generated by the random variables {ρn, Ln}. Furthermore, let E`[ · ]denote the expected value conditioned on the fact that the length of the duplicated substring is ` and letδ` = E`[Rn+1|Fn]−Rn. Recall that q0 is the probability of a substitution and qi, i > 0 is the probability ofthe event that a sequence of length ` = i is duplicated. The following set of conditions must be satis�ed forour analysis. Among them, we take (A1) to be given. It is easy to see that (A2)-(A3) are true, and (A4)-(A5)will be evident after we �nd δ`.

A 1. There exists K ∈ N such that qi = 0 for i > K.

A 2. Rn+1 −Rn is bounded and, thus, so is δ`.

A 3. ρn is bounded.

A 4. For each `, δ` is a function of ρn only, so we can write δ` = δ`(ρn).

A 5. The function δ`(ρn) is Lipschitz.

To understand how ρn varies, our starting point is the di�erence sequence ρn+1 − ρn. We note that

ρn+1 − ρn = E[ρn+1 − ρn

∣∣Fn

]+(ρn+1 − E

[ρn+1

∣∣Fn

]). (1)

For the �rst term of the right side of (1), we have

E[ρn+1 − ρn

∣∣Fn

]=

K∑`=0

q`(E`

[ρn+1

∣∣Fn

]− ρn

)=

K∑`=0

q`

(Rn + δ`(ρn)

Ln + `− Rn

Ln

)

=1

Ln

K∑`=0

q`h`(ρn)(1 +O

(L−1n

))=

1

Lnh(ρn)

(1 +O

(L−1n

)), (2)

where h`(ρ) = δ`(ρ)− `ρ, h(ρ) =∑K

`=0 q`h`(ρ), and where we have used 1/(Ln + `) =(1 +O

(L−1n

))/Ln

which follows from the boundedness of ` (see (A1)).Furthermore, for the second term of the right side of (1), we have

ρn+1 − E[ρn+1

∣∣Fn

]=Rn+1

Ln+1− E

[Rn+1

Ln+1

∣∣∣∣Fn

]=

1 +O(L−1n

)Ln

(Rn+1 − E[Rn+1|Fn])

=1

Ln

(1 +O

(L−1n

))Mn+1 (3)

where Mn+1 = Rn+1 − E[Rn+1|Fn]. Note that Mn is a bounded martingale di�erence sequence.From (1), (2), and (3), we �nd ρn+1 = ρn+

1Ln

(h(ρn) +Mn+1 +O

(L−1n

)), where we have used the fact

that h(ρn)(1 +O

(L−1n

))= h(ρn) + O

(L−1n

). This follows from the boundedness of h(ρn), which in turn

follows from the boundedness of δ(ρn).The following theorem relates the discrete system describing ρn to a continuous system.

Theorem 1 ([9, Theorem 2]). The sequence {ρn} converges almost surely to a compact connected internally

chain transitive invariant set of the ode

dρtdt

= h(ρt). (4)

While di�erent properties of the sequence can be analyzed via the aforementioned method, for ourpurpose, the autocorrelation of the sequence is the most suitable. The autocorrelation function Rr of asequence s = s1 · · · s|s|, si ∈ A, at lag r, is de�ned as

Rr =

|s|∑i=1

〈si, si+r〉,

where each index of s should be replaced by its representative (modulo |s|) in the residue system {1, . . . , |s|},and where 〈α, β〉 = 1 i� α = β and 〈α, β〉 = 0 otherwise.

Let Rrn denote the autocorrelation of function after n mutations starting from the seed sequence and

let ρrn =Rr

n

Ln. We study the vectors Rn =

(R0

n, R1n, . . . , R

m̂−1n

)and ρn = Rn

Ln, for a constant m̂. Note that

R0n = Ln and ρ0n = 1.

Let s = s(n) = s1 · · · s|s| be the sequence at time n. At time n+ 1, either a substitution or a duplicationhas happened. In the former case, suppose the symbol at position i is changed to another symbol of thealphabet, and in the latter case, suppose that the substring si+1 · · · si+` is duplicated in a tandem manner;after duplication the sequence becomes

s1 · · · sisi+1 · · · si+`si+1 · · · si+`si+`+1 · · · s|s|.

Fix the value of i. For ` = 0, i.e., the case of a substitution,

Rrn+1 = Rr

n − 〈si, si+r〉 − 〈si, si−r〉+ 〈s′i, si+r〉+ 〈s′i, si−r〉,

where s′i denote the new (mutated) symbol.

Now we consider the case of ` > 0, which corresponds to tandem duplications. For 0 < ` ≤ r, we have

Rrn+1 = Rr

n −i∑

j=i+`−r+1

〈sj , sj+r〉+i+∑̀

j=i+`−r+1

〈sj , sj+r−`〉.

The conditions resulting in the �rst summation j ≤ i and j + r > i + ` and those resulting in the secondsummation are i < j ≤ i+ ` or i+ ` < j + r ≤ i+ 2`. For ` > r,

Rrn+1 = Rr

n +

i+`−r∑j=i+1

〈sj , sj+r〉+i+∑̀

j=i+`−r+1

〈sj , sj+r−`〉.

Let h`(ρn) =(h0`(ρn), . . . , h

m̂` (ρn)

)= E`[Rn+1 −Rn|Fn]− `ρn. To compute h`, we �rst �nd the following

expected values, where r > 0 and where i is randomly and uniformly distributed among the Ln positions:

E0[〈si, si+r〉|Fn] = E0[〈si, si−r〉|Fn] = ρrn,

E0[〈s′i, si+r〉|Fn] = E0[〈s′i, si−r〉|Fn] =1− ρrn

3,

E`

i∑j=i+`−r+1

〈sj , sj+r〉

∣∣∣∣∣∣Fn

=1

Ln

Ln∑i=1

i∑j=i+`−r+1

〈sj , sj+r〉

= (r − `)ρrn,

E`

i+∑̀j=i+`−r+1

〈sj , sj+r−`〉

∣∣∣∣∣∣Fn

=1

Ln

Ln∑i=1

i+∑̀j=i+`−r+1

〈sj , sj+r−`〉

= rρr−`n ,

E`

i+`−r∑j=i+1

〈sj , sj+r〉

∣∣∣∣∣∣Fn

=1

Ln

Ln∑i=1

i+`−r∑j=i+1

〈sj , sj+r〉

= (`− r)ρrn.

From these we �nd that

hr`(ρ) =

{− 8

3ρr + 2

3 , ` = 0

rρr−` − rρr, ` > 0(5)

and

d

dtρrt = q0

(−8

3ρrt +

2

3

)+ r

∑`>0

q`ρr−`t − (1− q0)rρrt (6)

for r > 0. We thus see that the set of equations governing ρ are linear.Suppose that m̂ ≥ K and recall that ρ =

(ρ0, . . . , ρm̂−1

). For 1 ≤ r ≤ m̂ − 1, the maximum value for j

such that, in the equation for ρr, the coe�cient of ρj is non-zero is max1≤r≤m̂{r,K − r} ≤ m̂− 1. We thushave

d

dtρt = Aρt, (7)

where A is the m̂× m̂ matrix whose rows and columns are indexed by {0, 1, . . . , m̂− 1} and its elements aregiven as

Arj =

2q0/3 + rqr, if j = 0

rqr−j + rqr+j , if 0 < j < r

q0(r − 8

3

)+ rq2r − r, if j = r

rqr+j , if j > r

(8)

for r > 0 and as A0j = 0 for all j. For example, if m̂ = 7,K = 4, the system (7) is given by

ρ̇0

ρ̇1

ρ̇2

ρ̇3

ρ̇4

ρ̇5

ρ̇6

=

0 0 0 0 0 0 02q03 + q1 − 5q0

3 + q2 − 1 q3 q4 0 0 02q03 + 2q2 2q1 + 2q3 − 2q0

3 + 2q4 − 2 0 0 0 02q03 + 3q3 3q2 + 3q4 3q1

q03 − 3 0 0 0

2q03 + 4q4 4q3 4q2 4q1

4q03 − 4 0 0

2q03 5q4 5q3 5q2 5q1

7q03 − 5 0

2q03 0 6q4 6q3 6q2 6q1

10q03 − 6

ρ0

ρ1

ρ2

ρ3

ρ4

ρ5

ρ6

. (9)

To determine the stability of (7) we use the Gershgorin circle theorem. We note that∑

j Arj = −2q0 andthat Arr = − 8q0

3 − r(1− q0 − q2r) for r > 0. The circles centered at Arr and with radius∑

j Arj − Arr =

2q03 + r(1− q0 − q2r) in the complex plane either do not intersect the right half of the plane or they intersectit only at 0. Hence, by the Gershgorin circle theorem, the eigenvalues of A are either 0 or have negative realparts. Let (λj)

mj=0 denote the eigenvalues of A, with λ0 = 0 and λj = aj + ıbj for j > 0, where aj < 0 and ı

denotes√−1. For 0 ≤ r ≤ m̂− 1, we have

ρrt = cr0(t) +

m∑j=1

cjk(t)eajt+ıbjt,

where crj(t) are polynomials in t of degree at most m̂. Since ρr(t) is bounded between 0 and 1, it is evidentthat cr0(t) is in fact a constant. Let this constant be denoted by ρr∞. We thus have

ρrt = ρr∞ +

m∑j=1

cjk(t)eajt+ıbjt, (10)

which implies that(ρ0t , . . . , ρ

m̂−1t

)converges to ρ∞ =

(ρ0∞, . . . , ρ

m̂−1∞

). Note that ρ0∞ = 1.

From (10), we have limt→∞ddtρ

rt = 0. By taking the limit of the equation d

dtρt = Aρt as t→∞, it followsthat

Aρ∞ = 0,

implying that ρ∞ is in the null space of A. Furthermore similar to [10], one can use (10) to show that thenull space of A is in fact a set that satis�es the conditions of Theorem 1 for the di�erential equation (7).Thus ρn converges almost surely to a point in the null space of A.

In the following sections, we consider the null space of A in two cases. First, we assume q0 = 0, that is,there are no substitutions. Next we study the case with positive probability of substitutions, i.e., q0 > 0.

3 Tandem Duplication

In this section, we consider the case in which the only type of occurring mutations is tandem duplication.We show that in this case the null space of A is surprisingly simple. In particular, every vector in the nullspace is periodic with period d, where d is the gcd of all length i for which the probability of duplication qiis positive. An interesting fact is that d only depends on qis being positive or not, but it does not furtherdepend on their values. The following theorem, which is the main result of this section, describes how theautocorrelation behaves as a result of the periodicity property of the null space.

Theorem 2. Suppose q0 = 0. Let P = {i : i > 0, qi > 0} and d = gcdP . The normalized autocorrelation

ρn =(ρ0n, . . . , ρ

m̂−1n

)converges almost surely to a vector ρ∞ =

(ρ0∞, . . . , ρ

m̂−1∞

), where ρj∞ is periodic in j

with period d, ρj∞ = 1 if j ≡ 0 mod d, and ρj∞ = ρd−j∞ . In particular, every pair of symbols at distance d in

s(n) are, with high probability, the same.

The fact that the limit of ρn is periodic with period d enables us to easily predict its behavior for largevalues of n. On the other hand, since given P , d does not depend on the values of qi, observing d does notenable us to estimate q.

To prove Theorem 2, we �rst prove the following lemma.

Lemma 3. Let q0 = 0, P = {i > 0 : qi > 0}, and d = gcdP . Furthermore, let S(t) = Span{v0, . . . , vbt/2c

},

where vi = (vi,0, . . . , vi,m−1)T, with

vi,j =

{1, j ≡ ±i mod t,

0, else.

We have Null(A) = S(d).

Proof. Let B = (Brj) be an m̂× m̂ matrix, with rows and columns indexed by 0, . . . , m̂− 1, de�ned as

Brj =

qr, if j = 0

qr−j + qr+j , if 0 < j < r

q2r − 1, if j = r

qr+j , if j > r

(11)

Since q0 = 0, we have Null(B) = Null(A). We further recall that qi ∈ R, qi ∈ [0, 1], and∑

i qi = 1.Additionally, assume i1 < i2 < · · · < ik are the only indices for which qij > 0. Finally, we assume n is largeenough to enable us to see all the nonzero qi's in the matrix, or more formally, we require m̂ ≥ ik.

We are interested in �nding the null-space of B. Instead of doing this directly, we consider the matrix

A′ = I +B.

The goal now is to �nd the right eigenspace of A′ for the eigenvalue 1.

First we prove S(d) ⊆ Null(B). We do this by showing that for all vi ∈ S(d), vi is in the right eigenspaceof A′ corresponding to the eigenvalue 1, i.e., A′vi = vi. This is immediate when we note that when i ≡ ±a(mod d), then in the ith row of A′ (numbering of rows and columns starts from 0), coordinates j ≡ ±a(mod d) contain all the elements qmd, and in particular, qi1 , qi2 , . . . , qik .

It is obvious that the vectors in S(d) are linearly independent. To complete the proof we need to showthat the geometric multiplicity of the eigenvalue 1 of A′ is |S(d)| = bd/2c+ 1.

The matrix A′ is stochastic, and therefore, its spectral radius is ρ(A′) = 1. Let GA′ be the (weighted)directed graph whose adjacency matrix is given by A′. By Perron-Frobenius theory, it is well known thatthe eigenvalues of A′ are the union (in the multiset sense) of the eigenvalues of the irreducible componentsof GA′ . Additionally, the geometric multiplicity of ρ(A′) (also called the Perron-Frobenius (PF) eigenvalueof A′) is 1 for each irreducible component.

Combining the above, and remembering the PF eigenvalue of an irreducible graph is a weighted average ofthe out-degree of its vertices, we obtain that the geometric multiplicity of ρ(A′) = 1 is exactly the number ofirreducible sink components of GA′ . Thus, as a �nal step in the proof, we show that the number of irreduciblesink components of GA′ is exactly bd/2c+ 1.

Let us denote the vertices of the graph GA′ by w0, w1, . . . , wm̂−1. From each w`, ` > 0, we have k out-going edges corresponding to i1, i2, . . . , ik. The edge corresponding to ij is directed from w` to w`−ij when` ≥ ij , and otherwise to wij−`. When describing a path we shall refer to this edge as �taking ij from w`�.Finally, vertex w0 has a single out-going edge which is also a self-loop.

By construction, all vertices w` for ` ≥ ik have incoming edges from vertices w`′ with `′ > `. Thus, they are

certainly not part of an irreducible sink component. We therefore concentrate on vertices w0, w1, . . . , wik−1only. We now look at the irreducible components over these vertices.

For each ij , starting from w`, 0 < ` < ik, we can take a path representing the orbit of −ij modulo ik. If` ≥ ij , then we can move from w` to w`−ij . If ` < ij we can also do this but we require two steps: from w`

to wij−` by taking an ij step, and then from wij−` to wik−(ij−`) by taking an ik step. Indeed

ik − (ij − `) ≡ `− ij (mod ik).

Since we can do this for every ij every node w` is connected to every node w`′ with ` ≡ `′ (mod d). Also,by taking the ik edge from each node, we can see that from every node w` we can reach every node w`′ with` ≡ ±`′ (mod d).

The only exception to the above are nodes w` with ` ≡ 0 (mod d) since they get �stuck� at v0, which isan irreducible sink component on its own. We therefore reach the conclusion that there are exactly bd/2c+1irreducible sink components which are of the form

Va = {w` | 0 < ` < ik, ` ≡ ±a (mod d)},

for 0 < a ≤ d/2, as well as V0 = {w0}.

Proof of Theorem 2. Since ρ∞ is in the null space of A, where the null space of A is given by Lemma 4, ρ∞is a linear combination of the vectors S(d). Furthermore, by de�nition we know that ρ0∞ = 1. In the basis ofS(d) given in Lemma 4, the only vector that has a nonzero element in the 0th coordinate is

v0,j =

{1, j ≡ 0 mod d,

0, else.

So the coe�cient of v0,j in the linear combination describing ρ0∞ is 1 and thus ρ0∞ = 1 if j ≡ 0 mod d. Wehence have Theorem 2.

4 Tandem Duplication and Substitution

Lemma 4. Let q0 > 0, P = {i > 0 : qi > 0}, and d = gcdP . We have Null(A) = Span(v), where v =

(v0, . . . , vm−1)Tis a vector satisfying v0 = 1 and vj = 1/4 for j 6≡ 0 mod d.

For example, for d = 3, v is of the form (1, 1/4, 1/4, v3, 1/4, 1/4, v6, 1/4, . . . , vm−1)T.

Proof. Checking that Av = 0 is not di�cult. Using (12), we can see that for Av = 0 to hold, the conditionsthat vr for r 6≡ 0 mod d must satisfy are

vr = q0

(2

3r(1− 4vr) + vr

)+

∑k:1≤kd≤m̂

v|kd−r|qkd,

which are satis�ed if we let vr = 1/4 for all r 6≡ 0 mod d.

We now consider both tandem duplication and substitution mutations and describe how the parametersof the model, as well as the number of mutations of each type, may be estimated. In this setting, while theparameters of the model are unknown, we have access to the sequence s(n) for some large value of n.

First, note that we can rewrite the equation Aρ∞ = 0, where A is the matrix given in (8), as

C

q0q1...qm̂

=

ρ1∞2ρ2∞...

(m̂− 1)ρm̂−1∞

, (12)

where C = (Cri) is a (m̂− 1)× (m̂+ 1) matrix whose elements are

Cri =

{23 +

(r − 8

3

)ρr, i = 0

rρi−r, else,

where r ∈ {1, . . . , m̂− 1} and i ∈ {0, 1, . . . , m̂}. For example, for m̂ = 7,

C =

23 −

53ρ

1 ρ0 ρ1 ρ2 ρ3 ρ4 ρ5 ρ623 −

23ρ

2 2ρ1 2ρ0 2ρ1 2ρ2 2ρ3 2ρ4 2ρ523 + 1

3ρ3 3ρ2 3ρ1 3ρ0 3ρ1 3ρ2 3ρ3 3ρ4

23 + 4

3ρ4 4ρ3 4ρ2 4ρ1 4ρ0 4ρ1 4ρ2 4ρ3

23 + 7

3ρ5 5ρ4 5ρ3 5ρ2 5ρ1 5ρ0 5ρ1 5ρ2

23 + 10

3 ρ6 6ρ5 6ρ4 6ρ3 6ρ2 6ρ1 6ρ0 6ρ1

Given the sequence s(n), we can now estimate the values of qi. To do so, from s(n) we obtain ρn =(

ρ0n, . . . , ρm̂−1n

). Since n is large, we can approximate ρ∞ by ρn. In our model, there exists K such that

qi = 0 for i > K. However, the value of K is unknown to us. We thus choose some m̂′ and assume thatqi = 0 for i > m̂′. The value of m̂′ can be chosen for example based on our knowledge of the underlyingbiological processes, such as slipped strand mispairings [11], that leads to tandem repeats. Furthermore, the

value of m̂′ should be chosen large enough so that m̂′ ≥ K with a high degree of con�dence. The equality(12) becomes

C ′

q0q1...qm̂′

=

ρ1n2ρ2n...

(m̂− 1)ρm̂−1n

, (13)

where C ′ is the matrix containing the �rst m̂′ + 1 columns of C. Now to obtain an estimate of q =(q0, q1, . . . , qm̂′) we can solve the least-square curve �tting problem

q̂ = argminq‖C ′q− ρn‖

22

s.t. 1qT = 1 (14)

qi ≥ 0, for 0 ≤ i ≤ m̂′.

The solution q̂ of this problem contains an estimate of the substitution probability q0 and the probabilitiesq` of duplications of lengths `. Furthermore, since s(n) is a sequence of tandem repeats, if we assume that itis generated from one copy of its repeated substring, we can estimate the total number n of mutations thathave occurred as

n̂ =

∣∣s(n)∣∣∑m̂′

i=1 iq̂i, (15)

where we have assumed the length of the seed s(0) is small and where the denominator represents the expectedgrowth in the length of the sequence at each step. The estimator based on (12) and (15) is referred to asSEMR.

In tandem repeat sequences observed in genomes, such as the one given in Figure 2, it is clear thatduplication events have lengths that are multiples of a certain value, leading to a pattern of that lengthappearing many times. We refer to this length as the period and the number of times that the patternappears is referred to as the copy number. While in general SEMR does not need to know the period toprovide estimations of the mutation probabilities, if it is known, we can set qi = 0 for i not divisible by theperiod.

Since our method relies on asymptotic approximation, for short sequences, speci�cally those with copynumber ≤ 3, we provide an alternative estimation algorithm. In such sequences there is either 1 or 2 duplica-tion events of length d and 0 or more substitutions. The number of duplication can be found easily from thelength of the sequence. Let ai be the number of distinct symbols appearing at the ith position (relative tothe start of the pattern) of di�erent copies minus 1. For example, for ACTGCTACT, we have a1 = 1 since twosymbols, A and T, appear in the �rst position of di�erent copies. The ai can be used to infer the number ofsubstitutions. A substitution will contribute to ai if it occurs before the �rst duplication event. To account

for hidden substitutions, we estimate the number of substitutions as (∑

i ai)(r+1)

r , where r is the number ofduplication events. So we have both estimates of the number of substitutions and duplications. Note that inthis simple analysis, we have assumed that each substitution results in a symbol that has not been observedin that position in any of the copies, which is a reasonable assumption for a small number of mutations.

4.1 Simulation Results

In this section, we use simulation to evaluate the performance of SEMR by comparing its estimates of themodel parameters with the true values. We also compare SEMR to DTSCORE introduced by Elemento andGascuel [12], which was shown to outperform other similar methods [13]. We show that SEMR provides moreaccurate estimates and is signi�cantly faster compared to DTSCORE.

In our simulation set up, we �rst generate a random seed s(0) that then undergoes n random substitutionsand tandem duplications, where the probabilities of these events are given by q, itself randomly generated.The resulting sequence s(n) is then passed to estimators based on (14) and (15), which of course do notknow s(0), n, or q. We evaluate the performance by �nding the L2 error in estimating q̂, ‖q̂−q‖2, averaged

across N experiments for each value of n. We also �nd the normalized root mean square (NRMS) error inestimating n. For a given value of n, NRMS error is de�ned as

NRMSE(n, n̂) =1

n

√√√√ 1

N

N∑i=1

(n̂i − n)2 ,

where N is the number of experiments with n mutations and ni is the estimate for n in the ith experiment.

We �nd the errors for two di�erent cases: for a pair of given values for n and q̂, we estimate n̂ andq̂ based on 1) a single sequence and 2) ns sequences generated with parameters q and n. In the lattercase, estimates are obtained for each sequence individually and then averaged. The multiple-sample caseis intended to show that performance improves, as expected, with more data. Due to the large number oftandem repeat sequences in many genomes, it is reasonable to expect that for a set of factors a�ectingduplication probabilities, a given set of values for these factors may arise multiple times, and so, in suchcases, we may expect a similar performance improvement in aggregate results.

50 100 150 200 250 300 350 400 450 5000.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

(a)

50 100 150 200 250 300 350 400 450 5000.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

(b)

Fig. 3: Errors of the estimate q̂ of q, (a), and the estimate n̂ of n, (b) .

More detail on the simulation setup is given below. The results are given in Figure 3 where n ranges from10 to 500, with step size equal to 10. For each value of n, the experiment is performed N = 500 times, andin each of these N trials, estimates are obtained based on a single sequence and based on ns = 5 sequencesdrawn for the same seed, q, and n̂. We observe that as n increases, the errors sharply decrease. For a singlesequence and a small number of mutations, the estimation algorithm relies on a very limited amount of data.As the number n of mutations increases the sequence becomes longer, providing more data in the form ofthe autocorrelation function, and simultaneously asymptotic approximations become more accurate. It isalso observed that as expected with more than one sample for the same set of parameters, more accurateestimates are obtained.

We now describe the process of sequence generation and estimation in more detail. The seed sequences(0) is a random sequence over the alphabet {A,C,G,T} of a random length that is chosen uniformly fromthe set {4, 5, . . . , 10}.

{(α1, α2, α3, α4) | α1 + · · ·+ α4 = 1, α1, . . . , α4 ≥ 0}.

All other values of q are set to 0. We then perform n mutation steps, each a substitution with probability q0or a tandem duplication of length id with probability qid, for i ∈ {1, 2, 3}. If in a tandem duplication step,the length chosen for duplication based on distribution q is larger than the length of the sequence, the wholesequence is duplicated. Note that since the length of the sequence grows, such an event may only happena few times at the beginning of the process. A set of parameters and their estimates are given in Figure 1,which was discussed earlier.

20 40 60 80 100 1200.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

(a)

20 40 60 80 100 1200

5

10

15

(b)

Fig. 4: Comparison of SEMR (SE) and DTSCORE (DT): Error of the estimate q̂ of q, (a), and the averageexecution time for an instance of the problem, (b).

In the results in this paper (both in this section and the next), we set the computation parameters as

follows. First, we �nd ρ for {0, 1, . . . , b |s|2 c}, to ensure that each value of the autocorrelation function is the

average of at least |s|/2 values. Furthermore, we let m̂′ = min(max(10d, 5r∗),

⌊|s|2

⌋), where r∗ = argmaxi ρ

i.

The max here is intended to ensure that m̂′ is large enough, while the min ensures that all needed values of ρare available. The value of m̂ is set as m̂ = m̂′− 1. We note that in (15), if q̂0 is close to 1, then the estimaten̂ for n may be very large. It is reasonable to expect that n̂ is not larger than the length of the sequence.Thus, we heuristically add the constraint (d, 2d, . . . , m̂′d)(qd, q2d, . . . , qm̂′d)

T ≥ 1 to (14), which ensures thatin each mutation contributes at least 1 to the length of the sequence.

We now compare the performance of SEMR with DTSCORE [12]. DTSCORE is a distance-based algo-rithm designed to given a sequence �nd the duplication history in the form of a tree, thus providing estimatesfor the number of duplications of various lengths. In [13], it was shown that DTSCORE performs better thanother algorithms for identifying the duplication three, including TRHIST [14] and WINDOWS [15]. Dueto the slower speed of DTSCORE (the worst-case time complexity is O(L4), where L is the copy numberof the sequence), we restrict the range of the number of mutations n to {10, 20, . . . , 120} and also reduceN = 200 but maintain ns = 5. The comparison is given in Figure 4. Since from DTSCORE, we can onlyderive estimates for counts of duplications but not substitutions, we compare the accuracy of estimatingq′ = (q′1, q

′2, . . . ) where q

′i for i ≥ 1 is de�ned as

q′i =qi

1− q0·

From Figure 4.(a), it is clear that SEMR estimates q′ with signi�cantly higher accuracy than DTSCORE.Furthermore, if multiple samples from the same distribution are available, the improvement for SEMR islarger than for DTSCORE. Finally, the execution time of SEMR is faster than DTSCORE. In particular,for n = 120, on average, SEMR needs 0.015 seconds to compute the estimate, while DTSCORE needs 14.11seconds, almost 3 orders of magnitude longer.

4.2 Tandem Repeats in the Human Genome

In this section, we apply SEMR to tandem repeats in the human genome to estimate the number of sub-stitution and tandem duplication mutations for each. We use these estimates to explore the relationship ofmutation rates for minisatellite and microsatellites and across chromosomes.

We use the Tandem Repeats Database (TRDB) [16], which provides the set of tandem repeats in eachchromosome as identi�ed by the Tandem Repeat Finder (TRF) algorithm and related information suchas the length of the repeat unit (base duplication length) and indel (insertion/deletion) percentage. As apreprocessing step, among overlapping repeats, we keep only one. We also remove repeats with unknown(N) bases. Finally, we discard repeats whose indel percentage is nonzero, as our model does not includeinsertion and deletion mutations. As an example, the number of repeat sequences in chromosome 1 reducesfrom 93, 649 to 44, 837 as a result of preprocessing.

We applied the SEMR algorithm to repeats in each chromosome. The results for chromosome X are givenin Figure 5. Each point in this plot corresponds to a tandem repeat sequence. The position of each point isdetermined by the estimated number of tandem duplications and substitutions that occurred to create thesequence. It can be observed in this �gure that tandem repeat sequences can be divided into two classeswith di�erent behaviors: one dominated by tandem duplication mutations and the other by substitutionmutations. It appears that this di�erence in behavior can be explained by the length of the repeat pattern.Tandem repeats are categorized based on their lengths: Microsatellites are repeats whose repeat patterns are≥ 1 and ≤ 9 bases long. Minisatellites, on the other hand, are repeated sequences whose pattern is between10- and 100-bases long [?]. Our estimation results indicate that microsatellites are less prone to substitutionmutations, which agrees with previous �ndings [], and so this observation acts as a validation of the proposedmethod. Other chromosomes exhibit behavior similar to chromosome X illustrated here.

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

Fig. 5: The mutation pro�le of tandem repeats in chromosome X.

Through comparison with the chimpanzee genome [17,18], it is known that mutation rates vary acrosschromosomes. To see this variation can be observed also in the repeated region, we study the number ofmutations in each tandem repeat sequence across chromosomes. The results are illustrated in Figure 6,

where we have estimated the number of mutations of each type for each tandem repeat sequence and theplotted the median for each chromosome. It can be observed that chromosomes 16, 19, 21, 22, and Y havethe highest number of mutations per repeated sequence, which is in agreement with [17]. On the otherhand, chromosome X has the smallest divergence from chimpanzee but here chromosome 14 has the smallestnumber of mutations. This suggests that while not exactly aligned, mutation rates in repeated regions arerepresentative of overall rates to a certain extent. It is important to note that the method proposed here,unlike prior work, relies on a single genome to make inferences about mutational activity.

1 2 3 4 5 6 7 8 9 10111213141516171819202122 X Y

Chromosome

0

1

2

3

4

5

6

7

Num

ber

of m

utat

ions

Tandem duplicationsSubstitutions

Fig. 6: Mutation variation across chromosomes: median of the total number of mutations per tandem repeatsequence for each chromosome.

5 Conclusion

In this paper, we introduced a new stochastic model for tandem duplication and substitution mutations, andanalyzed it via stochastic approximation. In particular, we fully characterized the limit set of the stochas-tic process described by the model. In addition to enabling us to predict the behavior of a sequence thatundergoes tandem duplication and substitution mutations, this characterization allowed us to derive a min-imization problem whose solutions are estimates of the mutation probabilities for tandem duplication andsubstitution. We showed further that it is possible to estimate the total number of mutations. Finally, weevaluated the estimation method via simulation by generating random sequences and comparing the esti-mated probabilities with the true values and also applied it to the human genome, where it demonstratedthe di�ering behavior of micro- and mini-satellites as well as the variability of mutation activity acrosschromosomes.

Acknowledgement

The authors would like to thank Han Mao Kiah for helpful discussions related to Lemma 4.

References

1. E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle,W. FitzHugh, et al., �Initial sequencing and analysis of the human genome,� Nature, vol. 409, no. 6822, pp. 860�921, 2001.

2. D. Pumpernik, B. Oblak, and B. BorÅ½tnik, �Replication slippage versus point mutation rates in short tandemrepeats of the human genome,� Molecular Genetics and Genomics, vol. 279, no. 1, pp. 53�61, 2008.

3. T. B. Sonay, T. Carvalho, M. D. Robinson, M. P. Greminger, M. KrÃ×tzen, D. Comas, G. Highnam, D. Mittel-man, A. Sharp, T. Marques-Bonet, and A. Wagner, �Tandem repeat variation in human and great ape populationsand its impact on gene expression divergence,� Genome Research (Accepted Manuscript), Aug. 2015.

4. J. M. Butler, �Genetics and Genomics of Core Short Tandem Repeat Loci Used in Human Identity Testing,�Journal of Forensic Sciences, vol. 51, no. 2, pp. 253�265, 2006.

5. K. Usdin, �The biological e�ects of simple tandem repeats: Lessons from the repeat expansion diseases,� GenomeResearch, vol. 18, pp. 1011�1019, July 2008.

6. J. W. Fondon and H. R. Garner, �Molecular origins of rapid and continuous morphological evolution,� Proceedingsof the National Academy of Sciences, vol. 101, no. 52, pp. 18058�18063, 2004.

7. S. Kruglyak, R. T. Durrett, M. D. Schug, and C. F. Aquadro, �Equilibrium distributions of microsatellite re-peat length resulting from a balance between slippage events and point mutations,� Proceedings of the NationalAcademy of Sciences, vol. 95, no. 18, pp. 10774�10778, 1998.

8. Y. Lai and F. Sun, �The Relationship Between Microsatellite Slippage Mutation Rate and the Number of RepeatUnits,� Molecular Biology and Evolution, vol. 20, no. 12, pp. 2123�2131, 2003.

9. V. S. Borkar, �Stochastic approximation,� Cambridge Books, 2008.10. F. Farnoud, M. Schwartz, and J. Bruck, �A stochastic model for genomic interspersed duplication,� in Proceedings

of the 2015 IEEE International Symposium on Information Theory (ISIT2015), Hong Kong, China SAR, pp. 904�908, June 2015.

11. G. Levinson and G. A. Gutman, �Slipped-strand mispairing: a major mechanism for DNA sequence evolution,�Molecular Biology and Evolution, vol. 4, no. 3, pp. 203�221, 1987.

12. O. Elemento and O. Gascuel, �An e�cient and accurate distance based algorithm to reconstruct tandem dupli-cation trees,� Bioinformatics, vol. 18, pp. S92�S99, Oct. 2002.

13. Olivier Gascuel, Denis Bertrand, and Olivier Elemento, �Reconstructing the duplication history of tandemlyrepeated sequences,� in Mathematics of Evolution and Phylogeny (Olivier Gascuel, ed.), ch. 8, Oxford, New York:Oxford University Press, May 2005.

14. G. Benson and L. Dong, �Reconstructing the duplication history of a tandem repeat.,� in ISMB, pp. 44�53, 1999.15. M. Tang, M. Waterman, and S. Yooseph, �Zinc �nger gene clusters and tandem gene duplication,� Journal of

Computational Biology, vol. 9, no. 2, pp. 429�446, 2002.16. Y. Gelfand, A. Rodriguez, and G. Benson, �TRDB�the Tandem Repeats Database,� Nucleic Acids Research,

vol. 35, pp. D80�87, Jan. 2007.17. A. Hodgkinson and A. Eyre-Walker, �Variation in the mutation rate across mammalian genomes,� Nature Reviews

Genetics, vol. 12, pp. 756�766, Nov. 2011.18. I. Ebersberger, D. Metzler, C. Schwarz, and S. Pääbo, �Genomewide Comparison of DNA Sequences between

Humans and Chimpanzees,� American Journal of Human Genetics, vol. 70, pp. 1490�1497, June 2002.

Documents

Estimation of the Duplication History under a Stochastic ...ffh8x/d/smtd.pdf · duplication and substitution mutations. The starting point is a short sequence which we refer to as