88
MUMmer 游游游 游游游 游游游 游游游 2007/01/02

MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Embed Size (px)

Citation preview

Page 1: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

MUMmer

游騰楷 杜海倫王慧芬 曾俊雄

2007/01/02

Page 2: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Outlines

• Suffix Tree• MUMmer 1.0• MUMmer 2.1• MUMmer 3.0• Conclusion

Page 3: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Suffix Tree

游騰楷

Page 4: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Trie

• A Trie, also called a prefix tree, is a tree that records the information of strings.

• It likes a dictionary of those strings.

• Advantages comparing to BST: – Search time.

– Space.

– Prefix matching.

– Balance.

Page 5: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Suffix Trie and Suffix Tree

• A suffix tree, Tree(T), is a compact trie that represents all the suffixes of a string T.

Page 6: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Suffix Trie and Suffix Tree(cont.)

a

b

b

aaa

aa

b

b

b

a

baab

baab

ab

abaab

baab

aab

ab

b

Suffix Trie Suffix Tree

Page 7: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Suffix Trie and Suffix Tree(cont.)

• Let |T| = n• Suffix trie needs O(n2) space.

– Consider T = an/2bn/2

• Suffix tree needs O(n) space.– Since every symbol can cause only one branch, total no

des (edges) cannot exceed O(n).

– In a node we only record the starting position of corresponding suffix.

– In a edge we need not to record the whole substring but the starting and ending positions of the substring.

Page 8: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

1,1

2,54,5 2,5

Suffix Trie and Suffix Tree(cont.)1:abaab

2:baab

3:aab

4:ab

5:b

Suffix Tree

1:abaab$

3:aab$

2:baab$

5:b$

2,2

3,5

a

baab

baab

ab

Suffix Tree

3,5 4:ab$

2,2

Page 9: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Suffix Tree: Full Text Index

• P occurs in T

P is a prefix of some suffix of T

Path for P exists in Tree(T)

Page 10: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Linear Time Construction

• 1973, Weiner gave the first linear time algorithm. • 1976, McCreight gave a more readable algorithm.• They were all processing from right to left.• In 1992, Ukkonen gave a left-to-right on-line algo

rithm.

Page 11: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

On-line Construction of Suffix Trie

• We have O(n2) time to do this, it is quite easy.

• We construct Triei from Triei-1.

• We need to record the current end points of every suffix.

Page 12: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

a

a

b

b

Trie(abaaba)

a

b

b

aa

a

b

b

aaa

aa

a

b

b

aaa

aa

b

b

b

a

b

b

aaa

aa

b

b

ba

a

a

Page 13: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Two Lemmas

• We call the current end points of suffix i as Ci.

1. If some Ci ever branch out it will never branch again.

2. If Ci does not branch, those Cj also does not branch for j > i.

Page 14: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Ideas to Find Suffix Tree

• Thus – If one branched we never consider it again.

– The branches have an order from small indices suffix to large indices.

• For a suffix we have 3 phases:1. Going along original tree.

2. Branching a new leaf.

3. Growing the leaf.

Page 15: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Ideas to Find Suffix Tree(cont.)

• 3 phases:1. Going along original tree.

2. Branching a new leaf.

3. Growing the leaf.

• We just record the longest suffix that is not branch yet.

• When it branches, we try to find next suffix.• To do this, we need suffix links.

Page 16: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Suffix Links

• Suffix link is a pointer from an internal node xS to another internal node S. x , and S = *.

• Suffix tree of “abaabab$”:

a b

aabab

ababb

ababb

ba

abaabab$baabab$

aabab$

abab$

bab$

ab$ b$

$

Page 17: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Using Suffix Link to Find Common Substring(1)

a b

a

abab

ababb

ababb

ba

abaabab$

baabab$

aabab$

abab$

bab$

ab$ b$

$

Consider abaaaba

Page 18: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Using Suffix Link to Find Common Substring (2)

a b

a

abab

ababb

ababb

ba

abaabab$

baabab$

aabab$

abab$

bab$

ab$ b$

$

abaaabaabaa

Page 19: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Using Suffix Link to Find Common Substring (3)

a b

a

abab

ababb

ababb

ba

abaabab$

baabab$

aabab$

abab$

bab$

ab$ b$

$

abaaababaa

Page 20: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Using Suffix Link to Find Common Substring (4)

a b

a

abab

ababb

ababb

ba

abaabab$

baabab$

aabab$

abab$

bab$

ab$ b$

$

abaaabaaa

Page 21: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Using Suffix Link to Find Common Substring (5)

a b

a

abab

ababb

ababb

ba

abaabab$

baabab$

aabab$

abab$

bab$

ab$ b$

$

abaaaba

Page 22: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Using Suffix Link to Find Common Substring (6)

a b

a

abab

ababb

ababb

ba

abaabab$

baabab$

aabab$

abab$

bab$

ab$ b$

$

abaaabaaaba

Page 23: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Using Suffix Link to Find Common Substring (7)

a b

a

abab

ababb

ababb

ba

abaabab$

baabab$

aabab$

abab$

bab$

ab$ b$

$

abaaaba ababaa

Page 24: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Construct Suffix Tree

• Like matching, we use suffix link to find next suffix.

Page 25: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Suffix Tree(abaabab$)

1,- 2,-

1 step

aba

1,- 2,-

ab

1,-

a

1,12,-

abaa

2,-4,-

1,12,-

abaab

2,- 4,-

1 step

1,12,-

abaaba

2,-4,-

2 step

Page 26: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

abaabab

1,12,-

2,-4,-

Branch

abab

1,12,-

2,3 4,-

4,- 7,- Use suffix link to find this, and go down by “bab”

bab

1,12,-

2,3 4,-

4,- 7,- Nearest ancestor internal node

abab

1,12,3

2,3 4,-

4,- 7,-4,-

7,-Nearest ancestor internal node

bab

1,12,3

2,3 4,-

4,- 7,-4,-

7,-

ab

1 step

Page 27: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Suffix Tree(abaabab$)

1,12,2

3,34,-

4,- 7,-4,-

7,-

2,2

8,-3,38,-

abaabab$baabab$

aabab$

abab$

bab$

ab$ b$

8,- $

Page 28: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Time Complexity

• For phase 3, we do nothing.• For phase 1, we totally use O(n) time.• For phase 2, we do O(n) times of branch. The only

bottleneck is how fast can we find next position.

Page 29: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Time Complexity(cont.)

• Consider we branch at the red circle having distance t to its parent.

• Next suffix can at most pass through t internal nodes.

• Use amortized analysis, we can easily find total internal nodes the algorithm passing through is O(n).

t steps

Page 30: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Time Complexity(cont.)

• Thus, constructing suffix tree needs O(n) time and O(n) space.

Page 31: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Applications

• Longest common substring.• Repeating of a pattern in a string.• Approximate matching.• etc.

Page 32: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Suffix Arrayhattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

ε

ε

atti

attivatti

hattivatti

i

ivatti

ti

tivatti

tti

ttivatti

vatti

11

7

2

1

10

5

9

4

8

3

6Suffix array of hattivatti: (11, 7, 2, 1, 10, 5, 9, 4, 8, 3, 6)

att binary search

It can construct from suffix tree in linear time. (DFS)

Page 33: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Special Thanks to

• Esko Ukkonen• Hsueh-I Lu

• Wikipedia

• Some figures of this slides are based on their slides.

Page 34: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

• Suffix Tree• MUMmer 1.0• MUMmer 2.1• MUMmer 3.0• Conclusion

Page 35: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

MUMmer 1.0 –Alignment of whole genomes

D95922019 杜海倫2007/01/02

Page 36: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

About The System - MUMmer

• For rapidly aligning whole genome sequences– Assumption: the sequences are closely related

– Output:• Alignment of the input sequences

• Highlighting the exact differences in the genomes

– SNPs, insertions, significant repeats, tandem repeats, reversals

– Main idea• Suffix tree

• Longest increasing subsequence (LIS)

• Smith-Waterman alignment

Page 37: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Alignment step

• Perform a maximal unique match (MUM) decomposition of the two genomes -> Suffix tree

• Sort the MUMs, and extract the longest possible set of matches in the same order -> LIS

• Close gaps -> Smith-Waterman alignment

• Output!

Page 38: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Step 1• Suffix tree

8 3

Page 39: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Step 1 (cont’)Construct a suffix tree T for genome A

Add the suffixes for genome B(implement: A+dummy character+B)

Find out unique matching sequence: an internal node with exactly two child nodes,

such that the child nodes are leaf nodes from different genomes

Find out MUM: For highly similar genomes, set MUM>=50bp

For more distantly related genomes, set MUM>=20bp

Page 40: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Step 2

• Sort, LIS=> O(KlogK) => O(N)– K: the numbers of MUMs

– K<<N/logN

Page 41: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Step 3• Process the gap into one of the four classes– SNP

• Genome A: cgtcataaagt

• Genome B: cgtcctaaagt

– Insert• Genome A: cgtctaaagtggggaaaactctgg

• Genome B: cgtctaaagt. . . . . . . . Ctctgg

• Transposition or simple insertions

– Polymorphic regions• Genome A: cgtctaaagtggggaaaactctgg

• Genome B: cgtctaaagta tgacaggctctgg

• Should be aligned

– Repeat• Genome A: aaggaaggaaggagct

• Genome B: aaggaagg. . . . agct

Page 42: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Result and Discussion

• Comparing two strain of tuberculosis– H37Rv and CDC1551– >99% identical– Be able to catalog

• all SNPs• all insertions of every length• All tandem repeat with different copy numbers

– Performance (DEC Alpha 4100)• 5s for step1• 45s for step 2• 5s for step 3

Page 43: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Result and Discussion (cont’)

Page 44: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Result and Discussion (cont’)

• Comparing two Mycoplasma genome– M.genitalium (580074nt) and M.pneumoniae (226000n

t)

– Performance (DEC Alpha 4100)• 6.5s for step1

• 0.02s for step 2

• 116s for step 3

– FASTA: many hours

Page 45: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Result and Discussion (cont’)

FASTA

25mers

MUMmer

Page 46: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Result and Discussion (cont’)

• Comparing human and mouse– 222930bp of human chromosome 12 (accession no.

U47924) and 227538bp of mouse chromosome 6 (accession no. AC002397)

– Performance -29s• 1.6s for step1

Page 47: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Result and Discussion (cont’)

Page 48: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

• Suffix Tree• MUMmer 1.0• MUMmer 2.1• MUMmer 3.0• Conclusion

Page 49: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

MUMmer 2.1

王慧芬

Page 50: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

MUMmer 2.1• Fast algorithms for large-scale genome alignment and

comparison• by Delcher, Phillippy, Carlton and Salzberg• Nucleic Acids Research 2002• http://www.tigr.org/software/mummer/MUMmer2.pdf

Page 51: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Agenda

• Key in MUMmer 1• Improvements in MUMmer 2.1• Technical improvements in MUMmer 2.1• Application to DNA sequence alignment

– Alignment of incomplete genomes

Page 52: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

MUMmer 1• Key

– Built a suffix tree containing 2 input sequences

– Find all maximal unique matches (MUMs) between them.

• MUM (Maximal Unique Matches)– A subsequence occurred in 2 exactly matching copies, once i

n each input sequence

– Cannot be extended in either direction

Page 53: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Improvements in MUMmer 2.1• Fast and less memory, by a factor of nearly three

• Able to align DNA or protein sequence

MUM1 MUM2

Time 74s (1GHz) 27s (1GHz)

Mem 293MB 100MB

To align 4.7 Mb genome of E. coli and

3.0Mb large chromosome of V.cholerae

Page 54: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Technical improvements in MUMmer 2.1

• A reduction in amount of memory used to store suffix trees– Kurtz (1999) technique is used

• An alternative algorithm to find initial exact matches• Cluster matches

Page 55: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Alternative to find initial exact matches

• MUMmer 1:– Built a suffix tree containing 2 input sequences

• MUMmer 2:– Chang-Lawler (1994) method is used

• running time is reduced

– Built a suffix tree storing only one sequence (reference)

– 2nd sequence (query) streamed against the suffix tree • memory usage is reduced by at least half

• once the suffix tree is built, arbitrarily long/multiple queries can be streamed.

Page 56: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Alternative to find initial exact matches (cont.)

• MUMmer 2 (cont.):– Identify where the query sequence would branch off from the tree, to find

all matches

– Unique match • Wherever a branch occurs at a tree position with just a single leaf beneath it

– Maximal match• Using suffix links to find next match (extended match)

• By checking the character immediately preceding the start of this match, we can determine whether it is a maximal match

• To find all maximal matches, it is in time proportional to the length of the query

Page 57: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Unique match

Page 58: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Maximal match

• suffix links is used to find extended match

Page 59: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Cluster matches• MUMmer 1:

– Align 2 complete sequences, no rearrangement.

– That is, to find a single longest alignment.

• MUMmer 2:– After matches are identified, the interval length between match

es are checked.

– If the interval length between matches is less than a user-defined gap length, the matches are joined into a cluster.

Page 60: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Alignment of incomplete genomes

• DNA sequencing• Shotgun sequencing • Terms • Finishing• NUCmer (NUCleotide MUMmer)

Page 61: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

DNA sequencing

– Human genomes are approx. 3 billion bases.

– Sequencing machine can generate sequences for fragments in 500-600 bp long.

– In order to read DNA, genome is broken up into tiny of pieces (reads), each is read individually.

– After all pieces are read, they are assembled in the correct order.

Page 62: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Shotgun sequencing

• Extract DNA

• Fragment DNA

• Clone DNA

• Sequence both ends of clones– 500-600 bp each read

• Assemble– reads are assembled to reconstruct the genome

• Finish sequencing (close gaps)

Page 63: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Terms• Raw sequence

– Unassembled sequence reads

• Contig– Overlapping reads are joined into longer composite sequences, called contigs

• Finished sequence– Complete sequence of a genome with no gaps and an accuracy of >99.9%

• Full shotgun coverage– Genome coverage in random raw sequence required to produce finished sequ

ence, 8-10 fold (8-10X)

Page 64: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Finishing

• The process of– Determining the order and orientation of all the contigs, and – Generating additional sequence to fill in all the gaps between them

(closing all the gaps)

• Finishing phase is dispensed in many projects.• However, MUMmer 2.1 is used as a base to build a progra

m for the “finishing phase”– Align the multiple contigs to a completed reference genome, and – Align one set of contigs to another set of contigs

Page 65: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

NUCmer (NUCleotide MUMmer)

• Built based on MUMmer 2.1 to develop a miltiple-contig alignment program

• 3 steps• Outputs

Page 66: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

NUCmer – 3 steps• Step 1

– Input: 2 DNA sequences in 2 multi-fasta files representing partial or complete assemblies.

– Each DNA sequence is represented as contig sequence– Creates a map of all contig positions within each of the multi-fasta files– Concatenates the two files seperately – Runs MUMmer to find all exact matches between two genomes– These matches are mapped back to the separate contigs

Page 67: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

NUCmer – 3 steps (cont.)

• Step 2 : Clustering– MUMs (output of step 1) are clustered together if they are separate

d in user-defined distance

• Step 3– Run a modified Smith-Waterman DP alignment to align the seque

nce between MUMs (output of step 2)

• Result– Alignment of “every sequence contig in the 1st file” to “every seq

uence contig in the 2nd file”– Order, orientation, and coverage identity percentage of contigs

Page 68: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion
Page 69: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

• Suffix Tree• MUMmer 1.0• MUMmer 2.1• MUMmer 3.0• Conclusion

Page 70: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Improvements in MUMmer 3.0

d95922026 曾俊雄

Page 71: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Overview

• Functionality vs. Modularity

Page 72: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

What’s New?

• Optimized suffix-tree library (rewrite)• Non-unique maximal matches (New!)• Distant matches (2.1)

Page 73: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Optimized Suffix-Tree

• The most significant improvement.– multi-contig query against multi-contig reference (conti

nue)

– rewrite, more compact (continue)

Page 74: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

MUMmer 3.0, page 4

Page 75: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

• multi-contig query against multi-contig reference– already in MUMmer 2.1 through Nucmer package

– now imported into the core

Page 76: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

• more compact suffix-tree

Page 77: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

• Some previous work:– Manber & Myers : 18.8n~22.4n bytes (DNA)

– K¨arkk¨ainen : 15n~18n (?)

– Crochemore and V´erin : 32.7n

– The strmat software package by Knight, Gusfield and Stoye : 24n~28n (string length at most 2^23)

Page 78: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

• Basic Idea:– Use different data structure for different kinds of tree

nodes

– duplicated information should be removed

Page 79: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

• Terms:– depth

Page 80: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

– head position

abab

i

w

a

b the longest w, the smallest i

Page 81: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

• Implementation by Stefan Kurtz :– n+5q (6n) integers !!

• for leaf node, depth and head position and some others are not required

– (3+1/16)n integers !!• more observation

Page 82: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

a

aW

if the head position of aW is known, the head position of W will satisfy the constraint:

aW.headposition + 1 >= W.headposition

Page 83: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

• case 1: =

• case 2: >

aW

aWaW

aW

W W

for case 1, the internal node representing aW doesn’t even need the head position information

Page 84: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Non-unique Maximal Matches

• 1.0 : matches must be unique– may miss

• 2.0 : uniqueness required in reference sequence– still may miss

Page 85: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

• now, a command line to generate all maximal matches, regardless of uniqueness– in cost of very large output file

Page 86: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Distant Matches

• 1.0 : only 100% match is allowed– less sensitivity

• 2.1 : distant (not 100%) match is allowed– through extension packages

• 3.0 : improve

Page 87: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

• Suffix Tree• MUMmer 1.0• MUMmer 2.1• MUMmer 3.0• Conclusion

Page 88: MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion

Conclusion

• MUMmer is good at alignment between closely related species

• Distant matches are considered as MUMmer improved

• The open source License may have some advantage

• The space problem is still there