62
Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (<10.000 bps) 4 Sequence assembly 3 Comparison of large sequences (up to 250 000 000) 5 Efficient data search structures and algorithms 6 Proteins...

Bioinformatics PhD. Course

  • Upload
    hide

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Bioinformatics PhD. Course. Summary (approximate). 1. Biological introduction. 2. Comparison of short sequences (

Citation preview

Page 1: Bioinformatics PhD. Course

Bioinformatics PhD. Course

Summary (approximate)

• 1. Biological introduction• 2. Comparison of short sequences (<10.000 bps)

• 4 Sequence assembly

• 3 Comparison of large sequences (up to 250 000 000)

• 5 Efficient data search structures and algorithms

• 6 Proteins...

Page 2: Bioinformatics PhD. Course

3. Comparison of large sequences

Summary (more or less)

• 3.1 Overview• 3.2 Suffix trees• 3.3 MUMs

Page 3: Bioinformatics PhD. Course

Suffix trees

Algorithms on strings, trees and sequences, Dan Gusfield Cambridge University Press

http://sequence.rutgers.edu/st/

Page 4: Bioinformatics PhD. Course

Suffix trees

Given string ababaas:

1: ababaas

2: babaas

3: abaas

4: baas

5: aas

6: as

7: s

as,3

s,6

as,5

s,7

as,4ba

baas,2

a

babaas,1

a

babaas,1

ba

baas,2

as,3

as,4

s,6

as,5

s,7

Suffixes:

What kind of queries can we do?

Page 5: Bioinformatics PhD. Course

Applications of Suffix trees

a

babaas,1as,3

ba

baas,2

as,4

s,6

as,5

s,7

1. Exact string matching

…………………………

…………………………

• Does the sequence ababaas contain any ocurrence of the patterns abab, aab, and ab?

Page 6: Bioinformatics PhD. Course

Applications of Suffix trees

2. Finding the repeats within a sequence.

a

babaas,1as,3

ba

baas,2

as,4

s,6

as,5

s,7

…………………………

Page 7: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

ababaabbs,1

Page 8: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1

Page 9: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1ababaabbs,1

Page 10: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1

abbs,3

Page 11: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1

abbs,3

ba

baabbs,2

Page 12: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

ababaabbs,1

abbs,3

ba

baabbs,2

abbs,4

Page 13: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

ababaabbs,1

abbs,3

abbs,4ba

baabbs,2

abbs,4

abbs,3ba

a

baabbs,1

Page 14: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

abbs,4ba

baabbs,2

abbs,4

abbs,3ba

a

baabbs,1

abbs,5

Page 15: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

abbs,4ba

baabbs,2

abbs,4

abbs,3ba

a

baabbs,1

abbs,5

Page 16: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

abbs,4

ba

ba

baabbs,2

abbs,4

a abbs,5

b

a abbs,3

baabbs,1

Page 17: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

abbs,4ba

baabbs,2

abbs,4

a abbs,5

b

a abbs,3

baabbs,1

bs,6

Page 18: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

abbs,4ba

baabbs,2

abbs,4

a abbs,5

b

a abbs,3

baabbs,1

bs,6

Page 19: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

a abbs,5

b

a abbs,3

baabbs,1

bs,6

a

baabbs,2

b

abbs,4

bs,7

Page 20: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

a abbs,5

b

a abbs,3

baabbs,1

bs,6

a

baabbs,2

b

abbs,4

bs,7

s,7

Page 21: Bioinformatics PhD. Course

Quadratic Insertion algorithm

Given the string ababaabbs

a abbs,5

b

a abbs,3

baabbs,1

bs,6

a

baabbs,2

b

abbs,4

bs,7

s,7

s,7

Page 22: Bioinformatics PhD. Course

Generalizad suffix tree

A suffix tree of many strings …

and is the suffix tree of the concatenation of strings.

the generalized suffix tree of ababaabb and aabaat …

is the suffix tree of ababaabαaabaatβ, :

is called a generalized suffix tree …

For instance,

Page 23: Bioinformatics PhD. Course

Generalizad suffix tree

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,7

α,7

Given the suffix tree of ababaabα :

Construction of the suffix tree of ababaabbαaabaaβ :

Page 24: Bioinformatics PhD. Course

Generalizad suffix tree

a abbα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,7

α,7

Construction of the suffix tree of ababaabbαaabaaβ :

Page 25: Bioinformatics PhD. Course

Generalizad suffix tree

Construction of the suffix tree of ababaabbαaabaaβ :

a bα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,7

α,7

abaaβ,1

Page 26: Bioinformatics PhD. Course

Generalizad suffix tree

Construction of the suffix tree of ababaabbαaabaaβ :

a bα,5

b

a abbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,7

α,7

abaaβ,1

Page 27: Bioinformatics PhD. Course

Generalizad suffix tree

a bα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,7

α,7

abaaβ,1

aβ,2

Construction of the suffix tree of ababaabbαaabaaβ :

Page 28: Bioinformatics PhD. Course

Generalizad suffix tree

a bα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

abbα,4

bα,7

α,7

α,7

abaaβ,1

aβ,2

Construction of the suffix tree of ababaabbαaabaaβ :

Page 29: Bioinformatics PhD. Course

Construction of the suffix tree of ababaabbαaabaaβ :

Generalizad suffix tree

a bα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,7

α,7

abaaβ,1

aβ,2

aβ,3

Page 30: Bioinformatics PhD. Course

Construction of the suffix tree of ababaabbαaabaaβ :

Generalizad suffix tree

a bα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,7

α,7

abaaβ,1

aβ,2

aβ,3

Page 31: Bioinformatics PhD. Course

Generalizad suffix tree

abα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,7

α,7

baaβ,1

aβ,2

aβ,3

a

β,4

Construction of the suffix tree of ababaabbαaabaaβ :

Page 32: Bioinformatics PhD. Course

Generalizad suffix tree

abα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,7

α,7

baaβ,1

aβ,2

aβ,3

a

β,4

Construction of the suffix tree of ababaabbαaabaaβ :

Page 33: Bioinformatics PhD. Course

Generalizad suffix tree

abα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,7

α,7

baaβ,1

aβ,2

aβ,3

a

β,4β,4

Construction of the suffix tree of ababaabbαaabaaβ :

Page 34: Bioinformatics PhD. Course

Generalizad suffix tree

abα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,7

α,7

baaβ,1

aβ,2

aβ,3

a

β,4β,4

Construction of the suffix tree of ababaabbαaabaaβ :

Page 35: Bioinformatics PhD. Course

Generalizad suffix tree

abα,5

b

a bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,7

α,7

baaβ,1

aβ,2

aβ,3

a

β,4β,4β,4

Construction of the suffix tree of ababaabbαaabaaβ :

Page 36: Bioinformatics PhD. Course

Generalizad suffix tree

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,7

α,7

baaβ,1

aβ,2

a β,3

aβ,4β,4

β,4

Generalized suffix tree of ababaabbαaabaaβ :

What kind of queries can we do?

Page 37: Bioinformatics PhD. Course

Applications of Suffix trees

1. The substring problem for a database of patterns DB• Does the DB contain any ocurrence of patterns abab, aab, and ab?

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,7

α,7

baaβ,1

aβ,2

a β,3

aβ,4β,4

β,4

Page 38: Bioinformatics PhD. Course

Applications of Suffix trees

2. The longest common substring of two strings

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,7

α,7

baaβ,1

aβ,2

a β,3

aβ,4β,4

β,4

Page 39: Bioinformatics PhD. Course

Applications of Suffix trees

3. Finding MUMs.

a bα,5

ba bbα,3

baabbα,1

bα,6

a

baabbα,2

b

bbα,4

bα,7

α,7

α,7

baaβ,1

aβ,2

a β,3

aβ,4β,4

β,4

Page 40: Bioinformatics PhD. Course

Linear Insertion algorithm:

Given the string …………………………......

P2: the string is the longest string that can be spelt through the tree.

P1: the leaves of suffixes from have been inserted

and the suffix-tree

…...

Page 41: Bioinformatics PhD. Course

Insertion algorithm: example

Given the string ababaababb...

ba

baababb...,2

a ababb...,5

ba ababb...,3

baababb...,1ababb...,4

Page 42: Bioinformatics PhD. Course

Linear Insertion algorithm:

Given the string …………………………......

P2: the string is the longest string that …

P1: the leaves of suffixes from have been inserted

…...

…...

P3: there is a pointer,called “suffix pointer” between any node and its longest no proper suffix node.

´

´

Page 43: Bioinformatics PhD. Course

Insertion algorithm: example

ba

baababb...,2

a ababb...,5

ba ababb...,3

baababb...,1ababb...,4

6 7...Given the string ababaababb...

Page 44: Bioinformatics PhD. Course

Insertion algorithm: example

a ababb...,5

ba ababb...,3

baababb...,1

ba

baababb...,2

ababb...,4

Given the string ababaababb...

6 7 8...

Page 45: Bioinformatics PhD. Course

Insertion algorithm: example

a ababb...,5

ba ababb...,3

baababb...,1

ba

baababb...,2

ababb...,4

Given the string ababaababb...

6 7 8...

baababb...,1b

b...,6

ababb...,1

Page 46: Bioinformatics PhD. Course

Insertion algorithm: example

a ababb...,5

ba ababb...,3

ba

baababb...,2

ababb...,4

Given the string ababaababb...

7 8...

b

b...,6

ababb...,1

Page 47: Bioinformatics PhD. Course

Insertion algorithm: example

a ababb...,5

ba ababb...,3

ba

baababb...,2

ababb...,4

Given the string ababaababb...

7 8…

b

b...,6

ababb...,1

baababb...,2b

b...,7

aababb...,2

Page 48: Bioinformatics PhD. Course

Insertion algorithm: example

a ababb...,5

ba ababb...,3

ba ababb...,4

Given the string ababaababb...

8…

b

b...,6

ababb...,1

b

b...,7

aababb...,2

Page 49: Bioinformatics PhD. Course

Insertion algorithm: improving time

Resume: Given the string ababaababb...

6 7 8...

a ababb...,5

ababb...,3babaababb...,1

ba

baababb...,2

ababb...,4

we have pointed to the following nodes

Page 50: Bioinformatics PhD. Course

a ababb...,5

ababb...,3babaababb...,1

ba

baababb...,2

ababb...,4

Insertion algorithm: improving time

Resume: Given the string ababaababb...

6 7 8...

we have pointed to the following nodes

babaababb...,1

ba

baababb...,2

Page 51: Bioinformatics PhD. Course

Suffix tree implementation:suffix-links

Given sequence ababaas

a

babaas,1

as,3

ba

baas,2

as,4

s,6

as,5

s,7

a

Page 52: Bioinformatics PhD. Course

Suffix links

a

babaas,1

as,3

ba

baas,2

as,4

s,6

as,5

s,7

Given Suffix tree of ababaas

Page 53: Bioinformatics PhD. Course

Insertion algorithm

Given the string ababaabbs

ababaabbs,1

Page 54: Bioinformatics PhD. Course

Insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1

Page 55: Bioinformatics PhD. Course

Insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1

Page 56: Bioinformatics PhD. Course

Insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1ababaabbs,1

Page 57: Bioinformatics PhD. Course

Insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1

abbs,3

Page 58: Bioinformatics PhD. Course

Insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1

abbs,3

Page 59: Bioinformatics PhD. Course

Insertion algorithm

Given the string ababaabbs

babaabbs,2

ababaabbs,1

abbs,3

ba

baabbs,2

Page 60: Bioinformatics PhD. Course

Insertion algorithm

Given the string ababaabbs

ababaabbs,1

abbs,3

ba

baabbs,2

abbs,4

Page 61: Bioinformatics PhD. Course

Insertion algorithm

Given the string ababaabbs

baabbs,1

abbs,5

abbs,3ba

ba

baabbs,2

abbs,4

a

Page 62: Bioinformatics PhD. Course

Insertion algorithm

Given the string ababaabbs

ba

baabbs,2

a abbs,5

ba abbs,3

baabbs,1abbs,4