45
Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton [email protected]

Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton [email protected]

Embed Size (px)

Citation preview

Page 1: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Genomic Repeat Visualisation Using Suffix Arrays

Nava Whiteford

Department of ChemistryUniversity of Southampton

[email protected]

Page 2: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Repeat Visualisation Using Suffix Arrays

• The Analysis

• Artificial Sequences

• Genomic Sequences

• The Algorithm

• Larger Sequences

• Non-genomic sequences

Page 3: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

Page 4: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

Page 5: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

Page 6: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

Page 7: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

Page 8: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

Page 9: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

Page 10: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

Page 11: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

Page 12: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

Page 13: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA1 2 3

Page 14: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

AT Occurs 3 time(s)

TG Occurs 1 time(s)

GC Occurs 1 time(s)

CA Occurs 1 time(s)

TA Occurs 2 time(s)

Page 15: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeatscore plot

A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2).

ATGCATATA

AT TG GC CA AT TA AT TA

No. occurrences (r)

No. sequences the occur r times.

1 3

2 1

3 1

4 0

AT Occurs 3 time(s)

TG Occurs 1 time(s)

GC Occurs 1 time(s)

CA Occurs 1 time(s)

TA Occurs 2 time(s)

Page 16: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeat-score plot

Number of occurrences

Sub-string length 1

Sub-string length 2

Sub-string length 3

Sub-string length 4

Sub-string length 5

1 2 3 5 6 5

2 0 1 1 0 0

3 1 1 0 0 0

4 1 0 0 0 0

5 0 0 0 0 0

Page 17: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The repeat-score plot

The resulting matrix is then plotted as an image:

Page 18: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Repeatscore plots of Artificial Sequences

Small repeats

Reverse strand is also included

Page 19: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Random Sequences

Page 20: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

DNA Sequences

• “The language of life”

• Composed of four different bases A, T, G and C

• Sequences range in size from 2000bp to 670 billion bp.

Page 21: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Small Genomic Sequences

Lambda Phage

Page 22: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Small Genomic Sequences

Lambda Phage Random Sequence

Page 23: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

E.Coli

Page 24: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

E.Coli

Page 25: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

E.Coli

Sequences coding for rRNA

Known inter-genic repeat elements

Page 26: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

E.Coli

Page 27: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Repeats in Genomic Sequences

Page 28: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

A Linear time algorithm

• The plots shown would take hours to construct using traditional methods.

• The algorithms used would not scale linearly

• It is not feasible to create these plots on large sequences unless more advanced algorithms are used.

Page 29: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The suffix array

• banana$• anana$• nana$• ana$• na$• a$

•Original string: banana$

All suffixes

Page 30: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

The suffix array

• banana$• anana$• nana$• ana$• na$• a$

•Original string: banana$

In sorted order

• a$• ana$• anana$• banana$• na$• nana$

All suffixes

Page 31: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Generating the repeatscore plot

a$ana$

anana$

banana$

na$

nana$

Page 32: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Generating the repeatscore plot

a$ana$

anana$

banana$

na$

nana$

Page 33: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Whole human genome

Page 34: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Whole human genome

Page 35: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Whole human genome

Page 36: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Human Chromosome 18

Page 37: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Arabidopsis thaliana chromosome 1, coding region

Page 38: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Fibonacci derived sequences

Page 39: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Gallus gallus chromosome 20

Page 40: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Application to other sequences

• Analysing writing styles

• Finding plagiarised text

• Any sequence that may contain motif based, language like structure.

Page 41: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Shakespeare

Page 42: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Text document containing the text “The quick brown fox jumped over the lazy dog” 16times.

Page 43: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

“On the Economy of Machinery and Manufacturers” by Charles Babbage with artificial

repeat inserted 16times.

Page 44: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

“On the Economy of Machinery and Manufacturers” by Charles Babbage with artificial

repeat inserted 16times.

Page 45: Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

Conclusion

• This new visualisation technique can highlight repeat structure in sequences.

• In genomic sequences this maybe useful in generating annotation.

• There are applications in other areas worth pursuing.

• Our next step is to allow the repeatscore plot to be easily interrogated by a user in order to better understand the repeat structure.