31
Annotation and Alignment of the Drosophila Genomes

Annotation and Alignment of the Drosophila Genomes

  • View
    223

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Annotation and Alignment of the Drosophila Genomes

Annotation and Alignment of the Drosophila Genomes

Page 2: Annotation and Alignment of the Drosophila Genomes

One (possibly wrong) alignment is not enough: the history of parametric

inference• 1992: Waterman, M., Eggert, M. & Lander, E. • Parametric sequence comparisons, Proc. Natl. Acad. Sci. USA 89, 6090-6093

• 1994: Gusfield, D., Balasubramanian, K. & Naor, D. • Parametric optimization of sequence alignment, Algorithmica 12, 312-326.

• 2003: Wang, L., Zhao, J. • Parametric alignment of ordered trees, Bioinformatics, 19 2237-2245.

• 2004: Fernández-Baca, D., Seppäläinen, T. & Slutzki, G. • Parametric Multiple Sequence Alignment and Phylogeny Construction, Journal of Discrete Algorithms, 2 271-287.

XPARAL by Kristian Stevens and Dan Gusfield

Page 3: Annotation and Alignment of the Drosophila Genomes

Whole Genome Parametric AlignmentColin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and

Kevin Woods

Mathematics and Computer Science

• Parametric alignment in higher dimensions.• Faster new algorithms.• Deeper understanding of alignment polytopes.

Biology

• Whole genome parametric alignment.• Biological implications of alignment parameters.• Alignment with biology rather than for biology.

Page 4: Annotation and Alignment of the Drosophila Genomes

Whole Genome Parametric AlignmentColin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and

Kevin Woods

Mathematics and Computer Science

• Parametric alignment in higher dimensions.• Faster new algorithms.• Deeper understanding of alignment polytopes.

Biology

• Whole genome parametric alignment.• Biological implications of alignment parameters.•

CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTGCTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTTCTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA-------CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG----CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT-CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA-------CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT

analysis

Page 5: Annotation and Alignment of the Drosophila Genomes

Whole Genome Parametric AlignmentColin Dewey, Peter Huggins, Lior Pachter, Bernd Sturmfels and

Kevin Woods

Mathematics and Computer Science

• Parametric alignment in higher dimensions.• Faster new algorithms.• Deeper understanding of alignment polytopes.

Biology

• Whole genome parametric alignment.• Biological implications of alignment parameters.•

CTGAAGGAAT-------TCTATATT---------AAAGAAGATTTCTCATCATTGGTTGCTGCGGGATTAGGGGTCATTAGAGT---------GCCGAAAAGCGA---------GTTTCTGGAATAGTTAATTTCATTGTAACACATAAACGTTTTAAATTCTATTGAAA-------CTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCG----CTGCGGGATTAGGAGTCATTAGAGT---------GCGGAAAAGCGG---------GTT-CTGCAGCAGTTAAATA-ATTGTAATAAACAATTCTCT--AATTTGGTCCAAA-------CTGCGGGATTAGCGGTCATTGGTGT---------GAAGAATAGATC---------CTTT

analysis

Page 6: Annotation and Alignment of the Drosophila Genomes

computational geometry

Page 7: Annotation and Alignment of the Drosophila Genomes

A Whole Genome Parametric Alignment ofD. Melanogaster and D. Pseudoobscura

• Divided the genomes into 1,116,792 constrained and 877,982 unconstrained segment pairs.

• 2d, 3d, 4d, and 5d alignment polytopes were constructed for each of the 877,802 unconstrained segment pairs.

• Computed the Minkowski sum of the 877,802 2d polytopes.

+ =

Page 8: Annotation and Alignment of the Drosophila Genomes

A Whole Genome Parametric Alignment ofD. Melanogaster and D. Pseudoobscura

• Divided the genomes into 1,116,792 constrained and 877,982 unconstrained segment pairs.This is an orthology map of the two genomes.

• 2d, 3d, 4d, and 5d alignment polytopes were constructed for each of the 877,802 unconstrained segment pairs.For each segment pair, obtain all possible optimal summaries for all parameters in a Needleman--Wunsch scoring scheme.

• Computed the Minkowski sum of the 877,802 2d polytopes.There are only 838 optimal alignments of the two Drosophila genomes if the same match, mismatch and gap parameters are used for all the segment pair alignments.

Page 9: Annotation and Alignment of the Drosophila Genomes
Page 10: Annotation and Alignment of the Drosophila Genomes
Page 11: Annotation and Alignment of the Drosophila Genomes
Page 12: Annotation and Alignment of the Drosophila Genomes
Page 13: Annotation and Alignment of the Drosophila Genomes
Page 14: Annotation and Alignment of the Drosophila Genomes

>melCTGCGGGATTAGGGGTCATTAGAGTGCCGAAAAGCGAGTTTATTCTATGGAC>pseCTGGAAGAGTTTTGATTAGTAGGGGATCCATGGGGGCGAGGAGAGGCCATCATCGTGTAC

How do we build the polytope for ?

Page 15: Annotation and Alignment of the Drosophila Genomes

Alignment polytopes are small

Theorem: The number of vertices of an alignment polytope for two sequences of length n and m is O((n+m)d(d-1)/(d+1)) where d is the number of free parameters in the scoring scheme.

Examples: Parameters Model Vertices M,X,S Jukes-Cantor with linear gap penaltyO(n+m)2/3

M,X,S,G Jukes-Cantor with affine gap penaltyO(n+m)3/2

M,XTS,XTV,S,G K2P with affine gap penalty O(n+m)12/5

L. Pachter and B. Sturmfels, Parametric inference for biological sequence analysis, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16138--16143.L. Pachter and B. Sturmfels, Tropical geometry of statistical models, Proceedings of the National Academy of Sciences, Volume 101, Number 46 (2004), p 16132--16137.L. Pachter and B. Sturmfels (eds.), Algebraic Statistics for Computational Biology, Cambridge University Press.

Page 16: Annotation and Alignment of the Drosophila Genomes

The algebraic statistical model for sequence alignment, known as the pair hidden Markov model, is the image of the map

The logarithms of the parameters give the edge lengths for the shortest path problem on the alignment graph.

Page 17: Annotation and Alignment of the Drosophila Genomes

Newton Polytope of a Polynomial

Definition: The Newton polytope of a polynomial

is defined to be the convex hull of the lattice points in Rd corresponding to monomials in f:

14

Page 18: Annotation and Alignment of the Drosophila Genomes

NPi,j = S*NPi-1,j+S*NPi,j-1+(X or M)*NPi-1,j-1

A

A

C

A

T

T

A

G

AA G A T T A C C A C A

Newton polytope forpositions [1,i] and [1,j]

in each sequence

Convex hull of union Minkowski sum

Polytope propagation

Page 19: Annotation and Alignment of the Drosophila Genomes

BP England, U Heberlein, R Tjian. Purified Drosophila transcription factor, Adh distal factor-1 (Adf-1), binds to sites in several Drosophila promoters and activates transcription, J Biol Chem 1990.

Back to Adf1

Page 20: Annotation and Alignment of the Drosophila Genomes

Drosophila DNase I Footprint Database (v2.0)

Home Search Browse by Target Browse by Factor

Target GeneChromosom

e ArmStart Stop

Transcription Factor

Pubmed ID (PMID)

Footprint ID (FPID)

Footprint Alignment

ems (CG2988) 3R 9723806 9723816Abd

-B (CG11648)9491376 003205

Abd-B->ems:003205

ems (CG2988) 3R 9723843 9723853Abd

-B (CG11648)9491376 003206

Abd-B->ems:003206

ems (CG2988) 3R 9723998 9724008Abd

-B (CG11648)9491376 003208

Abd-B->ems:003208

ems (CG2988) 3R 9724091 9724102Abd

-B (CG11648)9491376 003209

Abd-B->ems:003209

ems (CG2988) 3R 9724526 9724536Abd

-B (CG11648)9491376 003211

Abd-B->ems:003211

ems (CG2988) 3R 9724557 9724567Abd

-B (CG11648)9491376 003213

Abd-B->ems:003213

ems (CG2988) 3R 9724614 9724624Abd

-B (CG11648)9491376 003214

Abd-B->ems:003214

dpp (CG9885) 2L 2454657 2454685 Adf1 (CG15845) 7791801 003665 Adf1->dpp:003665

Adh (CG3481) 2L 14615472 14615509 Adf1 (CG15845) 2105454 005046 Adf1->Adh:005046

Ddc (CG10697) 2L 19116303 19116321 Adf1 (CG15845) 2318884 005464 Adf1->Ddc:005464

Antp (CG1028) 3R 2825018 2825059 Adf1 (CG15845) 2318884 006446 Adf1->Antp:006446

Adh (CG3481) 2L 14616171 14616209 Adf1 (CG15845) 2105454 005059 Adf1->Adh:005059

Antp (CG1028) 3R 2825117 2825144 Adf1 (CG15845) 2318884 006447 Adf1->Antp:006447

Antp (CG1028) 3R 2825151 2825174 Adf1 (CG15845) 2318884 006448 Adf1->Antp:006448

Page 21: Annotation and Alignment of the Drosophila Genomes

Back to Adf1

mel TGTGCGTCAGCGTCGGCCGCAACAGCG pse TGT-----------------GACTGCG *** ** ***

BLASTZ alignment

Page 22: Annotation and Alignment of the Drosophila Genomes

Back to Adf1

mel TGTGCGTCAGCGTCGGCCGCAACAGCG pse TGT-----------------GACTGCG *** ** ***

mel TGTG----CGTCAGC--G----TCGGCC---GC-AACAG-CG Pse TGTGACTGCG-CTGCCTGGTCCTCGGCCACAGCCAAC-GTCG **** ** * ** * ****** ** *** * **

Page 23: Annotation and Alignment of the Drosophila Genomes

Back to Adf1

mel TGTGCGTCAGCGTCGGCCGCAACAGCG pse TGT-----------------GACTGCG *** ** ***

mel TGTG----CGTCAGC--G----TCGGCC---GC-AACAG-CG pse TGTGACTGCG-CTGCCTGGTCCTCGGCCACAGCCAAC-GTCG **** ** * ** * ****** ** *** * **

mel TGTGCGTCAGC------GTCGGCCGCAACAGCG pse TGTGACTGCGCTGCCTGGTCCTCGGCCACAGC- **** * ** *** * ** *****

Page 24: Annotation and Alignment of the Drosophila Genomes

Drosophila DNase I Footprint Database (v2.0)

Home Search Browse by Target Browse by Factor

Target GeneChromosom

e ArmStart Stop

Transcription Factor

Pubmed ID (PMID)

Footprint ID (FPID)

Footprint Alignment

ems (CG2988) 3R 9723806 9723816Abd-B

(CG11648)9491376 003205

Abd-B->ems:003205

ems (CG2988) 3R 9723843 9723853Abd-B

(CG11648)9491376 003206

Abd-B->ems:003206

ems (CG2988) 3R 9723998 9724008Abd

-B (CG11648)9491376 003208

Abd-B->ems:003208

ems (CG2988) 3R 9724091 9724102Abd

-B (CG11648)9491376 003209

Abd-B->ems:003209

ems (CG2988) 3R 9724526 9724536Abd

-B (CG11648)9491376 003211

Abd-B->ems:003211

ems (CG2988) 3R 9724557 9724567Abd

-B (CG11648)9491376 003213

Abd-B->ems:003213

ems (CG2988) 3R 9724614 9724624Abd

-B (CG11648)9491376 003214

Abd-B->ems:003214

dpp (CG9885) 2L 2454657 2454685 Adf1 (CG15845) 7791801 003665 Adf1->dpp:003665

Adh (CG3481) 2L 14615472 14615509 Adf1 (CG15845) 2105454 005046 Adf1->Adh:005046

Ddc (CG10697) 2L 19116303 19116321 Adf1 (CG15845) 2318884 005464 Adf1->Ddc:005464

Antp (CG1028) 3R 2825018 2825059 Adf1 (CG15845) 2318884 006446 Adf1->Antp:006446

Adh (CG3481) 2L 14616171 14616209 Adf1 (CG15845) 2105454 005059 Adf1->Adh:005059

Antp (CG1028) 3R 2825117 2825144 Adf1 (CG15845) 2318884 006447 Adf1->Antp:006447

Antp (CG1028) 3R 2825151 2825174 Adf1 (CG15845) 2318884 006448 Adf1->Antp:006448

Page 25: Annotation and Alignment of the Drosophila Genomes

Per site analysis Group 1 mean per site % identity 51.3% 51.3% 47.8%

Group 2 mean per site % identity 47.8% 42.9% 42.9%

Difference of means (group 1 – group 2) 3.6% 8.4% 4.9%

Difference of means resampling p-value 0.05 0.003 1E-5

Distribution comparison KS p-value 0.026 0.0016 2E-6

Per base analysis Group 1 mean per base % identity 47.8% 47.8% 46.3%

Group 2 mean per base % identity 46.3% 42.4% 42.4%

Difference of means (group 1 – group 2) 1.5% 5.4% 3.9%

Page 26: Annotation and Alignment of the Drosophila Genomes

Per site analysis Group 1 mean per site % identity 51.3% 51.3% 47.8%

Group 2 mean per site % identity 47.8% 42.9% 42.9%

Difference of means (group 1 – group 2) 3.6% 8.4% 4.9%

Difference of means resampling p-value 0.05 0.003 1E-5

Distribution comparison KS p-value 0.026 0.0016 2E-6

Per base analysis Group 1 mean per base % identity 47.8% 47.8% 46.3%

Group 2 mean per base % identity 46.3% 42.4% 42.4%

Difference of means (group 1 – group 2) 1.5% 5.4% 3.9%

80.4%

Page 27: Annotation and Alignment of the Drosophila Genomes

Per site analysis Group 1 mean per site % identity 51.3% 51.3% 47.8%

Group 2 mean per site % identity 47.8% 42.9% 42.9%

Difference of means (group 1 – group 2) 3.6% 8.4% 4.9%

Difference of means resampling p-value 0.05 0.003 1E-5

Distribution comparison KS p-value 0.026 0.0016 2E-6

Per base analysis Group 1 mean per base % identity 47.8% 47.8% 46.3%

Group 2 mean per base % identity 46.3% 42.4% 42.4%

Difference of means (group 1 – group 2) 1.5% 5.4% 3.9%

85.1%

Page 28: Annotation and Alignment of the Drosophila Genomes

Per site analysis Group 1 mean per site % identity 51.3% 51.3% 47.8%

Group 2 mean per site % identity 47.8% 42.9% 42.9%

Difference of means (group 1 – group 2) 3.6% 8.4% 4.9%

Difference of means resampling p-value 0.05 0.003 1E-5

Distribution comparison KS p-value 0.026 0.0016 2E-6

Per base analysis Group 1 mean per base % identity 47.8% 47.8% 46.3%

Group 2 mean per base % identity 46.3% 42.4% 42.4%

Difference of means (group 1 – group 2) 1.5% 5.4% 3.9%

86.5%

Page 29: Annotation and Alignment of the Drosophila Genomes

Per site analysis Group 1 mean per site % identity 51.3% 51.3% 47.8%

Group 2 mean per site % identity 47.8% 42.9% 42.9%

Difference of means (group 1 – group 2) 3.6% 8.4% 4.9%

Difference of means resampling p-value 0.05 0.003 1E-5

Distribution comparison KS p-value 0.026 0.0016 2E-6

Per base analysis Group 1 mean per base % identity 47.8% 47.8% 46.3%

Group 2 mean per base % identity 46.3% 42.4% 42.4%

Difference of means (group 1 – group 2) 1.5% 5.4% 3.9%

79.1%

Page 30: Annotation and Alignment of the Drosophila Genomes

Applications• Conservation of cis-regulatory elements

• Phylogenetics: branch length estimation

This is the expected number of mutations per site in an alignment with summary (x,s).

Jukes-Cantor correction:

Page 31: Annotation and Alignment of the Drosophila Genomes

Applications• Conservation of cis-regulatory elements

• Phylogenetics: branch length estimation