Sequence Analysis with Artemis and Artemis Comparison Tool (ACT) Carribean Bioinformatics Workshop

Preview:

DESCRIPTION

Sequence Analysis with Artemis and Artemis Comparison Tool (ACT) Carribean Bioinformatics Workshop 18 th -29 th January , 2010. atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtt - PowerPoint PPT Presentation

Citation preview

Sequence Analysis with Artemisand

Artemis Comparison Tool (ACT)

Carribean Bioinformatics Workshop

18th-29th January , 2010

atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtttttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatcatttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccgcagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatatataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattatttatatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaacatacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattaggagatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaattgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatattatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaattcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaataatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcattaaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatatatatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacatatgtattattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattactaccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaatatatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaagaatttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatatatatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtatttataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaacattttcttcattatcaaaaatatttatttcctaattttttttttttgtaaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattcaaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataataaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcatatctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaattctgatcattgatccgtcttccttaaatgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgttttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaatgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttttttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagttaagcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaataaagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaattcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatctataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa

atcttttacttttttcatcatctatacaaaaaatcatagaatattcatcatgttgtttaaaataatgtattccattatgaactttattacaaccctcgtttttaattaattcacattttatatctttaagtataatatcatttaacattatgttatcttcctcagtgtttttcattattatttgcatgtacagtttatcatttttatgtaccaaactatatcttatattaaatggatctctacttataaagttaaaatctttttttaattttttcttttcacttccaattttatattccgcagtacatcgaattctaaaaaaaaaaataaataatatataatatataataaataatatataataaataatatataatatataataaataatatataatatataatatataataaataatatataatatataatatataataaataatatataataaataatatataatatataatatataatactttggaaagattatttatatgaatatatacacctttaataggatacacacatcatatttatatatatacatataaatattccataaatatttatacaacctcaaataaaataaacatacatatatatatataaatatatacatatatgtatcattacgtaaaaacatcaaagaaatatactggaaaacatgtcacaaaactaaaaaaggtattaggagatatatttactgattcctcatttttataaatgttaaaattattatccctagtccaaatatccacatttattaaattcacttgaatattgttttttaaattgctagatatattaatttgagatttaaaattctgacctatataaacctttcgagaatttataggtagacttaaacttatttcatttgataaactaatattatcatttatgtccttatcaaaatttattttctccatttcagttattttaaacatattccaaatattgttattaaacaagggcggacttaaacgaagtaattcaatcttaactccctccttcacttcactcattttatatattccttaatttttactatgtttattaaattaacatatatataaacaaatatgtcactaataatatatatatatatatatatatatatatatattataaatgttttactctattttcacatcttgtccttttttttttaaaaatcccaattcttattcattaaataataatgtattttttttttttttttttttttttattaattattatgttactgttttattatatacactcttaatcatatatatatatttatatatatatatatatatatatatatattattcccttttcatgttttaaacaagaaaaaaaactaaaaaaaaaaaaaataataaaatatatttttataacagatgtattattaaaatgtatatataaaaatatatattccatttattattatttttttatatacattgttataagagtatcttctcccttctggtttatattactaccatttcactttgaacttttcataaaaattaatagaatatcaaatatgtataatatataacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaatatatatatatatatatatacatataatatatatttcatctaatcatttaaaattattattatatattttttaaaaaatatatttatgataacataaaaagaatttaattttaattaaatatatataattacatacatctaatattattatatatatataataagttttccaaatagaatacttatatattatatatatatatatatatatatatattcttccataaaaagaataaaataaaataaaaacaccttaaaagtatttgtaaaaaattccccacattgaatatatagttgtatttataaaattaaagaaaaagcataaagttaccatttaatagtggagattagtaagtttttcttcattatcaaaaatatttatttcctaattttttttttttgtaaaatatatttaaaaatgtaatagattatgtattaaataatataaatatagcaaaatgttcaattttagaaatttgcctctttttgacaaggataattcaaaagatacaggtaaaaaaaaaaaaataaagtaaaacaaaacaaaacaaaaaacaaaaaaaaaaaaaaaaaaaaaaatgacatgttataatataatataataaataaaaattatgtaatatatcataatcgaagaaacatatatgaaaccaaaaagaaacagatcttgatttattaatacatatataactaacattcatatctttatttttgtagatgatataaaaaattttataaactcttatgaagggatatatttttcatcatccaataaatttataaatgtatttctagacaaaattctgatcattgatccgtcttccttaggtgttattacaataaatacagatctgtatgtagttgatttcctttttaatgagaaaaataagaatcttattgttttagggtaatgaaatatatatagatttatatttttatttatttattatatattattttttaatttttcttttatatatttattttatttagtgtataaaatgatatcctttatatttatatttacatgggatattcaaataataacaaaaatgagtatacacatatatatatatatatatatatatatgtatattttttttttttttttatgttcctataggaaagggaagaattcactgatttgtagtgtttacaatattagggaatgcaactttacacttttgaaaaaaattcagttaagcaaaaatattaataacattaaaaagacactgatagcaaaatgtaatgaatatataataacattagaaaataagaaaattactttttatttcttaaataaagattatagtataaatcaaagtgaattaatagaagacggaaaagaacttattgaaaatatctatttgtcaaaaaatcatatcttgttagtaataaaaaattcatatgtatatatataccaattagatattaaaaattcccatattagttatacacttattgatagtttcaatttaaatttatcctacctcagagaatctataaataataaaaaaaagcatataaataaaataaatgatgtatcaaataatgacccaaaaaaggataataatgaaaaaaatacttcatctaataatataa

Sequencing is just the beginning of the process

Extracting information & interpreting

What´s therewhere are the geneswhich geneshow to find them?

SEQUENCE ANNOTATION

Sequencing is just the beginning of the process

Extracting information & interpreting

What´s therewhere are the geneswhich geneshow to find them?

SEQUENCE ANNOTATION

Strategies for sequence annotationStrategies for sequence annotation

Predictive methods

Comparative methods

Experimental methods

Interpretation of the DNA sequence into genes according to rules

Strategies for sequence annotationStrategies for sequence annotation

Predictive methods

Comparative methods

Experimental methods

Interpretation of the DNA sequence into genes according to rules

Interpretation of the DNA sequence into genes according to similarities with other sequences

Strategies for sequence annotationStrategies for sequence annotation

Predictive methods

Comparative methods

Experimental methods

Interpretation of the DNA sequence into genes according to rules

Interpretation of the DNA sequence into genes according to similarities with other sequences

Interpretation of the DNA sequence into genes according to experimental results (e.g. cDNA)

EST Blast Hit

Gene prediction programs:ORFs and CDSs

ORFs are not equivalent to CDSs

Not all open reading frames are coding sequences

Gene prediction

Gene finder

Glimmer

Orpheus PHAT

GeneMark

Gene finding programs

• Genefinding software packages use Hidden Markov Models.

• Predict coding, intergenic and intron sequences

• Need to be trained on a specific organism.

• Never perfect!

Gene prediction programs: Problems

• ORFs are not equivalent to CDSs

• Gene prediction programs find new genes that share properties with a given set of genes.

• They can be confounded by:– Sequence constraints (ribosomal proteins etc.)

– Sequence biases

– Different sets of genes

– Horizontal gene transfer

– Non-coding DNA

Gene prediction programs: Problems

Different gene training sets: Plasmodium falciparum

Original annotation

Updated annotation

Gene prediction programs: Problems

Non-protein coding regions: S. typhi ribosomal RNA genes

glimmer

genefinder

final

orpheus

glimmer

genefinder

final

orpheus

Gene prediction programs: ProblemsNon-protein coding regions: N. meningitidis DNA repeats

glimmerorpheusfinal

glimmerorpheusfinal

Gene prediction programs: Problems

Pseudogenes M. leprae

Gene prediction programs: Problems

Pseudogenes: M. lepraeGlimmer

Gene prediction programs: Problems

Pseudogenes: M. lepraePseudogenes: M. lepraeORPHEUS

Gene prediction programs: Problems

Pseudogenes: M. leprae

WUBLASTX vs. M. tuberculosis

Gene prediction programs: Problems

Pseudogenes: M. leprae

Final annotation

The Gene Prediction Process

DNA SEQUENCE

AN

NA

LY

SIS

SO

FT

WA

RE

UsefullCDSPrediction

Annotator

AT content

Gene finders

Codon Usage

BlastX

FASTA

ESTs

Eukaryotic gene

AAAAAAAAAACAP

AAAAAAAAAACAP

TTTTTTTTT

TTTTTTTTT

intron Exon II5’UTR Exon Istop

3’UTR

EST

cDNA

mRNA

EST

Exon III

ATG GT AG GT AG

AT content• Coding regions have higher GC content in

AT rich genomes

AT content

CODON USAGE

• Codon bias is different for each organism.• DNA content in coding regions is restricted

– but it is not restricted in non coding regions.

• The codon usage for any particular gene can influence expression.

Codon usage

• All organisms have a preferred set of codons.

Malaria TrypanosomaGUU 0.41 GUU 0.28

GUC 0.06 GUC 0.19

GUA 0.42 GUA 0.14

GUG 0.11 GUG 0.39

Codon Usage• http://www.kazusa.or.jp/codon/

Codon Usage in Artemis

Forward frames

Reverseframes

Codon usage & gene finding in : Leishmania

GC frame plot

• Plots the third position GC content of each frame of a DNA sequence.

• In coding DNA the GC content of the 3rd base is often higher.

• Good prediction of coding in malaria and trypanosomes.

GC frame plot of tubulin gene cluster on T. brucei Chr 1

Homology Data

• Coding regions are more conserved than non coding regions due to selective pressure.

• Comparing all possible translations against all known proteins will give clues to known genes.

• Blastx

Gene finding: using ACT

TBLASTX comparisons

P. knowlesi

P. falciparum

P. yoelii

Gene finding by RNA-Seq(Transcriptional landscape of Neospora caninum Tachyzoites

Day 3 Tachyzoites (RNAseq)

Day 4 Tachyzoites (RNAseq)

Day 3 Tachyzoites (RNAseq)

Day 4 Tachyzoites (RNAseq)

N. caninum Chr08

T. gondii Chr085’ UTR 3’ UTR

TBLASTX matches visualised in ACT

Transcriptome sequencing in Neospora(RNAseq is useful for predicting/confirming UTR boundaries)

RNA-Seq: correcting gene models

Before

%GC

After

%GC

__16hr, __32hr, __48hr

Recommended