Upload
marina-manrique
View
7.217
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Slides from the talk presented at the conference "Applied Bioinformatics and Public Health" at Cambridge during 1-3 June 2011
Citation preview
BG7A new system for bacterial genome annotation designed for NGS data
www.ohnosequences.com www.era7bioinformatics.com
annotation designed for NGS data
Motivation
Features
How it works?
Comparisons
Motivation
The need of a system specially designed for NGS data annotation with a pipeline unbiased by existing annotation systems designed for Sanger sequences
The need of a versatile system able to annotate genes even in the
step of preliminary assembly of the genome
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
step of preliminary assembly of the genome
Special focus is given to the detection of “unexpected proteins” without orthologous in close genomes (horizontally acquired genes, phage genes, plasmid genes…)
A fast, automated and scalable process to face the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies
Motivation
Features
How it works?
Comparisons
Features
1. A new approach
2. It’s tolerant to NGS errors
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features3. It’s based on cloud computing
4. It uses bio4j
Motivation
Features
How it works?
Comparisons
Features: Approach
ORF prediction
is based on
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
is based on
protein similarity
Motivation
Features
How it works?
Comparisons
Features: Approach
Use as much information as you can (not just start/stop signals)
TGGATGTGGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCTGA
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
TGGATGTGGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCTGA
A B C D E
Motivation
Features
How it works?
Comparisons
Features: Approach
Standard
Sequence Sequence
BG7
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features ORF prediction(Glimmer)
Function prediction(Blast)
Protein searching(Blast)
CDS prediction
RNA searching(Blast)
Motivation
Features
How it works?
Comparisons
Features: NGS errors
Issue Technology
Genomes in several contigs All
Sequencing errors in start/stop codons Illumina substitutions
454 indels
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
454 indels
Frameshifts 454 indels
Horizontal gene transfer None
BG7 system is tolerant to all these issues
Motivation
Features
How it works?
Comparisons
Features: Cloud computing
AWS (Amazon Web Services)
Completely Scalable On demand
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
CheapFast
1 genome in ~2 hours 100 genomes in ~2 hours once you’ve got the reference proteins
Useful in tracking outbreaks
Motivation
Features
How it works?
Comparisons
Features: bio4j
It uses
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
www.bio4j.com
Much richer annotations
Motivation
Features
How it works?
Comparisons
How it works?
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
1• Expert Manual Selection of reference sequences
2• Protein search• Blast
3
• CDS definition• HSPs merge
• Extension of the similarity region searching for start/stop signals
www.ohnosequences.com www.era7bioinformatics.com
4
• Solving conflicts• Solving duplicates
• Solving overlaps
5•RNA search• Blast
6
• Incorporation of RNA genes• Definition of RNA genes
• Conflicts with protein coding genes previously annotated are solved
A B C
Motivation
Features
How it works?
Comparisons
Step 2: Protein search with tBlastn
Input contigs (aa)
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
Reference Proteins (aa)
are searched in the contigs sequences
Motivation
Features
How it works?
Comparisons
Step 3: CDS definitionMerging HSPs
Input contigs (aa)
Several HSPs
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
Input contigs (aa)
Protein
Motivation
Features
How it works?
Comparisons
Step 3: CDS definitionMerging HSPs
Input contigs (aa)
Several HSPs
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
Input contigs (aa)
Protein
We merge the HSPs to form a single similarity region
Motivation
Features
How it works?
Comparisons
Step 3: CDS definitionSearch for start/stop signals
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
We then search for start/stop signals upstream and downstream the region with high similarity with the protein
Motivation
Features
How it works?
Comparisons
Step 3: CDS definition
Although we don’t find an start/stop codon for a given CDS we keep it
We just mark it accordingly
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
We just mark it accordingly
Motivation
Features
How it works?
Comparisons
Step 4: Solving conflictsDuplicates
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
Motivation
Features
How it works?
Comparisons
Step 4: Solving conflictsDuplicates
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
Motivation
Features
How it works?
Comparisons
Step 4: Solving conflictsOverlapping CDS
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
Motivation
Features
How it works?
Comparisons
Step 5: RNA searchBlastn
Input contigs (nt)
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
Reference RNAs (nt) are searched in the contigs
Motivation
Features
How it works?
Comparisons
Step 6: Incorporation of RNA genesDefinition of RNA genes
Input contigs (nt)
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
Motivation
Features
How it works?
Comparisons
Step 6: Incorporation of RNA genesConflicts with protein coding genes are solved
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
If in a particular region we find a protein coding gene and a RNA gene. RNA gene is selected over the protein coding one
Motivation
Features
How it works?
Comparisons
Finally
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
TGGATGTGGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCTGA
A B C D E
Motivation
Features
How it works?
Comparisons
Comparisons
We’ve compared the NCBI annotations for
Escherichia coli str. K-12 substr. MG1655(Refseq ID NC_000913)
www.ohnosequences.com www.era7bioinformatics.com
Upcoming featuresWith BG7 annotations
Motivation
Features
How it works?
Comparisons
Comparisons
Feature NCBI BG7
The results we got were:
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
Feature NCBI BG7
Protein coding genes 4145 43701
49512
RNA 175 156
1 Selected genes2 All detected genes: Selected + dismissed
Motivation
Features
How it works?
Comparisons
Comparisons
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
Motivation
Features
How it works?
Comparisons
Comparisons
Conclusions
Even in a not advantageous situation (not a NGS project and a very well annotated genome)
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
(not a NGS project and a very well annotated genome)
We got in one round annotation step
- ~95% of the NCBI protein coding genes- ~89% of the NCBI RNA genes- 419 new proteins detected
Motivation
Features
How it works?
Comparisons
Upcoming features
Improvements now focused on:
- Overlapping solving phase
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features- Detection of very small proteins
And any new need we find using it
Motivation
Features
How it works?
Comparisons
Thanks:
Oh no sequences! team
Raquel Tobes: Bioinformatician, main advisor
Pablo Pareja: Main developer
www.ohnosequences.com www.era7bioinformatics.com
Upcoming features
Pablo Pareja: Main developer
Eduardo Pareja: Scientific advisor
Eduardo Pareja-Tobes: Mathematician, advisor
Carmen Torrecillas: Junior Bioinformatician
Marina Manrique: Bioinformatician
Thanks for your attention!
www.ohnosequences.com www.era7bioinformatics.com
Thanks for your attention!