30
BG7 A new system for bacterial genome annotation designed for NGS data www.ohnosequences.com www.era7bioinformatics.com annotation designed for NGS data

BG7, a new system for bacterial genome annotation designed for NGS data

Embed Size (px)

DESCRIPTION

Slides from the talk presented at the conference "Applied Bioinformatics and Public Health" at Cambridge during 1-3 June 2011

Citation preview

Page 1: BG7, a new system for bacterial genome annotation designed for NGS data

BG7A new system for bacterial genome annotation designed for NGS data

www.ohnosequences.com www.era7bioinformatics.com

annotation designed for NGS data

Page 2: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Motivation

The need of a system specially designed for NGS data annotation with a pipeline unbiased by existing annotation systems designed for Sanger sequences

The need of a versatile system able to annotate genes even in the

step of preliminary assembly of the genome

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

step of preliminary assembly of the genome

Special focus is given to the detection of “unexpected proteins” without orthologous in close genomes (horizontally acquired genes, phage genes, plasmid genes…)

A fast, automated and scalable process to face the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies

Page 3: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Features

1. A new approach

2. It’s tolerant to NGS errors

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features3. It’s based on cloud computing

4. It uses bio4j

Page 4: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Features: Approach

ORF prediction

is based on

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

is based on

protein similarity

Page 5: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Features: Approach

Use as much information as you can (not just start/stop signals)

TGGATGTGGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCTGA

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

TGGATGTGGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCTGA

A B C D E

Page 6: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Features: Approach

Standard

Sequence Sequence

BG7

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features ORF prediction(Glimmer)

Function prediction(Blast)

Protein searching(Blast)

CDS prediction

RNA searching(Blast)

Page 7: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Features: NGS errors

Issue Technology

Genomes in several contigs All

Sequencing errors in start/stop codons Illumina substitutions

454 indels

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

454 indels

Frameshifts 454 indels

Horizontal gene transfer None

BG7 system is tolerant to all these issues

Page 8: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Features: Cloud computing

AWS (Amazon Web Services)

Completely Scalable On demand

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

CheapFast

1 genome in ~2 hours 100 genomes in ~2 hours once you’ve got the reference proteins

Useful in tracking outbreaks

Page 9: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Features: bio4j

It uses

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

www.bio4j.com

Much richer annotations

Page 10: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

How it works?

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

Page 11: BG7, a new system for bacterial genome annotation designed for NGS data

1• Expert Manual Selection of reference sequences

2• Protein search• Blast

3

• CDS definition• HSPs merge

• Extension of the similarity region searching for start/stop signals

www.ohnosequences.com www.era7bioinformatics.com

4

• Solving conflicts• Solving duplicates

• Solving overlaps

5•RNA search• Blast

6

• Incorporation of RNA genes• Definition of RNA genes

• Conflicts with protein coding genes previously annotated are solved

Page 12: BG7, a new system for bacterial genome annotation designed for NGS data

A B C

Motivation

Features

How it works?

Comparisons

Step 2: Protein search with tBlastn

Input contigs (aa)

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

Reference Proteins (aa)

are searched in the contigs sequences

Page 13: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Step 3: CDS definitionMerging HSPs

Input contigs (aa)

Several HSPs

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

Input contigs (aa)

Protein

Page 14: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Step 3: CDS definitionMerging HSPs

Input contigs (aa)

Several HSPs

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

Input contigs (aa)

Protein

We merge the HSPs to form a single similarity region

Page 15: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Step 3: CDS definitionSearch for start/stop signals

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

We then search for start/stop signals upstream and downstream the region with high similarity with the protein

Page 16: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Step 3: CDS definition

Although we don’t find an start/stop codon for a given CDS we keep it

We just mark it accordingly

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

We just mark it accordingly

Page 17: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Step 4: Solving conflictsDuplicates

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

Page 18: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Step 4: Solving conflictsDuplicates

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

Page 19: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Step 4: Solving conflictsOverlapping CDS

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

Page 20: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Step 5: RNA searchBlastn

Input contigs (nt)

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

Reference RNAs (nt) are searched in the contigs

Page 21: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Step 6: Incorporation of RNA genesDefinition of RNA genes

Input contigs (nt)

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

Page 22: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Step 6: Incorporation of RNA genesConflicts with protein coding genes are solved

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

If in a particular region we find a protein coding gene and a RNA gene. RNA gene is selected over the protein coding one

Page 23: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Finally

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

TGGATGTGGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCTGA

A B C D E

Page 24: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Comparisons

We’ve compared the NCBI annotations for

Escherichia coli str. K-12 substr. MG1655(Refseq ID NC_000913)

www.ohnosequences.com www.era7bioinformatics.com

Upcoming featuresWith BG7 annotations

Page 25: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Comparisons

Feature NCBI BG7

The results we got were:

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

Feature NCBI BG7

Protein coding genes 4145 43701

49512

RNA 175 156

1 Selected genes2 All detected genes: Selected + dismissed

Page 26: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Comparisons

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

Page 27: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Comparisons

Conclusions

Even in a not advantageous situation (not a NGS project and a very well annotated genome)

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

(not a NGS project and a very well annotated genome)

We got in one round annotation step

- ~95% of the NCBI protein coding genes- ~89% of the NCBI RNA genes- 419 new proteins detected

Page 28: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Upcoming features

Improvements now focused on:

- Overlapping solving phase

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features- Detection of very small proteins

And any new need we find using it

Page 29: BG7, a new system for bacterial genome annotation designed for NGS data

Motivation

Features

How it works?

Comparisons

Thanks:

Oh no sequences! team

Raquel Tobes: Bioinformatician, main advisor

Pablo Pareja: Main developer

www.ohnosequences.com www.era7bioinformatics.com

Upcoming features

Pablo Pareja: Main developer

Eduardo Pareja: Scientific advisor

Eduardo Pareja-Tobes: Mathematician, advisor

Carmen Torrecillas: Junior Bioinformatician

Marina Manrique: Bioinformatician

Page 30: BG7, a new system for bacterial genome annotation designed for NGS data

Thanks for your attention!

www.ohnosequences.com www.era7bioinformatics.com

Thanks for your attention!