36
15 January 2006, PAG XIV SanDiego Rémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Embed Size (px)

Citation preview

Page 1: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

15 January 2006, PAG XIV SanDiego Rémy Bruggmann, MIPS/IBI, GSF

A Bioinformatic Framework to Unravel the Secrets of the Tomato

Genome

Page 2: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Outline

Introduction

Data management

Annotation

Training/Test gene set

Summary

Page 3: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

MIPS´ look at the Green Side of Life

– genome projects and database activities –

Arabidopsis thalianaArabidopsis lyrata *Capsella rubella *

MaizeRice

MedicagoLotus

Solanum lycopersicum

Page 4: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

MIPS´ look at the Green Side of Life

– genome projects and database activities –

Need to streamline and unify databases as well as analytical schemas and operation routines

Strong synergism and very robust

Risk to loose flexibility and „custom tailor“ attractiveness

Awareness that not every genome and every community„is just the same“

Page 5: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

From Center Centric Strategies to distributed Approaches

Typically, genome projects undergo particular phases:

Sequenced BACs are annotated

Gene models are published to the community

Potentially generates competition rather than collaboration among groups

Page 6: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

From Center Centric Strategies to distributed Approaches

Consequences can be:

underlying analytical procedures are not always tested, trained and evaluated

Between groups more or less pronounced differences exist--> differing, contradicting and confliciting data

Page 7: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

„information enriched high quality genome backbone to address genome scale biological

questions“

Aim of all groups:

Page 8: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

From Center Centric Strategies to distributed Approaches

An example ...

International Medicago Genome Annotation Group

Consists of groups participating either in the International or the European Medicago Genome Initiative annotation/ bioinformatics programs

Agreement on common annotation standards, data exchange formats and naming conventions

Aims to produce and provide unified high-quality Medicago data set

Page 9: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

From Center Centric Strategies to distributed Approaches

Advantages of sharing efforts in genome annotation within a common annotation pipeline

Page 10: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

From Center Centric Strategies to distributed Approaches

prevents from:

(i) duplicating efforts

(ii) conflicts resulted from different

annotation “standards”

ensures high-quality annotation standards

ensures common (gene) naming common dataset

Integrates and profits from knowledge and expertise

of the individual groups

Page 11: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Data management

All data should be organized in agenome database

Page 12: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Wishlist for a modern genome db

Complete Comprehensive Up-to-date Integrated User interface Application interface State-of-the-art automatic analysis Adaptable Cross-genome comparison

…low cost, low manpower...

Page 13: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

PlantsDB Philosophy

Plants Genome Resource: provides and integrates sequence data from European plant sequencing consortia along with publically available data from the international initiative

Plants DB communicates bioinformatic analysis data (visualization, genetic elements, structural data, ontologies, domains...; BLAST, browse and search,…comparative analysis)

Integration: provides a distributed network to integrate and retrieve data from heterogenous resources using BioMOBY (connection to other plant DBs, PlaNET)

Page 14: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Preliminary Annotation Pipeline

Towards a preliminary annotation

Page 15: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Repeat OntologyRepeatDatabase

RepeatMasker

Repeat Detection

Masked sequences Repeat annotation

Gene prediction GAMEXML

Page 16: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Gene Prediction

EST DB

Protein DB

ESTAssemblies

e.g. SwissProt

External Databases

► GenomeThreader► FGenesH++/ProtMap► GeneMarkHMM

GAMEXML

Gene prediction programs

Document of computational

results

Manual annotation inApollo Genome Viewer

PlantsDB

Web Access

Gbrowse

Page 17: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

First Results

Page 18: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Repeat Masker

5.8 MB analysed (48 BACs)

~ 6.7 % repetitive elements(<0.2% - 23% per bac)

~ 1 min/100 kb

whole genome (euchromatic part):

~ 2 daysBACs

0

5

10

15

20

25

Repeat content[%]

State: December 2005

Page 19: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Preliminary Results

Comparison of different gene finders

Page 20: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

ab initio predictions

EST/TC

FGeneSH

GeneMark

EST/TC

Page 21: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

ab initio predictions

Page 22: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

ab initio predictions

FGeneSH++ and GeneMarkHMM often generate incomplete or wrong gene models at the moment

There are no matrices available that are trained for tomato

Tomato matrices will increase prediction quality dramatically

Collection of annotated high quality genes for a training/test set for EuGene, FGeneSH,

GeneMarkHMM, ...

Page 23: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Training/Test Gene Set

How can we get a training/test set?

Map available tomato cDNA/ESTs to the BACs(use only high confident matches)

Link experimental data to the genemodels

Use this gene set for ab initio gene finder training

Page 24: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

GenomeThreader

GenomeThreader used for EST/cDNA-Mapping:

similarity-based approach:EST/Proteins used to predict gene structure via optimal spliced alignments

Offers many options (full user control)

incremental updates (avoids a lot of duplicated computations)

Improved GeneSeqer

Page 25: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

GenomeThreader - calculations

DB Entries Size [MB] Calc time/100kb [s] Whole Genome

Tomato 32401 27 27 s

~ 2.8 daysMicroTom 26363 21 22 s

Potato 38239 34 23 s

Tobacco 28661 20 39 s

Arabidopsis cDNAs 31939 45 10 s 0.3 days

Dicots 404822 311 170 s 4.3 days

rice cds 15639 21 8 s 0.2 days

Uni_trembl Plants 185564 74 38 s 1.0 day

Uniprot_swissprot 181571 82 8 s 0.2 days

Nonred 1675230 662 437 s 11.1 days

Total 2834224 1433 14 min 22 days

(single CPU, euchromatic part)

Page 26: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Example

Tomato

Microtom

Potato

Tobacco

Page 27: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Examples - UK

Page 28: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Example

Page 29: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Number of high quality genes

0

2

4

6

8

10 Number of genes: 164(covered completely by cDNA/ESTs)

~3.4 genes/BAC(range: 0 - 9 genes/BAC)

These genes can be used to train gene finders

BAC

# genes

(Only very good alignments considered)

Page 30: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Gene Finder

Which program can be trained for tomato?

One possibility is EuGene (VIB Gent)

- performed well e.g. for Arabidopsis and Medicago- available as soon as test/training gene set is large

enough

Page 31: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

EuGene - overview

DNA MarkovAA MarkovSplice

sites

Start sites

Protein similarities

EST similaritiesFL cDNA

Exon conservation

Repeats

Statistical contents

NetGene2

SplicePredictor

SpliceMachine

GeneSplicer

NetStartSpliceMachineATRPred

Similarities

Plugins

Plugin

training

Needs

one

dataset

Optimize

plugin

combination

Needs

one

dataset

Test

Needs

one

dataset

new

TRAINING OPTIM TEST

Page 32: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

EuGene

First round training:- 500 high quality tomato genes- statistical models on codon usage and splice sites of Arabidopsis will be used

Second round training:- 2000 high quality tomato genes- Build a tomato-only version of EuGene

Approx. 150 BACs needed for first round training

Page 33: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Current state of sequenced BACs

Total number of BACs:- unfinished: 71- finished: 87- available: 52

Page 34: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Summary

ab initio gene finders are not yet calibrated to tomato

Need of a test/training gene set to calibrate the gene finders

We need another 100 BACs to get enough genes for a first round training of EuGene

GenomeThreader produces only good alignments with ESTs from SOL-species (Tomato, Potato, Tobacco)

More repeats will be detected (will be included in RepeatMasker Library)

Page 35: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

Acknowledgments

Automated annotation

MIPS

Heidrun GundlachGeorg HabererManuel SpannaglKlaus F.X. Mayer

Manual Annotation/Curation/Web-site(Chromosome 4)Imperial CollegeDaniel BuchanJames Abbot

Sarah ButcherGerard Bishop

Sequencing & Assembly(Chromosome 4)Sanger InstituteChristine NicholsonSean Humphray

MPIZ Köln Heiko Schoof

EuGeneVIB Gent Stephane Rombauts

GenomeThreaderUniversity of HamburgGordon GremmeStefan KurtzVolker Brendel

Page 36: 15 January 2006, PAG XIV SanDiegoRémy Bruggmann, MIPS/IBI, GSF A Bioinformatic Framework to Unravel the Secrets of the Tomato Genome

15 January 2006, PAG XIV SanDiego Rémy Bruggmann, MIPS/IBI, GSF

A Bioinformatic Framework to Unravel the Secrets of the Tomato

Genome