22
BRC6 28 th October 2008 Collective annotation of the Ixodes scapularis genome: VectorBase, MSCs and the tick community. Daniel Lawson, VectorBase

BRC6 28 th October 2008 Collective annotation of the Ixodes scapularis genome: VectorBase, MSCs and the tick community. Daniel Lawson, VectorBase

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

BRC6 28th October 2008

Collective annotation of the Ixodes scapularis genome: VectorBase, MSCs and the tick community.

Daniel Lawson, VectorBase

BRC6 28th October 2008

Arthropod vectors of human pathogens

LutzomyiaPhlebotomusCulex RhodniusAnopheles GlossinaAedes PediculusIxodes

BRC6 28th October 2008

Deer tick Ixodes scapularis

• Vector of Lyme disease (spirochete Borrelia burgdorferi)• Estimated genome size of 2.1 Gb• Sequenced strain: Wikel

• 12th generation from ticks sourced from New York, Oklahoma & Connecticut

• First Chelicerate genome to be sequenced

BRC6 28th October 2008

Genome annotation cycle

Automatic gene build

Assembly

Community annotations

Manual annotations

Other genomes, gene sets

Repeat library (TEs etc)ESTs, cDNAs

Protein domains

BRC6 28th October 2008

Generating sequence

• Sequencing undertaken by established sequencing centres (e.g. Broad, JCVI,)

• Initial assembly annotated in collaboration with the sequencing centre(s)

• 19,300,000 trace reads generated • Approx. 6x WGS • 570K BAC end sequencing• Assembly produced at JCVI

• 194K EST sequences

BRC6 28th October 2008

Assembly statistics

• This WGS project has the project accession ABJB000000000. The current version of the project (01) has the accession number ABJB010000000, and consists of 1,141,594 scaffolds (ABJB010000001-ABJB011141594).

• Released assembly IscaW1• 570,637 contigs• 369,495 supercontigs• Assembled coverage of 3.8x

BRC6 28th October 2008

Preparing for gene build

• Repeatmasking • Analyses to identify repeat elements• RepeatScout• RECON

• Standard tandem-repeat & low-complexity filtering• Collate data sets • Transcripts (cDNA & EST data)• Peptides (taxonomic groupings, inc. Daphnia pulex)

• Train gene predictors, mainly Augustus (JCVI)

BRC6 28th October 2008

Annotation plan

• First-pass gene prediction• Focused on protein-coding genes CDS’s

• Semi-automated approach• This is not manual curation

• Involvement of community where possible• Timely delivery of gene set

BRC6 28th October 2008

Gene Prediction

• Each group/centre has it’s own gene prediction pipeline/protocol.• Each group produces a 1st pass ‘best guess’ set of predictions

• 0.5 sets, public release• These sets are merged into a single set

• 1.0 set, not released• Quality control activities

• 1.1.set, public release• Which is annotated with protein features• .. And released to the wider world

BRC6 28th October 2008

Merging gene predictions

Reduce to single predictions per locus

Compare exon/intron structures

Gene set #1 Gene set #2

Identical structures

Compatible structures

Different structures

Merge/Split structures Complex No Map

Add isoform predictions based on EST/Peptide data

Canonical gene set

BRC6 28th October 2008

Merge annotation comparisons

BRC6 28th October 2008

ExamplesIsoform-compat

Isoform-diff

BRC6 28th October 2008

ExamplesMerge/Splits

Difficult

BRC6 28th October 2008

GBrowse viewer

BRC6 28th October 2008

VectorBase browser

BRC6 28th October 2008

Final gene set (IscaW1.1)

• 20,486 protein-coding genes• 48% have Pfam domain• 40% have supporting EST evidence

• 8,138 tRNAs• Over-prediction of Ser (4425) and Thr (1527)

predictions• 301 ncRNA

• Submitted to GenBank last week, release to be coordinated in the next couple of weeks

BRC6 28th October 2008

Genome annotation cycle

Automatic gene build

Assembly

Community annotations

Manual annotations

Other genomes, gene sets

Repeat library (TEs etc)ESTs, cDNAs

Protein domains

BRC6 28th October 2008

Community annotation

Web submission

CH

AD

O

Researcher

Community representative

Appraisal

Approval

GFF3Gene Build

vb!vb!

Total: 13,339 entries

An. gambiae 9,423

Cx. quinquefasciatus 2,598

Ae. aegypti 1,281

Ix. scapularis 37

BRC6 28th October 2008

Community annotation track in browser

BRC6 28th October 2008

Lessons

• Annotation plan for sequencing and annotation of new genomes is well established (MSC & BRC)• Clearly defining the data release strategies (0.5,1.0 & 1.1)• Monthly conference calls • Face to face meeting when merging 0.5 gene predictions• Coordinated release between MSC, VectorBase and GenBank

BRC6 28th October 2008

But we can always improve

• Agreement on project/public identifiers at the start of the project• Primarily contigs and supercontigs• Overall nomenclature applied as final step in annotation

• More QC before the major milestones• Better communication

BRC6 28th October 2008

Acknowledgements

• Kitsos Louis• Pantelis Topalis• Emmanuel Dialynas

• Ewan Birney• Martin Hammond• Daniel Lawson• Karyn Megy

• Bill Gelbart• Kathy Campbell

• Fotis Kafatos• George Christophides• Bob MacCallum• Seth Redmond

• Peter Atkinson• Peter Arensburger

• Catherine Hill• Jason Meyer

• Frank Collins• Greg Madey• Scott Emrich• Ryan Butler• Katie Cybulski• Nate Konopinski• Rob Bruggner (alumni)• E.O. Stinson (alumni)

• Dave Severson• Neil Lobo

• Frank Collins• Neil Lobo

Aedes Anopheles Culex Ixodes

EMBL-EBI Harvard IMBB Imperial Notre Dame

Colleagues

Ensembl { Genebuilders, Web, Compara, Core, Outreach }

BRCs { Pathema, ApiDB }

Sequencers { JCVI & Broad Institute }