View
216
Download
0
Tags:
Embed Size (px)
Citation preview
BRC6 28th October 2008
Collective annotation of the Ixodes scapularis genome: VectorBase, MSCs and the tick community.
Daniel Lawson, VectorBase
BRC6 28th October 2008
Arthropod vectors of human pathogens
LutzomyiaPhlebotomusCulex RhodniusAnopheles GlossinaAedes PediculusIxodes
BRC6 28th October 2008
Deer tick Ixodes scapularis
• Vector of Lyme disease (spirochete Borrelia burgdorferi)• Estimated genome size of 2.1 Gb• Sequenced strain: Wikel
• 12th generation from ticks sourced from New York, Oklahoma & Connecticut
• First Chelicerate genome to be sequenced
BRC6 28th October 2008
Genome annotation cycle
Automatic gene build
Assembly
Community annotations
Manual annotations
Other genomes, gene sets
Repeat library (TEs etc)ESTs, cDNAs
Protein domains
BRC6 28th October 2008
Generating sequence
• Sequencing undertaken by established sequencing centres (e.g. Broad, JCVI,)
• Initial assembly annotated in collaboration with the sequencing centre(s)
• 19,300,000 trace reads generated • Approx. 6x WGS • 570K BAC end sequencing• Assembly produced at JCVI
• 194K EST sequences
BRC6 28th October 2008
Assembly statistics
• This WGS project has the project accession ABJB000000000. The current version of the project (01) has the accession number ABJB010000000, and consists of 1,141,594 scaffolds (ABJB010000001-ABJB011141594).
• Released assembly IscaW1• 570,637 contigs• 369,495 supercontigs• Assembled coverage of 3.8x
BRC6 28th October 2008
Preparing for gene build
• Repeatmasking • Analyses to identify repeat elements• RepeatScout• RECON
• Standard tandem-repeat & low-complexity filtering• Collate data sets • Transcripts (cDNA & EST data)• Peptides (taxonomic groupings, inc. Daphnia pulex)
• Train gene predictors, mainly Augustus (JCVI)
BRC6 28th October 2008
Annotation plan
• First-pass gene prediction• Focused on protein-coding genes CDS’s
• Semi-automated approach• This is not manual curation
• Involvement of community where possible• Timely delivery of gene set
BRC6 28th October 2008
Gene Prediction
• Each group/centre has it’s own gene prediction pipeline/protocol.• Each group produces a 1st pass ‘best guess’ set of predictions
• 0.5 sets, public release• These sets are merged into a single set
• 1.0 set, not released• Quality control activities
• 1.1.set, public release• Which is annotated with protein features• .. And released to the wider world
BRC6 28th October 2008
Merging gene predictions
Reduce to single predictions per locus
Compare exon/intron structures
Gene set #1 Gene set #2
Identical structures
Compatible structures
Different structures
Merge/Split structures Complex No Map
Add isoform predictions based on EST/Peptide data
Canonical gene set
BRC6 28th October 2008
Final gene set (IscaW1.1)
• 20,486 protein-coding genes• 48% have Pfam domain• 40% have supporting EST evidence
• 8,138 tRNAs• Over-prediction of Ser (4425) and Thr (1527)
predictions• 301 ncRNA
• Submitted to GenBank last week, release to be coordinated in the next couple of weeks
BRC6 28th October 2008
Genome annotation cycle
Automatic gene build
Assembly
Community annotations
Manual annotations
Other genomes, gene sets
Repeat library (TEs etc)ESTs, cDNAs
Protein domains
BRC6 28th October 2008
Community annotation
Web submission
CH
AD
O
Researcher
Community representative
Appraisal
Approval
GFF3Gene Build
vb!vb!
Total: 13,339 entries
An. gambiae 9,423
Cx. quinquefasciatus 2,598
Ae. aegypti 1,281
Ix. scapularis 37
BRC6 28th October 2008
Lessons
• Annotation plan for sequencing and annotation of new genomes is well established (MSC & BRC)• Clearly defining the data release strategies (0.5,1.0 & 1.1)• Monthly conference calls • Face to face meeting when merging 0.5 gene predictions• Coordinated release between MSC, VectorBase and GenBank
BRC6 28th October 2008
But we can always improve
• Agreement on project/public identifiers at the start of the project• Primarily contigs and supercontigs• Overall nomenclature applied as final step in annotation
• More QC before the major milestones• Better communication
BRC6 28th October 2008
Acknowledgements
• Kitsos Louis• Pantelis Topalis• Emmanuel Dialynas
• Ewan Birney• Martin Hammond• Daniel Lawson• Karyn Megy
• Bill Gelbart• Kathy Campbell
• Fotis Kafatos• George Christophides• Bob MacCallum• Seth Redmond
• Peter Atkinson• Peter Arensburger
• Catherine Hill• Jason Meyer
• Frank Collins• Greg Madey• Scott Emrich• Ryan Butler• Katie Cybulski• Nate Konopinski• Rob Bruggner (alumni)• E.O. Stinson (alumni)
• Dave Severson• Neil Lobo
• Frank Collins• Neil Lobo
Aedes Anopheles Culex Ixodes
EMBL-EBI Harvard IMBB Imperial Notre Dame
Colleagues
Ensembl { Genebuilders, Web, Compara, Core, Outreach }
BRCs { Pathema, ApiDB }
Sequencers { JCVI & Broad Institute }