Annotator Interface
Sharon Diskin
GUS 3.0 Workshop
June 18-21, 2002
Outline
Current annotation efforts Motivation for new annotation tool Requirements for new annotation tool Thoughts on design and implementation Future plans
Current Annotation Efforts
Overview of Current Efforts Automated annotation has been applied to the DoTS transcripts
– Predicted gene ownership (clustering of assemblies)– BlastX against NR
• Automated assignment of descriptions based on similarity– BlastX against ProDom and RPS-Blast against CDD
• Predicted GO Functions– Framefinder
• Predicted Protein Sequences– Blat alignments– EPCR, Index Words, etc…
Manual annotation efforts have focused on – validating the automated annotation and – adding additional information at the central dogma level
Manual annotation of the gene index utilizes an annotation tool, the GUS Annotator Interface, which directly updates the GUSdev database.
•GenBank, dbEST sequences•Make Quality (remove vector, polyA, NNNs)
Incoming Sequences (EST/mRNA)
“Quality” sequences
“Unassembled” clusters
CAP4 assemblies(generate consensus sequences)
Dots Consensussequences
•Assemble sequences with CAP4
Blocked sequences
•Block with RepeatMasker
•Blastn to cluster sequences
Gene Cluster (RNA s in the Gene)
BLASTn DoTs consensus sequences
(98% identity, 150bps)
DoTS RNA transcripts
The assembly of sequences generates a consensus sequence or DoTS transcript
Current Efforts: Gene Annotation (1)
GenerateDoTS
transcriptsFeature_1
Feature_5
Feature_2
Feature_3
Feature_4
Gene_A
Instance_1
Instance_5
Instance_2
Instance_3
Instance_4
Assembly_1
Assembly_5
Assembly_2
Assembly_3
Assembly_4
RNA RNAInstance Assembly
RNA_1
RNA_5
RNA_2
RNA_3
RNA_4
RNAFeature
…
Task 1: Validation of Gene Membership
… ……
Gene
Current Efforts: Gene Annotation (2)
GenerateDoTS
transcriptsFeature_1
Feature_5
Feature_2
Feature_3
Feature_4
Gene_AInstance_1
Instance_5
Instance_2
Instance_3
Instance_4
Assembly_1
Assembly_5
Assembly_2
Assembly_3
Assembly_4
RNA RNAInstance Assembly
RNA_1
RNA_5
RNA_2
RNA_3
RNA_4
RNAFeature
…
- Removing RNAs from the cluster results in the creation of a new Gene
- An entry is made in the MergeSplit table for tracking purposes
- Similar process followed when an RNA is added to a Gene
… …
Gene_B
Gene
Current Efforts: Gene Annotation (3)Task 2: Assign Reference RNA
– will be annotated further
– RNA table
Task 3: Assign Approved Gene Name/Symbol– Gene Table– Evidence: Comment (specifies database link)
Task 4: Assign Gene Description– Gene Table– Evidence: Comment
Task 4: Associate known Gene synonyms– GeneSynonym table– Evidence: Comment
Current Efforts: RNA Annotation
Annotation of “Reference Sequence”
Task 1: Assign/Confirm Description of assembly– RNA table
Task 2: Confirm/Add/Delete GO Functions– ProteinGOFunction (in GUSdev, GO tables have been re-designed in
GUS3.0)– Evidence: Comments or Similarity (ProDom, CDD-Pfam, CDD-Smart,
or NR)
Current Annotator Interface Architecture
GUSdev“XML” file
Annotator Interface
AnnotatorInterfaceSubmitter
GA-Plugin
JavaServletwrites
reads
executes
DBI(Insert/Update/Delete)
PerlObjectLayer
JDBC (Query Only)
Current Annotator Interface
Current Gene Annotation
Validate Cluster and Assign Reference RNA/Assembly
Current Gene Annotation (cont.)
Assign Gene Name/Symbol
Assign Gene Description
Assign Gene Synonym(s) Evidence
Current RNA (and Protein) Annotation
RNA Description
GO FunctionsEvidence
Allgenes Display of Gene Annotation
Allgenes Display of RNA Annotation
(Confirmed or manually added GO Functions)
RNA Description
Status of Current Annotation(as of June 20, 2002)
1289 manually reviewed genes– 1003 with gene name– 697 with gene synonyms– 1046 with description
6146 manually reviewed RNAs/DoTS assemblies
949 ‘proteins’ with reviewed GO function
Motivation for new tool Want to annotate using genomic sequence
• Create “curated” gene models specifying structure
• Increase structure of annotation in GUS
• Annotation of proteins
• Redefinition of annotation tasks
• Current interface not designed for this purpose
Some Other Annotation Tools • Artemis
• Developed and used at Sanger
• Reads and writes flat files
• Supports rich set of annotations• Save as EMBL format
• Apollo• Combined effort including members from Sanger and
Berkeley
• Flat files (CORBA access to ENSEMBL)
• 2 versions, currently being merged• Sanger: annotation viewer
• Berkeley: focus on editing
No Existing Tool To Meet All of Our Needs
Requirements At a High Level
Requirements: Graphical View Provide alignment of features on genomic sequence
– could potentially display any feature type currently stored in GUS3.0
– features can be selected and used to generate “curated” features
– similar to display and functionality in Apollo Toggle (or configure) the display of each feature type Zoom to sequence level and will include links to
functionality relevant to the feature highlighted Also support creation of features “from scratch”
– based on literature, etc. Detail editors provide ability to change endpoints, etc.
Gene Annotation Create curated gene model
– specify gene boundaries – specify location of exons (and thus introns)
• 5' exon boundary (putative transcription start site)• 3' exon boundary (include poly adenylation signal)
– automatic creation of Gene entry– merge with existing gene instances through GeneInstance table– tables/views affected:
• GeneFeature• ExonFeature• GeneInstance• Gene• MergeSplit
– evidence: features used to create model, PubMed ID– should be as easy as clicking on existing features and saying
make curated (then can modify endpoints, etc. if needed)
Gene Annotation (2) Assign (HUGO or MGI approved) abbrievated gene name/symbol
– Gene Table– Evidence: ExternalDatabaseLink
Assign full gene name (MGI or HUGO full gene name)– Gene Table– Evidence: ExternalDatabaseLink
Assign abbrievated gene name/symbol synonyms (non-approved gene symbols)
– GeneSynonym Table– Evidence: ExternalDatabaseLink
Assign full gene name aliases– GeneAlias Table– Evidence: ExternalDatabaseLink
Gene Annotation (3) Assign gene category (e.g. non-coding)
– Gene Table– Evidence:
• ExternalDatabaseLink/Literature Reference
• Similarity (eg. to known non-coding RNA)
Confirm/assign gene chromosomal location– GeneChromosomalLocation– Evidence:
• ExternalDatabaseLink/Literature Reference
• RH mapping data
• Alignments/Features
OMIM Link assignment (verification if computationally determined)
– ExternalDatabaseLink
RNA Annotation (1)
Create “curated RNAs”– Define RNA transcript forms of gene (create RNAs)– Using exons defined by curated gene– 5' and 3' UTRs – Automatic creation of RNA entry– Merge existing RNA instances– Tables affected:
• RNAFeature
• UTRFeature
• RNAInstance
• RNA
– Evidence: Features used to create
Assign RNA categories to created RNAs (e.g. alternative form)– RNARNACategory Table
RNA Annotation Assign (or confirm computed) RNA description
– RNA table– Evidence: Gene from which it is derived
Anatomy expression assignment(s)– RNAAnatomy– RNAAnatomyLOE– Evidence:
• ExternalDatabaseLink/Literature references
• Assembly anatomy percent from DoTS
• RAD experiments
Assign GO terms to curated RNA (non-coding RNAs, e.g. small RNA involved in splicing)
– GOTermAssociation– GOTermAssociationEvid– Evidence: ExternalDatabaseLInk, Literature References
Computational analysis performed on curated RNA sequences– Annotation workflow
• Framefinder translation, GO terms, Similarities, etc.
Requirements: Protein Annotation Confirm/assign GO Function
– GOTermAssociation, GOTermAssociationEvid– Evidence: ExternalDatabaseLink and/or Literature References
Confirm/assign GO Biological Process– GOTermAssociation, GOTermAssociationEvid– Evidence: ExternalDatabaseLink and/or Literature References
Confirm/assign GO Cellular Component – GOTermAssociation, GOTermAssociationEvid– Evidence: ExternalDatabaseLink and/or Literature References
Assign protein name– Protein Table– Evidence: ExternalDatabaseLink, Literature Ref, Similarities
Assign protein name synonyms– Protein Table– Evidence: ExternalDatabaseLink, Literature Ref, Similarities
Protein Annotation (2) Assign protein category (post-translational modifications)
– ProteinProteinCategory
– Evidence: ExternalDatabaseLink, Literature References
Protein-protein interactions assigned– Interaction
– InteractionInteractionLOE
– Evidence: PubMed ID, etc.
Protein pathway assignments– PathwayInteraction (for newly created interactions)
– Still under consideration: What is best way to link with existing pathway • for example, Pathway is represented in DoTS, and we want to say that this curated Protein is really the same as a protein in a pathway.
Assign post translational modification category Assign interactions involving this protein Assign pathway protein is known to be involved in Assign protein family Ability to modify and/or delete curated protein
Evidence will be associated with all annotation
Next Steps/ Open Issues
Completion of Java Object Layer Decision regarding BioJava wrappers
– What exactly will this give us to aid in interface development (eg. FeatureRenderer, etc…)
Discussion on layout of interface– Joan’s input after experimentation with other tools
Depending on the above :– Client Side portion which communicates with remote GUS Server
– Interface Implementation