Upload
julius-hoover
View
225
Download
0
Tags:
Embed Size (px)
Citation preview
Making GO Annotations For Fungal Genomes
A brief overview
Outline of Topics• Intro
• Overview of Overall Annotation Pipeline
• Introduction to the Gene Ontologies (GO)
• Making GO Annotations
• Submitting GO Annotations
• GO Tool - AmiGO
Intro & Overview of Overall Sequence and
Annotation Pipeline
Karen Christie
Saccharomyces Genome DatabasesStanford University
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1.00E+09
1.00E+10
1.00E+11
1.00E+12
Dec-82Dec-83Dec-84Dec-85Dec-86Dec-87Dec-88Dec-89Dec-90Dec-91Dec-92Dec-93Dec-94Dec-95Dec-96Dec-97Dec-98Dec-99Dec-00Dec-01Dec-02Dec-03Dec-04Dec-05Dec-06
Growth in 2006 Percent of Total3.50 x 1010 nucs 23.2%
Total Nucleotides at GenBank/EMBL/DDBJincluding Whole Genome Shotgun
NCBI created by Congress
Dec 20061.52E+11
WGS section started
EBI created at Hinxton
Saccharomyces cerevisiaeHaemophilus influenzae
Caenorhabditis elegans
Drosophila melanogaster
Mus musculus
Homo sapiens
Fungal Genomes being sequenced at Broad Institute
Fungal Genomes being sequenced by JGI
Published Literature
PubMed: over 15 million citations
Basic search: secondary metabolism → 109580
Limit search:secondary metabolism (published in the last 1 year) → 5440
Boolean operators:secondary metabolism AND Aspergillus → 479
Numbers as of 3/21/2007
Gene Ontology Objectives• GO represents categories used to classify
specific parts of our biological knowledge:– Biological Process– Molecular Function– Cellular Component
• GO develops a common language applicable to any organism
• GO terms can be used to annotate gene products from any species, allowing comparison of information across species
My genome is sequenced!
ATGTCTTTTTTAAGTGCATCGATGTCCTGGGGGCTTAGTATAATGCTCCCCGAGCTTCCTAGCGCTTAGTGCATTAGACTAGGGCCAAAATGACTACTGTTCTTAAAGTACTAGTACTTACTACGCCCTGTTTCTTTCTTCTTCTAAAAGACTAACTAAGTGCTAGTCTAGATCTACTATTACTACCCTACCTACTATACTAGACTAATTACCAACCCCTAGGGTACTAAATTTGCCTAGTTTACGTAGCGTTCTTAAAACGTACTAGATTACCGTACTAGGGACGTACTAAGGTACTAG…
What do I do now?
• Sequence of genes/genome
• Primary Annotation - the location and structure of genes
• Secondary Annotation - the functions of the genes
Overview of Sequencing/Annotation Pipeline
ATGCTTCCTGATTTTGCCCTGGACTTCGCTTGTATAAATTCATTGCACC…
GO process: terrequinone A biosynthesis
GO function: methyltransferase activity
Enzyme Commission: 2.1.1.-alcohol dehydrogenase
Who will be annotating?
• Just you?
• A single group?
• A consortium of groups?
The number of people and groups participating
and the funding will affect some decisions on
whether to set up a database or use flatfiles.
Do you (or your group) have gene calls for your sequence?
yes no
yes
yes
no no
Make automated or manual gene calls
TIGR’s Eukaryotic Annotation course
very useful
Are the protein predictions submitted to GenBank/DDBJ/EMBL?
Submit gene/protein calls to GenBank/DDBJ/EMBL
GOA will make GO annotations (IEA) usingautomated methods
Resources to make functional annotations?
Contact GO Consortium for advice,training, help with coordination, etc.
Set up pipeline for any automated annotationsnot being done by GOA
Manual GO annotationsfrom literature, or fromsequence similarity methods
GOA will collect all GO annotationsand submit them to GOC
You (or your group) collects all GOannotations and submits them to GOC
GOA will maintain annotation file You (or your group) maintains annotation file
UniProtKB contains translationsof all coding regions in GenBank/DDBJ/EMBL
Decide who will collate all GO annotations into one file
Automated Eukaryotic Gene Annotation
Genome Sequence
Repeat masked sequence
Gene finders Database comparisons
Combined consensus prediction
EST based refinement(adjust exons, UTRs, alternative splicing)
Automated Gene Annotation
EST Database
Develop a training set
TwinscanGeneZillaglimmerHMMAugustusFgeneshetc.
AAT_aaAAT_natRNA ScanGMAPSim4etc.
Gene predictions
Repeat masker
Genome alignments
Based on TIGR course
Manual Gene Annotation?
1st Question - Is it in the budget?
Manual annotation can be a lot better
than automated, but is a lot more
expensive and time consuming!
Based on TIGR Eukaryotic Annotation course
Manual Gene Annotation Tools
• Viewer only– Gbrowse
• Editors– Apollo (requires a database)– Manatee (requires a database)– Artemis (runs on flat files)
Based on TIGR Eukaryotic Annotation course
Eukaryotic Gene Annotation
At the end of the procedure, you’ll have:• Gene calls• Protein predictions• Unique IDs for your genes
This last is important. Gene IDs are unambiguous. Gene names are frequently ambiguous. You’ll also need IDs in order to submit GO annotations.
Example:
Gene Name: SP119242 hits in Entrez nucleotide 1 hit
Gene ID: NM_138473
Ready to make Functional Annotations!
• Questions – What’s your budget?– How much literature is available?
• Automated annotations– Faster, cheaper– Often less specific
• Manual annotations– Time consuming & more expensive– Precise and accurate
Do you (or your group) have gene calls for your sequence?
yes no
yes
yes
no no
Make automated or manual gene calls
TIGR’s Eukaryotic Annotation course
very useful
Are the protein predictions submitted to GenBank/DDBJ/EMBL?
Submit gene/protein calls to GenBank/DDBJ/EMBL
GOA will make GO annotations (IEA) usingautomated methods
Resources to make functional annotations?
Contact GO Consortium for advice,training, help with coordination, etc.
Set up pipeline for any automated annotationsnot being done by GOA
Manual GO annotationsfrom literature, or fromsequence similarity methods
UniProtKB contains translationsof all coding regions in GenBank/DDBJ/EMBL
Decide who will collate all GO annotations into one fileDecide who will collate all GO annotations into one file
Introduction to GO
Rama Balakrishnan
Saccharomyces Genome Database
Stanford University, CA
A Common Language for Annotation of Genes from
Yeast, Flies and Mice
The Gene Ontologies
…and Plants and Worms
…and Humans
…and anything else!
http://www.geneontology.org/
What’s in a name?
• What is a cell?
Cell
Cell
Cell
Cell
Cell
Image from http://microscopy.fsu.edu
What’s in a name?
• The same name can be used to describe different concepts
What’s in a name?
What’s in a name?
• Glucose synthesis• Glucose biosynthesis• Glucose formation• Glucose anabolism• Gluconeogenesis
• All refer to the process of making glucose from simpler components
What’s in a name?
• The same name can be used to describe different concepts
• A concept can be described using different names
Comparison is difficult – in particular across species or across databases
What’s in a name?
• Rad54 (S. cerevisiae)• Okra (D. melanogaster)• Rhp54(S. pombe)
What do these genes products have in common?
ATP dependent helicase involved in DNA recombination, repair
What is the Gene Ontology?
A (part of the) solution: - A controlled vocabulary that can be applied
to all organisms - Used to describe gene products - proteins
and RNA - in any organism
What is Ontology?
• Dictionary: A branch of metaphysics concerned with the nature and relations of being.
• Barry Smith: The science of what is, of the kinds and structures of objects, properties, events, processes and relations in every area
of reality.
1606 1700s
So what does that mean?
From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a
representation of things, that are detectable or directly observable, and the relationships between those things.
Ontology
Includes:
1. A vocabulary of terms (names for concepts)
2. Definitions
3. Defined logical relationships to each other
How does GO work?
• What does the gene product do? – Molecular Function
• Why does it perform these activities? – Process
• Where does it act?– Location in the cell, cellular component
What information might we want to capture about a gene product?
• Molecular Function = elemental activity/task– the tasks performed by individual gene products; examples are
carbohydrate binding and ATPase activity
• Biological Process = biological goal or objective
– broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions
• Cellular Component = location or complex– subcellular structures, locations, and macromolecular complexes; examples
include nucleus, telomere, and RNA polymerase II holoenzyme
The 3 Gene Ontologies
Cellular Componentwhere a gene product acts
Molecular Functionactivities or “jobs” of a gene product
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
glucose-6-phosphate isomerase activity
insulin bindinginsulin receptor
activity
drug transporter activity
Molecular Function
• A gene product may have several functions; a function term refers to a single reaction or activity, not a gene product.
• Sets of functions make up a biological process.
Biological Process
cell division
transcription
limb development Courtship behavior
Function (what) Process (why)
Drive nail (into wood) Carpentry
Drive stake (into soil) Gardening
Smash roach Pest Control
Clown’s juggling object Entertainment
Example: Gene Product = hammer
term: gluconeogenesis
id: GO:0006094
definition: The formation of glucose from noncarbohydrate precursors, such as pyruvate, amino acids and glycerol.
Synonym: glucose biosynthesis
What’s in a GO term?
No GO Areas
• GO covers ‘normal’ functions and processes– No pathological processes– No experimental conditions
• NO evolutionary relationships• NO gene products• NOT a system of nomenclature for
genes
Ontology Structure
• The Gene Ontology is structured as a hierarchical directed acyclic graph (DAG)
• Terms can have more than one parent and zero, one or more children
• Terms are linked by two relationships– is-a– part-of
Chromosome
Cytoplasmic chromosome
Mitochondrialchromosome
Plastid chromosome
Nuclear chromosome
A child is a subset or instances of
a parent’s elements
Parent-Child Relationships
One-to-many parental relationshipMany-to-many parental relationship
Each child has only one parent
Each child may have one or more parents
Parent-Child Relationships
DAG: Directed Acyclic Graph
[other organelles]
chromosome
Intracellular organelle
nucleus
nuclear chromosome
cell part
cellular_component
mitochondrial chromosome
[Other types of chromosomes]
is_a
part_of
A Sample DAG
True Path Rule
• The path from a child term all the way up to its top-level parent(s) must always be true
cell cytoplasm
chromosome nuclear chromosome cytoplasmic chromosome mitochondrial chromosome
nucleus nuclear chromosome
is-a
part-of
•Terms become obsolete when they are removed or redefined
•GO IDs are never deleted from the ontologies
•For every obsoleted term, a comment is added to explains why the term is now obsolete
Ensuring Stability in a Dynamic Ontology
Obsolete Cellular ComponentObsolete Molecular Function
Biological Process
Obsolete Biological Process
Molecular FunctionCellular Component
Why modify the GO?
• GO reflects current knowledge of biology
• Biology drives changes to the ontologies
term: MAPKKK cascade (mating sensu Saccharomyces)
goid: GO:0007244
definition: OBSOLETE. MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces.
definition_reference: PMID:9561267
comment: This term was made obsolete because it is a gene product specific term. To update annotations, use the biological process term 'signal transduction during conjugation with cellular fusion ; GO:0000750'.
Obsolete terms
definition: MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces
• Access gene product functional information
• Do cross species comparison
• Find how much of a proteome is involved in a process/ function/ component
•Provide a link between biological knowledge and …
• gene expression profiles
• proteomics data
What can scientists do with GO?
Whole genome analysis(J. D. Munkvold et al., 2004)
Microarray analysis
Using GO to Aid Microarray Analysis
Orthogonal to existing ontologies to facilitate combinatorial approaches- Share unique identifier space- Include definitions
• Anatomies
• Cell Types
• Sequence Attributes (SO)
• Temporal Attributes
• Phenotypes
• Diseases
• More….
http://obo.sourceforge.net
Beyond GO – Open Biomedical Ontologies
GO Annotations: What are they and how are they made?
Maria Costanzo
Saccharomyces and Candida Genome DatabasesStanford University
Let’s Get Started!
• What is an annotation?• Annotation approaches• Strategies for identifying literature to
annotate• Strategies for reading a paper for
annotation• Strategies for annotating a gene and a
genome
What is a GO annotation?
• A annotation is a piece of information associated with a gene product
• A gene product is usually a protein but can be a functional RNA
• A GO annotation is a Gene Ontology term associated with a gene product
Anatomy of a GO annotation
Gene Product
GO Term
IMP, IGI, IPI, ISS, IDA, IEP, TAS, NAS, ND, RCA, IC, IEA
Evidence Code
Reference
IMP inferred from mutant phenotypeIGI inferred from genetic interactionIPI inferred from physical interactionISS inferred from sequence similarityIDA inferred from direct assayIEP inferred from expression patternIC inferred by curatorTAS traceable author statementNAS non-traceable author statementND no biological data availableIEA inferred from electronic annotation
http://www.geneontology.org/doc/GO.evidence.html
Evidence Codes for GO AnnotationsEvidence Codes for GO Annotations
Additional annotation information
• WITH/FROM: supporting info for the evidence code– IPI, IGI, ISS, IEA, IC– Contains the interacting or similar gene product
• QUALIFIER: describes the GO term– NOT– contributes to (used with Molecular Function terms)– colocalizes with (used with Cellular Component terms)
Approaches for annotation of a genome
1. Automated/Electronic approaches
2. Manual approaches
3. Combinatorial approach
Electronic Annotation
• Generate annotations relatively quickly & cheaply
• Annotation derived without human validation– Sequence similarity, e.g. BLAST search ‘hits’,HMMs, etc.– Mapping file, e.g. interpro2go, ec2go, etc.
• Useful For:– genomes that don’t have extensive literature– groups with limited curatorial resources
Electronic Annotation
• Often based on sequence similarity
• Document the method used in a abstract– unpublished abstract in your own database
– unpublished abstract submitted to GO references collection
• Annotation is not reviewed by human• IEA evidence code
Combinatorial Approach, e.g. using sequence similarity
1. Alignments published in literature
2. Analysis using full length protein
3. Analysis using protein domains
Example IEA Annotations from dictyBase
Example unpublished reference
Manual annotation
• Created by scientific curators• Time intensive• Utilizes
– published literature– sequence comparison data
• Aided by curation tools– Manatee (open source from TIGR)– Apollo (open source from GMOD)– Artemis (open source)
Literature Source1. PubMed
- National Library of Medicine, National Institutes of Health- http://ncbi.nlm.nih.gov
2. Agricola - United States Department of Agriculture, National
Agricultural Library- http://agricola.nal.usda.gov
3. Embase- Elsevier- http://www.embase.com
4. Biosis - Thomson
- http://www.biosis.org5. Unpublished (e.g. for internal sequence analysis methods)
- abstract in your own database- unpublished abstract submitted to GO references
collection
Example Annotation
GO Term
Gene Product
nek2
centrosomeGO:0005813
Reference
PMID: 11956323
Evidence Code
IDAInferred fromDirect Assay
1. Species name
2. Gene/gene product names:daf-12, spo11, Sonic hedgehog
3. Process AND species: embryonic development AND elegans
4. Function AND species:transcription factor AND mays
5. Cellular component AND species (genus):plasma membrane AND Drosophila
What to Search For in Published Literature?
GO Annotation: GMOD Tools for Enhancing Information Retrieval
GMOD – Generic Software Components for Model Organism Databases
- http://www.gmod.org/home
- Literature search tools:PubSearch – http://www.gmod.org/?q=node/44
PubFetch - http://www.gmod.org/?q=node/84
Textpresso – http://www.textpresso.org- full text of articles- semantic categories
GO Annotation: Strategies for Identifying Literature for Curation
1. Primary research literature with new experimental data- Mutant phenotypes – process- Activity assays – function- Localization studies – component
2. Computational analyses- Phylogenetic analysis – function (ISS)- Domain analysis
3. Review articles- Summarizes and cites primary literature (TAS)
Which parts of the paper are most important?
• Experimental Results• Results: Figures, Tables, Text• Materials and Methods
• Introductory information• Abstract
• Explanatory text (use with caution)• (Introduction) – mostly TAS information• (Discussion)
Reading papers as curator,rather than as a bench scientist
• Don’t be swayed by the speculations or theories that may appear in the Discussion.
• Focus on the actual results vs. the possible, but not proven, implications of those results.
• Read for details and contact authors if key identifiers are missing.
How to find a GO term to use?
• Web based tools-– AmiGO browser (http://www.godatabase.org)– QuickGO (http://www.ebi.ac.uk/ego/)
• Downloadable tool (https://sourceforge.net/projects/geneontology/)– OBO-Edit (must also download the ontology file)
Extracting Information from a paper
Sample text from PMID: 12374299
In this study, we report the isolation and molecular characterization of the B. napus PERK1 cDNA, that is predicted to encode a novel receptor-like kinase. We have shown that like other plant RLKs, the kinase domain of PERK1 has serine/threonine kinase activity, In addition, the location of a PERK1-GTP fusion protein to the plasma membrane supports the prediction that PERK1 is an integral membrane protein…these kinases have been implicated in early stages of wound response…
Example Manual Annotations from SGD
Annotation from published literature
1. Focus on known genes
2. Identify literature relevant to that genea. using gene names, species name
3. Complete annotation set for a genea. annotate available experimental datab. annotations to root nodes indicate
nothing is known
[other organelles]
chromosome
Intracellular organelle
nucleus
nuclear chromosome
cell part
cellular_component
mitochondrial chromosome
[Other types of chromosomes]
is_a
part_of
Annotating genes to GO terms and the True Path Rule
RAD51 RAD52
The True Path Rule Applied to Annotations
Are all paths to the root true for my gene product?
• Yes, great, annotate
• No?– Is there a term I can use where all paths
will be true– Does the ontology structure need to be
changed?
I don’t see terms in the ontology to describe the biology of my species
• Source Forge (SF) tracker for term related issueshttps://sourceforge.net/projects/geneontology/
• Send an email to the GO mailing list• Content meetings
– Organized by the consortium if the ontology related issues can’t be resolved over email/SF
– Look for announcements on the GO website, mailing lists
http://pamgo.vbi.vt.edu
1. Develop GO terms for functions, processes and structures used by microbes in their associations with plants and animals
• fungi, oomycetes, bacteria, nematodes• 472 terms recently added to GO
2. Create reference genomes by manual annotation of selected microbe genomes
• in progress
3. Training workshops• July 26, 2007. IS-MPMI Workshop, Sorrento,
Italy• August 8-10, 2007. Virginia Bioinformatics
Institute- travel funds available for students and
postdocs
GO Terms needed: Secondary Metabolism
• The fungal community is going to need to add new terms: – secondary metabolism pathways– possibly other areas
• Fungal species so far annotated have not had secondary metabolism pathways, so no terms have been created to represent these areas
• The GO Consortium will be very happy to work with the fungal community to create the needed terms
Contributing GO Annotations
Karen Christie
Saccharomyces Genome DatabasesStanford University
Do you (or your group) have gene calls for your sequence?
yes no
yes
yes
no no
Make automated or manual gene calls
TIGR’s Eukaryotic Annotation course
very useful
Are the gene/protein predictions submitted to GenBank/DDBJ/EMBL?
Submit gene/protein calls to GenBank/DDBJ/EMBL
GOA will make GO annotations (IEA) usingautomated methods
Resources to make functional annotations?
Contact GO Consortium for advice,training, help with coordination, etc.
Set up pipeline for any automated annotationsnot being done by GOA
Manual GO annotationsfrom literature, or fromsequence similarity methods
GOA will collect all GO annotationsand submit them to GOC
You (or your group) collects all GOannotations and submits them to GOC
GOA will maintain annotation file You (or your group) maintains annotation file
UniProtKB contains translationsof all coding regions in GenBank/DDBJ/EMBL
Decide who will collate all GO annotations into one fileDecide who will collate all GO annotations into one file
Do you (or your group) have gene calls for your sequence?
yes no
yes no no
Make automated or manual gene callsAre the protein predictions submitted to GenBank/DDBJ/EMBL?
Submit gene/protein calls to GenBank/DDBJ/EMBL
GOA will make GO annotations (IEA) usingautomated methods
Resources to make functional annotations?
GOA will collect all GO annotationsand submit them to GOC
GOA will maintain annotation file
UniProtKB contains translationsof all coding regions in GenBank/DDBJ/EMBL
Do you (or your group) have gene calls for your sequence?
yes no
yes
yes
no
Make automated or manual gene callsAre the protein predictions submitted to GenBank/DDBJ/EMBL?
Submit gene/protein calls to GenBank/DDBJ/EMBL
GOA will make GO annotations (IEA) usingautomated methods
Resources to make functional annotations?
Contact GO Consortium for advice,training, help with coordination, etc.
Set up pipeline for any automated annotationsnot being done by GOA
Manual GO annotationsfrom literature, or fromsequence similarity methods
You (or your group) collects all GOannotations and submits them to GOC
You (or your group) maintains annotation file
UniProtKB contains translationsof all coding regions in GenBank/DDBJ/EMBL
Decide who will collate all GO annotations into one file
I have my annotations, what next?
DB: Source of the ID in column 2Examples- SGD, MGI, UniProt
Symbol like Brr2, DDX21_HUMAN
that means something to a biologist, not an ID
ID for the gene or gene_productExamples - FBgn0015331, MGI:99240, SPAC9.03c
Object_Type - gene, transcript, protein, protein_structure, or complex, should match the ID
gene_association file - format info at http://www.geneontology.org/GO.annotation.shtml#file
DB source DB Object ID Object Symbol Qualifier GOID DB:reference Ev_code With/From Aspect DB object Name Synonym Object_type Taxon ID Date Assigned bySGD S000004660 AAC1 GO:0005743 SGD_REF:S000050955|PMID:2167309 TAS C ADP/ATP translocatorYMR056C gene taxon:4932 20010118 SGDSGD S000004660 AAC1 GO:0005471 SGD_REF:S000050955|PMID:2167309 IDA F ADP/ATP translocatorYMR056C gene taxon:4932 20010213 SGDSGD S000004660 AAC1 GO:0006839 SGD_REF:S000050955|PMID:2167309 IGI SGD:S000000126 P ADP/ATP translocatorYMR056C gene taxon:4932 20040226 SGDSGD S000004660 AAC1 GO:0009060 SGD_REF:S000050955|PMID:2167309 IGI SGD:S000000126 P ADP/ATP translocatorYMR056C gene taxon:4932 20040226 SGDSGD S000000289 AAC3 GO:0005743 SGD_REF:S000045889|PMID:2165073 ISS SGD:S000000126|SGD:S000004660C ADP/ATP translocatorYBR085W|ANC3 gene taxon:4932 20040226 SGDSGD S000000289 AAC3 GO:0005471 SGD_REF:S000045889|PMID:2165073 ISS SGD:S000000126|SGD:S000004660F ADP/ATP translocatorYBR085W|ANC3 gene taxon:4932 20040226 SGDSGD S000000289 AAC3 GO:0009061 SGD_REF:S000045889|PMID:2165073 IGI SGD:S000000126 P ADP/ATP translocatorYBR085W|ANC3 gene taxon:4932 20040226 SGDSGD S000000289 AAC3 GO:0009061 SGD_REF:S000052497|PMID:1915842 IGI SGD:S000000126|SGD:S000004660P ADP/ATP translocatorYBR085W|ANC3 gene taxon:4932 20040226 SGDSGD S000000289 AAC3 GO:0009061 SGD_REF:S000045889|PMID:2165073 IEP P ADP/ATP translocatorYBR085W|ANC3 gene taxon:4932 20040226 SGDSGD S000003916 AAD10 GO:0008372 SGD_REF:S000069584 ND C aryl-alcohol dehydrogenase (putative)YJR155W gene taxon:4932 20010119 SGDSGD S000003916 AAD10 GO:0018456 SGD_REF:S000042151|PMID:10572264 ISS F aryl-alcohol dehydrogenase (putative)YJR155W gene taxon:4932 20020902 SGDSGD S000003916 AAD10 GO:0006081 SGD_REF:S000042151|PMID:10572264 ISS P aryl-alcohol dehydrogenase (putative)YJR155W gene taxon:4932 20020902 SGDSGD S000005275 AAD14 GO:0008372 SGD_REF:S000069584 ND C aryl-alcohol dehydrogenase (putative)YNL331C gene taxon:4932 20010119 SGDSGD S000005275 AAD14 GO:0018456 SGD_REF:S000042151|PMID:10572264 ISS F aryl-alcohol dehydrogenase (putative)YNL331C gene taxon:4932 20020902 SGDSGD S000005275 AAD14 GO:0006081 SGD_REF:S000042151|PMID:10572264 ISS P aryl-alcohol dehydrogenase (putative)YNL331C gene taxon:4932 20020902 SGDSGD S000005525 AAD15 GO:0008372 SGD_REF:S000069584 ND C aryl-alcohol dehydrogenase (putative)YOL165C gene taxon:4932 20010119 SGDSGD S000005525 AAD15 GO:0018456 SGD_REF:S000042151|PMID:10572264 ISS F aryl-alcohol dehydrogenase (putative)YOL165C gene taxon:4932 20020902 SGDSGD S000005525 AAD15 GO:0006081 SGD_REF:S000042151|PMID:10572264 ISS P aryl-alcohol dehydrogenase (putative)YOL165C gene taxon:4932 20020902 SGDSGD S000001837 AAD16 GO:0008372 SGD_REF:S000069584 ND C YFL057C gene taxon:4932 20020902 SGDSGD S000001837 AAD16 GO:0018456 SGD_REF:S000042151|PMID:10572264 ISS F YFL057C gene taxon:4932 20020902 SGDSGD S000001837 AAD16 GO:0006081 SGD_REF:S000042151|PMID:10572264 ISS P YFL057C gene taxon:4932 20020902 SGDSGD S000000704 AAD3 GO:0008372 SGD_REF:S000069584 ND C aryl-alcohol dehydrogenase (putative)YCR107W gene taxon:4932 20010119 SGDSGD S000000704 AAD3 GO:0018456 SGD_REF:S000042151|PMID:10572264 ISS F aryl-alcohol dehydrogenase (putative)YCR107W gene taxon:4932 20020902 SGDSGD S000000704 AAD3 GO:0006081 SGD_REF:S000042151|PMID:10572264 ISS P aryl-alcohol dehydrogenase (putative)YCR107W gene taxon:4932 20020902 SGD
These columns may be empty
Sample gene-associations file
What tools/infrastructure do you need to record annotations?
• Excel spread sheet (simple, easy, small scale)
OR
• Database– FileMaker Pro, Access (Simple databases)– ORACLE, Sybase, or MySQL (Relational
databases)
How do I share my gene_associations file?
• Provide them to the larger community by submitting your annotations to the GO project
• What information should I submit to GO?– Gene_association file– Short file with info about submitting group
• Where should I submit the data?– Contact the GOC to establish a contact for your group – [email protected]
Databases contributing annotations include:
– dictyBase (Dictyostelium discoideum) – FlyBase (Drosophila melanogaster) – GeneDB (Schizosaccharomyces pombe, Plasmodium falciparum,
Leishmania major and Trypanosoma brucei) – UniProt Knowledgebase (Swiss-Prot/TrEMBL/PIR-PSD) and InterPro
databases – Gramene (grains, including rice, Oryza) – Mouse Genome Database (MGD) and Gene Expression Database (GXD)
(Mus musculus) – Rat Genome Database (RGD) (Rattus norvegicus)– Reactome– Saccharomyces Genome Database (SGD) (Saccharomyces cerevisiae) – The Arabidopsis Information Resource (TAIR) (Arabidopsis thaliana) – The Institute for Genomic Research (TIGR): databases on several bacterial
species – WormBase (Caenorhabditis elegans) – Zebrafish Information Network (ZFIN): (Danio rerio)
Annotation coverage
Annotation Coverage by Genome
GO Current Annotations
http://www.geneontology.org/GO.current.annotations.shtml
GO Current Annotations: Filtered Files
http://www.geneontology.org/GO.current.annotations.shtml
GO Current Annotations: Unfiltered Files
http://www.geneontology.org/GO.current.annotations.shtml
GOA Proteome Species Specific Files
http://www.ebi.ac.uk/GOA/proteomes.html
Resources offered by the GO project
• Website (http://www.geneontology.org)– Lots of documentation– Tools, tutorials and software
• Mailing list ([email protected])• Help email address ([email protected])• GO project on SourceForge
(https://sourceforge.net/projects/geneontology)– Submit suggestions, e.g. new ontology terms, etc.– Download tools, e.g. OBO-Edit
• AmiGO browser (http://amigo.geneontology.org)• GO database
AmiGO Tutorial
Rama Balakrishnan
Saccharomyces Genome Database Stanford University
What is AmiGO?
• Web application that allows you to:
– browse the ontologies
– view annotations from various species
– compare sequences using BLAST (GOst)
AmiGO
http://amigo.geneontology.org
Basic Search
AmiGO Search Results: GO Terms
Term Details Page
Gene Product Details and Annotations
Is_a relationship
Part_of relationship
Leaf node or no children
Node has been opened, can be clicked to closeNode has children, can be clicked to view children
pie chart summary of the numbers of gene products associated to
any immediate descendants of this term in the tree.
Annotations associated with a termAnnotation data are from the gene_associations file submitted by the annotating groups
AmiGO Advanced Search
Filters
BLAST• Blast a protein sequence against all gene products that have a GO
annotation
• Can be accessed from the AmiGO Home page (front page)
BLAST can also be accessed from the annotations section
AmiGO Help
Contact us
• We welcome your input• Please send suggestions, bugs to us• [email protected]
Contact us
• We welcome your input• Please send suggestions, bugs to us• [email protected]
Acknowledgements
The people of the GO Consortium: