70
GENE ANNOTATION AND ONTOLOGY Marcus C. Chibucos, Ph.D. Arabidopsis thaliana ATPase HMA4 zinc binding domain GO:0006829 : zinc ion transport (BP) GO:0005886 : plasma membrane (CC) GO:0005515 : protein binding (MF) Annotation Ontology Evidence

Chibucos annot go_final

Embed Size (px)

DESCRIPTION

Gene Ontology

Citation preview

Page 1: Chibucos annot go_final

GENE ANNOTATION AND ONTOLOGY

Marcus C. Chibucos, Ph.D.

Arabidopsis thaliana ATPaseHMA4 zinc binding domain

GO:0006829 : zinc ion transport (BP)GO:0005886 : plasma membrane (CC)GO:0005515 : protein binding (MF)

Annotation

Ontology

Evidence

Page 2: Chibucos annot go_final

2 Outline of this talk

Background: the language of biology

Gene Ontology: overview, terms & structure

Annotating with GO and Evidence

Using annotation to facilitate your research

Page 3: Chibucos annot go_final

3

About screenshots in this talk

AmiGO web-based ontology browser http://amigo.geneontology.org

OBO-Edit stand-alone editor http://oboedit.org

Page 4: Chibucos annot go_final

4

What is annotation? Who is involved?Term confusion (what’s in a name?)Scale: the sea of dataControlled vocabularies & ontologiesThe Gene Ontology Consortium

Background: the language of biology

Page 5: Chibucos annot go_final

5

Annotation

annotate – to make or furnish critical or explanatory notes or comment.

(Merriam-Webster dictionary)

genome annotation – the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding the layers of analysis and interpretation necessary to extract its biological significance and place it into the context of our understanding of biological processes.

(Lincoln Stein, PMID 11433356)

Gene Ontology annotation – the process of assigning GO terms to gene products… according to two general principles: first, annotations should be attributed to a source; second, each annotation should indicate the evidence on which it is based.

(http://www.geneontology.org)

Page 6: Chibucos annot go_final

6

Diverse parties involved

End-users, including various researchers Small-scale laboratory projects Whole genome sequencing projects

Annotators From reading papers to computational

analysis Ontology developers

Create terms that reflect scientific knowledge

Make interoperable ontologies, database links

Developers of tools & resources Standards for storing & sharing data Web interfaces for data analysis & sharing

Many areas of expertise Laboratory sciences – biology, chemistry,

medicine, and many other disciplines Computational science – bioinformatics,

genomics, statistics Software development & web design Philosophy – ontology & logic

Page 7: Chibucos annot go_final

7

Term confusion: synonyms

Do biologists use precise & consistent language? Mutually understood concepts – DNA,

RNA, or protein Synonym (one thing known by more

than one name) – translation and protein synthesis

Enzyme Commission reactions Standardized id, official name &

alternative names

http://www.expasy.ch/enzyme/2.7.1.40

Page 8: Chibucos annot go_final

8

Term confusion: homonyms

Homonyms common in biology – different things known by the same name Sporulation Vascular (plant vasculature, i.e. xylem &

phloem, or vascular smooth muscle, i.e. blood vessels?)

Endospore formation

Bacillus anthracis

Reproductive sporulationAsci & ascospores, Morchella elata (morel)

http://en.wikipedia.org/wiki/File:Morelasci.jpg©PG Warner 2008 (accessed 17-Sep-09)

http://www.microbelibrary.org/ASMOnly/details.asp?id=1426&Lang=©L Stauffer 2003 (accessed 17-Sep-09)

“Sporulation”

Page 9: Chibucos annot go_final

9

Term confusion: homonyms and biological complexity

AmiGO query “vascular” 51 terms In biology, many related phenomena

are described with similar terminology

Page 10: Chibucos annot go_final

10

The problem of scale

Enormous data sets◦ Microarray experiments◦ Whole genome sequencing

projects◦ Comparative genomics of

multiple diverse taxa

Computers don’t understand nuance◦ Millions of proteins to annotate◦ How to effectively search?◦ How to draw meaningful

comparisons?

http://en.wikipedia.org/wiki/File:Microarray2.gif(accessed 17-Sep-09)

Small data sets, small experiments & isolated scientific communities?

Page 11: Chibucos annot go_final

11

The Gene Ontology (GO)

Way to address the problems of synonyms, homonyms, biological complexity, increasing glut of data

GO provides a common biological language for protein functional annotation

www.geneontology.org

Page 12: Chibucos annot go_final

12

Controlled vocabulary (CV)

An official list of precisely defined terms that can be used to classify information and facilitate its retrieval Think of flat list like a thesaurus or

catalog Benefits of CVs

Allow standardized descriptions of things

Remedy synonym & homonym issues Can be cross-referenced externally Facilitate electronic searching

http://www.nlm.nih.gov/nichsr/hta101/ta101014.html

A CV can be “…used to index and retrieve a body of literature in a bibliographic, factual, or other database. An example is the MeSH controlled vocabulary used in MEDLINE and other MEDLARS databases of the NLM.”

Page 13: Chibucos annot go_final

13

Ontology is a type of CV with defined relationships

GO terms describe biological attributes of gene products…

Ontology – formalizes knowledge of a subject with precise textual definitions

Networked terms where child more specific (“granular”) than parentLess

specific

More granular

Page 14: Chibucos annot go_final

14

How GO works

GO Consortium develops & maintains: Ontologies and cross-links between

ontologies and different resources Tools to develop and use the ontologies SourceForge tracker for development

People studying organisms at databases annotate gene products with GO terms

Groups share files of annotation data about their respective organisms

Because a common language was used to describe gene products and this information was shared amongst databases… We can search uniformly across

databases Do comparative genomics of diverse

taxa

Page 15: Chibucos annot go_final

15

GO on SourceForgesourceforge.net/projects/geneontology

Page 16: Chibucos annot go_final

16

The Gene Ontology Consortium

ZFIN

Reactome IGS

Collaboration began 1998 among model organism databases mouse (MGI), fruit fly (FlyBase) and baker’s yeast (SGD) Michael Ashburner of FlyBase

contributed the base vocabulary Today > 20 members & associates

First publication 2000 (PMID 10802651) Today, PubMed query “gene ontology”

yields 3,347 papers (27-Jun-2011) Organisms represented by GO

annotations from every kingdom of life

Many groups use GO in many different ways for their research

Among eight OBO-Foundry ontologies

Page 17: Chibucos annot go_final

17

OBO Foundry ontologieswww.obofoundry.org

Collaboration among developers of science-based ontologies

Establish principles for ontology development Goal of creating a suite of orthogonal

interoperable reference ontologies in the biomedical domain.

many others…

Page 18: Chibucos annot go_final

18

What the GO is notGO comprises three ontologiesAnatomy & storage of GO termsOntology structureDetail of a term in AmiGOTrue path rule

Gene Ontology:overview, terms & structure

Page 19: Chibucos annot go_final

19

Caveats – what GO is not

Not gene naming system or gene catalog GO describes attributes of biological objects –

“oxidoreductase activity” not “cytochrome c”

The three ontologies have limitations No sequence attributes or structural features No characteristics unique to mutants or

disease No environment, evolution or expression No anatomy features above cellular

component

Not dictated standard or federated solution Databases share annotations as they see fit Curators evaluate differently

GO is evolving as our knowledge evolves New terms added on daily basis Incorrect/poorly defined terms made obsolete Secondary ids – terms with same meaning

merged

Page 20: Chibucos annot go_final

20

GO comprises three ontologies

Cellular component ontology (CC) “cytoplasm”

Molecular function ontology (MF) “protein binding” “peptidase activity” “cysteine-type endopeptidase activity”

Biological process ontology (BP) “proteolysis” “apoptosis”

Terms describe attributes of gene products (GPs) Any protein or RNA encoded by a gene Species-independent context, e.g. “ribosome” Could describe GPs found in limited taxa, e.g.

“photosynthesis” or “lactation”

One GP can be associated with ≥ 1 CC, BP, MF Example: Caspase-6 from Bos taurus

Page 21: Chibucos annot go_final

21

Cellular component ontology

Describes location at level of subcellular structure & macromolecular complex

GP subcomponent of or located in particular cellular component, with some exceptions:

No individual proteins or nucleic acids No multicellular anatomical terms For annotation purposes, a GP can be

associated with or located in ≥ one cellular component

Anatomical structure rough endoplasmic

reticulum nucleus nuclear inner

membrane

Multi-subunit enzyme or protein complex ribosome proteasome ubiquitin ligase

complex

Page 22: Chibucos annot go_final

22

Molecular function ontology

Describe gene product activity at molecular level Describes attributes of entities

Adenylate cyclase (E.C. 4.6.1.1)Catalyzes a specific reaction:

ATP = 3',5'-cyclic AMP + diphosphateDescribed by the Gene Ontology term:

“adenylate cyclase activity” (GO:0004016)http://www.ebi.ac.uk/pdbsum/1ab8

[accessed 4-Feb-2010]

Usually single GP, sometimes a complex “ferritin receptor activity”

Definition: “combining with ferritin, an iron-storing protein complex, to initiate a change in cell activity”

Broad functions “catalytic activity” “transporter

activity” “binding”

Specific functions “adenylate cyclase activity” “protein-DNA complex

transmembrane transporter activity”

“Fc-gamma receptor I complex binding”

Page 23: Chibucos annot go_final

23

Biological process ontology

Describes recognized series of events or molecular functions with a defined beginning and end

“GO does not try to represent the dynamics or dependencies that would be required to fully describe a pathway” (from GO documentation)

Mutant phenotypes often reflect disruptions in BP

Specific process “pyrimidine

metabolism” “α-glucosidase

transport

Broad process “cellular

physiological process”

“signal transduction”

http://www.geneontology.org/GO.process.guidelines.shtml

General considerationsThe Cell Cycle

The Development Node

Multi-Organism Process

MetabolismRegulation

Detection of and Response to StimuliSensory PerceptionSignaling Pathways

Transport and Localization

Transporter activity (molecular function)Other Misc. Standard

Defs

Page 24: Chibucos annot go_final

24

Anatomy of a GO term

Term name

goid (unique numerical identifier)

Precise textual definition with reference stating source

Synonyms (broad or narrow) for searching,

alternative names, misspellings…

GO slim

Ontology placement

Page 25: Chibucos annot go_final

25

Storage and cross referencing of GO terms

Storage in flat file (text)

Database cross reference for mappings to GO GO term identical to object

in other database

Page 26: Chibucos annot go_final

26

Ontology structure:parent-child relationship

Parent term (broader)

Child term (specialized)

hexose biosynthesis

hexose metabolism

monosaccharide biosynthesis

Up in the tree is more general; down in the tree is more specific:

Annotation of genes Start with terms denoting broad functional

categories Use more specific term as knowledge

warrants

Page 27: Chibucos annot go_final

27

Ontology structure:terms arranged in DAGs

GO terms structured as hierarchical-like directed acyclic graphs (DAGs) Tree-like, but each term can

have more than one parent (pseudo-hierarchy)

Each term may have one or more child terms (“siblings” share same parent)

parents

“siblings”

child terms

child term

parent

Page 28: Chibucos annot go_final

28

GO has three term relationships

is_a - child is instance of parent (“A is_a B”) Class-subclass relationship

part_of - child part of parent (“C part_of D”) When C present, part of D; but C not

always present Nucleus always part_of cell; not all cells

have nuclei regulates

Child term regulates parent term

(Zoomed in view of biological process ontology depicted here.)

Page 29: Chibucos annot go_final

29

AmiGO for viewing terms

Open source HTML-based application developed by the GO Consortium

Interface for browsing, querying and visualizing OBO data Users can search GO terms or annotations

Available via website or download for local install http://amigo.geneontology.org

GO:0019836

Example query with

keyword “hemolysis” or

goid GO:0019836

Page 30: Chibucos annot go_final

30

AmiGO search results

Click

Page 31: Chibucos annot go_final

31

Term information in AmiGO

Webpage continues…

Page 32: Chibucos annot go_final

32

AmiGO view continued

Our term is much further down…

Number of gene products in GO

annotation collection annotated to that term or one of its child terms

Relationship between

term and its parent

Several informativ

e views

Clic

k

Page 33: Chibucos annot go_final

33

Graph view

Alternative view of network of terms

Page 34: Chibucos annot go_final

34

A term with two parents

amine group carboxylic acid group

generic amino acid

• Name: amino acid transmembrane transporter activity• ID number: GO:0015171• Definition: Catalysis of the transfer of amino acids from

one side of a membrane to the other. Amino acids are organic molecules that contain an amino group and a carboxyl group. [source: GOC:ai, GOC:mtg_transport, ISBN:0815340729]

• parent term: amine transmembrane transporter activity (GO:0005275)

• relationship to parent: “is_a”

• parent term: carboxylic acid transmembrane transporter activity (GO:0046943)

• relationship to parent: “is_a”

Page 35: Chibucos annot go_final

35

Multiple paths to root:graphical view in OBO-Edit

Page 36: Chibucos annot go_final

36

“True path rule”

The pathway from a term all the way up to its top-level parent(s) must always be true for any gene product that could be annotated to that term (“if true for the child, then true for the parent”)

cell organelle mitochondrion proton-transporting ATP synthase complex

Incorrect for Bacteria

cell intracellular proton-transporting ATP synthase complex plasma membrane proton-transporting ATP synthase complex mitochondrial proton-transporting ATP synthase complex

membrane plasma membrane plasma membrane proton-transporting ATP synthase complex

organelle mitochondrion mitochondrial inner membrane mitochondrial proton-transporting ATP synthase complex

Correct for Bacteria (and Eukaryotes)

(Abbreviated versions of the actual trees)

Page 37: Chibucos annot go_final

What is GO annotation?Literature curation at model organism databasesThe annotation fileEvidence – critical for annotationSequence similarity-based annotationAnnotation specificity

Annotating with GO and Evidence

37

Page 38: Chibucos annot go_final

38

GO annotation overview

Associating a GO term with a gene product Goal is to select GO terms from all three

ontologies to represent what, where, and how

Linking a GO term to a gene product asserts that it has that attribute

For example, 6-phosphofructokinase Molecular function

GO:0003872 6-phosphofructokinase activity Biological process

GO:0006096 glycolysis Cellular component

GO:0005737 cytoplasm

Annotation, whether based on literature or computational methods, always involves: Learning something about a gene product Selecting an appropriate GO term Providing an appropriate evidence code Citing a [preferably open access] reference Entering information into GO annotation file

Page 39: Chibucos annot go_final

39

Chaperone DnaK, one protein/multiple annotations

Molecular function ATP binding (GO:0005524) ATPase activity (GO:0016887) unfolded protein binding (GO:0051082) misfolded protein binding (GO:0051787) denatured protein binding

(GO:0031249)

Biological process protein folding (GO:0006457) protein refolding (GO:0042026) protein stabilization (GO:0050821) response to stress (GO:0006950)

Cellular component cytoplasm (GO:0005737)

Page 40: Chibucos annot go_final

40

Literature curation performed at model organism databases

From the abstract:

Page 41: Chibucos annot go_final

41

Results section indicates a “direct assay” annotation

They document the findings of a direct assay performed on purified protein:

They further document the methods used, and evaluate the findings in the Discussion section…

Page 42: Chibucos annot go_final

42

Query AmiGO with “DNA ligase” & “DNA ligation”

All “ligation” in biological process ontology

Page 43: Chibucos annot go_final

43

Resulting annotations

GO id term name

aspect ev. code

reference

with

GO:0003909

DNA ligase activity

molecular function

IDA PMID:17705817

N/A

GO:0006266

DNA ligation

biological process

IDA PMID:17705817

N/A

GO:0005737

cytoplasm cellular component

IC PMID:17705817

GO:0003909

Name: DNA ligase (stated in paper) Gene symbol: ligA (stated in paper) EC: 6.5.1.2 (queried enzyme for “DNA

ligase”)

Page 44: Chibucos annot go_final

44

Gene annotation file captures annotations

Evidence

Page 45: Chibucos annot go_final

45

Evidence

Essential to base annotation on evidence Conclusions more robust and traceable With evidence, a GO annotation is standard

operating procedure (SOP)-independent

Many types of evidence exist For example, experiment described in

literature What method (e.g. direct assay, mutant

phenotype, et cetera) was used? Did author cite references? Did author provide details of analyses?

Perhaps you used a sequence-based method What were the methods of manual curation? Give accession numbers of similar sequences Provide any references describing methods

Controlled vocabularies help here, too!

Page 46: Chibucos annot go_final

46

GO standard references

GO_REF:0000011 A Hidden Markov Model (HMM) is a statistical representation of patterns found in a data set. When using HMMs with proteins, the HMM is a statistical model of the patterns of the amino acids found in a multiple alignment of a set of proteins called the "seed". Seed proteins are chosen based on sequence similarity to each other. Seed members can be chosen with different levels of relationship to each other. They can be members of a superfamily (ex. ABC transporter, ATP-binding proteins), they can all share the same exact specific function (ex. biotin synthase) or they could share another type of relationship of intermediate specificity (ex. subfamily, domain). New proteins can be scored against the model generated from the seed according to how closely the patterns of amino acids in the new proteins match those in the seed. There are two scores assigned to the HMM which allow annotators to judge how well any new protein scores to the model. Proteins scoring above the "trusted cutoff" score can be assumed to be part of the group defined by the seed. Proteins scoring below the "noise cutoff" score can be assumed to NOT be a part of the group. Proteins scoring between the trusted and noise cutoffs may be part of the group but may not. One of the important features of HMMs is that they are built from a multiple alignment of protein sequences, not a pairwise alignment. This is significant, since shared similarity between many proteins is much more likely to indicate shared functional relationship than sequence similarity between just two proteins. The usefulness of an HMM is directly related to the amount of care that is taken in chosing the seed members, building a good multiple alignment of the seed members, assessing the level of specificity of the model, and choosing the cutoff scores correctly. In order to properly assess what functional relevance an above-trusted scoring HMM match has to a query, one must carefully determine what the functional scope of the HMM is. If the HMM models proteins that all share the same function then it is likely possible to assign a specific function to high-scoring match proteins based on the HMM. If the HMM models proteins that have a wide variety of functions, then it will not be possible to assign a specific function to the query based on the HMM match, however, depending on the nature of the HMM in question, it may be possible to assign a more general (family or subfamily level) function. In order to determine the functional scope of an HMM, one must carefully read the documentation associated with the HMM. The annotator must also consider whether the function attributed to the proteins in the HMM makes sense for the query based on what is known about the organism in which the query protein resides and in light of any other information that might be available about the query protein. After carefully considering all of these issues the annotator makes an annotation.

GO_REF:0000011 A Hidden Markov Model (HMM) is a statistical representation of patterns found in a data set. When using HMMs with proteins, the HMM is a statistical model of the patterns of the amino acids found in a multiple alignment of a set of proteins called the "seed". Seed proteins are chosen based on sequence similarity to each other. Seed members can be chosen with different levels of relationship to each other...

Page 47: Chibucos annot go_final

47

GO evidence codeswww.geneontology.org/GO.evidence.shtml

EXP - inferred from experiment IDA - inferred from direct assay IEP inferred from expression pattern IGI - inferred from genetic interaction IPI - inferred from physical interaction IMP - inferred from mutant phenotype

ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model

IGC - inferred from genomic context ND - no biological data available IC - inferred by curator TAS - traceable author statement NAS - non-traceable author statement IEA - inferred from electronic annotation

GO codes are a subset of yet another ontology!

Page 48: Chibucos annot go_final

48

Types of sequence similarity-based annotations

Find similarity between gene product & one that is experimentally characterized BLAST-type alignments Shared synteny to establish orthology of

genomic regions between species

Find similarity between gene product and defined protein family HMMs (Pfam, TIGRFAMS) Prosite InterPro

Find motifs in gene product with prediction tools TMHMM SignalP

Many (most?) information you find is based on transitive annotation and much of it has never been looked at by a human being!

Page 49: Chibucos annot go_final

49

Evaluation of sequence similarity-based information

Visually inspect alignments & criteria Length & identity Conservation of catalytic sites Check HMM scores with respect to cutoff

Look at available metabolic analysis Pathways, complexes?

Information from neighboring genes Gene in an operon (common prokaryotes)

can supplement weak similarity evidence

Sequence characteristics Transmembrane regions? Signal peptide? Known motifs that give a clue to function? Paralogous family member

Page 50: Chibucos annot go_final

50

An example: HI0678, a protein from H. influenzae…

...high quality alignment to experimentally characterized triosephosphate isomerase from Vibrio marinus

Page 51: Chibucos annot go_final

51

further down the page

Information from Swiss-Prot database on experimentally characterized match protein

Page 52: Chibucos annot go_final

52

…. full-length match, high percent identity (67.8%), conserved active and binding sites (boxed in red).

High quality…..

Page 53: Chibucos annot go_final

53

Resulting annotations

GO id term name

aspect ev code

reference with

GO:0004807triose-phosphate isomerase activity

molecular function

ISSGO_REF:0000012

Swiss-Prot:P50921

GO:0006096 glycolysis

biological process

IGCPMID:15347579

TIGR_GenProp:GenProp0120

GO:0005737 cytoplasm

cellular component

ICGO_REF:0000012 GO:0004807

name: triosephosphate isomerase

gene symbol: tpiA

EC: 5.3.1.1

(This, and the following annotations, came from the match protein.)

Page 54: Chibucos annot go_final

54

KEGG pathway for glycolysis core

Page 55: Chibucos annot go_final

55

KEGG pathway for glycolysis core

Page 56: Chibucos annot go_final

56

Resulting annotations

GO id term name

aspect ev code

reference with

GO:0004807triose-phosphate isomerase activity

molecular function

ISSGO_REF:0000012

Swiss-Prot:P50921

GO:0006096 glycolysis biological process

IGCGO_REF:0000012

KEGG_PATHWAY:

hin00010

GO:0005737 cytoplasm cellular component

ICGO_REF:0000012 GO:0004807

name: triosephosphate isomerase

gene symbol: tpiA

EC: 5.3.1.1

Page 57: Chibucos annot go_final

57

And another annotation

GO id term name

aspect ev code

reference with

GO:0004807triose-phosphate isomerase activity

molecular function

ISSGO_REF:0000012

Swiss-Prot:P50921

GO:0006096 glycolysis biological process

IGCGO_REF:0000012

KEGG_PATHWAY:

hin00010

GO:0005737 cytoplasm cellular component

ICGO_REF:0000012 GO:0004807

The biologist knows that glycolysis takes place in the cytoplasm in bacteria, and so infers a cytoplasmic location for that protein (“inferred by curator” evidence code).

Page 58: Chibucos annot go_final

58

Annotation specificity should reflect knowledge

Available evidence for three genes

#1-good match to an HMM for “kinase”

#2-good match to an HMM for “kinase”-a high-quality BER match to an experimentally characterized “glucokinase’ AND a ‘fructokinase’

#3-good match to an HMM specific for “ribokinase”-a high-quality BER match to an experimentally characterized ribokinase

GO trees (very abbreviated)

Function catalytic activity kinase activity carbohydrate kinase activity ribokinase activity glucokinase activity fructokinase activity

Process metabolism carbohydrate metabolism monosaccharide metabolism hexose metabolism glucose metabolism fructose metabolism pentose metabolism ribose metabolism

#1

#1

#2

#2

#3

#3

Page 59: Chibucos annot go_final

Using shared annotationsSearch for GO terms at databasesSlims for broad classificationGO toolsWorking with GO-limited data setsSummary

Using annotation to facilitate your research

59

Page 60: Chibucos annot go_final

60

Sharing annotations

Annotation file sent to GO, put in repository All these data free to anyone Hundreds of thousands of GP annotations

Annotation files all in same format Facilitates easy use of data by everyone

Most of your favorite organism databases use these annotation files

Page 61: Chibucos annot go_final

61

Searching for GO terms at EuPathDB

Page 62: Chibucos annot go_final

62

Slim is a distilled (reduced) ontology Made by manually pruning low-level terms

with an ontology editor Selected high-level terms remain Slims reduce ontology complexity

Reduce clutter & see general trends Microarray experiments Comparative whole genome analyses Remove irrelevant terms

Looking at specific taxa, such as yeast or plant

Go offers script to bin more granular annotations up to higher levels

Ontology slimwww.geneontology.org/GO.slims.shtml

Page 63: Chibucos annot go_final

63

Comparing genomes with a GO slim

MJ Gardner, et al. (2002) Nature 419:498-511

High-level biological process terms used to compare Plasmodium and Saccharomyces

Page 64: Chibucos annot go_final

64

GO slim: manual/orthology-based gene annotations

Nucleic Acids Res. 2010 January; 38(Database issue): D420–D427.

Page 65: Chibucos annot go_final

65

GO toolswww.geneontology.org/GO.tools.shtml

The real challenge is finding the right one for your needs

For example, statistical representation of GO terms:

http://go.princeton.edu/cgi-bin/GOTermFinder

Page 66: Chibucos annot go_final

66

GO & analysis of RNA-seq data

We present GOseq, an application for performing Gene Ontology (GO) analysis on RNA-seq data. GO analysis is widely used to reduce complexity and highlight biological processes in genome-wide expression studies, but standard methods give biased results on RNA-seq data due to over-detection of differential expression for long and highly expressed transcripts. Application of GOseq to a prostate cancer data set shows that GOseq dramatically changes the results, highlighting categories more consistent with the known biology.

Young et al. Genome Biology 2010, 11:R14 http://genomebiology.com/2010/11/2/R14

Page 67: Chibucos annot go_final

67

When GO is limited

Food for thought: what happens when we have limited GO (or other)annotation data?

New and interesting genomes often see this problem

Page 68: Chibucos annot go_final

68

Comparative analysis of orthologs in syntenic blocks

The more genomes we have at our disposal, the better

Structural rearrangements, absence of intron, gene duplication, intron structure, gene deletion/creation

Nucleic Acids Res. 2010 January; 38(Database issue): D420–D427.

Page 69: Chibucos annot go_final

69

Summary GO analyses

GO remedies problems of synonyms & homonyms in biological nomenclature Queries based on IDs linked to precise

definitions, not less reliable text-matching

GO can help you to: Find all genes that share a particular

function regardless of sequence Do comparisons across any species

annotated with GO Summarize major classes of genes in a

newly sequenced genome Characterize expressed genes is a study Drive hypotheses to test in the laboratory

GO is not a panacea but it should be a valuable tool in your genomics toolbox

Page 70: Chibucos annot go_final

The title slide revisited…

Arabidopsis thaliana ATPaseHMA4 zinc binding domain

GO:0006829 : zinc ion transport (BP)GO:0005886 : plasma membrane (CC)GO:0005515 : protein binding (MF)

Annotation

Ontology

Evidence

THANK YOU.