154
Ontologies for life sciences: examples from the Gene Ontology Melanie Courtot GO/GOA project lead [email protected] @mcourtot

Ontologies for life sciences: examples from the gene ontology

Embed Size (px)

Citation preview

Ontologies for life sciences: examples from the Gene Ontology

Melanie Courtot GO/GOA project lead [email protected] @mcourtot

Ontologies for life sciences

Cross domain resources . Cross dom

ain resources

dg

P

b

s

y

Data resources at EMBL-EBI Genes, genomes & variation

RNA Central

ArrayExpress

Expression Atlas

Metabolights

PRIDE

InterPro Pfam UniProt

ChEMBL SureChEMBL ChEBI

Molecular structures

Protein Data Bank in Europe

Electron Microscopy Data Bank

European Nucleotide Archive

European Variation Archive

European Genome-phenome Archive

Gene, protein & metabolite expression

Protein sequences, families & motifs

Chemical biology

Reactions, interactions & pathways

IntAct Reactome MetaboLights

Systems

BioModels Enzyme Portal BioSamples

Ensembl

Ensembl Genomes

GWAS Catalog

Metagenomics portal

Europe PubMed Central

BioStudies

Gene Ontology

Experimental Factor

Ontology

Literature & ontologies

Different words same concept: example of Dyschromatopsia

Search PubMed for “color blindness”

Search PubMed for “Dyschromatopsia”

Search PubMed for "abnormality of the eye"

Thousands of sample attributes…

genomics transcriptomics proteomics metabolomics transcriptomics metabolomics

individual experiments genomics transcriptomics proteomics metabolomics

transcriptomics metabolomics individual experiments genomics transcriptomics proteomics metabolomics

transcriptomics metabolomics individual experiments

Data integration in times of ‘omics’

genomics transcriptomics proteomics metabolomics transcriptomics metabolomics

individual experiments

conducted at different times by different researchers using different equipment/approaches reporting same type of results differently

Data growth is fast

12 month doubling

18 month doubling 4 month doubling

3 month doubling

100000000

1E+09

1E+10

1E+11

1E+12

1E+13

1E+14

1E+15

1E+16

2002   2004   2006   2008   2010   2012   2014   2016  

byte

s

date

EGA

ENA

PRIDE

MetaboLights

ArrayExpress

Slide credit: Paul Flicek

Data growth is fast

12 month doubling

18 month doubling 4 month doubling

3 month doubling

100000000

1E+09

1E+10

1E+11

1E+12

1E+13

1E+14

1E+15

1E+16

2002   2004   2006   2008   2010   2012   2014   2016  

byte

s

date

EGA

ENA

PRIDE

MetaboLights

ArrayExpress

Slide credit: Paul Flicek

Vast amount of data generated means

vast amount of data submitted to repositories

Curation - Dirty data and the long tail

200 100

sex:female

gender:female

disease:breast cancer

frequency=2285 frequency=1288

data integration [ˈdeɪtə ˌɪntəˈgreɪʃən]: (computational) means to access, retrieve and analyse data sets from different sources in order to exploit them, i.e., gain new knowledge, and share that new knowledge

data integration [ˈdeɪtə ˌɪntəˈgreɪʃən]: (computational) means to access, retrieve and analyse data sets from different sources in order to exploit them, i.e., gain new knowledge, and share that new knowledge

Standards What do they offer? •  uniformity and consistency in reporting data

•  effective reuse, integration and mining of data

•  creation of SOPs, benchmarks, quality assessment

•  community cohesion

What constitutes a standard?

1.  Establish your community

2.  Define community needs

3.  Define minimal information which needs to be gathered and exchanged by that community

4.  Design* an interchange format

5.  Design* domain-specific controlled vocabularies

*Design = review, reuse and fill the gaps

https://xkcd.com/927/

http://www.biosharing.org

•  Many “Minimum information about a…..” papers now published.

Standards – XML interchange formats

http://www.sbml.org

Adding semantics to the data formats

•  Same name for different concepts

•  Different names for the same concept

Inconsistency in naming of biological concepts

?

An example …

Tactition Tactile sense

Taction

perception of touch ; GO:0050975

Sample description with semantic markup

CL:CL_0000071 (blood vessel endothelial cell)

obo:CHEBI_39867 (valproic acid)

NCBITaxon:NCBITaxon_9606 (Homo Sapiens)

Curation

Ontologies

•  Representation of important things in a specific domain

•  Describes types of entities (e.g. cells) and relations between them

•  An active, formal computational artifact

•  A mathematical model based on a subset of first order logic

•  Tools can automatically process ontologies

•  A communication tool

•  Provides a dictionary for collaborators, a shared understanding

•  Allows data sharing

Reasoning is critical

•  Prokaryotic and Eukaryotic cell are declared disjoints

•  Fungal cell is a Eukaryotic cell

•  Spore is a Fungal cell and a Prokaryotic cell

⇒ Unsatisfiability

⇒ Solution: clarify spore (sensu Mycetozoa) AND actinomycete-type spore

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0022006

Different words same concept: example of Dyschromatopsia

We searched earlier for : -  Dyschromatopsia -  Colorblindness -  Abnormality of the eye

The ontology of color blindness

HP:0011518 (Dichromacy )

HP:0011518 (Eye)

HP:0000551 (Abnormality of color vision )

HP:0007641 (Dyschromatopsia)

Is-a

Is-a Disease-location

The ontology of color blindness

HP:0011518 (Dichromacy )

HP:0011518 (Eye)

HP:0000551 (Abnormality of color vision )

HP:0007641 (Dyschromatopsia)

Is-a

Is-a Disease-location

“Colorblindness”

“A form of colorblindness in which only two of the three fundamental colors can be distinguished due to a lack of one of the retinal cone pigments.”

synonym

definition

Building ontologies

•  Put things into categories

•  Helps organise the data

•  Allows us to generalise over data

•  Capture the relations between things

•  Anatomical parts

Biopolymer

Nucleic Acid Polypeptide

Enzyme DNA RNA

tRNA mRNA smRNA

Ontologies add value

Smarter searching

Data visualisation

Data analysis

Data integration

CMPO term: graped micronucleus CMPO_0000156

CMPO term: graped micronucleus CMPO_0000156

Integrate file formats Integrate metadata

Apply phenotype ontology

Predict disease gene/biomarkers

Human Disease

Cell Gene knockdown

31

32

Genotype Phenotype

Sequence Proteins

Gene products Transcript

Pathways

Cell type

BRENDA tissue / enzyme source

Development

Anatomy

Phenotype

Plasmodium life cycle

- Sequence types and features - Genetic Context

- Molecule role - Molecular Function - Biological process - Cellular component

- Protein covalent bond - Protein domain - UniProt taxonomy

-Pathway ontology -Event (INOH pathway ontology) -Systems Biology -Protein-protein interaction

-Arabidopsis development -Cereal plant development -Plant growth and developmental stage -C. elegans development -Drosophila development FBdv fly development.obo OBO yes yes -Human developmental anatomy, abstract version -Human developmental anatomy, timed version

-Mosquito gross anatomy -Mouse adult gross anatomy -Mouse gross anatomy and development -C. elegans gross anatomy -Arabidopsis gross anatomy -Cereal plant gross anatomy -Drosophila gross anatomy -Dictyostelium discoideum anatomy -Fungal gross anatomy FAO -Plant structure -Maize gross anatomy -Medaka fish anatomy and development -Zebrafish anatomy and development

-NCI Thesaurus -Mouse pathology -Human disease -Cereal plant trait -PATO PATO attribute and value.obo -Mammalian phenotype - Human phenotype -Habronattus courtship -Loggerhead nesting -Animal natural history and life history

eVOC (Expressed Sequence Annotation for Humans)

Ontologies for life sciences

Open Biological and Biomedical Ontologies (OBO)

A subset of biological and biomedical ontologies whose developers have agreed in advance to accept a common set of principles reflecting best practice in ontology development designed to ensure …

•  tight connection to the biomedical basic sciences

•  compatibility

•  interoperability, common relations

•  formal robustness

•  support for logic-based reasoning

http://www.obofoundry.org

OBO Foundry

Building metadata (& ontology) rich resources

•  We build tools for semantic enrichment and alignment

•  Interoperability toolkit

•  Microservices based architecture

•  Technology-agnostic

•  Pushing boundaries of ontology “embedding”

Raw Data to Explicit Knowledge

Data Exploration

and Cleanup

Data structuring

Ontology Annotation

Data cleaning and mapping

Ontology building

Webulous

OxO mapping service

Searching for ontology terms: the EBI Ontology Lookup Service

•  for searching and visualizing >140 ontologies from the biomedical domain

•  includes (among others):

•  Gene Ontology

•  OBO Relations ontology

•  Evidence ontology

•  Pathogen Transmission Ontology

•  Symptom Ontology

•  Basic Formal Ontology

Ontology Lookup Service

•  Ontology search engine

•  Ontology visualisation

•  Powerful RESTful API

•  Open source project

•  Generic infrastructure (can load any ontology represented in OWL)

https://github.com/EBISPOT/OLS

Repository of over 150 biomedical ontologies (4.5 million terms, 11 million relations)

http://www.ebi.ac.uk/ols

Choosing the right term

•  Sample attributes and variables are mapped to EFO ontology

Sample attribute

Mapping data to ontology terms

• Zooma automatically annotates sample attributes and variables with ontology classes

Mapping data to ontology terms

Mapping data to ontology terms

Information supplied as part of a search

The source of this mapping

ZOOMA contains a linked data repository of annotation knowledge and highly annotated data

Expression Atlas: source of mappings

•  Atlas automated pipeline runs against Zooma, then curators: •  Check that the automatic mappings are all correct

•  Create a list of new mappings that should be added to Zooma

•  Webulous Google Add-On •  Connect to the Webulous server from Google Spreadsheets •  Load templates from the Webulous server •  Submit populated templates back to the server for processing

Expression Atlas: curation

What happens when we need a term that is not in EFO?

Adding diseases to EFO using

•  Design pattern templates can be loaded into Google Sheets

•  A Webulous template specifies a series of fields (columns) for the input data

Some fields only allow values from a

list of ontology terms

Adding diseases to EFO using

This data validation provides user with convenient term autocomplete

when entering data into a cell

Adding diseases to EFO using

Raw Data to Explicit Knowledge

Data Exploration

and Cleanup

Data structuring

Ontology Annotation

Data cleaning and mapping

Ontology building

Webulous

OxO mapping service

BioSolr

“BioSolr aims to significantly advance the state of the art with regards to indexing and querying biomedical data with

freely available open source software”

flaxsearch/BioSolr

Solr documents with ontology annotation

Enriched Solr with ontology content (synonyms, structure, relations)

Solr/Elastic plugin Query expansion and hierarchical faceting

Which other diseases are associated with PDE4D?

View diseases grouped in therapeutic areas or organised in a tree

View more information about PDE4D

Filter by therapeutic area

http://www.ebi.ac.uk/rdf

Publishing biological data as Linked Open Data

•  The EBI RDF platform

•  Released Nov 2013

•  Currently over 16 billion RDF triples

•  Datasets updated ~ quarterly

LOD diagram August 2014

Jupp et al (2013). The EBI RDF Platform: Linked Open Data for the Life Sciences. Bioinformatics.

RDF Platform Integration points

Gene (via identifiers.org/ensembl)

RNA transcript (via identifiers.org/ensembl)

uniprot:Protein

rdfs:seeAlso (not currently linking

to identifiers.org but soon)

discretized differential gene expression ratio

(sio: SIO_001078)

Gene Expression Atlas

Ensembl

sio:'is attribute of'(sio:SIO_000011)

Uniprot

Gene Ontology

GO BP GO MF GO CC

uniprot:classifiedWith

bq:occursIn

Organisms

Organism/taxon

ChEMBL

Assay(?)

chembl:h

asTarget

?

bq:isVersionOf

uniprot:organism

rdfs:seeAlso

1

1

1

*

1

* * *

1

1

BioModels

SBMLModel

Reaction

Species

Compartment

bq:isbq:isVersionOf

bq:isVersionOf

bq:isbq:isVersionOf

bq:isHomologTobq:hasPart

ChEBI

Reactome

Pathway

bq:is

Vers

ionOf

bq:isVersionOf

SBObq:is

Relationships within Biomodels can be found

at https://github.com/sarala/ricordo-

rdfconverter/wiki/SBML-RDF-Schema

rdfs:seeAlso

Structure

PDB

1

rdfs:seeAlso

Target (?)

unipr

ot:tra

nscri

bedF

rom

Protein (via identifiers.org/ensembl)

uniprot:translatedTo

bq:isVersionOf

RDF Platform – lessons learned

Successes •  Novel queries possible over

EBI datasets

•  Production quality RDF releases

•  Community of users

•  Highly available public SPARQL endpoints

•  500+ users (10-50 million hits per month)

•  Lots of interest

•  Catalyst for new RDF efforts

Lessons ●  Public SPARQL endpoints

problematic

●  Query federation not performant

●  Inference support limited

●  Not scalable for all EBI data e.g. Variation, ENA

●  Lack of expertise in service teams

●  Too much overhead to get started quickly in this space

An example: The Gene Ontology and Gene Ontology Annotation

Model Organism Databases

•  A way to capture biological knowledge for individual gene products in a written and computable form

The Gene Ontology

•  A set of concepts and their relationships to each other arranged as a hierarchy

www.ebi.ac.uk/QuickGO

Less specific concepts

More specific concepts

The Gene Ontology

http://geneontology.org/

•  Collaborative effort to address the need for consistent descriptions of genes/gene products across databases

•  Use of GO terms by collaborating databases facilitates uniform queries across all of them

Aims of the GO project

•  compile the ontologies

•  >40000 terms

•  constantly increasing and improving

•  annotate gene products using the terms

•  provide public resource of data and tools

•  regular releases of annotations

•  tools for browsing/querying annotations and editing the GO

The GO editorial office at EMBL-EBI

•  Part of the Sample, Phenotypes and Ontology team (SPOT)

•  Contributes to development of the Gene Ontology

•  Specific areas of interest: autophagy, synapse…

•  Answers user requests

•  New terms, modifications, updates

•  Help support

•  Curator requests

GO editorial office at the EBI:

Paola Roncaglia

David Osumi-Sutherland

Develop the ontology

•  An OWL ontology of >41,000 classes

•  biological process, cellular component, molecular function

•  > 14,000 imported classes (CL, Uberon, ChEBI, NCBI_tax)

•  >136,000 logical axioms, including:

•  ~72,000 subClassOf axioms between named GO classes

•  ~41,000 simple existential restrictions (subClassOf R some C)

•  EL expressivity => fast, scalable reasoning (with ELK)

https://www.cs.ox.ac.uk/isg/tools/ELK/

Ontology structure

•  Hierarchical

Terms can have more than one parent

•  Terms are linked by relationships

is_a part_of regulates (and +/- regulates)

www.ebi.ac.uk/QuickGO occurs_in has_part

These relationships allow for complex analysis of large datasets

Terms can have more than one child

Biological Process what does a gene product do?

cell division transcription

A commonly recognised series of events

Molecular Function how does a gene product act?

•  insulin binding

•  insulin receptor activity

•  glucose-6-phosphate isomerase activity

Cellular Component where is a gene product located?

plasma membrane

•  mitochondrion •  mitochondrial membrane •  mitochondrial matrix •  mitochondrial lumen

•  ribosome

•  large ribosomal subunit

•  small ribosomal subunit

Example GO annotation – cytochrome c

cellular components

molecular functions

biological processes Electron carrier activity

GO:0009055

oxidation-reduction process

GO:0055114

Mitochondrion

GO:0005739

https://www.ebi.ac.uk/QuickGO/GProtein?ac=P99999

Anatomy of a GO term

Unique identifier Term name

Definition

Synonyms

Cross-references

Hands-on Finding GO term

information

https://www.ebi.ac.uk/QuickGO/

What is the GO ID for the term mitochondrial chromosome

What is the GO ID for the term mitochondrial chromosome

GO:0000262

What are the four direct parents of the term nucleosome?

What are the four direct parents of the term nucleosome?

Chromatin Chromosomal part DNA packaging complex Protein-DNA complex

What types of relationships are there between the term nucleosome and its direct parents?

What types of relationships are there between the term nucleosome and its direct parents?

Part of chromatin Is a for the others

Building the GO

•  The GO editorial team

•  Submission via GitHub, https://github.com/geneontology/

•  Submissions via TermGenie, http://go.termgenie.org

•  ~80% terms are now created this way

Annotate gene products

GOA

Database

external annotation groups (25)

manual annotation by curators (125)

electronic prediction methods (11)

Making annotations available

GOA

Database

GOA & GOC ftp sites

QuickGO

Manual annotations

•  Time-consuming process producing lower numbers of annotations (~2,800 taxons covered)

•  More specific GO terms

•  Manual annotation is essential for creating predictions

•  Part of the Protein Function content team

•  Largest open-source contributor of annotations to GO

•  Focuses on human, but provide annotations for more than 441,000 species

•  Human curators, and collate manual and electronic annotations across community

UniProt-Gene Ontology Annotation (UniProt-GOA) project at the EMBL-EBI http://www.ebi.ac.uk/GOA

Aleksandra Shypitsyna

Elena Speretta

Penelope Garmiri

Tony Sawford

UniProt-GOA project at the EBI:

…a statement that a gene product;

P00505

Accession Name GO ID GO term name Reference Evidence code

IDA PMID:2731362 aspartate transaminase activity GO:0004069 GOT2

A GO annotation is …

…a statement that a gene product; 1. has a particular molecular function or is involved in a particular biological process

or is located within a certain cellular component

A GO annotation is …

P00505

Accession Name GO ID GO term name Reference Evidence code

IDA PMID:2731362 aspartate transaminase activity GO:0004069 GOT2

…a statement that a gene product; 1. has a particular molecular function or is involved in a particular biological process

or is located within a certain cellular component 2. as described in a particular reference

A GO annotation is …

P00505

Accession Name GO ID GO term name Reference Evidence code

IDA PMID:2731362 aspartate transaminase activity GO:0004069 GOT2

…a statement that a gene product; 1. has a particular molecular function or is involved in a particular biological process

or is located within a certain cellular component 2. as described in a particular reference 3. as determined by a particular method

A GO annotation is …

P00505

Accession Name GO ID GO term name Reference Evidence code

IDA PMID:2731362 aspartate transaminase activity GO:0004069 GOT2

Experimental data

Computational analysis

Author statements/ curator inference

(+ Inferred from electronic annotations)

http://www.evidenceontology.org/

Tracking provenance

Evidence codes

http://geneontology.org/page/evidence-code-decision-tree

Hands-on Manual annotation

example

PMID:18573874

FIG. 2. Human Nbp35 is a cytosolic protein. (A) EGFP fluorescence of a HeLa cell transiently transfected with a vector encoding a huNbp35-EGFP fusion protein (right) in comparison to the endogenous autofluorescence (AFL) of control cells (left).

(C) Sub-cellular localization of huNbp35 by cell fractionation. […]HuNbp35 exclusively colocalizes with tubulin in the cytosolic fraction, but not with mitochondrial aconitase (mtAconitase) present in the membrane fraction.

Human Nbp35 is a cytosolic protein.

Protein GO term Supporting evidence

Human Nbp35 is a cytosolic protein. •  Find the correct UniProt entry

http://www.uniprot.org

Human Nbp35 is a cytosolic protein.

Human Nbp35 is a cytosolic protein.

Protein GO term Supporting evidence NUBP1

Human Nbp35 is a cytosolic protein. •  Find the right GO term

https://www.ebi.ac.uk/QuickGO/

Human Nbp35 is a cytosolic protein.

Human Nbp35 is a cytosolic protein.

Protein GO term Supporting evidence NUBP1 GO:0005829

Human Nbp35 is a cytosolic protein. •  Evidence:

•  Fig 2A Immunofluorescence and/or

•  Fig 2C subcellular fractionation

GO evidence codes [small excerpt]

TAS, Traceable author statement NAS, Non-traceable author statement

IDA, Inferred from Direct Assay IMP, Inferred from Mutant Phenotype IPI, Inferred from Physical Interaction

Experimental evidence, Methods & Results

Abstract & Introduction

Human Nbp35 is a cytosolic protein.

Protein GO term Supporting evidence NUBP1 GO:0005829 IDA

Electronic Annotations •  Quick way of producing large numbers of annotations

•  Annotations use less-specific GO terms

Only source of annotation

for ~438,000 non-model

organism species

Electronic Annotations •  Quick way of producing large numbers of annotations

•  Annotations use less-specific GO terms

•  Only source of annotation for ~438,000 non-model organism species

orthology taxon constraints

Broad taxonomic coverage

…as well as less well-studied species that have;

•  Complete proteome •  >25% GO annotation coverage

We provide annotation files for well-studied species…

We have annotations for species that may not have a dedicated curation effort;

e.g. for 1,400 Solanacae species’ we have ~360,000 annotations for ~64,000 proteins

1. Mapping of external concepts to GO terms e.g. InterPro2GO, UniProt Keyword2GO, Enzyme Commission2GO

Electronic annotation methods

GO:0004715 ; non-membrane spanning protein tyrosine kinase activity

Annotations are high-quality and have an explanation of the method (GO_REF)

Macaque

Mouse Dog Cow

Guinea Pig Chimpanzee Rat

Chicken

Ensembl compara

2. Automatic transfer of manual annotations to orthologs

...and more

e.g. Human

Arabidopsis

Rice

Brachypodium

Maize

Poplar

Grape

…and more Ensembl compara

Electronic annotation methods

http://www.geneontology.org/cgi-bin/references.cgi

An example

ACCESSION   GO ID   GO ASPECT   GO TERM  P04637   GO:0047485   F   protein N-terminus binding  P04637   GO:0051087   F   chaperone binding  P04637   GO:0051721   F   protein phosphatase 2A binding  P04637   GO:0000733   P   DNA strand renaturation  P04637   GO:0006289   P   nucleotide-excision repair  P04637   GO:0006355   P   regulation of transcription, DNA-templated  P04637   GO:0006461   P   protein complex assembly  

ACCESSION   GO ID   GO ASPECT   GO TERM  Q549C9   GO:0047485   F   protein N-terminus binding  Q549C9   GO:0051087   F   chaperone binding  Q549C9   GO:0051721   F   protein phosphatase 2A binding  Q549C9   GO:0000733   P   DNA strand renaturation  Q549C9   GO:0006289   P   nucleotide-excision repair  Q549C9   GO:0006355   P   regulation of transcription, DNA-templated  Q549C9   GO:0006461   P   protein complex assembly  

Annotations from the source…

…are projected on to the target

InterPro  

Source of ~93 million GO mappings for ~30 million distinct UniProtKB sequences (Oct 30 2015 release)

3. Propagation of GO annotations to protein groups

GO mapping to domains:

Function of domain may not be function of protein

Family members can be experimentally characterised as lacking function:

P14210 - a serine protease homologue with no proteolytic activity

(proteins are reported to GOA to be blacklisted)

Broad families that are functionally diverse: The GHMP kinase superfamily includes - Galactokinases (EC=2.7.1.6) - Homoserine kinases (EC=2.7.1.39) - Mevalonate kinases (EC=2.7.1.36) - Diphosphomevalonate decarboxylases (EC 4.1.1.33)

Considerations for mapping GO terms

* Includes manual annotations integrated from external model organism and specialist groups

2,811,622 Manual annotations*

280,313,749 Electronic annotations

Number of annotations in UniProt-GOA database (June 2016)

Many ways to access GO annotation data

http://www.ebi.ac.uk/QuickGO

Map-up annotations with GO slims

Search GO terms or proteins

Find sets of GO annotations

Questions on how to use QuickGO? Contact [email protected]

One example: the QuickGO browser

http://www.ebi.ac.uk/QuickGO-Beta/

GO term enrichment analysis

•  What is it?

•  What can you use it for?

•  How does it actually work?

•  How can I actually do it?

•  When is it NOT a good idea to do it?

Enrichment analysis – basic principle Sample

40%

20%

Enrichment analysis Sample

40%

20%

Reference

20%

20%

=> The sample is over-enriched for

Enrichment analysis Sample

40%

20%

Reference

20%

20%

GO term enrichment analysis

•  What is it?

•  Most popular type of GO analysis

•  Determines which GO terms are more often associated with a specified list of genes/proteins compared with a control list or rest of genome

GO term enrichment analysis

•  What can you use it for?

GO term enrichment analysis

“Our gene list contains targets for GATA1 (orange balls) and SP1 (blue balls) transcription factors (TFs). For each TF, we extract the proportion of targets in the gene list and in the genome to construct the contingency table. Fisher's exact test is used to determine if there is a nonrandom association between the gene list and the specific regulation of a TF.”

•  http://bioinfo.cipf.es/docs/renato/simple_enrichment_analysis

GO term enrichment analysis

GO term enrichment analysis

•  How does it actually work?

•  http://geneontology.org/page/go-enrichment-analysis

•  http://geneontology.org/faq/what-minimum-information-include-functional-analysis-paper

•  Also useful for GO analysis in general:

GO term enrichment analysis

•  How can I actually do it?

•  Many tools available to do this analysis

•  User must decide which is best for their analysis

•  We’ll focus on the tool provided by the GO Consortium

•  Be aware that there are numerous third-party tools and that they do not all use up-to-date GO data

GO term enrichment analysis

•  How do you get to the GO TE tool?

•  From front page of GO website

•  From AmiGO

http://geneontology.org

http://geneontology.org

http://amigo.geneontology.org/amigo

http://amigo.geneontology.org/amigo

Spinocerebellar ataxia type 28

Paola Roncaglia

Novel biomarkers of rectal radiotherapy

Biomarker for diagnosis and prognosis

Gene expression changes in diabetes

Improved network analysis

Hands on - Dataset

•  Download http://tinyurl.com/IDs-for-enrichment

•  Go to http://geneontology.org

•  Run the enrichment analysis

Caveats

•  When can you NOT do an enrichment analysis?

•  Too few target genes/proteins

•  Genes/proteins of interest are not present in your background set (e.g. array)

•  Genes/proteins of interest are not expressed/translated in your sample(s)

138

Many gene products are associated with a large number of descriptive, leaf GO nodes:

GO slims

…however annotations can be mapped up to a smaller set of parent GO terms:

GO slims

Slim generation for industry

•  Collaboration funded by Roche

•  Need a custom GO slim for analysis of genesets of interest

•  Need to be descriptive enough

•  Without redundancy

•  Internal proprietary vocabulary – hard to maintain

•  Desire to automatically map to GO

http://www.swat4ls.org/wp-content/uploads/2015/10/SWAT4LS_2015_paper_44.pdf

ROCHE CV

GSEA with full GO GSEA with Roche CV

Courtesy Laura Badi

•  Mapping query: participant_OR_reg_participant some cannabinoid

•  Description: “A process in which a cannabinoid participates, or that regulates a process in which a cannabinoid participates.”

Results

•  We have successfully mapped 84% of terms from RCV (308/365) to OWL queries that can be used to replicate some proportion of the original manual mapping.

•  In addition, these queries find 1000s of terms that were missed in the original mapping.

David Osumi-Sutherland

GO SLIM (generic)

ROCHE CV – MANUAL ONLY

ROCHE CV MANUAL + AUTO

Go slims for metagenomics functional analysis

https://www.ebi.ac.uk/metagenomics/projects/SRP033553/samples/SRS512695/runs/SRR1045093/results/versions/3.0

Samples comparison

BP CC MF

Samples comparison (detail)

BP

CC

MF

http://www.ebi.ac.uk/about/news/service-news/metagenomics-go-slim-2016

Acknowledgements

•  GO editors and developers

•  GO annotators

•  The Gene Ontology (GO) Consortium

•  Samples, Phenotype and Ontology team (Helen Parkinson)

•  Protein Function Content team (Claire O’Donovan)

•  Funding: EMBL-EBI, National Human Genome Research Institute (NHGRI)

Thank you for your attention!

Contact Gene Ontology Annotation:

[email protected]

Contact Gene Ontology: http://geneontology.org/form/contact-go