36
Ontologies, data standards and controlled vocabularies

Ontologies, data standards and controlled vocabularies

Embed Size (px)

Citation preview

Page 1: Ontologies, data standards and controlled vocabularies

Ontologies, data standards and controlled

vocabularies

Page 2: Ontologies, data standards and controlled vocabularies

Why use standards and CVs?

• Very important in High-throughput biology to sort through the vast amounts of data

• To use the same data labels universally

• To enable quick retrieval of data

• To enable easy comparison of data

• To remove ambiguities

Page 3: Ontologies, data standards and controlled vocabularies

What’s in a name?

• What is a cell?

Page 4: Ontologies, data standards and controlled vocabularies

What’s in a name?

• What is a cell?

OR

Page 5: Ontologies, data standards and controlled vocabularies

What’s in a name?

• What is a cell?

OR

Page 6: Ontologies, data standards and controlled vocabularies

What’s in a name?

• What is a cell?

Page 7: Ontologies, data standards and controlled vocabularies

Ambiguities in naming• The same name can be used to describe different

concepts, e.g:– Glucose synthesis– Glucose biosynthesis– Glucose formation– Glucose anabolism– Gluconeogenesis

• All refer to the process of making glucose• Makes it difficult to compare the information• Solution: use Ontologies and Data Standards

Page 8: Ontologies, data standards and controlled vocabularies

Ontologies• An ontology is a formal specification of

terms and relationships between them –widely used in biology and boinformatics (e.g. taxonomy)

• The relationships are important and represented as graphs

• Ontology terms should have definitions• Ontologies are machine-readable• They are needed for ordering and

comparing large data sets

Page 9: Ontologies, data standards and controlled vocabularies

Gene Ontology (GO)

• http://www.geneontology.org• Many annotation systems are organism-specific or

different levels of granularity• GO introduced standard vocabulary first used for

mouse, fly and yeast, but now generic• Three ontologies: molecular function, biological

process and cellular component

Page 10: Ontologies, data standards and controlled vocabularies

GO Ontologies

•Molecular function: tasks performed by gene product –e.g. G-protein coupled receptor

•Biological process: broad biological goals accomplished by one or more gene products –e.g. G-protein signaling pathway

•Cellular component: part(s) of a cell of which a gene product is a component; includes extracellular environment of cells –e.g nucleus, membrane etc.

Page 11: Ontologies, data standards and controlled vocabularies

GO hierarchy

Relationships: “is-a”“part of”

Page 12: Ontologies, data standards and controlled vocabularies

How do gene products get GO terms?

• Electronic annotation:– Through mappings to other biological entities and

then automatic inference to proteins

• Manual annotation:– Model organism databases– Gene Ontology Annotation (GOA) project

• Evidence codes –attached to all GO annotations to show the source

Page 13: Ontologies, data standards and controlled vocabularies

Evidence Codes

IEA Inferred from Electronic Annotation

IDA Inferred from Direct Assay

IMP Inferred from Mutant Phenotype

IPI Inferred from Protein Interaction

IEP Inferred from Expression Pattern

IGI Inferred from Genetic Interaction

ISS* Inferred from Sequence or Structural Similarity

IGC Inferred from Genomic Context

RCA Reviewed Computational Analysis

TAS Traceable Author Statement

NAS Non-traceable Author Statement

IC Inferred from Curator Judgement

ND No Data available

Page 14: Ontologies, data standards and controlled vocabularies

Electronic annotation: GO mappings

Page 15: Ontologies, data standards and controlled vocabularies

Electronic annotation: GO mappings

Fatty acid biosynthesis (SwissProt keyword)

EC:6.4.1.2 (EC number)

IPR000438: Acetyl-CoA carboxylase carboxyl transferase beta subunit (InterPro entry)

MF_00527: Putative 3-methyladenine DNA glycosylase(HAMAP)

Camon et al. BMC Bioinformatics. 2005; 6 Suppl 1:S17

GO:fatty acid biosynthesis(GO:0006633)

GO:DNA repair (GO:0006281)

GO:acetyl-CoA carboxylaseactivity

(GO:0003989)

GO:acetyl-CoA carboxylase activity

(GO:0003989)

Page 16: Ontologies, data standards and controlled vocabularies

UniProt entry

Page 17: Ontologies, data standards and controlled vocabularies

http://www.ensembl.org/info/data/compara

Automatic transfer of annotations to orthologs

Cow

Dog

Rat

Dog

Rat

Mouse

Ensembl GO term projection via gene homology

Anopheles

Mouse

Chicken

Cow

Drosophila

COMPARA

Homologies between different species calculated

GO terms projected from MANUAL annotation only(IDA, IEP, IGI, IMP, IPI)

One-to-one and apparent one-to-one orthologies only used.

Page 18: Ontologies, data standards and controlled vocabularies

Manual annotation: GOA Project

• Largest open-source contributor of annotations to GO• Member of the GO Consortium since 2001• Provides annotation for more than 130,000 species• GOA’s priority is to annotate the human proteome• GOA is responsible for human, chicken, bovine and

many other annotations for the GO Consortium• Annotation is done through reading of the literature

Page 19: Ontologies, data standards and controlled vocabularies

Reference Genomes

Arabidopsis thaliana Caenorhabditis elegans Danio rerio (zebrafish) Dictyostelium discoideum Drosophila melanogaster Escherichia coli Homo sapiens Saccharomyces cerevisiae Mus musculusSchizosaccharomyces pombe Gallus gallus Rattus norvegicus

• Comprehensive annotation of a set of disease-related proteins in human

• Generate a reliable set of GO annotations for the 12 selected genomes

• Empowers comparative methods used in first pass annotation of other proteomes.

Page 20: Ontologies, data standards and controlled vocabularies

http://amigo.geneontology.org/cgi-bin/amigo/go.cgi

Accessing GO data (1)

Page 21: Ontologies, data standards and controlled vocabularies

QuickGO browser

http://www.ebi.ac.uk/quickgo

Human Insulin Receptor (P06213)

Accessing GO data (2)

Page 22: Ontologies, data standards and controlled vocabularies

Gene Association Files

http://www.geneontology.org/GO.current.annotations.shtm

Accessing GO data (3)

Page 23: Ontologies, data standards and controlled vocabularies

Gene Association File example

Accessing GO data (3)

Page 24: Ontologies, data standards and controlled vocabularies

ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/

http://www.ebi.ac.uk/GOA/downloads.html

Downloading GOA data

Page 25: Ontologies, data standards and controlled vocabularies

Functional annotation of proteins

Uses of GO 1

Page 26: Ontologies, data standards and controlled vocabularies

Find functional information on interaction proteins (IntAct)

Uses of GO 2

Page 27: Ontologies, data standards and controlled vocabularies

Microarray data analysis

Proteomics data analysis

Larkin JE et al, Physiol Genomics, 2004

Cunliffe HE et al, Cancer Res, 2003

GO classification

GO classification

Analysis of high-throughput data

Uses of GOAUses of GO 3

Page 28: Ontologies, data standards and controlled vocabularies

Other Ontologies:Open Biomedical Ontologies

http://obo.sourceforge.net

• Central location for accessing well-structured controlled vocabularies and ontologies for use in the biological and medical sciences.

• Provides simple format for ontologies that can encode terms, relationships between terms and definitions of terms including those taken from external ontologies.

Page 29: Ontologies, data standards and controlled vocabularies

Scope of Open Biomedical Ontologies

• Anatomy• Animal natural history and life history• Chemical• Development• Ethology• Evidence codes• Experimental conditions• Genomic and proteomic• Metabolomics• OBO relationship types• Phenotype• Taxonomic classification

Page 30: Ontologies, data standards and controlled vocabularies

Ontology Lookup Service (OLS)

• Single point of query for currently 47 ontologies.

• Ontologies are updated daily from CVS repositories, including the OBO CVS repository and the PRIDE CVS repository.

• A tool that offers interactive and programmatic interfaces for queries on term names, synonyms, relationships, annotations and database cross-references.

• Originally developed for using ontologies in PRIDE.

Page 31: Ontologies, data standards and controlled vocabularies

• These relationships have consequences when querying a database annotated using the ontology.

• What happens when I ask for PRIDE experiments describing the proteome of brain tissue?

The issue faced

Page 32: Ontologies, data standards and controlled vocabularies

Using Ontologies in PRIDE

For an experiment you want to define:– Species: Newt / NCBI Taxonomy ID– Tissue / organ / cell type: BRENDA Tissue

ontology, Cell Type ontology;– Sub-cellular component: Gene Ontology: GO;– Disease: Human Disease: DOID;– Genotype: GO;– Sample Processing: PSI Ontology;– Mass Spectrometry: PSI-MS Ontology;– Protein Modifications: PSI-MOD Ontology

Page 33: Ontologies, data standards and controlled vocabularies

OLS usage examples

• http://www.ebi.ac.uk/ontology-lookup/• What is the accession for “mitochondrion” in GO? In MeSH?

– search by term name in a specific ontology or across all

• I’m looking for a term to annotate my protocol step but I’m not sure what term to use.– browse an ontology

• I’m looking for all the experiments done on liver tissue?– get all children term of liver and query on those as well

• My data set was annotated with GO version 123 but that was a long time ago?– get updated term names for the identifiers you have and see if any have

been made obsolete

Page 34: Ontologies, data standards and controlled vocabularies

Standards for data exchange

• Systems Biology Markup Language (SBML) –computer-readable format for representing models of networks

• Biological Pathways Exchange (BioPAX) –format for representing pathways

• Proteomics Standards Initiative (PSI, MIAPE)

• Microarray standards –MIAME and MAGE

Page 35: Ontologies, data standards and controlled vocabularies

MIAPE/MIAME principles

• Enough information to: – Remove ambiguity in experiment– Allow easy interpretation of results– Allow experiment to be repeated– Enable comparison across similar experiments

• Use controlled vocabularies

Page 36: Ontologies, data standards and controlled vocabularies

Using ontologies and standards

• So much data in different places –need to organize and share it

• Used for data retrieval and comparison –easier to query

• Used for data integration and exchange –standard representation

• Used for evaluation –need “gold standard”