Intro to data analysis:Gene Ontology and
Pathways
Kjell PetersenIntoduction to Microarray technology
September 2009
Presentation adapted from Endre Anderssen and Vidar Beisvåg
NMC Trondheim
microarray.no
microarray.no
Overview• How can ontologies and pathway information help us• What is an ontology?• The Gene Ontology and how it's structured• How to use
– Interactively– Statistically
• Pathways
microarray.no
So here you are
• Figure of diff exp
microarray.no
Gene lists
• Long list of differentially expressed genes
• Possibly hundreds of papers describing the functions of the genes
• Misleading names• Different names in
dfifferent organisms
microarray.no
What’s in a name?
• The same name can be used to describe different concepts
• What is a cell?
microarray.no
Cell
microarray.no
Cell
microarray.no
Cell
Image from http://microscopy.fsu.edu
microarray.no
Ontologies
• Gene Ontology (GO)• Sequence Ontology (SO) (sequence features)• Phenotype and Trait Ontology (PATO)• Taxon (NCBI)• Anatomy (Penn)• Disease (ICD9)• Developmental stage (multiple sources)
microarray.no
Gene Ontology (GO)
• Why Gene Ontology?– Produce a controlled vocabulary describing aspects
of molecular biology, that can be applied to all organisms.
– Facilitate communication between people and organization.
– Improve interoperability between systems.
microarray.noGoal of GO Consortium
• Produce a controlled vocabulary describing aspects of molecular biology, that could be applied to all organism.
• Describe gene products using vocabulary terms (annotation).
• Develop tools:– to query and modify the vocavularies and annotations
(http://www.geneontology.org/)
microarray.no
How does GO work?
What information might we want to capture about a gene product?
• What does the gene product do?• Why does it perform these activities?• Where does it act?
microarray.no
The Gene Ontology (GO)
– Molecular function:• Gene product at biochemical level.
– Biological process:• Cellular events to which the gene product
contributes.
– Cellular component:• Location or complex of gene/protein.
microarray.no
Molecular Function
• activities or “jobs” of a gene product
Insulin bindinginsulin transport activity
microarray.no
Molecular Function
• drug transporter activity
microarray.no
Biological Process
• a commonly recognized series of events
cell division
microarray.no
Cellular Component
• where a gene product acts
microarray.no
Content of GO
• Molecular Function 7,309 terms• Biological Process 10,041 terms• Cellular Component 1,629 terms• Total 18, 975 terms• Obsolete terms: 992• As of October 2005
microarray.no
Ontology Structure• Directed acyclic graphs (DAGs)
• Relationships
– “is a”
• a is a type of b(e.g. truck is a car, or mitochondrion is an organelle)
– Regulates
• Positively regulates
• Negatively regulates
– “part of”
• subprocess of (process)
• physical part of (component)(e.g. engine is part of a car, or mitochondrion membrane is a part of a
mitochondrion)
microarray.no
microarray.no
Term Definitions and Curation
• The definitions for each GO term are being primarily derived from the Oxford Dictionary of Molecular Biology, or from relevant literature sources (SWISSPROT, PIR, NCBI CGAP, EC...).
• Curators around the world shifting through genomic and proteomic data then use the definitions and GO terms provided by GO to annotate or curate the genes and proteins in their favorite species.
• GO is stored as flatfiles, as XML files and as a relational database implemented in MySQL.
microarray.no
GO Annotation
• Association between gene product and applicable GO terms• Provided by member databases. Collaborating databases
annotate their gene products (or genes) with GO terms, providing references and indicating what kind of evidence is available to support the annotations.
• Made by manual or automated methods.
• GO Annotation• Database object: gene or gene product• GO term ID• Evidence supporting annotation• Reference
– publication or computational method
microarray.no
Gene Ontology and Microarrays• Hypothesis: Functionally related, differentially expressed genes
should accumulate in the corresponding GOgroup.
• Problem: Find a method, which scores accumulation of differential gene expression in a node of the Gene Ontology.
• GOtools can be important in order to answer questions such as:
– “are genes involved in process P overrepresented among the total of differentially expressed genes in an experiment” or
– “does treatment A induce more genes involved in process P than treatment B?".
microarray.no
Browsing GO in JExpress
microarray.no
Overrepresentation of GO terms
• We have a subset of genes– List of differentially expressed genes– List of genes that cluster together
• Which biological processes do these genes take part in?
• Is there an overrepresentation of the number of genes belonging to a particular biological process, compared to what could be expected?
microarray.no
Question
• If we look at the dataset containing all of our genes and see that 10% of these belong to cell cycle. We then do a differentially expressed genes analysis and get a list of genes we believe are significantly changed.
• How many of the genes in the gene list do you expect belong to cell cycle?
microarray.no
Setup
• We name our subset of interesting genes for test data• And the dataset containing all of our genes, the dataset
we extracted the interesting genes from and that we want to compare our testdata to, for reference data
Test data
Reference data
microarray.noGene Ontology Analysis
Reference data
Test data
Statistical comparison between the two GO components
microarray.no
Biological pathways
microarray.no
GO vs. Pathways
• Overview• Can handle a large
number of genes• Many genes
annotated• Every gene
considered on its own
• Detail view• Focused sets of
genes• Scattered
datasources• Focuses on
interactions between genes
microarray.no
Types of pathways
• Cartoons– Textbooks– Biocarta
• Circuit diagrams– KEGG– Reactome– geneRifs
• Computational networks– SBML models– Transcription factor
networks
microarray.no
Global networks
microarray.no
Local networks
microarray.no
Kegg
• Global network of regulation and metabolism
• Organised by separate pathways with hand drawn diagrams
• Pathways can be used to look for overrepresentation or enrichment
• Visually check for pathness or direction
microarray.no
microarray.no
Conclusion
• GO is the world map of molecular biology
• Pathways provide more detailed information
• Need for dynamic pathway creation coupled to data analysis