Upload
athena-zimmerman
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
http://www.aitbiotech.com/images/microarray.jpghttp://www.pnas.org/content/104/51/20374/F4.large.jpg
We got differentially expressed genes, now what ?Find function, enriched, reduce false positive
From gene-lists to functional annotations
1
• Molecular Function = elemental activity/task– the tasks performed by individual gene products;
examples are carbohydrate binding and ATPase activity
• Biological Process = biological goal or objective– broad biological goals, such as dna repair or purine
metabolism, that are accomplished by ordered assemblies of molecular functions
• Cellular Component = location or complex– subcellular structures, locations, and macromolecular
complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme
The 3 Gene Ontologies
Modified from: http://anil.cchmc.org/Intro_FunGen_Feb2008_Jegga.ppt#287,33,Slide 33
2
Function (what) Process (why)
Drive a nail - into wood Carpentry
Drive stake - into soil Gardening
Smash a bug Pest Control
A performer’s juggling object Entertainment
Example: Gene = hammer
http://anil.cchmc.org/Intro_FunGen_Feb2008_Jegga.ppt#284,34,Slide 34
3
4
http://www.geneontology.org/
Known Disease Genes
Direct Interactions of Disease Genes
Mining human interactome
Which of these interactants are potential new candidates?
Indirect Interactions of Disease Genes
7
66
778
Prioritize candidate genes in the interacting partners of the disease-related genes
•Training sets: disease related genes
•Test sets: interacting partners of the training genes
http://anil.cchmc.org/Intro_FunGen_Feb2008_Jegga.ppt#337,47,Slide 47
5
Database
Panther
ToppGene
STRING
GOTM
Onto-Tools
TF networks (P.A.I.N.T)
http://www.pantherdb.org
6A Small example of post-microarray analysis tools:
PANTHER™ Protein Classification System
7
http://www.pantherdb.org
WHAT CAN I DO ON THE PANTHER SITE?
Protein ANalysis Through Evolutionary RelationshipsGoal: The PANTHER site was designed to facilitate functional analysis of large numbers of genes, proteins or transcripts.
Tools:
• Explore protein families functionality, molecular functions, biological processes and pathways.
• Generate lists of genes, proteins or transcripts that belong to a given protein family or subfamily, have a given molecular function or participate in a given biological process or pathway, e.g. generate a candidate gene list for a disease.
• Analyze lists of genes in a batch mode, proteins or transcripts according to categories based on family, molecular function, biological process or pathway, e.g. analyze mRNA microarray data.
8
http://nar.oxfordjournals.org/cgi/content/full/31/1/334http://genome.cshlp.org/content/13/9/2129.fullhttp://nar.oxfordjournals.org/cgi/content/full/33/suppl_1/D284http://nar.oxfordjournals.org/cgi/content/full/35/suppl_1/D247
9
http://www.pantherdb.org/sitemap.jsp
Single gene search
Batch gene search
10
11
1788_S_AT36651_AT41788_I_AT35595_AT36285_AT39586_AT35160_AT39424_AT
USP1DDR1WNT10BPRKAR1BMLLCD44GNA13MMP15IER3
http://david.abcc.ncifcrf.gov/tools.jsp
Convert Gene list ID Affy ID Gene symbol
12
http://david.abcc.ncifcrf.gov/tools.jspPaste the AffyID listSelect AFFY_ID as ID typeSelect List type: Gene ListSubmit list
Select HOMO SAPIENS as species, press the select buttonChoose the Gene ID Conversion ToolSelect: GENE_SYMBOL, submit and download the results
13
Perform Panther Batch Search:
Copy the gene symbol list and paste into the Batch search in Pantherhttp://www.pantherdb.org/ => Batch SearchSelect upload ID type: Gene SymbolSelect File Type: ID listResult page: GenesSelect 1 datasets: NCBI: H. sapiens Press the Search buttonPress in the and select: Biological process
Panther Export Options
14
Click on either Pie slices or Bars to get sub-functions.Click on links to get gene lists for the chosen function.
http://www.pantherdb.org/genes/
15
Other Panther Options
http://www.pantherdb.org/panther/ontologies.jsp
Task: find genes in a specific ontology (or in a few ontologies)
Panther vs GO molecular function and biological process
Browse for genes in ontologies
16
Other Panther Options
Search PANTHER Pathway
http://www.pantherdb.org/pathway/
Add legend to pathway
17
Other Panther Options
Compare classifications of multiple clusters of lists to a reference list to statistically determine over- or under- representation of PANTHER classification categories. Each list is compared to the reference list using the binomial test (Cho & Campbell, TIGs 2000) for each molecular function, biological process, or pathway term in PANTHER.
Map the genes in a gene expression data file to a PANTHER ontology. For pathways, you can then view the gene expression values overlaid on top of a pathway diagram, where genes are colored according to the expression value.
http://www.pantherdb.org/tools/
Gene expression tools
18
Other Panther Options
19
optional
defaultPlay with graphics
- GRAPHIC RESULTSOther Panther Options
http://toppgene.cchmc.org/http://toppgene.cchmc.org/help/help.jsp
Portal for (i) gene list functional enrichment(ii) Candidate gene prioritization using either functional
annotations or network analysis(iii) identification and prioritization of novel disease candidate
genes in the interactome.
20
http://nar.oxfordjournals.org/cgi/reprint/gkp427v1 Hypergeometric distribution with Bonferroni correction
21
22
http://stattrek.com/Tables/Hypergeometric.aspx
What is a hypergeometric experiment?
A hypergeometric experiment has the following characteristics:Population size N, out of which M items are success.The researcher randomly selects a subset of n items from a population. Question: what is the probability that k selected item are success ?
What is a hypergeometric distribution?
A hypergeometric distribution is a probability distribution. It refers to the probabilities associated with the number of successes in a hypergeometric experiment. Example:We have a pack of 52 cards (26 black, success). We randomly select 12 cards out of 52. What is the probability of having 7 successes (black) ? (0.21)
Hypergeometric calculator results
Hypergeometric calculator:
Just 2 clarification slides….
Statistical Corrections
http://cbi.labri.fr/outils/BlastSets/BlastSets_web_manual/principles.html
In many analysis of biological experiments, a great number of false positives are found among the results. When making multiple comparisons, we need to apply a statistical correction to our threshold, to remove the maximum of false positives.
Commonly available statistical corrections:
23
Method Complexity Time Method Results Drawback
Bonferroni correction
simplest fastest Most conservative keeping only the most significant results, removing every possible noise, or putative results.
a lot of significant information is removed along with the noise
False Discovery Rate (FDR)
Less conservative a good compromise between keeping only really significant hits, and having too much false positives.
Some false positives…
When detecting differentially expressed genes, we want to detect ONLY the differentially expressed, with no false positives !
24
25
Example:
Go to ToppGene web-page: http://toppgene.cchmc.org/
Choose ToppFun link
Copy the gene symbol list and paste into the provided box, make sure that entry
name is HGNC symbol, press the Submit Query button.
Go to bottom of page, choose FDR correction method to all features, and submit.
Observe details of the results, each at a time.
Example: a. Using ToppFun for gene list enrichment analysis :Construct a gene list enrichment analysis on obesity-associated genes
26
27
28
b. Using ToppGene for disease gene prioritization based on functional similarity to training set genesQuery: To rank or prioritize a list of genes (test set) by functional annotation similarity to training set.
29
Calculates score and p-value for the genes and functions.
c. Using ToppNet for disease gene prioritization based on topological features in protein-protein interactions network (PPIN)Query: To rank or prioritize a list of genes (test set) based on topological features in PPIN.
30
31
d. Using ToppGenet to identify and prioritize the neighboring genes of the "seeds" or training set in protein-protein interactions network (PPIN)Query: To rank or prioritize a list of genes in the interactome of training set genes using either functional similarity (ToppGene) or PPIN analysis (ToppNet).
Create network by functional similarity (ToppGene) or network analysis (ToppNet). Distance to seeds: 1, the test set comprises all genes that are immediate interactants of the training set genes.purple nodes are the training set or seed genes.grey nodes are the interactants from the test set. The green nodes (subset of the grey ones) are the top ranked ones from the test set genes.
32
STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) (functional connectivity within a proteome)
http://string-db.org/
STRING is a database and web resource dedicated to protein–protein interactions, including both physical and functional interactions. It weights and integrates information from numerous sources, including experimental repositories, computational prediction methods and public text collections, thus acting as a meta-database that maps all interaction evidence onto a common set of genomes and proteins.
Version 8.0 of STRING covers about 2.5 million proteins from 630 organisms
Databases:MINT, HPRD, BIND, DIP, BioGRID, KEGG and Reactome, IntAct, EcoCyc , NCI-Nature Pathway Interaction Database and Gene Ontology (GO) protein complexes. SGD, OMIM , The Interactive Fly, and all abstracts from PubMed
33
A shift of focus to system biology in the “post-genomic” era
34
http://bioinfo.vanderbilt.edu/gotm/
35
http://bioinfo.vanderbilt.edu/gotm/GOTM_Manual.pdf
Bar graphPathway details
Input details Pathway gene details(all genes in pathway)
36
The apoptosis pathway as described by KEGG
Underexpressed genesOverexpressed genes
37
http://www.dbi.tju.edu/dbi/tools/paint/
38
TF networks (P.A.I.N.T)TF networks (P.A.I.N.T)
SUSPECTS is a server designed to automate the first steps of the candidate gene approach. http://www.genetics.med.ed.ac.uk/suspects/search.shtml
BRCA1
The 3D boxes represent genes. Higher, brighter boxes represent better (higher scoring) candidates. The width of a box corresponds to the number of different types of evidence that contribute to its score. If a box is blue then a potentially relevant PubMed abstract has been found.
39
http://www.genetics.med.ed.ac.uk/prospectr/
BRCA1:
PROSPECTR uses sequence features to rank genes in order of their likelihood of involvement in disease;
40