Upload
baba
View
91
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Introduction to Systems Biology 2012. Functional Enrichment Analysis :. making sense of big data. Aaron Brooks & Fang Yin Lo 8/21/2012. Insight. Experimental Design. Data analysis. Data collection. Mischel et al, 2004. From knots to knowledge. What is functional enrichment? - PowerPoint PPT Presentation
Citation preview
Functional Enrichment Analysis:
Aaron Brooks & Fang Yin Lo8/21/2012
Introduction to Systems Biology 2012
making sense of big data
Mischel et al, 2004
Experimental Design
Data collection
Data analysis
Insight
From knots to knowledge• What is functional enrichment?- Tools and caveats (e.g. DAVID and pvals)
• How can you apply these tools to large, complex analysis problems (i.e. automation)?
Why functional enrichment analysis?
Gene FunctionGeneGene
P(X q) q
m( k q
N m)( )k
N( )N: population sizem: # of positives in the populationk: # of drawsq: # of positives
Urn problem:
Total: 15 marbles (N):5 red (m)
10 black (N-m)
Draw 6 marbles (k)
How likely am I to draw 3 or more marbles?
Gene
Gene
GeneGene
CommonFunctions
CommonPathwayse.g. Glycolysis
e.g. sugar metabolism
Interesting glioblastoma study
The workflow of typical enrichment tools
Nucleic Acids Research, 2009, Vol. 37, No. 1 1–13
e.g. Gene Ontology (GO), KEGG Pathway, etc
e.g. GO Terms that are enriched in the input gene list
(the database for annotation, visualization and integrated discovery)
• Diverse, web-based functional analysis tool• Integrates a suite of databases and statistical tools
(GO, KEGG, Interpro, Disease)• User-friendly, • Problematic for large analysis problems (many
independent sets)
staRt your engineshttp://baliga.systemsbiology.net/events/sysbio/content/bicluster-307http://baliga.systemsbiology.net/events/sysbio/content/bicluster-353
What if you had many sets to
analyze?
AUTOMATION!!!
topGO• An R package that facilitates semi-automated enrichment
analysis for Gene Ontology• Three main Steps:
1. Prepare data- create a topGO object with list of genes identiers, gene-to-GO
annotations)
2. Run enrichment tests3. Display the results
Structured controlled vocabularies (ontologies) that describe relationships between gene products and their associated biological roles
• cellular components : the parts of a cell or its extracellular environment
• molecular functions: activities, such as catalytic or binding activities, that occur at the molecular level (e.g. catalytic activity, Toll receptor binding)
• biological processes: series of events accomplished by one or more ordered assemblies of molecular functions (e.g. signal transduction, pyrimidine metabolic process )
GO structure• Directed Acyclic Graph(DAG)
• Child terms are more specialized• Child can have more than one parent
Data preparation# Install topGO and Affymetrix Human Genome U133 Plus 2.0 Array
annotation data> source("http://bioconductor.org/biocLite.R")> biocLite("topGO")> biocLite("hgu133plus2.db")> geneSets # Input a list a genes
#load the library> library(topGO)> library(hgu133plus2.db)
#Boot the gaggle> library(gaggle)> gaggleInit()
### Initializing the analysis ###
# hgu133plus2ACCNUM: an R object that contains mappings between the manufacturers identifiers and gene names of Affymetrix Human Genome U133 Plus 2.0 Array
# all.genes: all background genes ( gene universe )
> all.genes <- ls(hgu133plus2ACCNUM)
Data preparation
Other annotation packages at Bioconductor
Other annotation packages can be found at: http://www.bioconductor.org/packages/release/data/annotation/
Data preparation: Input gene lists### Make gene lists #### We will make a list that includes two sets of genes of interest# Initialize the list:>glioblastoma.genes = list()
http://baliga.systemsbiology.net/events/sysbio/content/bicluster-307# broadcast 'bicluster 307 genes' to R>glioblastoma.genes[["bc307"]] = sapply(getNameList(),tolower)
Do the same for the other gene list: http://baliga.systemsbiology.net/events/sysbio/content/bicluster-353>glioblastoma.genes[["bc353"]] = sapply(getNameList(),tolower)
Data preparation: make topGO object## Analyze genes in bc353 first
> relevant.genes <- factor(as.integer(all.genes %in% glioblastoma.genes[["bc353"]]))
> names(relevant.genes) <- all.genes
# Construct the topGOdata object for automated analysis
>GOdata.BP <- new("topGOdata", ontology='BP', allGenes = relevant.genes,
annotationFun = annFUN.db, affyLib = 'hgu133plus2.db')
# ontology:'BP','MF; or 'CC'# allGenes: named vector of type numeric or factor. The names attribute contains the genes identifiers. The genes listed in this object define the gene universe.# annotationFun: function that maps gene identifiers to GO terms. # annFUN.db extracts the gene-to-GO mappings from the affyLib object# affyLib: character string containing the name of the Bioconductor annotaion package for a specific microarray chip.
Run Enrichment Analysis
> results <- runTest(GOdata.BP, algorithm = 'classic', statistic = 'fisher’)
Analysis of results: summary
# generate a summary of the enrichment analysis> results.table <- GenTable(GOdata.BP, results, topNodes =
length(results@score))
# How many GO terms were tested?> dim(results.table)[1]
# reduce results to GO terms passing Benjamini-Hochberg multiple hypothesis corrected pval <= 0.05, FDR <= 5%
>results.table.bh <- results.table[which(p.adjust(results.table[,"result1"],method="BH")<=0.05),]
Analysis of results: get significant GO terms
# reduce results to GO terms passing Benjamini-Hochberg multiple hypothesis corrected pval <= 0.05, FDR <= 5%
>results.table.bh <- results.table[which(p.adjust(results.table[,"result1"],method="BH")<=0.05),]
# How many terms are enriched?>dim(results.table.bh)[1]
# What are the top ten terms?>results.table.bh[1:10,]
Analysis of results: get genes in top GO terms
# Get all the genes annotated to a specific GO term of interest:>GOid.of.interest = results.table.bh[1,"GO.ID"]>all.term.genes = genesInTerm(GOdata.BP,GOid.of.interest)[[1]]
# Which of these genes is in the bicluster?>genes.of.interest <- intersect(glioblastoma.genes[["bc353"]],all.term.genes)
# print table with probe ID and gene symbol >gene.symbol= toTable(hgu133plus2SYMBOL[genes.of.interest]) # print table with probe ID and gene names >gene.name= toTable(hgu133plus2GENENAME[genes.of.interest]) # Combine the information of the genes, output to csvfile: >cbind(gene.symbol,gene.name[,2]) >write.csv(cbind(gene.symbol,gene.name[,2]), file =
"glioblastoma.genes_bc353_in_immune response.csv“)
Automation
results <- list()for( bc in names(glioblastoma.genes) ) { cat(paste("Computing functional enrichment for...",bc,"\n"))
relevant.genes <- factor(as.integer(all.genes %in% glioblastoma.genes[[bc]]))
names(relevant.genes) <- all.genes GOdata.BP <- new("topGOdata", ontology='BP', allGenes =
relevant.genes, annotationFun = annFUN.db, affyLib = 'hgu133plus2.db')
results[[bc]] <- GenTable(GOdata.BP,runTest(GOdata.BP, algorithm = 'classic', statistic = 'fisher'),topNodes = length(results@score))
}
Questions?
Other algorithms supported by topGO• Standard implementations of GO testing compute the
significance of a node independent of the significance of the neighboring nodes (‘classic’)
• Other algorithms take into considerations of the GO structure and try to find more specific GO terms (e.g.’elim’, ‘weight’, ‘weight01’,etc) Alexa et al. (2006). Bioinformatics (2006) 22 (13): 1600-1607.
Other algorithms supported by topGO# Try running other algorithms and compare the results:>r1.BP.elim = runTest(GOdata.BP, algorithm = 'elim', statistic = 'fisher')>r1.BP.weight = runTest(GOdata.BP, algorithm = 'weight', statistic = 'fisher‘)# This will take a while…
# After the runs are done, visually compare resulting p values from different algorithms:>pValue.classic <- score(r1.BP)>pValue.elim <- score(r1.BP.elim)[names(pValue.classic)]>pValue.weight <- score(r1.BP.weight)[names(pValue.classic)]>gstat <- termStat(GOdata.BP, names(pValue.classic))>gSize <- gstat$Annotated / max(gstat$Annotated) * 4>colMap <- function(x) { .col <- rep(rev(heat.colors(length(unique(x)))), time = table(x)) return(.col[match(1:length(x), order(x))])}>gCol <- colMap(gstat$Significant)>plot(pValue.classic, pValue.elim, xlab = "p-value classic", ylab = "p-value elim",pch = 19,
cex = gSize, col = gCol)
Broadcasting gene list to DAVID1. In R: > broadcast(geneSetGenes)
2.
3. Select target: DAVID
4. Broadcast to DAVID
Broadcasting gene list to DAVID
Broadcasting gene list to DAVID
Broadcasting gene list to DAVID
Gene list and population background being analyzed
Original database/resources where the terms orient
Enriched terms associate with input gene list
Modified Fisher Exact p-values
Corrected p-values
A group of terms having similar biological meaning due to sharing similar gene members
Clustering options and contingency
Some remaining challenges
Limitation of multiple testing correction on enrichment P-values • common multiple testing correction techniques maybe overly conservative approaches if there are thousands or
even more annotation terms involved in the analysis
Genome Inform. 2005;16:106-115.; Nucleic Acids Research, 2009, Vol. 37, No. 1 1–13
Cross-comparing enrichment analysis results derived from multiple gene lists• the size of the gene list impacts the absolute enrichment P-values, thereforedifficult to directly compare the absolute enrichment P-values across gene lists
Realistically positioning the role of enrichment P-values in the current data-mining environment:
• high-throughput enrichment data-mining environment is extremely complicated
• Variations of the user gene list size
• deviation of the number of genes associated with each annotation
• gene overlap between annotation
• incompleteness of annotation content
• strong connectivity/dependency among genes
• unbalanced distributions of annotation content
some may treat the resulting enrichment P-values as a scoring system that plays a advisory role
more of an exploratory procedure, with the aid of enrichment P-value, rather than a pure statistical solution.
the specificity of enrichment analysis is more impacted by non-statistical layers than it is by statistical methods alone
Nat. Protoc. 2008. doi: 10.1038/nprot.2008.211
Nucleic Acids Research, 2009, Vol. 37, No. 1 1–13
Some remaining challenges
Working with topGO data object
# work with Godata.BP# obtaining all genes> a = genes(GOdata.BP)> str(a) chr [1:31777] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" "1294_at"
...# number of genes> numGenes(GOdata.BP)[1] 31777# The list of significant genes can be accessed using the method sigGenes()> sg = sigGenes(GOdata.BP)> str(sg) chr [1:22] "201291_s_at" "202095_s_at" "202589_at" "202705_at"
"203213_at" ...
# accessing information related to GO and its structure# which GO terms are available for analysis:> ug = usedGO(GOdata.BP)> str(ug) chr [1:10921] "GO:0000002" "GO:0000003" "GO:0000012" "GO:0000018" "GO:0000019" ...
# select some random GO terms: (1). count the number of annotated genes and (2) obtain their annotation> sel.terms <- sample(usedGO(GOdata.BP), 10)> sel.terms [1] "GO:0032913" "GO:0043372" "GO:0044259" "GO:0032700" "GO:0043122" "GO:2001141" "GO:0060587" [8] "GO:0071352" "GO:0007256" "GO:0051343“
# Check what are the genes annotated to a specific GO term:> genesInTerm(GOdata.BP, "GO:0032913" )$`GO:0032913`[1] "208650_s_at" "208651_x_at" "209771_x_at" "209772_s_at" "216379_x_at" "266_s_at"
# Number of genes annotated to the selected GO terms:> num.ann.genes <- countGenesInTerm(GOdata.BP, sel.terms)> num.ann.genesGO:0032913 GO:0043372 GO:0044259 GO:0032700 GO:0043122 GO:2001141 GO:0060587 GO:0071352 GO:0007256 6 36 111 12 454 6809 2 5 11 GO:0051343 2
Working with topGO data object
> ann.genes <- genesInTerm(GOdata.BP, sel.terms)> str(ann.genes)List of 10 $ GO:0032913: chr [1:6] "208650_s_at" "208651_x_at" "209771_x_at" "209772_s_at" ... $ GO:0043372: chr [1:36] "1554519_at" "1555689_at" "1565358_at" "1569748_at" ... $ GO:0044259: chr [1:111] "1554383_a_at" "1555540_at" "1555896_a_at" "1556499_s_at" ... $ GO:0032700: chr [1:12] "1552798_a_at" "1556190_s_at" "201300_s_at" "207160_at" ... $ GO:0043122: chr [1:454] "1552360_a_at" "1552703_s_at" "1552798_a_at" "1552804_a_at" ... $ GO:2001141: chr [1:6809] "121_at" "1316_at" "1405_i_at" "1487_at" ... $ GO:0060587: chr [1:2] "201525_at" "207092_at" $ GO:0071352: chr [1:5] "201940_at" "201941_at" "201942_s_at" "201943_s_at" ... $ GO:0007256: chr [1:11] "1558984_at" "203652_at" "206362_x_at" "207347_at" ... $ GO:0051343: chr [1:2] "207514_s_at" "214286_at"
Working with topGO data object
Analysis of results: get genes in top GO terms
#We can also look at multiple GO terms at the same time:> GOids.of.interest = results.table.bh[c(1:10),"GO.ID"]> all.term.genes = genesInTerm(GOdata.BP,GOids.of.interest)
# Which of these genes is in the bicluster?> genes.of.interest <- sapply(names(all.term.genes),function(x)
{intersect(all.term.genes[[x]],glioblastoma.genes[["bc353"]])})
# print table with probe ID and gene symbol:> geneSynmol.of.interest <- lapply(names(genes.of.interest),function(x)
{toTable(hgu133plus2SYMBOL[genes.of.interest[[x]]])})> names(geneSynmol.of.interest)<- GOids.of.interest