Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Simulation studies of module preservation:
Simulation study of weak module preservation
Peter Langfelder and Steve Horvath
October 25, 2010
Contents
1 Overview 11.a Setting up the R session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Data simulation 2
3 Module identification 2
4 Calculation of module preservation 5
5 Analysis of results 5
1 Overview
This tutorial presents simulation a simulation study of module preservation in which we simulate a reference setwith 20 modules of sizes around 200 profiles (“genes”), and a test set in which 10 of the 20 reference modules arepreserved, and genes in the other 10 modules are simulated with independent random profiles (in the language ofWGCNA these genes are simulated “grey”). Unlike in our other simulation studies, here genes in the preservedmodules are simulated to be only very weakly co-expressed. In fact, we set up the parameters such that the standardmodule identification method in WGCNA does not find any modules; hence, cross-tabulation methods would bydefinition conclude that none of the modules are preserved. To give cross-tabulation methods a chance, we alsoemploy Partitioning Around Medoids (PAM) with a fixed number of clusters to partition the test set into 20 clusters.We find that PAM is moderately successful in identifying the preserved modules. Lastly, we apply the functioncluterRepro to this simulated data and find that observed IGP is not very good at distinguishing the preserved andnon-preserved modules.We encourage readers unfamiliar with any of the functions used in this tutorial to type, in the active R session,
help(functionName)
(replace functionName with the actual name of the function) to get a detailed description of what the functions does,what the input arguments mean, and what is the output.
1.a Setting up the R session
After starting R we execute a few commands to set the working directory and load the requisite packages:
# Display the current working directory
getwd();
# If necessary, change the path below to the directory where the data files are stored.
# "." means current directory. On Windows use a forward slash / instead of the usual \.
workingDir = ".";
setwd(workingDir);
1
# Load the packages WGCNA and cluster
library(WGCNA);
library(cluster);
# The following setting is important, do not omit.
options(stringsAsFactors = FALSE);
2 Data simulation
We simulate two data sets, each with 100 samples. First we set up simulation parameters such as module sizes etc.We also set up parameters such that in the reference data set the genes in each module are tightly co-expressed, butin the test set genes in each preserved module are only weakly co-expressed.
nSamples = 100;
nGenes = 5000;
nModules = c(20,20);
prop = seq(from = 0.044, to = 0.037, length.out = nModules[1]+1);
modProps = list(prop, prop);
nSets = 2;
# Here we set how tightly co-expressed the modules should be.
minCor = c(0.3, 0.05);
maxCor = c(1, 0.35);
eigengenes = list();
expr = list();
simLabels = list();
cutHeight = c(0.999, 0.9998);
Next we simulate the data using the WGCNA function simulateMultiExpr. We define the matrix leaveOut whichtells the simulation function which modules should be left out in each of the data sets. In this case, we leave outhalf of the modules in the second data set. The seed eigengenes are simulated as independent random vectors. Themodules in the test (second) data set are simulated to be very “loose”.
set.seed(1);
leaveOut = list(rep(FALSE, nModules[1]), rep(FALSE, nModules[1]));
leaveOut[[2]][c(1:(nModules[1]/2))*2] = TRUE
simOrder = list();
for (set in 1:nSets)
{
eigengenes[[set]] = matrix(rnorm(nSamples * nModules[set]), nSamples, nModules[set])
x = simulateDatExpr(eigengenes[[set]], nGenes, modProps[[set]],
minCor = minCor[set], maxCor = maxCor[set], signed = TRUE, backgroundNoise = 1.0,
leaveOut = leaveOut[[set]]);
simLabels[[set]] = x$allLabels
simOrder[[set]] = x$labelOrder
expr[[set]] = list(data = x$datExpr);
colnames(expr[[set]]$data) = spaste("Gene.", c(1:nGenes));
}
3 Module identification
We now identify modules in the each of the simulated data sets using the WGCNA function blockwiseModules.
mods = list();
# Sof thresholding powers for network definition.
power = c(6, 4);
collectGarbage();
labels = list();
2
nn = if (interactive()) nSets else 1;
for (set in 1:nn)
{
mods[[set]] = blockwiseModules(expr[[set]]$data, networkType = "signed hybrid", deepSplit = 1,
detectCutHeight = cutHeight[set], TOMType = "none", power = power[set],
numericLabels = TRUE, verbose = 4);
labels[[set]] = matchLabels(mods[[set]]$colors, simLabels[[set]]);
collectGarbage();
}
We also run Partitioning Around Medoids (PAM) on the data.
PAMlabels = matrix(0, nGenes, nSets)
for (set in 1:nSets)
{
cr = cor(expr[[set]]$data);
cr[cr<0] = 0;
adj = cr^power[set];
dist = as.dist(1-adj);
PAMlabels[, set] = pam(dist, nModules[set], cluster.only = TRUE);
PAMlabels[, set] = matchLabels(PAMlabels[, set], simLabels[[set]]);
collectGarbage();
}
How did module identification do? We plot the gene dendorgrams with the simulaeted and identified module colors.
sizeGrWindow(10,7);
#pdf(file = "Plots/preserved-moduleDetectionFailed-dendrograms.pdf", width = 10, height = 7)
layout(matrix(c(1:5), 5, 1), heights = c(rep(c(0.8, 0.2), 2), 0.3));
setNames = c("Reference data set", "Test data set");
for (set in 1:nSets)
{
if (set==1)
{
colors = labels2colors(cbind(labels[[1]], simLabels[[set]]))
names = c("Inferred", "Simulated");
} else {
colors = labels2colors(cbind(PAMlabels[, set], simLabels[[set]]))
names = c("PAM", "Simulated");
}
plotDendroAndColors(mods[[set]]$dendrograms[[1]],
colors, names,
dendroLabels = FALSE, hang = 0.03,
main = spaste(LETTERS[set], ". ",
setNames[set], ": gene clustering tree and module colors"),
setLayout = FALSE, abHeight = cutHeight[set], cex.colorLabels = 1.2, cex.main = 1.5,
cex.lab = 1.2, cex.axis = 1.2);
}
The result is shown in Figure 1. In the test data set, hierarchical clustering did not identify any modules. That isbecause we have simulated the modules with very weak correlations.
3
0.90
0.92
0.94
0.96
0.98
1.00
A. Reference data set: gene clustering tree and module colors
hclust (*, "average")d
Hei
ght
Inferred
Simulated
0.94
0.95
0.96
0.97
0.98
0.99
1.00
B. Test data set: gene clustering tree and module colors
hclust (*, "average")d
Hei
ght
PAM
Simulated
PAMSimulated
C. PAM vs. simulated module colors
Figure 1: Module identification in the simulated data sets. In the reference set the hierarchical clustering (panelA) easily identifies the 20 modules as distinct branches. Simulated and identified module colors, shown below thedendrogram, show excellent agreement. In the test set (panel B) the hierarchical clustering did not identify anyrecognizable branches. The simulated and PAM colors, shown below the clustering tree, also do not show anyapparent relationship to the dendrogram. Panel C shows a comparison of simulated module colors and PAM clusterlabels. It is very difficult to argue that any of the modules in the test set are preserved.
4
4 Calculation of module preservation
Here we run the main module preservation function modulePreservation. After the calculation we save the results; ifa re-analysis of previously calculated results is performed, one can simply read the results from disk, thus saving alot of time.
names(expr) = c("Set1", "Set2");
labelList = list(labels[[1]], PAMlabels[, 2]);
names(labelList) = names(expr);
mp = modulePreservation(expr, labelList, networkType = "signed", nPermutations = 200, verbose = 3,
maxGoldModuleSize = 1000);
# Save the module preservation results as well as the PAM cluster labels
save(mp, PAMlabels, file = "preserved-moduleDetectionFailed-20Modules.RData");
If the module preservation results have been calculated previously, load the results from the disk:
load(file= "preserved-moduleDetectionFailed-20Modules.RData");
Calculation of IGP in clusterRepro
Here we apply cluterRepro to the test set. We calculated the eigengenes of the reference modules in the test setand use them as the centroids in the IGP calculation.
# Need centroids for the new data set. Calculate module eigengenes.
MEs = moduleEigengenes(expr[[2]]$data, labels[[1]])$eigengenes
# Get rid of the grey eigengene
MEs = MEs[, -1]
doClusterRepro = TRUE
if (doClusterRepro)
{
library(clusterRepro)
rownames(MEs) = spaste("Sample.", c(1:nSamples));
rownames(expr[[2]]$data) = spaste("Sample.", c(1:nSamples));
set.seed(40);
print(system.time( {
cr = clusterRepro(as.matrix(MEs), expr[[2]]$data, 1000);
} ));
save(cr, file = "preserved-moduleDetectionFailed-20Modules-cr.RData");
}
If the clusterRepro results have been calculated previously, load the results from the disk:
load(file = "preserved-moduleDetectionFailed-20Modules-cr.RData");
5 Analysis of results
Here we look at how well each method did at identifying the 10 preserved modules in the “hopelessly noisy” testdata. Since the modules all have very similar sizes, we do not plot results as a function of module size; rather, ineach plot we simply order the modules by their corresponding preservation statistic and look for a clean separationof preserved and non-preserved modules.
# How well can one distinguish preserved from non-preserved modules?
sizeGrWindow(10,8)
#pdf(file = "Plots/preserved-moduleDetectionFailed-20Modules-preservationSuccess.pdf", w= 10, h = 8);
presColor = c("red", "black")[as.numeric(leaveOut[[2]])+1];
# Set graphical parameters
par(mfrow = c(3,2)); par(mar = c(3.8, 3.8, 2, 0.5)); par(mgp = c(2.3, 0.7, 0));
5
cex.lab = 1.3; cex.axis = 1.3; cex.main = 1.4
# Module preservation: Zsummary scores
Zs =
mp$preservation$Z[[1]][[2]]$Zsummary[order(as.numeric(rownames(mp$preservation$Z[[1]][[2]])))][-c(1:2)];
order = order(-Zs);
plot(Zs[order], col = presColor[order], cex.main=cex.main,
xlab = "Index", ylab = "Preservation Zsummary",cex.lab = cex.lab, cex.axis = cex.axis,
main = "A. Network-based preservation indices: Zsummary")
legend("topright", c("Non-preserved module", "Preserved module"), pch = 1, col = c("black", "red"),
cex = cex.lab)
# Module preservation: psummary statistics
Zs = -mp$preservation$log.p[[1]][[2]]$log.psummary[
order(as.numeric(rownames(mp$preservation$Z[[1]][[2]])))][-c(1:2)];
order = order(-Zs);
plot(Zs[order], col = presColor[order],
xlab = "Index", ylab = "-log10(psummary)", cex.lab = cex.lab, cex.axis = cex.axis, cex.main=cex.main,
main = "B. Network-based preservation indices: psummary")
legend("topright", c("Non-preserved module", "Preserved module"), pch = 1, col = c("black", "red"),
cex = cex.lab)
abline(h=-log10(0.05), col = "blue");
abline(h=-log10(0.05/nModules[1]), col = "green");
# Co-clustering
cc = mp$accuracy$observed[[1]][[2]][-1, ’coClustering’];
order = order(-cc)
plot(cc[order], col = presColor[order], cex.main=cex.main,
xlab = "Index", ylab = "coClustering", cex.lab = cex.lab, cex.axis = cex.axis,
main = "D. Cross-tabulation with results of PAM: Co-clustering")
legend("topright", c("Non-preserved module", "Preserved module"), pch = 1, col = c("black", "red"),
cex = cex.lab)
# Cross-tabulation: fisher p-value
bestP = apply(tab$pTable[-1, ], 1, min); order = order(bestP)
plot(-log10(pmin(rep(1, nModules[1]), bestP[order])), col = presColor[order], cex.main=cex.main,
xlab = "Index", ylab = "-log10(Overlap p-value)", cex.lab = cex.lab, cex.axis = cex.axis,
main = "C. Cross-tabulation with results of PAM: overlap p-value")
legend("topright", c("Non-preserved module", "Preserved module"), pch = 1, col = c("black", "red"),
cex = cex.lab)
abline(h=-log10(0.05), col = "blue"); abline(h=-log10(0.05/nModules[1]), col = "green");
# clusterRepro: observed IGP
p = cr$Actual.IGP;
order = order(-p);
plot(p[order], col = presColor[order],
xlab = "Index", ylab = "IGP", cex.lab = cex.lab, cex.axis = cex.axis, cex.main=cex.main,
main = "E. clusterRepro: IGP")
legend("topright", c("Non-preserved module", "Preserved module"), pch = 1, col = c("black", "red"),
cex = cex.lab)
# clusterRepro: permutation p-value
p = cr$p;
order = order(p);
plot(-log10(p+1e-4)[order], col = presColor[order], cex.main=cex.main,
xlab = "Index", ylab = "-log10(clusterRepro p-value)", cex.lab = cex.lab, cex.axis = cex.axis,
main = "F. clusterRepro: permutation p-value")
abline(h=-log10(0.05), col = "blue");
abline(h=-log10(0.05/nModules[1]), col = "green");
6
legend("topright", c("Non-preserved module", "Preserved module"), pch = 1, col = c("black", "red"),
cex = cex.lab)
# If plotting into a pdf file, close it
dev.off();
The resulting plots are shown in Figure 2. The figure shows that network preservation statistics are in this casesuccessful in reliably separating preserved and non-preserved modules. On the other hand, cross-tabulation andclusterRepro have only limited success; based on Bonferoni corrected p-values, most of the preserved modules arecalled non-preserved.
●● ● ●
●
●● ●
● ●
● ● ● ● ● ● ● ● ● ●
5 10 15 20
02
46
810
A. Network−based preservation indices: Zsummary
Index
Pre
serv
atio
n Z
sum
mar
y ●
●
Non−preserved modulePreserved module
●
●● ●
●
● ●●
●
●
● ● ● ● ● ● ● ● ● ●
5 10 15 200
1020
3040
5060
B. Network−based preservation indices: psummary
Index
−lo
g10(
psum
mar
y)
●
●
Non−preserved modulePreserved module
●
● ●
●
● ●
●
● ●●
●
● ●
● ● ● ●● ●
●
5 10 15 20
0.05
00.
060
0.07
0
D. Cross−tabulation with results of PAM: Co−clustering
Index
coC
lust
erin
g
●
●
Non−preserved modulePreserved module
●
●
●● ●
●●
● ● ● ● ● ●● ● ● ● ●
● ●
5 10 15 20
12
34
56
7
C. Cross−tabulation with results of PAM: overlap p−value
Index
−lo
g10(
Ove
rlap
p−va
lue)
●
●
Non−preserved modulePreserved module
●
●
●
●●
●●
● ●●
●
● ●●
●● ●
● ●
●
5 10 15 20
0.15
0.20
0.25
E. clusterRepro: IGP
Index
IGP
●
●
Non−preserved modulePreserved module
●
●●
●
● ●
●
●●
●
● ●●
● ●● ● ● ●
●
5 10 15 20
0.0
1.0
2.0
3.0
F. clusterRepro: permutation p−value
Index
−lo
g10(
clus
terR
epro
p−
valu
e) ●
●
Non−preserved modulePreserved module
Figure 2: Success of several module preservation measures at distinguishing weakly preserved from non-preservedmodules. In each plot, modules are ordered by the preservation statistic shown in the plot. Red color denotespreserved and black non-preserved modules. In the p-value plots (the right column), the blue line denotes thethreshold p = 0.05, and the green line denotes the Bonferoni-corrected threshold p = 0.05. In the clusterReprop-value plot, we added 10−4 to all p-values so that zero p-values become 10−4 and fit into the plot. This figure showsthat network preservation statistics are in this case successful in reliably separating preserved and non-preservedmodules. On the other hand, cross-tabulation and clusterRepro have only limited success; based on Bonferonicorrected p-values, most of the preserved modules are called non-preserved.
7