Simulation studies of module preservation: Simulation study of … · 3 Module identi cation 2 4 Calculation of module preservation 5 5 Analysis of results 5 1 Overview This tutorial

Simulation studies of module preservation:

Simulation study of weak module preservation

Peter Langfelder and Steve Horvath

October 25, 2010

Contents

1 Overview 11.a Setting up the R session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Data simulation 2

3 Module identification 2

4 Calculation of module preservation 5

5 Analysis of results 5

1 Overview

This tutorial presents simulation a simulation study of module preservation in which we simulate a reference setwith 20 modules of sizes around 200 profiles (“genes”), and a test set in which 10 of the 20 reference modules arepreserved, and genes in the other 10 modules are simulated with independent random profiles (in the language ofWGCNA these genes are simulated “grey”). Unlike in our other simulation studies, here genes in the preservedmodules are simulated to be only very weakly co-expressed. In fact, we set up the parameters such that the standardmodule identification method in WGCNA does not find any modules; hence, cross-tabulation methods would bydefinition conclude that none of the modules are preserved. To give cross-tabulation methods a chance, we alsoemploy Partitioning Around Medoids (PAM) with a fixed number of clusters to partition the test set into 20 clusters.We find that PAM is moderately successful in identifying the preserved modules. Lastly, we apply the functioncluterRepro to this simulated data and find that observed IGP is not very good at distinguishing the preserved andnon-preserved modules.We encourage readers unfamiliar with any of the functions used in this tutorial to type, in the active R session,

help(functionName)

(replace functionName with the actual name of the function) to get a detailed description of what the functions does,what the input arguments mean, and what is the output.

1.a Setting up the R session

After starting R we execute a few commands to set the working directory and load the requisite packages:

# Display the current working directory

getwd();

# If necessary, change the path below to the directory where the data files are stored.

# "." means current directory. On Windows use a forward slash / instead of the usual \.

workingDir = ".";

setwd(workingDir);

1

# Load the packages WGCNA and cluster

library(WGCNA);

library(cluster);

# The following setting is important, do not omit.

options(stringsAsFactors = FALSE);

2 Data simulation

We simulate two data sets, each with 100 samples. First we set up simulation parameters such as module sizes etc.We also set up parameters such that in the reference data set the genes in each module are tightly co-expressed, butin the test set genes in each preserved module are only weakly co-expressed.

nSamples = 100;

nGenes = 5000;

nModules = c(20,20);

prop = seq(from = 0.044, to = 0.037, length.out = nModules[1]+1);

modProps = list(prop, prop);

nSets = 2;

# Here we set how tightly co-expressed the modules should be.

minCor = c(0.3, 0.05);

maxCor = c(1, 0.35);

eigengenes = list();

expr = list();

simLabels = list();

cutHeight = c(0.999, 0.9998);

Next we simulate the data using the WGCNA function simulateMultiExpr. We define the matrix leaveOut whichtells the simulation function which modules should be left out in each of the data sets. In this case, we leave outhalf of the modules in the second data set. The seed eigengenes are simulated as independent random vectors. Themodules in the test (second) data set are simulated to be very “loose”.

set.seed(1);

leaveOut = list(rep(FALSE, nModules[1]), rep(FALSE, nModules[1]));

leaveOut[[2]][c(1:(nModules[1]/2))*2] = TRUE

simOrder = list();

for (set in 1:nSets)

{

eigengenes[[set]] = matrix(rnorm(nSamples * nModules[set]), nSamples, nModules[set])

x = simulateDatExpr(eigengenes[[set]], nGenes, modProps[[set]],

minCor = minCor[set], maxCor = maxCor[set], signed = TRUE, backgroundNoise = 1.0,

leaveOut = leaveOut[[set]]);

simLabels[[set]] = x$allLabels

simOrder[[set]] = x$labelOrder

expr[[set]] = list(data = x$datExpr);

colnames(expr[[set]]$data) = spaste("Gene.", c(1:nGenes));

}

3 Module identification

We now identify modules in the each of the simulated data sets using the WGCNA function blockwiseModules.

mods = list();

# Sof thresholding powers for network definition.

power = c(6, 4);

collectGarbage();

labels = list();

2

nn = if (interactive()) nSets else 1;

for (set in 1:nn)

{

mods[[set]] = blockwiseModules(expr[[set]]$data, networkType = "signed hybrid", deepSplit = 1,

detectCutHeight = cutHeight[set], TOMType = "none", power = power[set],

numericLabels = TRUE, verbose = 4);

labels[[set]] = matchLabels(mods[[set]]$colors, simLabels[[set]]);

collectGarbage();

}

We also run Partitioning Around Medoids (PAM) on the data.

PAMlabels = matrix(0, nGenes, nSets)


{

cr = cor(expr[[set]]$data);

cr[cr<0] = 0;

adj = cr^power[set];

dist = as.dist(1-adj);

PAMlabels[, set] = pam(dist, nModules[set], cluster.only = TRUE);

PAMlabels[, set] = matchLabels(PAMlabels[, set], simLabels[[set]]);

collectGarbage();

}

How did module identification do? We plot the gene dendorgrams with the simulaeted and identified module colors.

sizeGrWindow(10,7);

#pdf(file = "Plots/preserved-moduleDetectionFailed-dendrograms.pdf", width = 10, height = 7)

layout(matrix(c(1:5), 5, 1), heights = c(rep(c(0.8, 0.2), 2), 0.3));

setNames = c("Reference data set", "Test data set");


{

if (set==1)

{

colors = labels2colors(cbind(labels[[1]], simLabels[[set]]))

names = c("Inferred", "Simulated");

} else {

colors = labels2colors(cbind(PAMlabels[, set], simLabels[[set]]))

names = c("PAM", "Simulated");

}

plotDendroAndColors(mods[[set]]$dendrograms[[1]],

colors, names,

dendroLabels = FALSE, hang = 0.03,

main = spaste(LETTERS[set], ". ",

setNames[set], ": gene clustering tree and module colors"),

setLayout = FALSE, abHeight = cutHeight[set], cex.colorLabels = 1.2, cex.main = 1.5,

cex.lab = 1.2, cex.axis = 1.2);

}

The result is shown in Figure 1. In the test data set, hierarchical clustering did not identify any modules. That isbecause we have simulated the modules with very weak correlations.

3

0.90

0.92

0.94

0.96

0.98

1.00

A. Reference data set: gene clustering tree and module colors

hclust (*, "average")d

Hei

ght

Inferred

Simulated

0.94

0.95

0.96

0.97

0.98

0.99

1.00

B. Test data set: gene clustering tree and module colors

hclust (*, "average")d

Hei

ght

PAM

Simulated

PAMSimulated

C. PAM vs. simulated module colors

Figure 1: Module identification in the simulated data sets. In the reference set the hierarchical clustering (panelA) easily identifies the 20 modules as distinct branches. Simulated and identified module colors, shown below thedendrogram, show excellent agreement. In the test set (panel B) the hierarchical clustering did not identify anyrecognizable branches. The simulated and PAM colors, shown below the clustering tree, also do not show anyapparent relationship to the dendrogram. Panel C shows a comparison of simulated module colors and PAM clusterlabels. It is very difficult to argue that any of the modules in the test set are preserved.

4

4 Calculation of module preservation

Here we run the main module preservation function modulePreservation. After the calculation we save the results; ifa re-analysis of previously calculated results is performed, one can simply read the results from disk, thus saving alot of time.

names(expr) = c("Set1", "Set2");

labelList = list(labels[[1]], PAMlabels[, 2]);

names(labelList) = names(expr);

mp = modulePreservation(expr, labelList, networkType = "signed", nPermutations = 200, verbose = 3,

maxGoldModuleSize = 1000);

# Save the module preservation results as well as the PAM cluster labels

save(mp, PAMlabels, file = "preserved-moduleDetectionFailed-20Modules.RData");

If the module preservation results have been calculated previously, load the results from the disk:

load(file= "preserved-moduleDetectionFailed-20Modules.RData");

Calculation of IGP in clusterRepro

Here we apply cluterRepro to the test set. We calculated the eigengenes of the reference modules in the test setand use them as the centroids in the IGP calculation.

# Need centroids for the new data set. Calculate module eigengenes.

MEs = moduleEigengenes(expr[[2]]$data, labels[[1]])$eigengenes

# Get rid of the grey eigengene

MEs = MEs[, -1]

doClusterRepro = TRUE

if (doClusterRepro)

{

library(clusterRepro)

rownames(MEs) = spaste("Sample.", c(1:nSamples));

rownames(expr[[2]]$data) = spaste("Sample.", c(1:nSamples));

set.seed(40);

print(system.time( {

cr = clusterRepro(as.matrix(MEs), expr[[2]]$data, 1000);

} ));

save(cr, file = "preserved-moduleDetectionFailed-20Modules-cr.RData");

}

If the clusterRepro results have been calculated previously, load the results from the disk:

load(file = "preserved-moduleDetectionFailed-20Modules-cr.RData");

5 Analysis of results

Here we look at how well each method did at identifying the 10 preserved modules in the “hopelessly noisy” testdata. Since the modules all have very similar sizes, we do not plot results as a function of module size; rather, ineach plot we simply order the modules by their corresponding preservation statistic and look for a clean separationof preserved and non-preserved modules.

# How well can one distinguish preserved from non-preserved modules?

sizeGrWindow(10,8)

#pdf(file = "Plots/preserved-moduleDetectionFailed-20Modules-preservationSuccess.pdf", w= 10, h = 8);

presColor = c("red", "black")[as.numeric(leaveOut[[2]])+1];

# Set graphical parameters

par(mfrow = c(3,2)); par(mar = c(3.8, 3.8, 2, 0.5)); par(mgp = c(2.3, 0.7, 0));

5

cex.lab = 1.3; cex.axis = 1.3; cex.main = 1.4

# Module preservation: Zsummary scores

Zs =

mp$preservation$Z[[1]][[2]]$Zsummary[order(as.numeric(rownames(mp$preservation$Z[[1]][[2]])))][-c(1:2)];

order = order(-Zs);

plot(Zs[order], col = presColor[order], cex.main=cex.main,

xlab = "Index", ylab = "Preservation Zsummary",cex.lab = cex.lab, cex.axis = cex.axis,

main = "A. Network-based preservation indices: Zsummary")

legend("topright", c("Non-preserved module", "Preserved module"), pch = 1, col = c("black", "red"),

cex = cex.lab)

# Module preservation: psummary statistics

Zs = -mp$preservation$log.p[[1]][[2]]$log.psummary[

order(as.numeric(rownames(mp$preservation$Z[[1]][[2]])))][-c(1:2)];

order = order(-Zs);

plot(Zs[order], col = presColor[order],

xlab = "Index", ylab = "-log10(psummary)", cex.lab = cex.lab, cex.axis = cex.axis, cex.main=cex.main,

main = "B. Network-based preservation indices: psummary")


cex = cex.lab)

abline(h=-log10(0.05), col = "blue");

abline(h=-log10(0.05/nModules[1]), col = "green");

# Co-clustering

cc = mp$accuracy$observed[[1]][[2]][-1, ’coClustering’];

order = order(-cc)

plot(cc[order], col = presColor[order], cex.main=cex.main,

xlab = "Index", ylab = "coClustering", cex.lab = cex.lab, cex.axis = cex.axis,

main = "D. Cross-tabulation with results of PAM: Co-clustering")


cex = cex.lab)

# Cross-tabulation: fisher p-value

bestP = apply(tab$pTable[-1, ], 1, min); order = order(bestP)

plot(-log10(pmin(rep(1, nModules[1]), bestP[order])), col = presColor[order], cex.main=cex.main,

xlab = "Index", ylab = "-log10(Overlap p-value)", cex.lab = cex.lab, cex.axis = cex.axis,

main = "C. Cross-tabulation with results of PAM: overlap p-value")


cex = cex.lab)

abline(h=-log10(0.05), col = "blue"); abline(h=-log10(0.05/nModules[1]), col = "green");

# clusterRepro: observed IGP

p = cr$Actual.IGP;

order = order(-p);

plot(p[order], col = presColor[order],

xlab = "Index", ylab = "IGP", cex.lab = cex.lab, cex.axis = cex.axis, cex.main=cex.main,

main = "E. clusterRepro: IGP")


cex = cex.lab)

# clusterRepro: permutation p-value

p = cr$p;

order = order(p);

plot(-log10(p+1e-4)[order], col = presColor[order], cex.main=cex.main,

xlab = "Index", ylab = "-log10(clusterRepro p-value)", cex.lab = cex.lab, cex.axis = cex.axis,

main = "F. clusterRepro: permutation p-value")

abline(h=-log10(0.05), col = "blue");

abline(h=-log10(0.05/nModules[1]), col = "green");

6


cex = cex.lab)

# If plotting into a pdf file, close it

dev.off();

The resulting plots are shown in Figure 2. The figure shows that network preservation statistics are in this casesuccessful in reliably separating preserved and non-preserved modules. On the other hand, cross-tabulation andclusterRepro have only limited success; based on Bonferoni corrected p-values, most of the preserved modules arecalled non-preserved.

●● ● ●

●

●● ●

● ●

● ● ● ● ● ● ● ● ● ●

5 10 15 20

02

46

810

A. Network−based preservation indices: Zsummary

Index

Pre

serv

atio

n Z

sum

mar

y ●

●

Non−preserved modulePreserved module

●

●● ●

●

● ●●

●

●

● ● ● ● ● ● ● ● ● ●

5 10 15 200

1020

3040

5060

B. Network−based preservation indices: psummary

Index

−lo

g10(

psum

mar

y)

●

●


●

● ●

●

● ●

●

● ●●

●

● ●

● ● ● ●● ●

●

5 10 15 20

0.05

00.

060

0.07

0

D. Cross−tabulation with results of PAM: Co−clustering

Index

coC

lust

erin

g

●

●


●

●

●● ●

●●

● ● ● ● ● ●● ● ● ● ●

● ●

5 10 15 20

12

34

56

7

C. Cross−tabulation with results of PAM: overlap p−value

Index

−lo

g10(

Ove

rlap

p−va

lue)

●

●


●

●

●

●●

●●

● ●●

●

● ●●

●● ●

● ●

●

5 10 15 20

0.15

0.20

0.25

E. clusterRepro: IGP

Index

IGP

●

●


●

●●

●

● ●

●

●●

●

● ●●

● ●● ● ● ●

●

5 10 15 20

0.0

1.0

2.0

3.0

F. clusterRepro: permutation p−value

Index

−lo

g10(

clus

terR

epro

p−

valu

e) ●

●


Figure 2: Success of several module preservation measures at distinguishing weakly preserved from non-preservedmodules. In each plot, modules are ordered by the preservation statistic shown in the plot. Red color denotespreserved and black non-preserved modules. In the p-value plots (the right column), the blue line denotes thethreshold p = 0.05, and the green line denotes the Bonferoni-corrected threshold p = 0.05. In the clusterReprop-value plot, we added 10−4 to all p-values so that zero p-values become 10−4 and fit into the plot. This figure showsthat network preservation statistics are in this case successful in reliably separating preserved and non-preservedmodules. On the other hand, cross-tabulation and clusterRepro have only limited success; based on Bonferonicorrected p-values, most of the preserved modules are called non-preserved.

7

Documents

Simulation studies of module preservation: Simulation study of … · 3 Module identi cation 2 4 Calculation of module preservation 5 5 Analysis of results 5 1 Overview This tutorial