Lab: Using Graphs for Comparing Transcriptome and Interactome
Data
S. Falcon, W. Huber, R. Gentleman
June 23, 2006
1 Introduction
In this lab we will use the graph, Rgraphviz, and RBGL Bioconductor packages to assess the re-lationship between sets of genes that cluster together based on expression value from microarrayexperiments (the transcriptome) and sets of genes whose proteins are known to physically interact(the interactome). The lab follows the discussion in section 22.2 of Gentleman et al. (2005), whichin turn is based on Balasubramanian et al. (2004).
The approach we will take is to create two graphs, one where edges represent the fact that twogenes are in the same cluster and the other where an edge represents the fact that the two genesphysically interact. If there is an association between clustering (which is based on gene expression)and physically interacting then we anticipate that when there is an edge between two proteins inone graph there will also be an edge between them in the other graph.
We test this hypothesis by creating the two graphs and counting how many edges they have incommon. Then, we generate a reference distribution by permuting the labels on one of the twographs (either one is fine) and for each permutation count edges in common. Finally, we comparethe observed number of edges with the permutation distribution and conclude that there is arelationship if the observed number is large, as determined by the reference distribution.
2 Required Packages
Exercise 1Use the library function to load the following packages: Biobase, graph, Rgraphviz, RBGL, RColor-Brewer, RbcBook1, and yeastExpData.
1
> library("Biobase")
> library("graph")
> library("Rgraphviz")
> library("RBGL")
> library("RbcBook1")
> library("yeastExpData")
3 The Data
The curated data in yeastExpData contains both gene expression data from a yeast cell-cycle ex-periment, including cluster membership (ccyclered), and protein-protein interaction (PPI) dataextracted from published papers (litG).
> data(ccyclered)
> data(litG)
Exercise 2What type of object is litG? How do you find out more about this class? Use the nodes methodto extract the first five nodes of litG.
It is an instance of the graphNEL class. You can use the manual page,class?graphNEL, to find out more.
> nodes(litG)[1:5]
[1] "YBL072C" "YBL083C" "YBR009C" "YBR010W" "YBR031W"
Exercise 3Explore the ccyclered data to determine what type of R object it is and what kind of data itcontains.
The ccyclered data is stored in a data.frame. You should also look at themanual page.
> class(ccyclered)
> str(ccyclered)
> dim(ccyclered)
> names(ccyclered)
2
4 Graph Basics
The RBGL package has a number of different graph algorithms implemented. In these next fewexercises we will see how to use a few of them. We will use the litG graph for our examples.
A graph can consist of one or more connected components. You can find them using the connect-edComp function.
> cc1 = connectedComp(litG)
> length(cc1)
[1] 2642
> cc1lens = sapply(cc1, length)
> table(cc1lens)
cc1lens1 2 3 4 5 6 7 8 12 13 36 88
2587 29 10 7 1 1 2 1 1 1 1 1
Exercise 4How many connected components are there? What is the size of the largest connected component?How many singletons are there? What are the elements (or values) stored in cc1? Create asubgraph of litG which has only the connected components of size 4 or more.
We first do the computations.
> cc1lens = sapply(cc1, length)
> good = cc1lens > 4
> whComp = cc1[good]
> sg1 = subGraph(unlist(whComp), litG)
> sg1
A graphNEL graph with undirected edgesNumber of Nodes = 182Number of Edges = 241
There are 2642 connected components. There largest component is of size88.
Lets plot the component of size 12 using the Rgraphviz package. We first compute the subgraph andthen lay it out. There are many options for node color, line color and type, node shape etc. If youare interested in more complex pictures you should read the vignette from the Rgraphviz package.
3
> sg12 = subGraph(cc1[cc1lens == 12][[1]], litG)
> l12 <- agopen(sg12, layoutType = "dot", nodeAttrs = makeNodeAttrs(sg12,
+ fillcolor = "steelblue2"), name = "")
> plot(l12)
YBL035C
YBR088C
YDL102W
YDR097CYJR006W
YJR043C
YNL102W
YPR135W
YMR167W
YPL164C YNL082W
YOL090W
Figure 1: The connected component of size 12.
Exercise 5Layout the graph using the two other layout types from Rgraphviz, namely neato and twopi.
You should plot these.
> l12 <- agopen(sg12, layoutType = "dot", nodeAttrs = makeNodeAttrs(sg12,
+ fillcolor = "steelblue2"), name = "")
4
Next let’s extract the largest component and use it to compute some other quantities. Here wecompute the shortest path between a rather arbitrarily selected pair of nodes.
> sg88 = subGraph(cc1[cc1lens == 88][[1]], litG)
> nN = nodes(sg88)
> sps = sp.between(sg88, nN[1], nN[81])
Exercise 6What sort of object is sps? What does the manual page say about it? Can you plot the graphand identify that this indeed is the shortest path? (You could color these nodes differently fromthe rest).
If you want to find the diameter of the graph it is defined as the longest shortest path between anytwo nodes. To compute this we use the function johnson.all.pairs.sp.
> allp = johnson.all.pairs.sp(sg88)
Exercise 7What type of object is allp? What data does it contain? What is the diameter of allp?
allp is a matrix of the shortest path distances between all pairs of nodes.You can find the diameter by finding the maximum value in allp.
> max(allp)
[1] 13
5 The Analysis
We now return to the analysis we discussed at the beginning of this lab. As suggested above wewant to create two graphs, one that reflects protein interactions and one that reflects clustered geneexpression. The first one of these graphs is litG and we will need to use the ccyclered to createthe second one.
Cho et al. (1998) discuss the k means clustering of 2885 Saccharomyces genes into 30 clusters withmeasurements taken over two synchronized cell cycles. These data are stored as ccyclered and wenext explain how to extract what is needed.
5
5.1 Cluster graph
The first step is to create a cluster graph from the ccyclered data in which edges are between allgenes that are in the same cluster. The clusters are given in the first column (named Cluster)of the data.frame. There is a specialized graph class, clusterGraph that can be used to representclusters.
We need to compute the set of genes in each cluster and you will do that by building a list, whereeach entry contains the names of the genes in that cluster.
Exercise 8Use the split function and the Y.name and Cluster columns of the ccyclered dataframe to createa list that maps gene name to cluster name. Store the list in a variable named clusts.
> clusts <- split(ccyclered[["Y.name"]], ccyclered[["Cluster"]])
Next we use the clusts list from the previous exercise to create a clusterGraph instance using new :
> cg1 <- new("clusterGraph", clusters = clusts)
Exercise 9How many connected components does the cluster graph cg1 have? Hint: apropos("connect").
> ccClust <- connectedComp(cg1)
5.2 PPI graph
We next turn our attention to a brief exploration of the literature-based PPI data stored as agraphNEL object.
Exercise 10Store the connected components of litG in a variable called ccLit.
Exercise 11a) Use listLen to compute the size of each connected component. Store the result in cclens.
b) Use table to summarize the sizes of the connected components stored in cclens.
c) How many singleton components are there?
6
> ccLit <- connectedComp(litG)
> cclens <- listLen(ccLit)
> table(cclens)
cclens1 2 3 4 5 6 7 8 12 13 36 88
2587 29 10 7 1 1 2 1 1 1 1 1
> nrSingletons <- table(cclens)["1"]
Creating an index vector that orders the connected components by size will allow us to easily accessthe smallest and largest components. In the example below, we list the genes in the eighth largestconnected component.
> ord <- order(cclens, decreasing = TRUE)
> ccLit[[ord[8]]]
[1] "YMR080C" "YLL026W" "YJR132W" "YDR172W" "YNL112W"[6] "YBR143C"
Exercise 12Use the subGraph method to create two new graphs sG1 and sG2, the first and second largestconnected components of of the litG graph.
> sG1 <- subGraph(ccLit[[ord[1]]], litG)
> sG2 <- subGraph(ccLit[[ord[2]]], litG)
Now we plot sG1 and sG2 using Rgraphviz:
> lsG1 <- agopen(sG1, layoutType = "neato", nodeAttrs = makeNodeAttrs(sG1),
+ name = "")
> lsG2 <- agopen(sG2, layoutType = "neato", nodeAttrs = makeNodeAttrs(sG2,
+ fillcolor = "#a6cee3"), name = "")
> plot(lsG1)
> plot(lsG2)
7
YDR382W
YER009W
YFL039C
YLR229C
YLR340W
YDL127W
YER111CYGR109C
YGR152C
YJL187C
YKL042W
YKL101W
YLR212C
YLR313C
YMR199W
YNL289W
YPL256C
YPR120C
YOR160W
YDR388W
YJL157C
YOR036W
YPL031C
YCL027W
YLL024C
YOR027W
YOR185C
YPL240C
YMR294W
YNL004W
YNL243W
YOL039W
YOR098C
YAL040C
YBR200W
YGR108W
YPL242C
YPR119W
YCL040W
YAL005C
YLR216C
YBR133C
YER114C
YDL179W
YHR005C
YJL194W
YLR079W
YLR319C
YMR109W
YDR432W
YDR356W
YEL003W
YHR061C
YHR172W
YLL021W
YNL126W
YPL016W
YDR085C
YHR102W
YOL016C
YBR160W
YMR308C
YHL007C
YKL068W
YAL029C
YBR109C
YGL016W
YLR293C
YML065W
YLR200W
YML094W
YMR092C
YMR186W
YDR309C
YHR129C
YAL041W
YBL016W
YBL079W
YOR127W
YPL174C
YDR103W
YDR323C
YDR184C
YHR069C
YOL021C
YOR181W
YBR155W
YPL140C
YBR009C
YBR010W
YNL030W
YNL031C
YOL139C
YAR007C
YBR073W
YER095W
YJL173C
YNL312W
YBL084C
YDR146C
YLR127C
YNL172W
YLR134W
YMR284W
YER179WYIL144W
YML104C
YOR191W
YDL008W
YDL030W
YDL042C
YDR004W
YGR162W
YMR117C
YDR386W
YDR485C
YDL043C
YDR118W
YMR106C
YML032C
YDR076W
YDR180W
YDL013W
YDR227W
Figure 2: The two largest PPI connected components.
5.3 Testing associations
It is now easy to determine how many pairs of genes have both a protein-protein interaction andare found in the same expression cluster. To compute this, find the intersection of the cluster-graphand the literature graph using intersection:
> commonG <- intersection(cg1, litG)
Exercise 13How many edges are common to the two graphs (cg1 and litG)?
Now we will try to determine whether the number of common edges is statistically interesting ornot. We will do this by generating a null distribution via permutation of node labels on the observedgraph.
Here is a function that can be used to generate values from the desired null distribution. Unfortu-nately, running this function with the current implementation is very slow.
> nodePerm <- function(g1, g2, B = 1000) {
+ n1 <- nodes(g1)
+ sapply(1:B, function(i) {
+ nodes(g1) <- sample(n1)
+ numEdges(intersection(g1, g2))
+ })
+ }
Exercise 14Describe what the nodePerm function is doing to make sure you understand how it works.
8
Since the nodePerm function is slow, we’ve computed 500 iterations ahead of time. Load theprecomputed result as follows:
> data(nPdist)
> summary(nPdist)
Exercise 15Plot the nPdist data and decide if the number of edges in common between litG and cg1 isstatistically interesting.
6 Some harder problems
In this section we present some problems that are more open ended. They are not formally part ofthis Lab, but are here for those who finish early, or who are particularly interested in these sortsof applications.
To answer the last two questions you will need to obtain a newer Bioconductor package called ScISI.
� Which of the expression clusters have intersections and with which of the literature clusters?
� Are there expression clusters that have a number of literature cluster edges going betweenthem (and hence suggesting that the expression clustering was too fine or that the genesinvolved in the literature cluster are not cell-cycle regulated).
� Are there known cell-cycle regulated protein complexes, and do the genes involved tend tocluster together in both graphs?
� Is the expression behavior of genes that are involved in multiple protein complexes differentfrom that of genes that are involved in only one complex?
The version number of R and packages loaded for generating this document are:
Version 2.3.1 Patched (2006-06-08 r38315)powerpc-apple-darwin8.6.0
attached base packages:[1] "tools" "methods" "stats" "graphics"[5] "grDevices" "utils" "datasets" "base"
other attached packages:yeastExpData RbcBook1 RBGL Rgraphviz
9
"0.6.0" "1.0.2" "1.8.1" "1.10.0"graph Ruuid Biobase
"1.10.6" "1.10.0" "1.10.0"
References
R. Balasubramanian, T. LaFramboise, D. Scholtens, et al. A graph theoretic approach to testingassociations between disparate sources of functional genomics data. Bioinformatics, 20:3353–3362, 2004.
R.J. Cho, M.J. Campbell, E.A. Winzeler, et al. A genome-wide transcriptional analysis of themitotic cell cycle. Molecular Cell, 2:65–73, 1998.
R. Gentleman, W. Huber, V. Carey, R. Irizarry, and S. Dudoit, editors. Bioinformatics and Com-putational Biology Solutions Using R and Bioconductor. Springer, 2005.
10