Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
http://www.psi.toronto.edu
Factorgrams: A tool for visualizing multi-wayassociations in biological data
Vincent Cheung, Inmar Givoni, Delbert Dueck, Brendan J. Frey
May 15, 2006 PSI TR 2006–44
Abstract
Effective visualization of biological data is often critical for subsequent analy-sis. The popular clustergram/dendrogram visualization rearranges rows andcolumns of a data matrix so as to highlight clusters of similar responses, butassumes each row or column belongs to only one cluster and cannot associateeach row or column with multiple clusters. Such multi-way associations oc-cur frequently, e.g., when a gene plays multiple biological roles. We describethe ’factorgram’ visualization, which rearranges the data into an expandedview, associating each row (or column) with multiple clusters of rows (orcolumns) and elucidating potentially new biological relationships. Factor-grams for mouse gene expression and yeast synthetic-lethal gene-interactiondatasets detect a larger number of statistically-significant clusters than clus-tergrams, plus a larger number of clusters enriched for gene ontology annota-tions. Experimentally-verified associations previously identified by manualrearrangement of rows and columns not grouped together by clustergrams,are readily identified by the factorgram.
Factorgrams: A tool for visualizing multi-way associations in biological data
Vincent Cheung1,2, Inmar Givoni1,2, Delbert Dueck1,2, Brendan J. Frey2,3,4
1. These authors contributed equally 2. Probabilistic and Statistical Inference Group, Departments of Electrical and
Computer Engineering and Computer Science, University of Toronto, 10 King’s College Rd., Toronto, ON, Canada, M5S 3G4
3. Banting and Best Department of Medical Research, University of Toronto, 112 College St., Toronto, ON, Canada, M5G 1L6
4. To whom correspondence should be addressed: Brendan J. Frey, [email protected], 416-978-7001 (office), 416-978-4425 (fax).
Effective visualization of biological data is often critical for subsequent analysis.
The popular clustergram/dendrogram visualization rearranges rows and columns of a
data matrix so as to highlight clusters of similar responses [1], but assumes each row
or column belongs to only one cluster and cannot associate each row or column with
multiple clusters. Such multi-way associations occur frequently, e.g., when a gene
plays multiple biological roles. We describe the ‘factorgram’ visualization, which
rearranges the data into an expanded view, associating each row (or column) with
multiple clusters of rows (or columns) and elucidating potentially new biological
relationships. Factorgrams for mouse gene expression and yeast synthetic-lethal gene-
interaction datasets detect a larger number of statistically-significant clusters than
clustergrams, plus a larger number of clusters enriched for gene ontology
annotations. Experimentally-verified associations previously identified by manual
rearrangement of rows and columns not grouped together by clustergrams, are
readily identified by the factorgram.
Factorgrams – Cheung et al.
2
High-throughput biological assays produce huge quantities of data, which can often be
organized into a matrix where rows and columns correspond to separate control variables,
and each element in the matrix is a binary- or real-valued measured response. For example,
the rows may correspond to microarray probes for genes, exons or tiled DNA segments, the
columns to time points or tissue samples, and each element to a real-valued transcript
abundance estimated using a microarray. Or, the rows and columns may correspond to
yeast genes and each element to a binary value indicating whether or not there is a
synthetic lethal interaction in the yeast strain with both genes deleted. While customized
techniques based on machine learning and statistical inference can be used to analyze this
kind of data, even simple rearrangements of the data or straightforward transformations of
the data can reveal biologically-significant patterns that are directly visualized in the
rearranged dataset. Currently, the most common visualization tool for biological matrix
data is the clustergram, in which the rows and the columns of the data matrix are re-ordered
using a clustering method such as hierarchical agglomerative clustering (HAC) [11] or k-
means clustering [12]. Similarities between nearby rows or columns may be indicated by
dendrogram. The rearranged matrix of data is shown as an image where the color at each
coordinate corresponds to the appropriate binary- or real-valued response [1].
A major deficiency of clustergrams is that rows (or columns) are grouped together
based on similarity of response across their entire columns (or rows). In many biological
assays, it is frequently the case that different biological processes will impact overlapping,
but not disjoint, sets of variables. For example, consider a yeast synthetic gene interaction
matrix in which rows correspond to gene deletion mutants of different non-essential yeast
genes, columns also correspond to gene deletion mutants of non-essential yeast genes, and
each element in the matrix measures the viability of a yeast strain with both genes knocked
out. If genes A and B exhibit similar viability patterns across a subset of other genes, they
can be grouped together, indicating that they may be part of the same functional pathway
involving all genes in the subset. However, gene B may additionally play a different
biological role as part of another pathway that does not involve gene A, but involves gene
C. Genes B and C may exhibit similar viability patterns across a different subset of genes,
namely those that are involved in the second pathway. In this case, there are two clusters
Factorgrams – Cheung et al.
3
corresponding to the two pathways and they overlap in the sense that they both contain
gene B. This creates a problem for clustering methods, which assume that clusters do not
overlap. Standard clustering techniques and the clustergram visualization do not provide a
general way to link gene B to both genes A and C, what we refer to as making ‘multi-way
associations’.
The failure of the clustergram to indicate multi-way associations is evident from the
data of Tong et al. [13], shown in Fig. 1A by ordering the rows and columns according to
the dendrograms produced by HAC (see Methods). Each row or column can only be
associated with nearby rows or columns, so overlapping clusters of genes (i.e., multi-way
associations) are not properly revealed. In Fig. 1B we highlight two examples of data
clusters that are significantly enriched for gene function annotations, but that were not
properly identified by hierarchical clustering and were thus broken apart in the clustergram.
In some cases, a row or column can be visually associated with two clusters if by
coincidence it is on the boundary between the two clusters or if a computational method is
used to move it to a position that is closer to another cluster. However, clustergrams cannot
generally reveal all such double-associations, because only a small number of cases can be
placed near cluster boundaries.
The ‘factorgram’ is a visualization that expands the input data matrix so as to enable
explicit visualization of data in overlapping clusters. We refer to each such overlapping
cluster as a ‘factor’, because techniques that explain each row or column of data as a
composition of multiple components are called ‘factorization’ methods in the machine
learning and statistics communities, and each component is called a ‘factor’ (c.f. [2]). The
output from a variety of factorization techniques [2-10] can be further processed to produce
the factorgram. The purpose of this paper is not to advocate a particular factorization
technique, but to illustrate how the factorgram visualization can be useful to biologists for
further analysis and detection of a larger number of statistically and biologically significant
associations than cannot be detected using clustergrams. Software for producing
factorgrams is available at http://www.psi.toronto.edu/factorgram.
Factorgrams – Cheung et al.
4
To give the reader a sense for how factorgrams work, in Fig. 2 we show a cartoon
illustrating one way to construct a factorgram. The clustergram of a synthetic binary data
matrix is shown in the upper-left corner, and includes blocks of data that have been broken
apart because specific rows and columns (shown with arrows) belong to multiple groups.
To produce the factorgram, the rows and columns are rearranged so that the dominant
factor appears as a contiguous block of data. This factor is then removed from the matrix
and placed in the factorgram. The rows and columns of the resulting matrix are again
rearranged so that the next dominant factor appears as a contiguous block of data. This
factor is removed and placed in the factorgram. This procedure is repeated until no more
significant factors remain in the data matrix. The factorgram is a figure showing the raw
data associated with all extracted factors and may additionally include appropriate row and
column labels. The factors may be shown along the diagonal of a matrix to reflect
similarities to the clustergram, or the factors can be arranged more compactly. Since the
same row (or column) can appear in multiple factors, the factorgram provides a way to
visualize clusters with overlapping variables.
Techniques for computing factorgrams should properly address how to efficiently
search over possible solutions and assess the statistical significance of the detected factors.
In contrast to clustergrams, where each row or column is associated with one cluster, in
factorgrams, each row or column can be associated with multiple factors through different
subsets of responses. This capability comes at the cost of an exponentially larger space of
possible solutions, since if there are k factors and each row or column can belong to n
factors at once, there are kn possible ways of assigning the row or column to the factors.
Approaches for efficiently searching over these assignments include computing linear
approximations and using more powerful probabilistic inference techniques. A binary
assignment variable with value 0 or 1 can be used to indicate whether or not a row or
column belongs to a factor. If these assignment variables are relaxed to be real-valued, data
elements are described by linear combinations of continuous variables. Continuous
solutions to the factorization problem can be computed using an eigen-vector method such
as principal components analysis [3] or an iterative technique such as factor analysis [2],
independent component analysis [4] and non-negative matrix factorization [5]. Once the
Factorgrams – Cheung et al.
5
real-valued ‘assignment’ variables have been computed, they can be thresholded to identify
components in the factorgram. One problem with these approaches is that a good binary
solution often cannot be obtained by thresholding the best continuous solution. An
alternative approach is to retain the binary representation of the assignment variables but
constrain the solution space. Bi-clustering [6] imposes the constraint that any two clusters
containing the same column (or row) must be defined by exactly the same set of columns
(or rows); this constraint makes extraction of factors easier, but is only appropriate when
different factors are defined by the same subsets of rows or columns.
Recently, methods have been proposed that retain the binary representation of the
assignment variables and avoid simplifying assumptions about the factors by using
techniques developed in the probabilistic and statistical inference communities to search for
the most probable setting of the assignment variables. The plaid method [7] works by
extracting one factor at a time using the linear approximation described above, but then
thresholds the assignment variables before proceeding to extract the next factor.
Probabilistic sparse matrix factorization [8,9] and matrix tile analysis [10] take a direct
approach and retain the binary representation, but find solutions by accounting for
uncertainties in the assignments and iteratively revisiting and refining factors until
convergence. The labelled latent Dirichlet allocation process analysis method [15] takes as
input the data matrix along with gene function annotations and performs Bayesian
inference in a hierarchical probabilistic graphical model to extract the factors.
The statistical significance of the factors in a factorgram can be assessed in a variety of
ways, but here we describe a general procedure that can be applied to any technique that
computes a factorgram. Denote the data matrix by X and the particular method used to
compute the factorgram by μ. We assume the method takes as input the data matrix and a
parameter θ that indirectly has an impact on the false detection rate. For example, θ could
be a real-valued threshold on a cost function that penalizes small factors (which are less
likely to be significant) but rewards factors containing highly similar data elements. For
method μ applied to data X using parameter θ, denote the number of extracted factors by
N(X,μ,θ). The null hypothesis is that a computed factor arose from random data with the
Factorgrams – Cheung et al.
6
same distribution as the source that produced X. Assuming the data matrix is large enough,
this distribution can be approximated by generating permuted data matrices (i.e., randomly
rearranging elements). Denote a data matrix obtained in this way by XR – since the data
elements were randomly rearranged, XR is a random variable. The expected number of
falsely detected factors can be estimated by E[N(XR,μ,θ)], where E[ ] denotes an
expectation w.r.t. the set of random matrices, XR. In this procedure, N(XR,μ,θ) is random
because XR is random, but also possibly because the method μ may produce different
solutions each time it is applied due to varying initial conditions.
We demonstrate the factorgram visualization using two publicly available datasets, the
yeast synthetic genetic array (SGA) dataset from Tong et al. [13] and the mouse gene
expression dataset from Zhang et al. [16]. The former is binary valued while the latter is
real-valued, and each dataset was factorized using a different method. However, the final
results are both readily visualized by the factorgram.
The SGA data is a binary-valued matrix of 135 by 1023 elements, where both rows
(query genes) and columns (array genes) correspond to yeast genes. Each element
represents the synthetic lethality of a double mutant whose corresponding row and column
genes have been knocked out. We applied HAC to both the rows and the columns of the
data matrix (see Methods) and in Fig. 1A we show the clustergram. The factorgram
generated from this data matrix (see Methods) contained 17 factors, 5.4 of which are
expected to be false detections based on the method described above, using ten random
permutation tests. The factorgram for the SGA data is shown in Fig. 3A, where each box
corresponds to one of the factors. While in the data matrix, only 3.35% of the gene pairs
exhibit synthetic lethality, in the identified set of factors 88.87% of the gene pairs exhibit
synthetic lethality. The identified factors account for 49.57% of the total number of
synthetic lethal interactions; 22.75% of the observed synthetic lethal interactions
correspond to array genes have three or less synthetic lethal interactions with query genes
and thus provide only weak evidence. In Fig. 3B, we show the clustergram obtained using
HAC (see Methods). We tried several clustering techniques and chose the one that
produced the largest number of biologically significant clusters (see below). In this figure,
Factorgrams – Cheung et al.
7
each element assigned to a factor is coloured so as to indicate the factor in Fig. 3A to which
it belongs. The clustergram breaks apart almost every identified factor, thus failing to
associate all genes within each factor.
An example of a well known biological pathway that is not readily identified in the
clustergram in Fig. 3A is the chitin synthase III pathway (SKT5, CHS3, CHS5, CHS7),
which is grouped together in the factorgram (dark green factor in Fig. 3). In some cases, the
clustergram correctly groups query genes, but fails to group array genes. For example,
query genes in the prefoldin complex (GIM3, GIM4, GIM5, PAC10, YKE2) were correctly
grouped in the clustergram, but corresponding array genes were not properly identified (red
factor in Fig. 3). The clustergram produced by Tong et al. used a different metric and
successfully grouped genes in the synthase III pathway, but failed to group together array
associated with the prefoldin complex. Instead, this complex was identified by laborious
manual rearrangement of the partial clusters. The large red factor shown in Fig. 3
successfully brings together similarities of array gene profiles and query gene profiles, thus
capturing both the row-based similarities and the column based similarities of the gene
profiles. Furthermore, the dark purple factor shows how the query genes of the prefoldin
complex are also similar based on their symmetric interaction as array genes, a relationship
that again was neither captured in our clustergram nor reported by Tong et al.
We next asked how many of the extracted factors are of biological significance in terms
of functional enrichment. We analysed whether the genes in each cluster were significantly
enriched for gene ontology (GO) annotations of biological process, molecular function, and
cellular component. We found 9 of the clusters to be enriched (p < 0.01) for at least one
annotation, with a total of 18 significant enrichments across all three categories (see
Methods). In comparison, when we used HAC to find the same number of clusters, we
were able to find only 4 clusters to be enriched for at least one annotation, with a total of
only 7 significant enrichments across all three categories. We tried a variety of different
clustering techniques, but these resulted in lower numbers of enriched clusters (data not
shown).
Factorgrams – Cheung et al.
8
The second dataset we analyzed and applied the factorgram visualization to is the
mouse mRNA expression dataset from Zhang et al. [16]. This dataset contains profiles for
over 22,709 known and predicted genes across 55 mouse tissues, organs, and cell types.
The factorgram generated from this dataset (see Methods) contained 37 factors and based
on the random permutation test we found that 0.5 factors are expected to be false
detections. The clustergram and factorgram are shown in Fig. 4. The factorgram correctly
identified many associations that are identified in the clustergram. For example, the cluster
containing nervous system tissues (inside the dark green bounding box in Fig. 4A) is
contained within the factor shown in Fig. 4D. Fig. 4C shows an example where the
discovered factor has been severely broken apart in the clustergram. This factor includes
genes that are expressed in both mature and embryonic nervous system tissues, and
includes genes not present in the previously described factor. Both of these factors have
statistically significant (p < 0.05) enrichment in gene annotations (see Methods).
Many factors reveal profile similarities across tissues which are not obviously similar,
and may seem surprising at first. However, this is one of the advantages of factorization
methods and of the factorgram, as they have the potential to enable the researcher to
scrutinize more interesting relationships. To answer the question of whether the additional
relationships detected in the factorgram are of biological significance, we analyzed the
factors for enrichment in GO biological process (GO-BP) annotations (see Methods). We
studied how many biologically significant groups were revealed in the factorgram
compared to other techniques for a fixed number of detected groups. In this case, parameter
θ controls the number of factors discovered in the data. We varied θ from 20 to 60 and
compared the genes in each factor with GO-BP annotations. Table 1 reports the number of
statistically significant factors (p < 0.05) for each θ-value; we also report corresponding
results for HAC. In general, the factogram reveals a larger number of significantly enriched
clusters.
Factorgrams – Cheung et al.
9
Table 1: Comparison of number of clusters or factors enriched for GO-
BP annotations for different numbers of extracted clusters/factors.
Number of clusters/factors, θ
Number of enriched factors in factorgram
Number of enriched clusters in clustergram
20 16 13 30 21 15 40 26 17 50 28 19 60 32 20
Just as Eisen et al. [1] demonstrated that the clustergram provides a visual tool enabling
biologists to gain leads to interesting associations, our goal in writing this paper was to
demonstrate the advantages of the factorgram visualization in revealing multi-way
associations and enabling biologists to detect multiple associations. Unlike clustergrams,
where the number of salient associations is limited by the two-dimensional arrangement of
the data, factorgrams extract many two-dimensional arrangements so as to identify multiple
associations. We are not advocating a specific computational technique for factorizing the
data matrix, but are instead introducing and advocating the factorgram as a way to visualize
the outputs from a variety of computational techniques [2-10], each of which is appropriate
in specific situations. Factorgrams can be used to visualize the output of these techniques,
regardless of whether each element in the generated factors belongs to only one factor or to
multiple factors. In our experiments on synthetic lethal yeast gene interaction data and
mouse gene expression data, we found that factorgrams reveal a larger number of
statistically significant and biologically significant clusters compared to the number
revealed in clustergrams by HAC.
As a general tool for analyzing matrices of data, the factorgram can potentially have
broad application in detecting associations in biological data. The factorgram has been
recently applied to chemical-genetic interaction data in yeast to identify associations
between chemical compounds and genes [17]. It has also been used to identify relationships
between hundreds of genes involved in transcription [18]. New large-scale microarray
datasets for the study of mammalian alternative splicing have also recently been analyzed
Factorgrams – Cheung et al.
10
using factorgrams [19]. Software for automatically producing factorgrams is available at
http://www.psi.toronto.edu/factorgram. Due to its simplicity, ease of computation, and
potential for revealing novel biological insights, we believe the factorgram will prove to be
a useful tool for visualizing multi-way associations in high-throughput biological data.
Methods
1. Yeast gene deletion data analysis.
The clustergram was produced using the MATLAB implementation of hierarchical
agglomerative clustering (HAC). We used Hamming distance, Euclidean distance,
correlation and city-block pseudo-distance functions with average and single linkage and
report results for Hamming distance with average linkage, which obtained the largest
number of clusters enriched for GO annotations.
The factorgram was produced using ‘matrix tile analysis’ based on an iterated
conditional modes technique [10]. Matrix tile analysis takes as input a matrix, where each
element is the log-ratio of probabilities that the element belongs in a factor and does not
belong in a factor. If a synthetic lethal interaction was observed between two pairs of genes
in the dataset of Tong et al. [13], we set the log-ratio to be log((1-ε)/ε) and otherwise we set
it to be log(ε/(1-ε)), where ε is the noise probability set to 0.0335, based on the average
number of observed synthetic lethal interactions. The parameter θ, which indirectly
determines the false detection rate by weighting the benefit of explaining the data to the
cost of introducing additional factors, was set to 0.15.
2. Mouse gene expression data analysis.
The clustergram was produced using the MATLAB implementation of HAC with
Pearson correlation and average linkage, which was selected for visualizing results by
Zhang et al. [16].
For the factorgram, we tried a different factorization technique from above called
‘probabilistic sparse matrix factorization’ (PSMF) [8,9]. Each expression measurement of a
Factorgrams – Cheung et al.
11
gene in each tissue in the input matrix is represented by the arcsinh (approximately the
logarithm) of the ratio of normalized intensity of the gene in the given tissue to the gene’s
median normalized intensity across all 55 tissues. Observing that the majority of genes
expressed in any tissue were expressed in less than half of the tissues, Zhang et al. set ratios
less than one equal to one, reasoning that these ratios represent noise rather than down-
regulation thus the data is entirely non-negative.
PSMF discovers a large predetermined number of factors (set to 50 here) and then
prunes them until every factor has a standard deviation less than an input threshold, θ,
which was set to 9.5. To determine the expected number of false detections, we reran the
analysis 135 times on randomly permuted data and computed the expected number of false
detections, which was 0.5. Based on the permutation tests, we also estimate the distribution
of factor element intensities for non significant factors (as those expected to be found on
permuted data) and discard of any factor elements found in the unpermuted data which
have an intensity below that of the 97.5th percentile of permutated factor elements. For
visualization purposes, Fig. 4B shows only those genes with individual reconstruction error
of less than θ.
3. Gene ontology (GO) enrichment analysis
GO annotation labels for the yeast genes were retrieved from Saccharomyces Genome
Database at ftp://genome-ftp.stanford.edu/pub/yeast/data_download/literature_curation/.
The functional category labels for the genes with known biological function in the Zhang et
al. database were derived from Gene Ontology Biological Process (GO-BP) category labels
assigned to genes by the European Bioinformatics Institute and Mouse Genome
Informatics. Bonferroni-corrected p-values for each factor/cluster were computed using the
hyper-geometric distribution, testing the probability of observing by chance the overlap of
a subset of genes in each factor or cluster with a subset of genes sharing a specific GO
annotation. We assign p-values for each factor/cluster by comparing it with all GO
annotations and choosing the most significant p-values.
Factorgrams – Cheung et al.
12
Acknowledgements
We thank C Boone, M Escobar, J Greenblatt, N Krogan, QD Morris and S Roweis for
helpful conversations. This work was supported by NSERC and CIAR.
Correspondence and requests for materials should be addressed to Brendan Frey at
Figure Legends
Figure 1. (A) A clustergram of the 135 x 1023 yeast gene interaction data from Tong et al.
[13], where rows correspond to ‘query’ genes, columns correspond to ‘array’ genes and
each data element is white if a double knockout of the two genes is lethal. For visual
clarity, only columns with a non-negligible number of interactions are shown. (B)
Synthetic lethal interactions in two groups of data that we detected and are significantly
enriched for gene function annotations are coloured; the remaining interactions are shown
in grey. The clustergram cannot associate genes in multiple ways, so these groups are
broken apart. For example, the array genes (columns) corresponding to the data shown in
red all have synthetic lethal interactions with query genes (rows) GIM3, GIM4, GIM5,
PAC10 and YKE2 (the prefoldin complex), but this association is broken apart because
some of the array genes also have synthetic lethal interactions with query genes BIM1,
CTF4, KAR3 and CIN8, while others do not.
Figure 2. A cartoon illustration of one way a factorgram can be constructed. The
clustergram of a binary data matrix is shown in the upper-left corner. The small arrows
indicate rows and columns having multi-way associations not evident in the clustergram.
The factorgram is created by recursively reordering rows and columns to identify a block
of data and then extracting the block of data. Each block is called a ‘factor’ to differentiate
it from a ‘cluster’, because more than one factor can contain the same row or column, e.g.,
the red factor and the green factor both include rows 15-19. The factorgram is a
visualization of the raw data in the form of factors that may be placed in an expanded
Factorgrams – Cheung et al.
13
matrix (where multiple rows or columns can have the same label) or shown as a collection
of sub-matrices with labelled rows and columns.
Figure 3. (A) A factorgram of the yeast genetic interaction data from Fig. 1, where each
factor has been colour-coded and non-synthetic lethal interactions are shown in black. (B)
The continuation of Fig. 1B, where each observed synthetic lethal interaction has been
coloured according to the colour code from A. Almost all factors are broken into multiple
parts by the clustergram.
Figure 4. (A) A clustergram of the 22,709 gene x 55 tissue mouse microarray dataset from
Zhang et al. [16]. (B) Factorgram visualization with the factors placed in a diagonal
pattern. Two of the factors are enlarged in (C) and (D), and the general areas in the data set
from which they came are indicated. Each factor is composed of a subset of rows and
columns of the data, and only in certain cases do these subsets correspond to rows and
columns that are adjacent in the clustergram. When this is not the case, the clustergram has
broken apart the factor, as in (C).
References 1. MV Eisen, PT Spellman, PO Brown, D Botstein. Cluster analysis and display of
genome-wide expression patterns. Proc National Acad Sci 95, 14863-14868 (1999).
2. D Rubin, D Thayer. EM algorithms for ML factor analysis. Psychometrika 47:1, 69-76 (1982).
3. IT Jolliffe. Principal Component Analysis. Springer-Verlag, New York NY (1986).
4. AJ Bell, TJ Sejnowski. An information maximization approach to blind separation and blind deconvolution. Neur Comp 7, 1129-1159 (1995).
5. D Lee, S Seung. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788-791 (1999).
6. Y Cheng, GM Church. Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol 8, 93-103 (2000).
7. L Lazzeroni, AB Owen. Plaid models for gene expression data. Stat Sin 12, 61-86 (2002).
8. D Dueck, J Huang, QD Morris, BJ Frey. Iterative analysis of microarray data. Proc 42nd Allerton Conf Communication, Control and Computing, Champaign-Urbana, IL (2004).
Factorgrams – Cheung et al.
14
9. D Dueck, QD Morris, BJ Frey. Multi-way clustering of microarray data using probabilistic sparse matrix factorization. Proc Int Conf Intell Syst Mol Biol 13, Bioinformatics 21 (Suppl 1), i144-i151 (2005).
10. I Givoni et al. Matrix tile analysis, in press (2006).
11. RR Sokal, CD Michener. A statistical method for evaluating systematic relationships. Univ Kans Sci Bull 38, 1409-1438 (1958).
12. SP Lloyd. Least squares quantization in PCM. IEEE Trans Info Theory 47, 129 (2001).
13. AH Tong et al. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 294, 2364-2368 (2001).
14. O Alter et al. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci USA 97, 10101-10106 (2000).
15. P Flaherty, G Giaever, J Kumm, MI Jordan et al. A latent variable model for chemogenomic profiling. Bioinf 21, 3286-3293 (2005).
16. W Zhang et al. The functional landscape of mouse gene expression. J Biol 3, 21.1-21.22 (2004).
17. AB Parsons et al. Exploring the mode-of-action of bioactive compounds by chemical-genetic profiling in yeast. Cell, in press (2006).
18. N Krogan et al. A global view of transcriptional pathways in yeast. Under review (2006).
19. BJ Blencowe et al. Coordinated alternative splicing in functionally-associated mammalian genes. Under review (2006).
Figure 1A, Cheung et al
RVS161RVS167ARC40ARP2CNE1
ECM15ERV29
KTR3LRE1
GS C2MNL1
MNS1YLR057W
CTS1HST1
RA S2YUR1
RAD24DIE2ALG6
YKL037WSHS1SK N1
SVP26MMS4
MUS81HST3MID2LIA1
ALG8KRE9SMY1
RTT107RA D9DDC1
RAD52ROT2
CWH41HK R1CDC2
CTI6DY N2
PAC11DY N3DY N1ARP1PA C1JNM1
PA C2CIN2BFA1CIN4RBL2KIP2TUB3
BB C1YB R094W
TUB2RRM3TOP1KIP3CLB4
AS E1KA R9CIN1
LAS21CHS6BNI4
YDR332WMCD1CHL1
MA D2HOG1SGS1MRC1TOF1
POL32HP R5ELG1
ERG11NUM1
NIP100NB P2DBF4
CDC8CDC42
BIK1KRE11KRE6SLT2
CS M3ARL1ARL3
GY P1SET2
RAD27RAD50CNB1SKT5
CHS7CHS3HOC1KRE1KNH1ES C2SWF1CDC7GA S1ARP6BNI1
MY O2PHO85DE P1
SAP30CIN8
KA R3CLA4
DCC1CTF18CTF8CTF4
CHS5CHS1SMI1FK S1
CDC45BIM1
CDC73GIM4GIM3
PAC10GIM5
YK E2RIC1YPT6
HS
L1TS
A1
PM
T2K
EX
2V
PS
1V
MA
21R
AD
61IN
P52
HIR
1C
HK
1M
EC
3H
IR3
SP
T8IR
A2
SC
S7
UM
E6
BO
I2C
PR
6Y
LR21
7WM
ID1
KR
E11
ER
D1
CH
O2
YB
R09
4WY
BR
099C
MU
S81
MM
S4
TOP
1N
CS
6E
LP2
ELP
3Y
GR
228W
ELP
4Y
PL1
02C
IKI3
SW
D3
BR
E2
DO
T1S
WI3
VP
S24
YTA
12B
TS1
PE
X6
SR
O9
GE
T1G
SG
1G
LO3
YN
L296
WC
OG
7C
OG
5TL
G2
GO
S1
CO
G6
CO
G8
VA
M10
VP
S5
GY
P1
VP
S17
MO
N2
ISC
1M
UM
2C
DC
26M
CK
1A
RD
1Y
DJ1
NC
L1A
HC
1A
LT1
RP
S22
AH
XK
2E
AF3
DS
E3
PN
T1P
AN
3P
RK
1P
IN4
SE
C72
MM
S2
DN
M1
MM
M1
YLL
007C
PM
P3
RIM
13M
ED
1C
LB3
SS
D1
HA
P5
WS
S1
YB
R25
5WS
DS
3B
BC
1M
YO
5Y
TA7
VIK
1B
MH
1R
TT10
3Y
DR
360W
RIM
21S
IF2
VP
S72
HO
S2
YN
L140
CS
TB5
YK
L118
WV
MA
6S
EC
66H
OC
1S
TE24
CP
R7
SE
M1
IES
2LE
O1
XR
S2
RTT
109
CY
K3
SH
S1
RP
N4
VP
S8
VP
S38
VP
S35
DR
S2
PE
A2
YD
R13
6CY
LR26
1CR
GP
1R
IC1
YP
T6R
PA
34A
RC
18FK
S1
TUS
1TP
M1
PA
T1P
RE
9IX
R1
ES
C2
RTT
107
YM
L095
C-A
RP
O41
HC
M1
RP
N10
AS
F1N
UP
60U
BP
6R
RM
3S
OD
1S
LM3
FPS
1P
TC1
CN
B1
NB
P2
VP
S51
ED
E1
RV
S16
7Y
LR33
8W ILM
1S
AC
6S
HE
4C
CW
12Y
LR11
1WP
MR
1H
UR
1D
EP
1C
AP
2C
AP
1M
RC
1R
AD
57S
LX8
RA
D5
RA
D55
RA
D51
RA
D54
HP
R5
RA
D50
RA
D9
RA
D17
RA
D24
DD
C1
CC
S1
SLK
19C
LB2
LTE
1S
AM
37B
RE
1O
PI3
GE
T2S
KT5
CH
S7
CH
S3
BN
I4C
HS
5Y
BL0
62W
CH
S6
BE
M1
EM
P24
YN
L171
CP
ER
1V
PS
29G
UP
1G
AS
1C
SF1
VP
S21
MR
E11
KR
E1
VID
22S
OH
1R
TF1
LGE
1S
EC
22C
IK1
MN
N11
SU
M1
RU
D3
SP
F1V
RP
1IM
L3C
HL4
CTF
19M
CM
22C
TF3
MC
M21
MC
M16
YP
L017
CB
UB
3C
HL1
CS
M1
YD
R14
9CP
AC
1LD
B18
NU
M1
PA
C11
DY
N1
AR
P1
DY
N2
KIP
2JN
M1
DY
N3
NIP
100
ELM
1U
BA
4E
LP6
VA
C14
AS
E1
BFA
1B
UB
2B
IK1
TUB
3R
BL2
MO
N1
KIP
3Y
GL2
17C
CLB
4K
AR
9V
AN
1B
RE
5S
WC
3V
PS
71S
WR
1A
RP
6S
WC
5H
TZ1
BE
M4
MN
N10
SW
I4Y
CL0
60C
CS
M3
TOF1
PO
L32
RA
D27
ELG
1TO
P3
YLR
235C
RM
I1S
GS
1S
LA1
FAB
1P
AC
2C
IN2
CIN
4C
IN1
MA
D1
MA
D2
MA
D3
BU
B1
KE
M1
RA
D52
BE
M2
BN
I1D
CC
1C
TF8
CTF
18C
TF4
SM
I1S
LT2
BC
K1
CLA
4G
IM4
PA
C10
GIM
5G
IM3
YK
E2
CIN
8B
IM1
KA
R3
Figure 1B, Cheung et al
RVS161RVS167ARC40ARP2CNE1
ECM15ERV29
KTR3LRE1
GS C2MNL1
MNS1YLR057W
CTS1HST1
RA S2YUR1
RAD24DIE2ALG6
YKL037WSHS1SK N1
SVP26MMS4
MUS81HST3MID2LIA1
ALG8KRE9SMY1
RTT107RA D9DDC1
RAD52ROT2
CWH41HK R1CDC2
CTI6DY N2
PAC11DY N3DY N1ARP1PA C1JNM1
PA C2CIN2BFA1CIN4RBL2KIP2TUB3
BB C1YB R094W
TUB2RRM3TOP1KIP3CLB4
AS E1KA R9CIN1
LAS21CHS6BNI4
YDR332WMCD1CHL1
MA D2HOG1SGS1MRC1TOF1
POL32HP R5ELG1
ERG11NUM1
NIP100NB P2DBF4
CDC8CDC42
BIK1KRE11KRE6SLT2
CS M3ARL1ARL3
GY P1SET2
RAD27RAD50CNB1SKT5
CHS7CHS3HOC1KRE1KNH1ES C2SWF1CDC7GA S1ARP6BNI1
MY O2PHO85DE P1
SAP30CIN8
KA R3CLA4
DCC1CTF18CTF8CTF4
CHS5CHS1SMI1FK S1
CDC45BIM1
CDC73GIM4GIM3
PAC10GIM5
YK E2RIC1YPT6
HS
L1TS
A1
PM
T2K
EX
2V
PS
1V
MA
21R
AD
61IN
P52
HIR
1C
HK
1M
EC
3H
IR3
SP
T8IR
A2
SC
S7
UM
E6
BO
I2C
PR
6Y
LR21
7WM
ID1
KR
E11
ER
D1
CH
O2
YB
R09
4WY
BR
099C
MU
S81
MM
S4
TOP
1N
CS
6E
LP2
ELP
3Y
GR
228W
ELP
4Y
PL1
02C
IKI3
SW
D3
BR
E2
DO
T1S
WI3
VP
S24
YTA
12B
TS1
PE
X6
SR
O9
GE
T1G
SG
1G
LO3
YN
L296
WC
OG
7C
OG
5TL
G2
GO
S1
CO
G6
CO
G8
VA
M10
VP
S5
GY
P1
VP
S17
MO
N2
ISC
1M
UM
2C
DC
26M
CK
1A
RD
1Y
DJ1
NC
L1A
HC
1A
LT1
RP
S22
AH
XK
2E
AF3
DS
E3
PN
T1P
AN
3P
RK
1P
IN4
SE
C72
MM
S2
DN
M1
MM
M1
YLL
007C
PM
P3
RIM
13M
ED
1C
LB3
SS
D1
HA
P5
WS
S1
YB
R25
5WS
DS
3B
BC
1M
YO
5Y
TA7
VIK
1B
MH
1R
TT10
3Y
DR
360W
RIM
21S
IF2
VP
S72
HO
S2
YN
L140
CS
TB5
YK
L118
WV
MA
6S
EC
66H
OC
1S
TE24
CP
R7
SE
M1
IES
2LE
O1
XR
S2
RTT
109
CY
K3
SH
S1
RP
N4
VP
S8
VP
S38
VP
S35
DR
S2
PE
A2
YD
R13
6CY
LR26
1CR
GP
1R
IC1
YP
T6R
PA
34A
RC
18FK
S1
TUS
1TP
M1
PA
T1P
RE
9IX
R1
ES
C2
RTT
107
YM
L095
C-A
RP
O41
HC
M1
RP
N10
AS
F1N
UP
60U
BP
6R
RM
3S
OD
1S
LM3
FPS
1P
TC1
CN
B1
NB
P2
VP
S51
ED
E1
RV
S16
7Y
LR33
8W ILM
1S
AC
6S
HE
4C
CW
12Y
LR11
1WP
MR
1H
UR
1D
EP
1C
AP
2C
AP
1M
RC
1R
AD
57S
LX8
RA
D5
RA
D55
RA
D51
RA
D54
HP
R5
RA
D50
RA
D9
RA
D17
RA
D24
DD
C1
CC
S1
SLK
19C
LB2
LTE
1S
AM
37B
RE
1O
PI3
GE
T2S
KT5
CH
S7
CH
S3
BN
I4C
HS
5Y
BL0
62W
CH
S6
BE
M1
EM
P24
YN
L171
CP
ER
1V
PS
29G
UP
1G
AS
1C
SF1
VP
S21
MR
E11
KR
E1
VID
22S
OH
1R
TF1
LGE
1S
EC
22C
IK1
MN
N11
SU
M1
RU
D3
SP
F1V
RP
1IM
L3C
HL4
CTF
19M
CM
22C
TF3
MC
M21
MC
M16
YP
L017
CB
UB
3C
HL1
CS
M1
YD
R14
9CP
AC
1LD
B18
NU
M1
PA
C11
DY
N1
AR
P1
DY
N2
KIP
2JN
M1
DY
N3
NIP
100
ELM
1U
BA
4E
LP6
VA
C14
AS
E1
BFA
1B
UB
2B
IK1
TUB
3R
BL2
MO
N1
KIP
3Y
GL2
17C
CLB
4K
AR
9V
AN
1B
RE
5S
WC
3V
PS
71S
WR
1A
RP
6S
WC
5H
TZ1
BE
M4
MN
N10
SW
I4Y
CL0
60C
CS
M3
TOF1
PO
L32
RA
D27
ELG
1TO
P3
YLR
235C
RM
I1S
GS
1S
LA1
FAB
1P
AC
2C
IN2
CIN
4C
IN1
MA
D1
MA
D2
MA
D3
BU
B1
KE
M1
RA
D52
BE
M2
BN
I1D
CC
1C
TF8
CTF
18C
TF4
SM
I1S
LT2
BC
K1
CLA
4G
IM4
PA
C10
GIM
5G
IM3
YK
E2
CIN
8B
IM1
KA
R3
0
0
0
0
0
0
0
F
0
0
0
0
0
0
0
0
0
0
0
G
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
H
0
0
0
0
0
0
0
0
0
0
0
0
I
0
0
0
0
0
0
0
0
0
0
0
0
J
070605
000004000003000002000001
000001400000130000012
0000110000100000900008
00000210000020
019018017016015
EDCBA
21
JIFE
111098
1516171819
B C F G H I J
7654321
191817161532
20191817141312
EDCBA
000000800000090000001000000011
000000001200000000130000000014
0
0
0
0
0
0
0
F
0
0
0
0
0
0
0
G
0
0
0
0
0
0
0
0
0
H
0
0
0
0
0
0
0
0
0
I
0
0
0
0
0
0
0
0
0
J
000007000006000005000004000003000002000001
00000210000020
00000190000018000001700000160000015
EDCBA
000000800000090000001000000011
000000001200000000130000000014
0
0
0
0
0
0
0
F
0
0
0
0
0
0
0
G
0
0
0
0
0
0
0
0
0
H
0
0
0
0
0
0
0
0
0
I
0
0
0
0
0
0
0
0
0
J
070605
000004000003000002000001
00000210000020
019018017016015
EDCBA
00000000004000000000050000000000600000000007
000000800000090000001000000011
000000001200000000130000000014
0
0
0
F
0
0
0
G
0
0
0
0
0
H
0
0
0
0
0
I
0
0
0
0
0
J
000003000002000001
00000210000020
00000190000018000001700000160000015
EDCBA
00000000004000000000050000000000600000000007
000000800000090000001000000011
000000000012000000000013000000000014
0
0
0
0
0
0
0
0
0
0
F
0
0
0
0
0
0
0
0
0
0
G
0
0
0
0
0
H
0
0
0
0
0
I
0
0
0
0
0
J
000003000002000001
00000210000020
00000190000018000001700000160000015
EDCBA
00000000001200000000001300000000001400000000004000000000050000000000600000000007
000000800000090000001000000011
0
0
0
0
0
0
0
0
0
0
F
0
0
0
0
0
0
0
0
0
0
G
0
0
0
0
0
H
0
0
0
0
0
I
0
0
0
0
0
J
000003000002000001
0000021000002000000190000018000001700000160000015
EDCBA
000000000015000000000016000000000017000000000018000000000019000000000020000000000021000000000012000000000013000000000014
00000000004
000000000050000000000600000000007
000000800000090000001000000011
0
0
0
F
0
0
0
G
0
0
0
H
0
0
0
I
0
0
0
J
000003000002000001
EDCBA
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
G
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
H
0000000015000000001600000000170000000018000000001900000000200000000021000000001200000000130000000014
000000004
000000005000000006000000007
0000800009000010000011
0
0
0
F
0
0
0
I
0
0
0
J
000003000002000001
EDCBA
765
1918171615
EDCB
121314
F G
32
1918171615
H I J
2120191817
4321
CBA
89
1011
F I JE
F G H I J
7654321
141312111098
21201918171615
EDCBAF G H I J
7654321
141312111098
21201918171615
EDCBA
Factorgram
(i) (ii) (iii) (iv) (v)
(vi)
(vii)(viii)(x) (ix)
Regroup toIdentify Block
Regroup toIdentify Block
Regroup toIdentify Block
Regroup toIdentify Block
ExtractFactor
ExtractFactor
ExtractFactor
ExtractFactor
Clustergram
Start
Figure 2, Cheung et al
Figure 3A, Cheung et al
CC
H1
MID
2 P
FA
4 Y
GL0
46W
R
LM1
DF
G16
R
OM
2 U
BC
4 PH
O85
M
MS
22
DO
C1
UB
P3
SP
A2
RIM
20
PR
E9
RP
N10
F
PS
1 P
TC
1 C
NB
1 VP
S51
R
VS16
7 IL
M1
SK
T5
CH
S7
CH
S3
BN
I4
CH
S5
CH
S6
YN
L171
C
MN
N11
E
LM1
BR
E5
SW
I4
SLA
1 B
EM
2 B
NI1
S
LT2
SMI1FKS1
NC
S2
BU
D6
ELP
2 E
LP3
YG
R22
8W
ELP
4 S
HS
1 S
KT5
CH
S7
CH
S3
BN
I4
CH
S5
YB
L062
W
BEM
1 Y
DR
149C
P
AC1
LDB1
8 N
UM
1 P
AC
11
DYN
1 A
RP
1 D
YN2
NIP
100
ELM
1 U
BA4
ELP
6 V
AC
14
BEM
4 S
WI4
F
AB1
BEM
2 S
MI1
S
LT2
BC
K1
CLA
4
BNI1MYO2CLA4
BIK
1 TU
B3
CIN
1 M
AD
1 G
IM4
PA
C10
G
IM5
GIM
3 Y
KE2
CIN
8 BI
M1
PAC2CIN2
BFA1CIN4
RBL2KIP2
TUB3ASE1KAR9
CIN1CHL1
MAD2BIK1
CSM3ARP6
MN
N2
TP
S1
MN
N9
EN
D3
EA
P1
YIR
003W
D
OA
1 S
IN3
RX
T2
SAP
30
PHO
23
YB
R25
5W
SD
S3
BB
C1
MY
O5
HO
C1
CY
K3
CC
W12
Y
LR11
1W
DE
P1
CA
P2
CA
P1
SK
T5
CH
S7
CH
S3
BN
I4
CH
S5
CH
S6
GU
P1
CS
F1
VPS
21
KR
E1
SEC
22
SU
M1
RU
D3
SP
F1
MN
N10
S
WI4
S
LA1
SLT
2 B
CK
1 C
LA4
GIM
4 PA
C10
G
IM5
GIM
3 Y
KE
2
RVS161RVS167
TH
I3
PR
M6
YGR
182C
T
IM13
C
DC
73
SP
T3
SA
S5
YIP
5 S
ET3
N
HP
10
YER
139C
S
PT8
E
LP3
SIF
2 V
PS
72
HO
S2
YNL1
40C
S
TB5
IE
S2
LEO
1 R
PN
10
AS
F1
BR
E1
SO
H1
RT
F1
LGE1
S
EC
22
NIP
100
VP
S71
S
WR
1 A
RP6
H
TZ1
S
WI4
C
LA4
BIM
1
DEP1SAP30
RA
D50
R
AD
9 R
AD
17
RA
D24
D
DC
1 R
AD
27
TO
P3
SG
S1
RA
D52
D
CC
1 C
TF8
C
TF
18
CT
F4
RRM3MRC1TOF1
POL32HPR5ELG1DBF4CSM3
RAD27CDC7
UB
R1
SGF
73
RA
D23
R
PS
4A
HS
L1
CD
C26
V
IK1
NU
P60
R
RM
3 M
RC
1 R
AD
57
SLX
8 R
AD
5 R
AD
55
RA
D51
R
AD
54
HP
R5
RA
D50
R
AD
9 D
DC
1 C
CS
1 S
LK19
C
LB2
MR
E11
S
OH
1 R
TF
1 C
HL1
K
IP3
AR
P6
HT
Z1
YC
L060
C
CS
M3
TO
F1
PO
L32
RA
D27
T
OP
3 Y
LR23
5C
SG
S1
MA
D1
MA
D2
MA
D3
BU
B1
KE
M1
RA
D52
C
TF
4 C
LA4
GIM
4 PA
C10
G
IM5
GIM
3 C
IN8
BIM
1 K
AR
3
DCC1CTF18
CTF8CTF4
RP
S8A
S
MY
1 Y
TA
12
RP
A34
AR
C18
F
KS
1 T
US
1 T
PM
1 P
RE
9 E
DE
1 R
VS
167
YLR
338W
IL
M1
SA
C6
SH
E4
CC
W12
Y
LR11
1W
OP
I3
GU
P1
CS
F1
VR
P1
VA
N1
BR
E5
MN
N10
S
WI4
S
LA1
FA
B1
BN
I1
SM
I1
SLT
2 B
CK
1 C
LA4
SKT5CHS7CHS3CHS5
RR
D2
RA
D61
M
CK
1 B
MH
1 H
UR
1 S
LK19
IM
L3
CH
L4
CT
F19
M
CM
22
CT
F3
MC
M21
M
CM
16
BU
B3
CH
L1
YD
R14
9C
PA
C1
LDB
18
NU
M1
PA
C11
D
YN
1 A
RP
1 D
YN
2 K
IP2
JNM
1 D
YN
3 N
IP10
0 E
LM1
AS
E1
BF
A1
BU
B2
BIK
1 T
UB
3 K
IP3
CLB
4 A
RP
6 C
SM
3 T
OF
1 C
IN1
MA
D1
MA
D2
MA
D3
BU
B1
KE
M1
RA
D52
B
EM
2 D
CC
1 C
TF
8 C
TF
18
CT
F4
SM
I1
CLA
4 G
IM4
PA
C10
G
IM5
GIM
3 Y
KE
2
CIN8KAR3BIM1
SA
P15
5 S
EC
66
HO
C1
ST
E24
CP
R7
PE
A2
ILM
1 S
AC
6 S
HE4
YL
R11
1W
GET
2 S
KT5
C
HS7
C
HS3
B
NI4
C
HS5
YB
L062
W
CH
S6
BE
M1
KR
E1
MN
N11
S
UM
1 R
UD
3 S
PF1
V
RP1
B
EM
2 S
LT2
BC
K1
CLA
4 G
IM4
GIM
3 Y
KE2
ARC40ARP2
PKR
1 P
DR
16
PHO
5 Y
BR
209W
TC
B2
YN
L179
C
YN
L200
C
SPS
19
YN
L203
C
SPS
18
YN
L228
W
CLN
2 Y
PR
053C
F
IG2
YM
R00
3W
BU
D20
PD
C1
YD
R31
4C
HX
T8
ELO
1 R
PE
1 Y
IL11
0W
YME
1 H
AP
2 Y
NL2
35C
TY
R1
PFK
2 Y
EL0
33W
M
DM
38
PDA
1 U
TH
1 S
KI2
IP
K1
PSY
2 PE
X22
Y
DR
248C
PR
M3
NU
C1
HIT
1 Y
PL2
61C
H
BT
1 C
UE
3 LD
B19
E
CM
21
LEA
1 G
RS
1 Y
DL2
06W
Y
FR
045W
Y
GL0
81W
W
HI2
PE
X6
VPS
5 V
PS17
PM
P3
VPS
35
PEA
2 R
PA
34
AR
C18
FP
S1
CN
B1
DE
P1
EM
P24
Y
NL1
71C
V
PS29
M
RE
11
LGE
1 SP
F1
BEM
4 BC
K1
CHS1
VP
S1
VM
A21
C
OG
7 C
OG
5 T
LG2
GO
S1
CO
G6
CO
G8
YKL
118W
C
PR
7 Y
DR
136C
Y
LR26
1C
RIC
1 Y
PT
6 G
ET
2 P
ER
1
ARL1ARL3GYP1SWF1
HA
T2
TR
F5
CS
G2
ME
T22
Y
HR
039C
-B
MS
I1
YB
R27
7C
HA
T1
YP
L183
W-A
BR
R1
DP
B3
YM
R16
6C
SUB
1 D
PB
4 AR
O2
RM
D11
S
PT
2 H
IR2
YD
L151
C
LIA
1 N
UT
1 Y
OR
024W
H
ST
3 AP
Q12
S
WI5
N
PR
2 IM
P2'
C
LB5
NH
P10
LS
M1
DF
G16
W
HI2
H
PC
2 S
IS2
HIR
1 C
HK
1 M
EC
3 H
IR3
SP
T8
SCS
7 M
MS
4 T
OP
1 M
ED
1 R
TT10
3 R
IM21
VP
S72
S
TB
5 IE
S2
LEO
1 X
RS
2 R
TT10
9 P
AT
1 A
SF
1 R
RM
3 SO
D1
RA
D57
R
AD
54
HP
R5
RA
D9
RA
D17
R
AD
24
DD
C1
LTE
1 SO
H1
RT
F1
LGE
1 SE
C22
C
SM
1 B
FA
1 BU
B2
SWC
5 H
TZ
1 S
WI4
Y
CL0
60C
C
SM
3 T
OF
1 PO
L32
TO
P3
YLR
235C
R
MI1
F
AB
1 M
AD
1 M
AD
2 KE
M1
RA
D52
D
CC
1 C
TF
8 C
TF
18
CT
F4
GIM
4 G
IM5
KAR
3
CDC45
MO
N1
KIP
3 Y
GL2
17C
C
LB4
KA
R9
CIN
1 B
NI1
G
IM4
PA
C10
G
IM5
GIM
3 Y
KE
2 C
IN8
BIM
1 K
AR
3
DYN2PAC11DYN3DYN1ARP1PAC1JNM1NUM1
NIP100NBP2
YE
R08
4W
RC
E1
YP
R08
4W
SE
C28
M
MR
1 G
CR
2 Y
EL0
33W
M
DM
38
PD
A1
DIA
2 E
CM
8 V
AC8
ELF
1 Y
LR16
8C
SW
I6
LIP
2 Y
LR26
9C
YML0
13C
-A
MR
PS8
BU
L1
SSN
8 Y
NL1
98C
T
HP
1 U
BP2
TIM
18
SSN
3 M
DM
35
VID
30
BU
D13
P
ET
309
CT
I6
YP
L182
C
IST
3 P
DB
1 R
RP
6 M
ET1
8 S
ET2
SN
U66
LS
T4
LSM
6 Q
RI5
C
TK1
TH
P2
DST
1 O
RM
2 S
PT2
HIR
2 A
PQ
12
NH
P10
YE
R13
9C
PD
E2
DO
A1
SIN
3 R
XT2
SA
P30
PH
O23
M
SN5
NKP
2 LS
M1
DF
G16
S
UR
1 H
PC2
SO
Y1
HIR
3 IK
I3
DO
T1
BTS
1 S
RO
9 G
ET1
GLO
3 G
OS
1 R
IM13
M
ED1
YTA
7 R
TT1
03
RIM
21
SIF
2 V
PS7
2 Y
NL1
40C
S
EM1
IES
2 LE
O1
XR
S2
RT
T109
V
PS8
RG
P1
RIC
1 Y
PT6
PAT
1 N
UP6
0 U
BP6
SLM
3 D
EP1
CC
S1
VP
S21
VID
22
RTF
1 S
EC
22
RU
D3
SPF
1 S
WC
3 V
PS7
1 S
WR
1 A
RP
6 S
WC
5 H
TZ1
MN
N10
S
WI4
K
EM1
RA
D52
C
IN8
CDC73
SR
O7
MR
L1
DS
S4
OX
R1
FUN
30
AP
L5
VPS
30
EG
D1
YO
R11
2W
INP
53
SF
T2
ES
C8
PP
G1
PS
D1
RE
R1
SK
Y1
MA
K31
D
CR
2 IM
H1
VPS
41
YD
R10
7C
EN
T5
EMP
70
EN
T4
GM
H1
VPS
74
TE
F4
YE
L043
W
VM
A8
GE
F1
SC
S2
RIM
15
BS
T1
RP
L2A
ER
V14
Y
IL03
9W
AP
L6
YP
R19
7C
YE
R08
4W
RC
E1
YP
R08
4W
SEC
28
UP
F3
EA
F7
NE
M1
RA
D4
YM
R01
0W
RPL
35A
Y
PT
7 E
ST
2 SW
A2
SN
F4
VA
M6
GE
T3
MV
P1
ST
V1
RA
V2
YD
R20
3W
RA
V1
AR
L1
AR
L3
SY
S1
CB
F1
VMA
22
TF
P3
RPL
16B
M
AK
10
YP
R05
0C
MA
K3
PK
R1
BUD
14
VMA
21
BR
E2
SR
O9
GE
T1
GS
G1
GLO
3 Y
NL2
96W
C
OG
7 C
OG
5 T
LG2
GO
S1
CO
G6
CO
G8
VAM
10
VP
S5
GY
P1
VPS
17
MO
N2
VP
S8
VPS
38
VPS
35
PT
C1
CN
B1
NB
P2
VPS
51
BR
E1
OP
I3
PE
R1
VPS
29
GU
P1
GA
S1
CS
F1
VPS
21
RT
F1
LGE
1 R
UD
3 B
RE
5 SW
C3
VPS
71
SWR
1 A
RP
6 H
TZ
1 B
EM
4 S
MI1
RIC1YPT6
MA
K10
YP
R05
0C
SE
T6
RG
A1
YBR
108W
B
EM
3 B
NI5
Y
GR
054W
P
LP1
MK
C7
CA
F40
U
RM
1 N
CL1
A
HC
1 A
LT1
RP
S22
A
HX
K2
EA
F3
DS
E3
PN
T1
PA
N3
PR
K1
PIN
4 S
EC
72
MM
S2
DN
M1
MM
M1
YLL0
07C
P
MP3
R
IM13
M
ED
1 C
LB3
SS
D1
HA
P5
WS
S1
YBR
255W
S
DS3
B
BC
1 M
YO
5 Y
TA7
V
IK1
BM
H1
RT
T10
3 Y
DR
360W
R
IM21
S
IF2
VP
S72
H
OS2
YN
L140
C
ST
B5
YKL1
18W
V
MA6
S
EC
66
HO
C1
ST
E24
C
PR
7 R
PN
4 V
PS8
C
AP2
P
ER
1 V
PS
29
MN
N11
S
UM
1 R
UD
3 S
PF1
V
RP1
IM
L3
CH
L4
CT
F19
M
CM
22
CT
F3
MC
M21
M
CM
16
YPL0
17C
B
UB3
C
HL1
C
SM
1 YD
R14
9C
PA
C1
LDB
18
NU
M1
PA
C11
D
YN
1 A
RP1
D
YN
2 K
IP2
JNM
1 D
YN
3 N
IP10
0 E
LM1
UB
A4
ELP
6 V
AC
14
AS
E1
BF
A1
BU
B2
BIK
1 T
UB3
R
BL2
K
AR
9 S
WC
3 V
PS
71
SW
R1
AR
P6
SW
C5
HT
Z1
SLA
1 F
AB1
P
AC
2 C
IN2
CIN
4 C
IN1
MA
D1
MA
D2
MA
D3
BU
B1
KE
M1
RA
D52
B
EM
2 D
CC
1 C
TF8
C
TF
18
CT
F4
SM
I1
SLT
2 B
CK1
C
LA4
CIN
8 B
IM1
KA
R3
GIM4GIM3
PAC10GIM5
YKE2
Figure 3B, Cheung et al
RVS161RVS167ARC40ARP2CNE1
ECM15ERV29
KTR3LRE1
GS C2MNL1
MNS1YLR057W
CTS1HST1
RA S2YUR1
RAD24DIE2ALG6
YKL037WSHS1SK N1
SVP26MMS4
MUS81HST3MID2LIA1
ALG8KRE9SMY1
RTT107RA D9DDC1
RAD52ROT2
CWH41HK R1CDC2
CTI6DY N2
PAC11DY N3DY N1ARP1PA C1JNM1
PA C2CIN2BFA1CIN4RBL2KIP2TUB3
BB C1YB R094W
TUB2RRM3TOP1KIP3CLB4
AS E1KA R9CIN1
LAS21CHS6BNI4
YDR332WMCD1CHL1
MA D2HOG1SGS1MRC1TOF1
POL32HP R5ELG1
ERG11NUM1
NIP100NB P2DBF4
CDC8CDC42
BIK1KRE11KRE6SLT2
CS M3ARL1ARL3
GY P1SET2
RAD27RAD50CNB1SKT5
CHS7CHS3HOC1KRE1KNH1ES C2SWF1CDC7GA S1ARP6BNI1
MY O2PHO85DE P1
SAP30CIN8
KA R3CLA4
DCC1CTF18CTF8CTF4
CHS5CHS1SMI1FK S1
CDC45BIM1
CDC73GIM4GIM3
PAC10GIM5
YK E2RIC1YPT6
HS
L1TS
A1
PM
T2K
EX
2V
PS
1V
MA
21R
AD
61IN
P52
HIR
1C
HK
1M
EC
3H
IR3
SP
T8IR
A2
SC
S7
UM
E6
BO
I2C
PR
6Y
LR21
7WM
ID1
KR
E11
ER
D1
CH
O2
YB
R09
4WY
BR
099C
MU
S81
MM
S4
TOP
1N
CS
6E
LP2
ELP
3Y
GR
228W
ELP
4Y
PL1
02C
IKI3
SW
D3
BR
E2
DO
T1S
WI3
VP
S24
YTA
12B
TS1
PE
X6
SR
O9
GE
T1G
SG
1G
LO3
YN
L296
WC
OG
7C
OG
5TL
G2
GO
S1
CO
G6
CO
G8
VA
M10
VP
S5
GY
P1
VP
S17
MO
N2
ISC
1M
UM
2C
DC
26M
CK
1A
RD
1Y
DJ1
NC
L1A
HC
1A
LT1
RP
S22
AH
XK
2E
AF3
DS
E3
PN
T1P
AN
3P
RK
1P
IN4
SE
C72
MM
S2
DN
M1
MM
M1
YLL
007C
PM
P3
RIM
13M
ED
1C
LB3
SS
D1
HA
P5
WS
S1
YB
R25
5WS
DS
3B
BC
1M
YO
5Y
TA7
VIK
1B
MH
1R
TT10
3Y
DR
360W
RIM
21S
IF2
VP
S72
HO
S2
YN
L140
CS
TB5
YK
L118
WV
MA
6S
EC
66H
OC
1S
TE24
CP
R7
SE
M1
IES
2LE
O1
XR
S2
RTT
109
CY
K3
SH
S1
RP
N4
VP
S8
VP
S38
VP
S35
DR
S2
PE
A2
YD
R13
6CY
LR26
1CR
GP
1R
IC1
YP
T6R
PA
34A
RC
18FK
S1
TUS
1TP
M1
PA
T1P
RE
9IX
R1
ES
C2
RTT
107
YM
L095
C-A
RP
O41
HC
M1
RP
N10
AS
F1N
UP
60U
BP
6R
RM
3S
OD
1S
LM3
FPS
1P
TC1
CN
B1
NB
P2
VP
S51
ED
E1
RV
S16
7Y
LR33
8W ILM
1S
AC
6S
HE
4C
CW
12Y
LR11
1WP
MR
1H
UR
1D
EP
1C
AP
2C
AP
1M
RC
1R
AD
57S
LX8
RA
D5
RA
D55
RA
D51
RA
D54
HP
R5
RA
D50
RA
D9
RA
D17
RA
D24
DD
C1
CC
S1
SLK
19C
LB2
LTE
1S
AM
37B
RE
1O
PI3
GE
T2S
KT5
CH
S7
CH
S3
BN
I4C
HS
5Y
BL0
62W
CH
S6
BE
M1
EM
P24
YN
L171
CP
ER
1V
PS
29G
UP
1G
AS
1C
SF1
VP
S21
MR
E11
KR
E1
VID
22S
OH
1R
TF1
LGE
1S
EC
22C
IK1
MN
N11
SU
M1
RU
D3
SP
F1V
RP
1IM
L3C
HL4
CTF
19M
CM
22C
TF3
MC
M21
MC
M16
YP
L017
CB
UB
3C
HL1
CS
M1
YD
R14
9CP
AC
1LD
B18
NU
M1
PA
C11
DY
N1
AR
P1
DY
N2
KIP
2JN
M1
DY
N3
NIP
100
ELM
1U
BA
4E
LP6
VA
C14
AS
E1
BFA
1B
UB
2B
IK1
TUB
3R
BL2
MO
N1
KIP
3Y
GL2
17C
CLB
4K
AR
9V
AN
1B
RE
5S
WC
3V
PS
71S
WR
1A
RP
6S
WC
5H
TZ1
BE
M4
MN
N10
SW
I4Y
CL0
60C
CS
M3
TOF1
PO
L32
RA
D27
ELG
1TO
P3
YLR
235C
RM
I1S
GS
1S
LA1
FAB
1P
AC
2C
IN2
CIN
4C
IN1
MA
D1
MA
D2
MA
D3
BU
B1
KE
M1
RA
D52
BE
M2
BN
I1D
CC
1C
TF8
CTF
18C
TF4
SM
I1S
LT2
BC
K1
CLA
4G
IM4
PA
C10
GIM
5G
IM3
YK
E2
CIN
8B
IM1
KA
R3
Adre
nal
Aorta
D
igit
Sn
out
Tong
ue s
urfa
ce
Thyr
oid
Tr
ache
a
Tong
ue
Pros
tate
Te
eth
Sk
in
Ova
ry
Ute
rus
Sa
livar
y
Col
on
Larg
e in
test
ine
Sm
all i
ntes
tine
M
amm
ary
glan
d
Stom
ach
Ki
dney
H
eart
Sk
elet
al M
uscl
e
Live
r Lu
ng
Blad
der
Thym
us
Bone
Mar
row
Fe
mur
Kn
ee
Brow
n fa
t Ly
mph
nod
e
Sple
en
Cal
varia
M
andi
ble
Pa
ncre
as
ES
Embr
yo
E10.
5 H
ead
Em
bryo
9.5
Em
bryo
12.
5
E14.
5 H
ead
Pl
acen
ta 1
2.5
Pl
acen
ta 9
.5
Epid
idym
us
Test
is
Brai
n
Cer
ebel
lum
C
orte
x
Mid
brai
n
Hin
dbra
in
Stria
tum
Tr
igem
inus
O
lfact
ory
bulb
Sp
inal
cor
d
Eye
A
B
Bra
in
Cer
ebel
lum
C
orte
x E
ye
Hin
dbra
in
Mid
brai
n O
lfact
ory
bulb
S
pina
l cor
d S
triat
um
Trig
emin
us
C
D
0
0.5
1.5
1
2.5
2
≥3
mRNA concentration
Figure 4, Cheung et al
Bra
in
Cer
ebel
lum
C
orte
x E
10.5
Hea
d E
14.5
Hea
d E
mbr
yo 1
2.5
Em
bryo
9.5
E
ye
Hin
dbra
in
Mid
brai
n S
triat
um
Trig
emin
us