Upload
phungdiep
View
222
Download
4
Embed Size (px)
Citation preview
1
Computational Biology IST
Technical University of Lisbon Ana Teresa Freitas
2016/2017
Microarray data analysis
Microarrays l Rows represent
genes l Columns represent
samples
l Many problems may be solved using clustering
l Example of microarray dataset
2
Microarray data
Expression levels of gene i, across samples
Gi
Expression levels of all genes, for one sample
Sj
Typical examples of samples: Heat shock, phases in cell cycle, cancer, normal, …
Microarray data
Genes
mRNA samples
Gene expression level of gene i in mRNA sample j
Log (treated-exp-value /controlled-exp-value )
sample1 sample2 sample3 sample4 sample5 …1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
3
What do we actually measure?
l We measure signal of cDNA target(s) which hybridize(s) to the probe (and backgrounds, ratios, standard deviations, dust etc.…)
l What do we wish to know (an abstraction)? [mRNA]1a , [mRNA] 1b ,….. [mRNA]Na , [mRNA] Nb
Where N = Number of Genes, a and b = different colors
Factors with impact on the signal level
l Amount of mRNA l Labeling efficiencies l Quality of the RNA l Laser/dye combination l Detection efficiency of photomultiplier …
4
Typical Assumption
[mRNA]n,a α signaln,a
“Normalization constant”
[mRNA]n,a = k * signaln,a
n = gene indexa = color
Low level analysis l Image analysis - computation of probes’
intensities/signals
l Normalization - is the attempt to compensate for systematic technical differences between chips, to see more clearly the systematic biological differences between samples. Statisticians use the term 'bias' to describe systematic errors, which affect a large number of genes.
5
Normalization
l Sources of Systematic Errors l Different incorporation efficiency of dyes l Different amounts of mRNA l Experimenter/protocol issues (comparing
chips processed by different labs) l Different scanning parameters l Batch bias
Normalization
l Two problems:
l How to detect biases? Which genes to use for estimating biases among chips?
l How to remove the biases?
6
Which genes to use for bias detection?
• All genes on the chip l Assumption: Most of the genes are equally
expressed in the compared samples, the proportion of the differential genes is low (<20%).
l Limits: l Not appropriate when comparing highly
heterogeneous samples (different tissues) l Not appropriate for analysis of ‘dedicated
chips’ (apoptosis chips, inflammation chips etc)
Which genes to use for bias detection?
• Housekeeping genes – Assumption: based on prior knowledge a set of genes
can be regarded as equally expressed in the compared samples
• Affy novel chips: ‘normalization set’ of 100 genes • NHGRI’s cDNA microarrays: 70 "house-keeping" genes
set – Limits:
• The validity of the assumption is questionable • Housekeeping genes are usually expressed at high
levels, not informative for the low intensities range
7
Normalization methods
l Global normalization (Scaling) l enforces the chips to have equal mean (median) intensity
l Intensity-dependent normalization (Lowess) l enforces equal means at all intensities
l Quantile Normalization l enforces the chips to have identical intensity distribution
Quantile Normalization
l Sort each column in the data matrix according to genes’ (probes’) intensities in each chip
l Compute mean intensity in each rank across the chips l Replace each intensity by the mean intensity at its rank l Re-order columns to original state, each row corresponds
to a gene
Chip #1 Chip #2 Chip #3 Average chip
8
Quantile Normalization
Before
After
What is Cluster Analysis?
l Cluster: a collection of data objects l Similar to one another within the same cluster l Dissimilar to the objects in other clusters
l Cluster analysis l Grouping a set of data objects into clusters
l Clustering is unsupervised classification: no predefined classes
l Typical applications l As a stand-alone tool to get insight into data distribution l As a preprocessing step for other algorithms
9
Things to study (1)
• Clustering (grouping) genes: i.e., finding groups of co-regulated genes
Expression levels across time of two clusters of co-regulated genesExample:
samples samples
Things to study (2)
• Clustering (grouping) samples
Groups of similarbehaviour ?
i.e., finding groups of samples with similar genetic profiles (e.g., cancer types).
10
Things to study (3)• Classifying genes: i.e., deciding if a gene is co-regulated with some known gene(s), based on their expression profiles across samples.
samples
Annotated gene 1
Annotated gene 2
samples
samples
Unknown gene
Co-regulation? Similar biological function? Same transcription factor?
Things to study (4)
• Classifying samples: i.e., classifying new samples, based on a set of classified samples (example: cancer versus normal; different types of cancer;...)
classified samplesA B samples to be classified
11
Things to study (5)
• Selecting genes: a) deciding if a given gene, in isolation, behaves differently in a control versus experimental situation (e.g., cancer vs normal, two types of cancer, treatment vs non-treatment).
b) Selecting which group genes is significantly different in a control versus experimental situation (same examples). c) Selecting which group of genes is relevant for a given classification problem.
Clustering methods l Similarity-based (need a similarity function)
l Construct a partition l Agglomerative, bottom up l Searching for an optimal partition
l Typically “hard” clustering
l Model-based (latent models, probabilistic or algebraic) l First compute the model l Clusters are obtained easily after having a model l Typically “soft” clustering
12
Similary-based clustering
l Define a similarity function to measure similarity between two objects
l Common criteria: Find a partition to l Maximize intra-cluster similarity l Minimize inter-cluster similarity
l Two ways to construct the partition l Hierarchical (e.g.,Agglomerative Hierarchical Clustering) l Search by starting at a random partition (e.g., K-means)
Agglomerative Hierarchical Clustering
l Given a similarity function to measure similarity between two objects
l Gradually group similar objects together in a bottom-up fashion
l Stop when some stopping criterion is met l Variations: different ways to compute
group similarity based on individual object similarity
13
Distance Metrics
l For clustering algorithms the calculation of a distance between gene vectors or experiment vectors is a necessary step
l Distances metrics can be classified as • Metric distances • Semi-metric distances
l Metric distances: 1. dab >= 0 2. dab = dba 3. daa = 0 4. dab <= dac + dcb
l Semi-metric distances: obey 1) to 3), fail in 4)
Distance Metrics
Minkowski distance If q = 1, d is Manhattan distance (semi-metric distance)
If q = 2, d is Euclidean distance (metric distance)
q q
pp
jxixjxixjxixjid )||...|||(|),(2211
−++−+−=
||...||||),(2211 pp jxixjxixjxixjid −++−+−=
)||...|||(|),( 22
22
2
11 pp jxixjxixjxixjid −++−+−=
14
Distance Metrics
Pearson correlation coefficient (semi-metric distance)
∑∑
∑
=−
=−
=−−
= n
ix
ixn
ix
ix
n
ix
ixx
ix
jid
12)
22(
12)
11(1
)22
)(11
(),(
-1 <= d(i,j) <= +1
)2,1( xx
x1
x2 )11( xx −
)22( xx −
Distance Metrics
Entropy based distances:Mutual Information(semi-metric distance)
• Mutual Information (MI) is a statistical representation of the correlation of two signals A and B.
• MI is a measure of the additional information known about one expression pattern when given another.
• MI is not based on linear models and can therefore also see non-linear dependencies (see picture).
15
Similarity-induced Structure
How to Compute Group Similarity?
Three Popular Methods: Given two groups g1 and g2, Single-link algorithm: s(g1,g2)= similarity of the closest pair Complete-link algorithm: s(g1,g2)= similarity of the farthest pair Average-link algorithm: s(g1,g2)= average of similarity of all pairs
16
Comparison of the Three Methods
l Single-link l “Loose” clusters l Individual decision, sensitive to outliers
l Complete-link l “Tight” clusters l Individual decision, sensitive to outliers
l Average-link l “In between” l Group decision, insensitive to outliers
l Which one is the best? Depends on what you need!
Hierarchical (agglomerative) clustering.
Strictly speaking, agglomerative clustering does not produce clusters, but a dendogram
2 3 5 1 4
Cutting the dendogram at a certain level yields clusters.
2 3 5 1 4
diss
imila
rity
Dendogram cutting is a problem analogous to the selection of K in K-means clustering.
17
Microarray data from time course of serumstimulation of primary human fibroblasts.
Experiment:Foreskin fibroblasts were grown in culture and weredeprived of serum for 48 hr. Serum was added back andsamples taken at time 0, 15 min, 30 min, 1hr, 2 hr, 3 hr, 4hr, 8 hr, 12 hr, 16 hr, 20 hr, 24 hr.
Clustering:Correlation Coefficient +
Agglomerative clustering (average-link)
Clusters with biological interpretation:(A) cholesterol biosynthesis,(B) the cell cycle,(C) the immediate-early response,(D) signalling and angiogenesis,(E) wound healing and tissue remodelling.
Example of agglomerative gene clustering (Eisen et al, 98)
Data Structures
l Data matrix
l Dissimilarity matrix
⎥⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢⎢
⎣
⎡
npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
0...)2,()1,(:::
)2,3()
...ndnd
0dd(3,10d(2,1)
0
18
Partitioning Algorithms: Basic Concept
l Partitioning method: Construct a partition of a database D of n objects into a set of k clusters
l Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion l Global optimal: exhaustively enumerate all partitions l Heuristic methods: k-means and k-medoids algorithms l k-means (MacQueen’67): Each cluster is represented by the
center of the cluster l k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in the cluster
The K-Means Clustering Method
l Given k, the k-means algorithm is implemented in four steps: l Step 1: Partition objects into k nonempty subsets
l Step 2: Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)
l Step 3: Assign each object to the cluster with the nearest seed point
l Go back to Step 2, stop when no more new assignment
19
The K-Means Clustering Method l Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K object as initial cluster center
Assign each objects to most similar center
Update the cluster means
Update the cluster means
reassign reassign
Comments on the K-Means Method
l Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
l Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
l Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing
l Weakness l Applicable only when mean is defined, then what about
categorical data?
l Need to specify k, the number of clusters, in advance l Unable to handle noisy data and outliers
l Not suitable to discover clusters with non-convex shapes
20
Variations of the K-Means Method
l A few variants of the k-means which differ in
l Selection of the initial k means
l Dissimilarity calculations
l Strategies to calculate cluster means
l Handling categorical data: k-modes (Huang’98)
l Replacing means of clusters with modes
l Using new dissimilarity measures to deal with categorical objects
l Using a frequency-based method to update modes of clusters
l A mixture of categorical and numerical data: k-prototype method
What is the problem of k-Means Method?
l The k-means algorithm is sensitive to outliers ! l Since an object with an extremely large value may substantially
distort the distribution of the data.
l K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the
most centrally located object in a cluster.
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
21
Problem 1 Consider the following expression matrix where the expression levels of 2 genes (G1 and G2) were analyzed in 7 healthy/infected tissues (conditions C1 to C7). Consider also the problem of grouping tissues given the expression profiles of the genes using clustering algorithms. l Determine the dendogram found by a hierarchical clustering algorithm (HCA)
using a bottom-up approach, the Euclidean distance to compute the distance between conditions, and the single-link distance to compute the distance between groups (intercluster distance).
l How would you use the dendogram to group the tissues in 2 groups (clusters) and which will be those clusters?
l Determine the groups found by the K-means (K=2) algorithm when the centroids are initialized with C5 = (4,3) and C6 = (1,1).
Biclustering: Motivation l Gene expression matrices have been
extensively analyzed using clustering in one of two dimensions l The gene dimension l The condition dimension
l This corresponds to the l Analysis of expression patterns of genes by
comparing rows in the matrix. l Analysis of expression patterns of samples by
comparing columns in the matrix.
22
Biclustering: Motivation
l Common objectives pursued when analyzing gene expression data include: 1. Grouping of genes according to their expression under
multiple conditions. 2. Classification of a new gene, given its expression and the
expression of other genes, with known classification. 3. Grouping of conditions based on the expression of a
number of genes. 4. Classification of a new sample, given the expression of the
genes under that experimental condition.
What is Biclustering?
l Biclustering = Simultaneous clustering of both rows and columns of
a data matrix.
l Concept can be traced back to the 70’ (Hartigan, 1972), although it
has been rarely used or studied.
l The term was introduced by (Cheng and Church, 2000) who were
the first to used it in gene expression data analysis.
l Technique used in other fields, such as collaborative filtering,
information retrieval and data mining.
23
What is Biclustering?
l We consider a n by m data matrix, A=(X,Y), where l X={x1,..., xn} = Set of n rows l Y={y1,..., ym} = Set of m columns l aij = numeric value (discrete or real) representing the relation
between row i and column j. l In the case of gene expression matrices
l X = Set of Genes l Y = Set of Conditions l aij = expression level of gene i under condition j (real value).
What is Biclustering?
anm ... anj ... an1 Gene n
... ... ... ... ... ...
aim ... aij ... ai1 Gene i
... ... ... ... ... ...
a1m ... a1j ... a11 Gene 1
Condition m ... Condition j ... Condition
1
A = (X,Y)
Gene Expression Matrix
24
What is Biclustering? Given the matrix A = (X,Y) I = Subset of rows J = Subset of columns l (I,Y) = a subset of rows that exhibit similar behavior
across the set of all columns = cluster of rows l (X,J) = a subset of columns that exhibit similar
behavior across the set of all rows = cluster of columns
What is Biclustering?
l (I,J) = a subset of rows and a subset of columns, where the rows exhibit similar behavior across the columns and vice-versa.
= sub-matrix of A that contains only the elements aij with set of rows I and set of columns J.
= bicluster l We want to identify a set of biclusters Bk = (Ik,Jk). l Each bicluster Bk must satisfies some specific
characteristics of homogeneity.
25
a69
a59
a49
a39
a29
a19
C9
G6
G5
G4
G3
G2
G1 a110
a18 a17 a16 a15 a14 a13 a12 a11
a610
a68 a67 a66 a65 a64 a63 a62 a61
a510
a58 a57 a56 a55 a54 a53 a52 a51
a410
a48 a47 a46 a45 a44 a43 a42 a41
a310
a38 a37 a36 a35 a34 a33 a32 a31
a210
a28 a27 a26 a25 a24 a23 a22 a21
C10 C8 C7 C6 C5 C4 C3 C2 C1
X = {G1, G2, G3, G4, G5, G6}
Y= {C1, C2, C3, C4, C5, C6, C7, C8, C9, C10}
I = {G2, G3, G4}
J = {C4, C5, C6}
Bicluster (I,J)
{{G2, G3, G4}, {C4, C5, C6}}
Cluster of Rows (I,Y)
{G2, G3, G4}
Cluster of Columns (X,J)
{C4, C5, C6}
What is Biclustering?
What is Biclustering? l Biclustering Goals
l Perform simultaneous clustering on the row and column dimensions of the gene expression matrix instead of clustering the rows and columns separetely.
l Identify sub-matrices (subsets of rows and subsets of columns) with interesting properties.
l Gene Expression Data Analysis l Identify subgroups of genes and subgroups of
conditions, where the genes exhibit highly correlated activities for every condition
Madeira, Sara C. and Oliveira, Arlindo L. Biclustering Algorithms for Biological Data Analysis: A Survey IEEE/ACM Trans. Comput. Biol. Bioinformatics January 2004
26
Bicluster Types l An interesting criteria to evaluate a biclustering algorithm
concerns the identification of the type of biclusters the algorithm
is able to find.
l There are four major classes of biclusters
1. Biclusters with constant values.
2. Biclusters with constant values on rows or columns.
3. Biclusters with coherent values.
4. Biclusters with coherent evolutions.
Constant Values
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0
27
4.0 4.0 4.0 4.0
3.0 3.0 3.0 3.0
2.0 2.0 2.0 2.0
1.0 1.0 1.0 1.0
4.0 3.0 2.0 1.0
4.0 3.0 2.0 1.0
4.0 3.0 2.0 1.0
4.0 3.0 2.0 1.0
Constant Rows Constant Columns
Constant Values on Rows or Columns
Coherent Values
4.0 9.0 6.0 5.0
3.0 8.0 5.0 4.0
1.0 6.0 3.0 2.0
0.0 5.0 2.0 1.0
4.5 1.5 6.0 3.0
6.0 2.0 8.0 4.0
3.0 1.0 4.0 2.0
1.5 0.5 2.0 1.0
Additive Model Multiplicative Model
28
Coherent Evolutions
S1 S1 S1 S1
S1 S1 S1 S1
S1 S1 S1 S1
S1 S1 S1 S1
S4 S4 S4 S4
S3 S3 S3 S3
S2 S2 S2 S2
S1 S1 S1 S1
Overall Coherent
Evolution
Coherent Evolution
On the Rows
Coherent Evolutions
S4 S3 S2 S1
S4 S3 S2 S1
S4 S3 S2 S1
S4 S3 S2 S1
12 20 15 90
15 27 20 40
35 49 40 49
10 19 13 70
Coherent Evolution
On the Columns
Order Preserving
Sub-Matrix (OPSM)
29
Algorithms
l When this is the case, a bicluster corresponds to a biclique in the corresponding bipartite graph.
l Finding a maximum size bicluster l Is equivalent to finding the maximum edge biclique in a
bipartite graph. l This problem is known to be NP-complete (Peeters, 2003).
l More complex cases l Where the actual numeric values in the matrix A are taken
into account to compute the quality of a bicluster l Have a complexity that is necessarily no lower than this
simpler case.
Algorithms
l Given this, the large majority of the algorithms use heuristic approaches to identify biclusters.
l In many cases the algorithm is preceded by a normalization step that is applied to the data matrix. l The goal is to make more evident the patterns of
interest. l Some algorithms avoid heuristics but exhibit
an exponential worst case runtime.
30
Algorithms l Different Objectives
l Identify one bicluster. l Identify a given number of biclusters.
l Different Approaches l Discover one bicluster at a time. l Discover one set of biclusters at a time. l Discover all biclusters at the same time
(Simultaneous bicluster identification)
Algorithms: Heuristic Approaches
l Iterative Row and Column Clustering Combination l Apply clustering algorithms to the rows and columns of the
data matrix, separately. l Combine the results using some sort of iterative procedure
to combine the two cluster arrangements.
l Divide and Conquer l Break the problem into several subproblems that are similar
to the original problem but smaller in size. l Solve the problems recursively.
31
Algorithms: Heuristic Approaches
l Combine the intermediate solutions to create a solution to the original problem.
l Usually break the matrix into submatrices (biclusters) based on a certain criterion and then continue the biclustering process on the new submatrices.
l Greedy Iterative Search l Always make a locally optimal choice in the hope that this
choice will lead to a globally good solution. l Usually perform greedy row/column addition/removal.
Algorithms
l Exhaustive Bicluster Enumeration l A number of methods have been used to speed up
exhaustive search. l In some cases the algorithms assume restrictions
on the size of the biclusters that should be listed.
32
Measure cluster homogeneity
33
Missing values: Random numbers Find one bicluster at a time Hide biclustering using random numbers
34
35
Example
l Consider the following expression matrix J
A(X,Y) =
I Run Brute-Force Deletion and Addition algorithm to find a Biclustering
36
Example l Run Algorithm 2, δ = 0 (maximum acceptable mean squared residue
score), α = 1,5 (a threshold for the multiple node deletion)
aIj – column aiJ – row aIJ = 29/(4x4) a1J = 5/4 aI1 = 7/4 a2J = 8/4 aI2 = 4/4 a3J = 6/4 aI3 = 9/4 a4J = 10/4 aI4 = 9/4 H(I,J) = (1/(4x4)) * ((a11 – a1J – aI1 + aIJ)^2 + (a12 – a1J – aI2 + aIJ)^2) + (a13 – a1J – aI3 + aIJ)^2) + (a14 – a1J – aI4 + aIJ)^2) + (a21 – a2J – aI1 + aIJ)^2) + … = 1,28 http://www.kemaleren.com/cheng-and-church.html