Upload
julian-skeels
View
216
Download
0
Embed Size (px)
Citation preview
Basic Gene Expression Data Analysis--Clustering
Gene Expression Data
Distance/Similarity Matrix
Gene Clusters
Regulatory Elements / Gene Functions
Pairwise Measures
Clustering
Motif Searching/Network Construction
Integrated Analysis(NMR/SNP/Clinic/….)
Microarray Experiment
Control
Treated
mRNA
RT andlabel with fluor dyes
cDNA
Spot (DNA probe):• known cDNA or• Oligo
Mix and hybridize target to microarray
Collections of Experiments
• Time course after a treatment
• Different treatments• Disease cell lines• Data are represented in
a matrix
Cluster Analysis
• Grouping of genes with “similar” expression profiles
• Grouping of disease cell lines/toxicants with “similar” effects on gene expression
• Clustering algorithms– Hierarchical clustering– Self-organizing maps– K-means clustering
Normalized Expression Data
Gene Expression Clustering
Protein/protein complex
DNA regulatory elements
Semantics of clusters:From co-expressed to co-regulated
Key Terms in Cluster Analysis
• Distance & Similarity measures
• Hierarchical & non-hierarchical
• Single/complete/average linkage
• Dendrograms & ordering
Measuring Similarity of Gene Expression
• Euclidean (L2) distance
• Manhattan (L1) distance
• Lm: (|x1-x2|m+|y1-y2|m)1/m
• L∞: max(|x1-x2|,|y1-y2|)
• Inner product: x1x2+y1y2
• Correlation coefficient• Spearman rank correlation
coefficient
(x(x11, y, y11))
(x(x22,y,y22))
Distance Measures: Minkowski Metric
r rp
iii
p
p
yxyxd
yyyy
xxxx
pyx
||),(
)(
)(
1
21
21
by defined is metric Minkowski The
:features have both and objects two Suppose
ref
Commonly Used Minkowski Metrics
||max),(
||),(
1
||),(
2
1
1
2 2
1
iipi
p
iii
p
iii
yxyxd
r
yxyxd
r
yxyxd
r
) distance sup"(" 3,
distance) (Manhattan 2,
) distance (Euclidean 1,
An Example
.4}3,4{max
.734
.5342 22
:distance sup"" 3,
:distance Manhattan 2,
:distance Euclidean 1,
4
3
x
y
Manhattan distance is called Hamming distance when all features are binary.
1101111110000111010011100100100110
1716151413121110987654321
GeneBGeneA
Gene Expression Levels Under 17 Conditions (1-High,0-Low)
. :Distance Hamming 5141001 )#()#(
From Clustering to Correlation
Time
Gene A
Gene B Gene A
Time
Gene B
Expression LevelExpression Level
Expression Level
Time
Gene A
Gene B
Similarity Measures: Correlation Coefficient
. and :averages
)()(
))((),(
1
1
1
1
1 1
22
1
p
iip
p
iip
p
i
p
iii
p
iii
yyxx
yyxx
yyxxyxs
1),( yxs
Hierarchical ClusteringGiven a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process hierarchical clustering is this:
1.Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain.
2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.
3.Compute distances (similarities) between the new cluster and each of the old clusters.
4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.
Normalized Expression Data
Hierarchical Clustering
a b c d
Hierarchical Clustering
3 clusters?3 clusters?
2 clusters?2 clusters?
Cluster Analysis • Eisen et al. (1998) (PNAS, 95:14863)
Correlation as measure of co-expression
Experiment over time
timet0 t1 t2 ...
control N genes N*N correlation matrix1 2 3 …
1 1.00 0.23 0.02 …2 1.00 0.34 …3 1.00 ……
N*N correlation matrix1 2 3
1 1.00 0.23 0.022 1.00 0.343 1.00
1 (2&3)1 1.00 0.28
(2&3) 1.00
Cluster Analysis
• Scan matrix for maximum
• Join genes to 1 node
2
3 • Update matrix
1
Cluster Analysis
• Result: Dendogram assemling N genes
• Points of discussion– similarity based, useful for co-expression– dependent on similarity measure?– useful in preliminary scans– biological relevance of clusters?
Distance Between Two Clusters
Min
distance
Average
distance
Max
distance
Single-Link Method / Nearest Neighbor
Complete-Link / Furthest Neighbor
Their Centroids.
Average of all cross-cluster pairs.
• single-link clustering (also called the connectedness or minimum method) : we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster.
• complete-link clustering (also called the diameter or maximum method): we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster.
• average-link clustering : we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.
Single-Link Method
ba
453652
cba
dcb
Distance Matrix
Euclidean Distance
453,
cba
dc
453652
cba
dcb4,, cbad
(1) (2) (3)
a,b,ccc d
a,b
d da,b,c,d
Complete-Link Method
ba
453652
cba
dcb
Distance Matrix
Euclidean Distance
465,
cba
dc
453652
cba
dcb6,,
badc
(1) (2) (3)
a,b
cc d
a,b
d c,da,b,c,d
Identifying disease genes
Non-tumor Liver
Tumor Liver
Liver-specific
Ribosomal proteins
Proliferation
Endothelial cells 1
X. Chen & P.O. Brown et alMolecular Biology of the CellVol. 13, 1929-1939, June 2002
• Human tumor patient and normal cells; various conditions
• Cluster or Classify genes according to tumors
• Cluster tumors according to genes
K-Means Clustering Algorithm
1) Select an initial partition of k clusters
2) Assign each object to the cluster with the closest center:
3) Compute the new centers of the clusters:
4) Repeat step 2 and 3 until no object changes cluster
SXXnXSC n
n
ii
,...,,/)( 1
1
K-Means Clustering
• Basic Ideas : using cluster centre (means) to represent cluster
• Assigning data elements to the closet cluster (centre).
• Goal: Minimise square error (intra-class dissimilarity) : =
• Variations of K-Means– Initialisation (select the number of clusters, initial partitions)
– Updating of center
– Hill-climbing (trying to move an object to another cluster).
))(,( ii
i xCxd
This method initially takes the number of components of the population equal to the final required number of clusters. In this step itself the final required number of clusters is chosen such that the points are mutually farthest apart. Next, it examines each component in the population and assigns it to one of the clusters depending on the minimum distance. The centroid's position is recalculated everytime a component is added to the cluster and this continues until all the components are grouped into the final required number of clusters.
The K-Means Clustering Method
• Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
k-means Clustering : Procedure (1)
Initialization 1Specify the number of cluster k : for example, k = 4
2
5000
Expression matrix
Each point is called “gene”
k-means Clustering : Procedure (2)
Initialization 2Genes are randomly assigned to one of k clusters
k-means Clustering : Procedure (2)
Calculate the mean of each cluster
CN
ii
C
ic g
Nm
1
1
(1,2)(3,2)
(3,4)
(6,7)
4
1i
BLUEm [(6,7) + (3,4) + …]
k-means Clustering : Procedure (4)
Each gene is reassigned to the nearest cluster
2||minarg iijj gmc
Gene i to cluster c
k-means Clustering : Procedure (4)
Each gene is reassigned to the nearest cluster
2||minarg iijj gmc
Gene i to cluster c
k-means Clustering : Procedure (5)
Iterate until the means are converged
k-means clustering : application
6220 yeast genes15 time pointsduring cell cycle
M/G1 phaseG1 phase
M phase
Result: 13 clusters of 30 clusters had statistical significancefor each biological function
S. Tavazoie & GM ChurchNature GeneticsVol. 22, July 1999
:
Computation Time and Memory Requirementn genes and m experiments
Computation time:• Hierarchical clustering
– O( m n2 log(n) )
• K-means clustering– t: number of iterations
– O( k t m n )
Memory requirement:• Hierarchical clustering
– O( mn + n2 )
• K-means clustering– t: number of iterations
– O( mn + kn )
Issues in Cluster Analysis
• A lot of clustering algorithms
• A lot of distance/similarity metrics
• Which clustering algorithm runs faster and uses less memory?
• How many clusters after all?
• Are the clusters stable?
• Are the clusters meaningful?
K-Means vs Hierarchical Clustering
Pattern Recognition
• Clarification of decision making processes and automating them using computers
supervised unsupervised
• known number of classes
• based on a training set
• used to classify future observations
• unknown number of classes
• no prior knowledge
• cluster analysis = one form