View
216
Download
1
Embed Size (px)
Citation preview
University of Crete CS483 1
The use of Minimum Spanning Trees in microarray expression data
Gkirtzou Ekaterini
University of Crete CS483 2
Introduction
Classic clustering algorithms, like K-means, self-organizing maps, etc., have certain drawbacks No guarantee for global optimal results Depend on geometric shape of cluster
boundaries (K-means)
University of Crete CS483 3
Introduction
MST clustering algorithms Expression data clustering analysis
(Xu et al -2001) Iterative clustering algorithm
(Varma et al - 2004) Dynamically growing self-organizing
tree (DGSOT) (Luo et al - 2004)
University of Crete CS483 4
Definitions
A minimum spanning tree (MST) of a weighted, undirected graph with weights is an acyclic subset that contains all of the vertices and whose total weight
is minimum.
T G
( ) ( )e T
w T w e
( , )G V E( )w e
University of Crete CS483 5
Definitions
The DNA microarray technology enables the massive parallel measurement of gene expression of thousands genes simultaneously. Its usefulness: compare the activity of genes in diseased
and healthy cells categorize a disease into subgroups discover new drug and toxicology studies.
University of Crete CS483 6
Definitions
Clustering is a common technique for data analysis. Clustering partitions the data set into subsets (clusters), so that the data in each subset share some common trait.
University of Crete CS483 7
MST clustering algorithms
Expression data clustering analysis (Xu et al -2001)
Iterative clustering algorithm (Varma et al - 2004)
Dynamically growing self-organizing tree (DGSOT) (Luo et al - 2004)
University of Crete CS483 8
Expression data clustering analysis
Let be a set of expression data with each representing the expression levels at time 1 through time t of gene i. We define a weighted, undirected graph as follows. The vertex set
and the edge set .
{ }iD d1( , , )ti i id e e
( , )G V E{ | }i iV d d D
{( , ) | , and }i j i jE d d for d d D i j
University of Crete CS483 9
Expression data clustering analysis
G is a complete graph. The weight of its edge is the distance
of the two vertices e.g. Euclidean distance, Correlation coefficient, etc.
Each cluster corresponds to one subtree of the MST.
No essential information is lost for clustering.
University of Crete CS483 10
Clustering through removing long MST-edges
Based on intuition of the cluster
Works very well when inter-cluster edges are larger than intra-cluster ones
University of Crete CS483 11
An iterative Clustering
Minimize the distance between the center of a cluster and its data
Starts with K arbitrary clusters of the MST for each pair of adjacent clusters finds
the edge to cut, which optimizes
1
( , ( ))i
K
ii d T
d center T (1)
(1)
University of Crete CS483 12
A globally optimal clustering
Tries to partition the tree into K subtrees
Select K representatives to optimize
1
( , )i
K
ii d T
d d
(2)
(2)
University of Crete CS483 13
MST clustering algorithms
Expression data clustering analysis (Xu et al -2001)
Iterative clustering algorithm (Varma et al - 2004)
Dynamically growing self-organizing tree (DGSOT) (Luo et al - 2004)
University of Crete CS483 14
Iterative clustering algorithm
The clustering measure used here is Fukuyama-Sugeno
where , are the two partitions of the set S, with each contains samples, denote by the mean of the samples in and the global mean of all samples. Also denote by the j-th sample in the cluster
2 2 2
1 1
( )kN
kj k k
k j
FS S x
1S 2SkS kN
k kjx
kS
kS
University of Crete CS483 15
Iterative clustering algorithm
Feature selection counts the gene’s support to a partition
Feature selection used here is t-statistic with pooled variance. T-statistic is heuristic measure
Genes with absolute t-statistic greater than a threshold are selected
University of Crete CS483 16
Iterative clustering algorithm
Create an MST from all genes Delete edges from MST and obtain
binary partitions. Select the one with minimum F-S clustering measure
The feature selection is used to select a subset of genes that single out between the clusters
University of Crete CS483 17
Iterative clustering algorithm
In the next iteration the clustering is done in this selected set of genes
Until the selected gene subset converges
Remove them form the pool and continue.
University of Crete CS483 18
MST clustering algorithms
Expression data clustering analysis (Xu et al -2001)
Iterative clustering algorithm (Varma et al - 2004)
Dynamically growing self-organizing tree (DGSOT) (Luo et al - 2004)
University of Crete CS483 19
Dynamically growing self-organizing tree (DGSOT)
In the previous algorithms the MST is constructed on the original set of data and used to test the intra-cluster quantity, while here the MST is used as a criterion to test the inter-cluster property.
University of Crete CS483 20
DGSOT algorithm
Tree structure self-organizing neural network
Grows vertically and horizontally Starts with a root-leaf node In every vertical growing every leaf
node with heterogeneity two descendents are created and the learning process take place
et RH T
University of Crete CS483 21
DGSOT algorithm Heterogeneity
Variability (maximum distance between input data and node)
Average distortion d of a leaf
D: total number of input
data of lead i : distance between data j and leaf i : reference vector of leaf i
1
( , )i
Dj i
j
d x nd
D
( , )j id x nin
University of Crete CS483 22
DGSOT algorithm
In every horizontal growing for every lowest non-leaf node a child is added until the validation criterion is satisfied and the learning process take place
The learning process distributes the data to the leaves in the best way. The best matching node has the minimum distance to the input data
University of Crete CS483 23
The validation criterion of DGSOT
Calculated without human intervention
Based on geometric characteristics of the clusters
Create the Voronoi diagram for the input data. The Voronoi diagram divides the set D data into n regions V(p):V(p) = {x | ( , ) ( , ) }D dist x p dist x q q
University of Crete CS483 24
The validation criterion of DGSOT
Let’s define a weighted, undirected graph .The vertices is the set of the centroids of the Voronoi cell and the edge set is defined as
Create the MST for the graph
( , )G V E( )V p
{( , ) | , ( ) and }i j i jE p p p p V p i j
( , )G V E
University of Crete CS483 25
Voronoi diagram of 2D dataset
In A, the dataset is partitioned into three Voronoi cells. The MST of the centroid is ‘even’.
In B, the dataset is partitioned into four Voronoi cells. The MST of the centroid is not ‘even’.
University of Crete CS483 26
The validation criterion of DGSOT
Cluster separation`
where is minimum length edge and is the maximum length edge
A low value of the CS means that the two centroids are to close to each other and the Voronoi partition is not valid, while a high CS value means that the Voronoi partition is valid.
min
max
ECS
E
minE
maxE
University of Crete CS483 27
Example of DGSOT
University of Crete CS483 28
Conclusions
The tree algorithms presented in this report have provided comparable result to those obtained by classic clustering algorithms, without their drawbacks, and superior to those obtained by standard hierarchical clustering.
University of Crete CS483 29
Questions