Upload
amari-overland
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Microarray Data Analysis
(Lecture for CS397-CXZ Algorithms in Bioinformatics)
March 19, 2004
ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
Gene Expression Data (Microarray)
p genes on n samples
Genes
mRNA samples
Gene expression level of gene i in mRNA sample j
Log (treated-exp-value /controlled-exp-value )
sample1 sample2 sample3 sample4 sample5 …
1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
Some possible applications
Sample from specific organ to show which genes are expressed
Compare samples from healthy and sick host to find gene-disease connection
Discover co-regulated genes
Discover promoters
Major Analysis Techniques
Single gene analysis Compare the expression levels of the same gene under
different conditions
Main techniques: Significance test (e.g., t-test)
Gene group analysis Find genes that are expressed similarly across many different
conditions
Main techniques: Clustering (many possibilities)
Gene network analysis Analyze gene regulation relationship at a large scale
Main techniques: Bayesian networks
Clustering Methods
Similarity-based (need a similarity function) Construct a partition
Agglomerative, bottom up
Searching for an optimal partition
Typically “hard” clustering
Model-based (latent models, probabilistic or algebraic)
First compute the model
Clusters are obtained easily after having a model
Typically “soft” clustering
Similarity-based Clustering
Define a similarity function to measure similarity between two objects
Common criteria: Find a partition to Maximize intra-cluster similarity
Minimize inter-cluster similarity
Two ways to construct the partition Hierarchical (e.g.,Agglomerative Hierarchical Clustering)
Search by starting at a random partition (e.g., K-means)
Method 1 (Similarity-based):
Agglomerative Hierarchical Clustering
Agglomerative Hierachical Clustering
Given a similarity function to measure similarity between two objects
Gradually group similar objects together in a bottom-up fashion
Stop when some stopping criterion is met
Variations: different ways to compute group similarity based on individual object similarity
Similarity Measure: Pearson CC
The most popular correlation coefficient is Pearson correlation coefficient (1892)
correlation between X={X1, X2, …, Xn} and Y={Y1, Y2, …, Yn} :
where
n
k
kk
YX
YYXX
nr
1
1
n
k
k
n
GGG
1
2
(Adapted from a Slide by Shin-Mu Tseng)
sXY
sXY is the
similaritybetween X & Y
Better measures focus on a subset of values…
Similarity-induced Structure
How to Compute Group Similarity?
Given two groups g1 and g2,
Single-link algorithm: s(g1,g2)= similarity of the closest pair
complete-link algorithm: s(g1,g2)= similarity of the farthest pair
average-link algorithm: s(g1,g2)= average of similarity of all pairs
Three Popular Methods:
Three Methods Illustrated
Single-link algorithm
?
g1 g2
complete-link algorithm
……
average-link algorithm
Comparison of the Three Methods
Single-link “Loose” clusters
Individual decision, sensitive to outliers
Complete-link “Tight” clusters
Individual decision, sensitive to outliers
Average-link “In between”
Group decision, insensitive to outliers
Which one is the best? Depends on what you need!
Method 2 (similarity-based):
K-Means
K-Means Clustering
Given a similarity function
Start with k randomly selected data points
Assume they are the centroids of k clusters
Assign every data point to a cluster whose centroid is the closest to the data point
Recompute the centroid for each cluster
Repeat this process until the similarity-based objective function converges
Method 3 (model-based):
Mixture Models
Mixture Model for Clustering
P(X|Cluster1)
P(X|Cluster2)
P(X|Cluster3)
P(X)=1P(X|Cluster1)+ 2P(X|Cluster2)+3P(X|Cluster3)
2| ( , )i i iX Cluster N
Mixture Model Estimation
Likelihood function
Parameters:i, i, i
Using EM algorithm
Similar to “soft” K-means
21
221
( )( ) exp( )
2i
ki
ii i
xp x
Method 4 (model-based) [If we have gtime]
Singular Value Decomposition (SVD)
Also called “Latent Semantic Indexing” (LSI)
Example of “Semantic Concepts”
(Slide from C. Faloutsos’s talk)
Singular Value Decomposition (SVD)
A[n x m] = U[n x r] r x r] (V[m x r])T
A: n x m matrix (n documents, m terms)
U: n x r matrix (n documents, r concepts)
: r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix)
V: m x r matrix (m terms, r concepts)
(Slide from C. Faloutsos’s talk)
Example of SVD
1 1 1 0 0
2 2 2 0 0
1 1 1 0 0
5 5 5 0 0
0 0 0 2 2
0 0 0 3 30 0 0 1 1
datainfretrieval
brainlung
0.18 0
0.36 0
0.18 0
0.90 0
0 0.53
0 0.800 0.27
=CS
MD
9.64 0
0 5.29x
0.58 0.58 0.58 0 0
0 0 0 0.71 0.71
x
CS-concept MD-concept
Term rep of concept
(Slide adapted from C. Faloutsos’s talk)
Strength of CS-concept
Dim. Reduction
A = U VT
More clustering methods and software
Partitioning : K-Means, K-Medoids, PAM, CLARA …
Hierarchical : Cluster, HAC 、 BIRCH 、 CURE 、 ROCK
Density-based : CAST, DBSCAN 、 OPTICS 、 CLIQUE…
Grid-based : STING 、 CLIQUE 、 WaveCluster…
Model-based : SOM (self-organized map) 、 COBWEB、 CLASSIT 、 AutoClass…
Two-way Clustering
Block clustering