Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Microarray Data Analysis

(Lecture for CS397-CXZ Algorithms in Bioinformatics)

March 19, 2004

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Gene Expression Data (Microarray)

p genes on n samples

Genes

mRNA samples

Gene expression level of gene i in mRNA sample j

Log (treated-exp-value /controlled-exp-value )

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Some possible applications

Sample from specific organ to show which genes are expressed

Compare samples from healthy and sick host to find gene-disease connection

Discover co-regulated genes

Discover promoters

Major Analysis Techniques

Single gene analysis Compare the expression levels of the same gene under

different conditions

Main techniques: Significance test (e.g., t-test)

Gene group analysis Find genes that are expressed similarly across many different

conditions

Main techniques: Clustering (many possibilities)

Gene network analysis Analyze gene regulation relationship at a large scale

Main techniques: Bayesian networks

Clustering Methods

Similarity-based (need a similarity function) Construct a partition

Agglomerative, bottom up

Searching for an optimal partition

Typically “hard” clustering

Model-based (latent models, probabilistic or algebraic)

First compute the model

Clusters are obtained easily after having a model

Typically “soft” clustering

Similarity-based Clustering

Define a similarity function to measure similarity between two objects

Common criteria: Find a partition to Maximize intra-cluster similarity

Minimize inter-cluster similarity

Two ways to construct the partition Hierarchical (e.g.,Agglomerative Hierarchical Clustering)

Search by starting at a random partition (e.g., K-means)

Method 1 (Similarity-based):

Agglomerative Hierarchical Clustering

Agglomerative Hierachical Clustering

Given a similarity function to measure similarity between two objects

Gradually group similar objects together in a bottom-up fashion

Stop when some stopping criterion is met

Variations: different ways to compute group similarity based on individual object similarity

Similarity Measure: Pearson CC

The most popular correlation coefficient is Pearson correlation coefficient (1892)

correlation between X={X1, X2, …, Xn} and Y={Y1, Y2, …, Yn} ：

where

n

k

kk

YX

YYXX

nr

1

1

n

k

k

n

GGG

1

2

(Adapted from a Slide by Shin-Mu Tseng)

sXY

sXY is the

similaritybetween X & Y

Better measures focus on a subset of values…

Similarity-induced Structure

How to Compute Group Similarity?

Given two groups g1 and g2,

Single-link algorithm: s(g1,g2)= similarity of the closest pair

complete-link algorithm: s(g1,g2)= similarity of the farthest pair

average-link algorithm: s(g1,g2)= average of similarity of all pairs

Three Popular Methods:

Three Methods Illustrated

Single-link algorithm

?

g1 g2

complete-link algorithm

……

average-link algorithm

Comparison of the Three Methods

Single-link “Loose” clusters

Individual decision, sensitive to outliers

Complete-link “Tight” clusters

Individual decision, sensitive to outliers

Average-link “In between”

Group decision, insensitive to outliers

Which one is the best? Depends on what you need!

Method 2 (similarity-based):

K-Means

K-Means Clustering

Given a similarity function

Start with k randomly selected data points

Assume they are the centroids of k clusters

Assign every data point to a cluster whose centroid is the closest to the data point

Recompute the centroid for each cluster

Repeat this process until the similarity-based objective function converges

Method 3 (model-based):

Mixture Models

Mixture Model for Clustering

P(X|Cluster1)

P(X|Cluster2)

P(X|Cluster3)

P(X)=1P(X|Cluster1)+ 2P(X|Cluster2)+3P(X|Cluster3)

2| ( , )i i iX Cluster N

Mixture Model Estimation

Likelihood function

Parameters:i, i, i

Using EM algorithm

Similar to “soft” K-means

21

221

( )( ) exp( )

2i

ki

ii i

xp x

Method 4 (model-based) [If we have gtime]

Singular Value Decomposition (SVD)

Also called “Latent Semantic Indexing” (LSI)

Example of “Semantic Concepts”

(Slide from C. Faloutsos’s talk)

Singular Value Decomposition (SVD)

A[n x m] = U[n x r] r x r] (V[m x r])T

A: n x m matrix (n documents, m terms)

U: n x r matrix (n documents, r concepts)

: r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix)

V: m x r matrix (m terms, r concepts)

(Slide from C. Faloutsos’s talk)

Example of SVD

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainfretrieval

brainlung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=CS

MD

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

CS-concept MD-concept

Term rep of concept

(Slide adapted from C. Faloutsos’s talk)

Strength of CS-concept

Dim. Reduction

A = U VT

More clustering methods and software

Partitioning ： K-Means, K-Medoids, PAM, CLARA …

Hierarchical ： Cluster, HAC 、 BIRCH 、 CURE 、 ROCK

Density-based ： CAST, DBSCAN 、 OPTICS 、 CLIQUE…

Grid-based ： STING 、 CLIQUE 、 WaveCluster…

Model-based ： SOM (self-organized map) 、 COBWEB、 CLASSIT 、 AutoClass…

Two-way Clustering

Block clustering

Documents

Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University