23
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Embed Size (px)

Citation preview

Page 1: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Microarray Data Analysis

(Lecture for CS397-CXZ Algorithms in Bioinformatics)

March 19, 2004

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Page 2: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Gene Expression Data (Microarray)

p genes on n samples

Genes

mRNA samples

Gene expression level of gene i in mRNA sample j

Log (treated-exp-value /controlled-exp-value )

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...

Page 3: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Some possible applications

Sample from specific organ to show which genes are expressed

Compare samples from healthy and sick host to find gene-disease connection

Discover co-regulated genes

Discover promoters

Page 4: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Major Analysis Techniques

Single gene analysis Compare the expression levels of the same gene under

different conditions

Main techniques: Significance test (e.g., t-test)

Gene group analysis Find genes that are expressed similarly across many different

conditions

Main techniques: Clustering (many possibilities)

Gene network analysis Analyze gene regulation relationship at a large scale

Main techniques: Bayesian networks

Page 5: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Clustering Methods

Similarity-based (need a similarity function) Construct a partition

Agglomerative, bottom up

Searching for an optimal partition

Typically “hard” clustering

Model-based (latent models, probabilistic or algebraic)

First compute the model

Clusters are obtained easily after having a model

Typically “soft” clustering

Page 6: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Similarity-based Clustering

Define a similarity function to measure similarity between two objects

Common criteria: Find a partition to Maximize intra-cluster similarity

Minimize inter-cluster similarity

Two ways to construct the partition Hierarchical (e.g.,Agglomerative Hierarchical Clustering)

Search by starting at a random partition (e.g., K-means)

Page 7: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Method 1 (Similarity-based):

Agglomerative Hierarchical Clustering

Page 8: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Agglomerative Hierachical Clustering

Given a similarity function to measure similarity between two objects

Gradually group similar objects together in a bottom-up fashion

Stop when some stopping criterion is met

Variations: different ways to compute group similarity based on individual object similarity

Page 9: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Similarity Measure: Pearson CC

The most popular correlation coefficient is Pearson correlation coefficient (1892)

correlation between X={X1, X2, …, Xn} and Y={Y1, Y2, …, Yn} :

where

n

k

kk

YX

YYXX

nr

1

1

n

k

k

n

GGG

1

2

(Adapted from a Slide by Shin-Mu Tseng)

sXY

sXY is the

similaritybetween X & Y

Better measures focus on a subset of values…

Page 10: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Similarity-induced Structure

Page 11: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

How to Compute Group Similarity?

Given two groups g1 and g2,

Single-link algorithm: s(g1,g2)= similarity of the closest pair

complete-link algorithm: s(g1,g2)= similarity of the farthest pair

average-link algorithm: s(g1,g2)= average of similarity of all pairs

Three Popular Methods:

Page 12: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Three Methods Illustrated

Single-link algorithm

?

g1 g2

complete-link algorithm

……

average-link algorithm

Page 13: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Comparison of the Three Methods

Single-link “Loose” clusters

Individual decision, sensitive to outliers

Complete-link “Tight” clusters

Individual decision, sensitive to outliers

Average-link “In between”

Group decision, insensitive to outliers

Which one is the best? Depends on what you need!

Page 14: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Method 2 (similarity-based):

K-Means

Page 15: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

K-Means Clustering

Given a similarity function

Start with k randomly selected data points

Assume they are the centroids of k clusters

Assign every data point to a cluster whose centroid is the closest to the data point

Recompute the centroid for each cluster

Repeat this process until the similarity-based objective function converges

Page 16: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Method 3 (model-based):

Mixture Models

Page 17: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Mixture Model for Clustering

P(X|Cluster1)

P(X|Cluster2)

P(X|Cluster3)

P(X)=1P(X|Cluster1)+ 2P(X|Cluster2)+3P(X|Cluster3)

2| ( , )i i iX Cluster N

Page 18: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Mixture Model Estimation

Likelihood function

Parameters:i, i, i

Using EM algorithm

Similar to “soft” K-means

21

221

( )( ) exp( )

2i

ki

ii i

xp x

Page 19: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Method 4 (model-based) [If we have gtime]

Singular Value Decomposition (SVD)

Also called “Latent Semantic Indexing” (LSI)

Page 20: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Example of “Semantic Concepts”

(Slide from C. Faloutsos’s talk)

Page 21: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Singular Value Decomposition (SVD)

A[n x m] = U[n x r] r x r] (V[m x r])T

A: n x m matrix (n documents, m terms)

U: n x r matrix (n documents, r concepts)

: r x r diagonal matrix (strength of each ‘concept’) (r : rank of the matrix)

V: m x r matrix (m terms, r concepts)

(Slide from C. Faloutsos’s talk)

Page 22: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

Example of SVD

1 1 1 0 0

2 2 2 0 0

1 1 1 0 0

5 5 5 0 0

0 0 0 2 2

0 0 0 3 30 0 0 1 1

datainfretrieval

brainlung

0.18 0

0.36 0

0.18 0

0.90 0

0 0.53

0 0.800 0.27

=CS

MD

9.64 0

0 5.29x

0.58 0.58 0.58 0 0

0 0 0 0.71 0.71

x

CS-concept MD-concept

Term rep of concept

(Slide adapted from C. Faloutsos’s talk)

Strength of CS-concept

Dim. Reduction

A = U VT

Page 23: Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University

More clustering methods and software

Partitioning : K-Means, K-Medoids, PAM, CLARA …

Hierarchical : Cluster, HAC 、 BIRCH 、 CURE 、 ROCK

Density-based : CAST, DBSCAN 、 OPTICS 、 CLIQUE…

Grid-based : STING 、 CLIQUE 、 WaveCluster…

Model-based : SOM (self-organized map) 、 COBWEB、 CLASSIT 、 AutoClass…

Two-way Clustering

Block clustering