38
Basic Gene Expression Data Analysis-- Clustering G ene Expression D ata D istance/Sim ilarity M atrix GeneC lusters Regulatory Elem ents/G ene Functions Pairwise Measures Clustering Motif Searching/Network Construc Integrated Analysis (NMR/SNP/Clinic/….)

Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Embed Size (px)

Citation preview

Page 1: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Basic Gene Expression Data Analysis--Clustering

Gene Expression Data

Distance/Similarity Matrix

Gene Clusters

Regulatory Elements / Gene Functions

Pairwise Measures

Clustering

Motif Searching/Network Construction

Integrated Analysis(NMR/SNP/Clinic/….)

Page 2: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Microarray Experiment

Control

Treated

mRNA

RT andlabel with fluor dyes

cDNA

Spot (DNA probe):• known cDNA or• Oligo

Mix and hybridize target to microarray

Page 3: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Collections of Experiments

• Time course after a treatment

• Different treatments• Disease cell lines• Data are represented in

a matrix

Page 4: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Cluster Analysis

• Grouping of genes with “similar” expression profiles

• Grouping of disease cell lines/toxicants with “similar” effects on gene expression

• Clustering algorithms– Hierarchical clustering– Self-organizing maps– K-means clustering

Page 5: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Normalized Expression Data

Gene Expression Clustering

Protein/protein complex

DNA regulatory elements

Semantics of clusters:From co-expressed to co-regulated

Page 6: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Key Terms in Cluster Analysis

• Distance & Similarity measures

• Hierarchical & non-hierarchical

• Single/complete/average linkage

• Dendrograms & ordering

Page 7: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Measuring Similarity of Gene Expression

• Euclidean (L2) distance

• Manhattan (L1) distance

• Lm: (|x1-x2|m+|y1-y2|m)1/m

• L∞: max(|x1-x2|,|y1-y2|)

• Inner product: x1x2+y1y2

• Correlation coefficient• Spearman rank correlation

coefficient

(x(x11, y, y11))

(x(x22,y,y22))

Page 8: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Distance Measures: Minkowski Metric

r rp

iii

p

p

yxyxd

yyyy

xxxx

pyx

||),(

)(

)(

1

21

21

by defined is metric Minkowski The

:features have both and objects two Suppose

ref

Page 9: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Commonly Used Minkowski Metrics

||max),(

||),(

1

||),(

2

1

1

2 2

1

iipi

p

iii

p

iii

yxyxd

r

yxyxd

r

yxyxd

r

) distance sup"(" 3,

distance) (Manhattan 2,

) distance (Euclidean 1,

Page 10: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

An Example

.4}3,4{max

.734

.5342 22

:distance sup"" 3,

:distance Manhattan 2,

:distance Euclidean 1,

4

3

x

y

Page 11: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Manhattan distance is called Hamming distance when all features are binary.

1101111110000111010011100100100110

1716151413121110987654321

GeneBGeneA

Gene Expression Levels Under 17 Conditions (1-High,0-Low)

. :Distance Hamming 5141001 )#()#(

Page 12: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

From Clustering to Correlation

Time

Gene A

Gene B Gene A

Time

Gene B

Expression LevelExpression Level

Expression Level

Time

Gene A

Gene B

Page 13: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Similarity Measures: Correlation Coefficient

. and :averages

)()(

))((),(

1

1

1

1

1 1

22

1

p

iip

p

iip

p

i

p

iii

p

iii

yyxx

yyxx

yyxxyxs

1),( yxs

Page 14: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Hierarchical ClusteringGiven a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process hierarchical clustering is this:

1.Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain.

2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.

3.Compute distances (similarities) between the new cluster and each of the old clusters.

4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

Page 15: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Normalized Expression Data

Hierarchical Clustering

a b c d

Page 16: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Hierarchical Clustering

3 clusters?3 clusters?

2 clusters?2 clusters?

Page 17: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Cluster Analysis • Eisen et al. (1998) (PNAS, 95:14863)

Correlation as measure of co-expression

Experiment over time

timet0 t1 t2 ...

control N genes N*N correlation matrix1 2 3 …

1 1.00 0.23 0.02 …2 1.00 0.34 …3 1.00 ……

Page 18: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

N*N correlation matrix1 2 3

1 1.00 0.23 0.022 1.00 0.343 1.00

1 (2&3)1 1.00 0.28

(2&3) 1.00

Cluster Analysis

• Scan matrix for maximum

• Join genes to 1 node

2

3 • Update matrix

1

Page 19: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Cluster Analysis

• Result: Dendogram assemling N genes

• Points of discussion– similarity based, useful for co-expression– dependent on similarity measure?– useful in preliminary scans– biological relevance of clusters?

Page 20: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Distance Between Two Clusters

Min

distance

Average

distance

Max

distance

Single-Link Method / Nearest Neighbor

Complete-Link / Furthest Neighbor

Their Centroids.

Average of all cross-cluster pairs.

• single-link clustering (also called the connectedness or minimum method) : we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster.

• complete-link clustering (also called the diameter or maximum method): we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster.

• average-link clustering : we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

Page 21: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Single-Link Method

ba

453652

cba

dcb

Distance Matrix

Euclidean Distance

453,

cba

dc

453652

cba

dcb4,, cbad

(1) (2) (3)

a,b,ccc d

a,b

d da,b,c,d

Page 22: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Complete-Link Method

ba

453652

cba

dcb

Distance Matrix

Euclidean Distance

465,

cba

dc

453652

cba

dcb6,,

badc

(1) (2) (3)

a,b

cc d

a,b

d c,da,b,c,d

Page 23: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Identifying disease genes

Non-tumor Liver

Tumor Liver

Liver-specific

Ribosomal proteins

Proliferation

Endothelial cells 1

X. Chen & P.O. Brown et alMolecular Biology of the CellVol. 13, 1929-1939, June 2002

Page 24: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

• Human tumor patient and normal cells; various conditions

• Cluster or Classify genes according to tumors

• Cluster tumors according to genes

Page 25: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

K-Means Clustering Algorithm

1) Select an initial partition of k clusters

2) Assign each object to the cluster with the closest center:

3) Compute the new centers of the clusters:

4) Repeat step 2 and 3 until no object changes cluster

SXXnXSC n

n

ii

,...,,/)( 1

1

Page 26: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

K-Means Clustering

• Basic Ideas : using cluster centre (means) to represent cluster

• Assigning data elements to the closet cluster (centre).

• Goal: Minimise square error (intra-class dissimilarity) : =

• Variations of K-Means– Initialisation (select the number of clusters, initial partitions)

– Updating of center

– Hill-climbing (trying to move an object to another cluster).

))(,( ii

i xCxd

This method initially takes the number of components of the population equal to the final required number of clusters. In this step itself the final required number of clusters is chosen such that the points are mutually farthest apart. Next, it examines each component in the population and assigns it to one of the clusters depending on the minimum distance. The centroid's position is recalculated everytime a component is added to the cluster and this continues until all the components are grouped into the final required number of clusters.

Page 27: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

The K-Means Clustering Method

• Example

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 28: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

k-means Clustering : Procedure (1)

Initialization 1Specify the number of cluster k : for example, k = 4

2

5000

Expression matrix

Each point is called “gene”

Page 29: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

k-means Clustering : Procedure (2)

Initialization 2Genes are randomly assigned to one of k clusters

Page 30: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

k-means Clustering : Procedure (2)

Calculate the mean of each cluster

CN

ii

C

ic g

Nm

1

1

(1,2)(3,2)

(3,4)

(6,7)

4

1i

BLUEm [(6,7) + (3,4) + …]

Page 31: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

k-means Clustering : Procedure (4)

Each gene is reassigned to the nearest cluster

2||minarg iijj gmc

Gene i to cluster c

Page 32: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

k-means Clustering : Procedure (4)

Each gene is reassigned to the nearest cluster

2||minarg iijj gmc

Gene i to cluster c

Page 33: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

k-means Clustering : Procedure (5)

Iterate until the means are converged

Page 34: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

k-means clustering : application

6220 yeast genes15 time pointsduring cell cycle

M/G1 phaseG1 phase

M phase

Result: 13 clusters of 30 clusters had statistical significancefor each biological function

S. Tavazoie & GM ChurchNature GeneticsVol. 22, July 1999

:

Page 35: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Computation Time and Memory Requirementn genes and m experiments

Computation time:• Hierarchical clustering

– O( m n2 log(n) )

• K-means clustering– t: number of iterations

– O( k t m n )

Memory requirement:• Hierarchical clustering

– O( mn + n2 )

• K-means clustering– t: number of iterations

– O( mn + kn )

Page 36: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Issues in Cluster Analysis

• A lot of clustering algorithms

• A lot of distance/similarity metrics

• Which clustering algorithm runs faster and uses less memory?

• How many clusters after all?

• Are the clusters stable?

• Are the clusters meaningful?

Page 37: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

K-Means vs Hierarchical Clustering

Page 38: Basic Gene Expression Data Analysis--Clustering Pairwise Measures Clustering Motif Searching/Network Construction Integrated Analysis (NMR/SNP/Clinic/….)

Pattern Recognition

• Clarification of decision making processes and automating them using computers

supervised unsupervised

• known number of classes

• based on a training set

• used to classify future observations

• unknown number of classes

• no prior knowledge

• cluster analysis = one form