Clustering Gene Expression Data

Feb 2001 (GG) 1

Clustering Gene Expression Data

• Gene Expression Data• Clustering of Genes and Conditions• Methods

– Agglomerative Hierarchical: Average Linkage

– Centroids: K-Means

– Physically motivated: Super-Paramagnetic Clustering

• Coupled Two-Way Clustering

DNA Microarrays Workshop

Feb. 26 – Mar. 2, 2001 ,UNIL & EPFL, Lausanne

Gaddy Getz, Weizmann Institute, Israel

Feb 2001 (GG) 2

Gene Expression Technologies

• DNA Chips (Affymetrix) and MicroArrays can measure mRNA concentration of thousands of genes simultaneously

• General scheme: Extract RNA, synthesize labeled cDNA, Hybridize with DNA on chip.

Feb 2001 (GG) 3

Single Experiment

• After hybridization– Scan the Chip and obtain an image file

– Image Analysis (find spots, measure signal and noise)Tools: ScanAlyze, Affymetrix, …

• Output File– Affymetrix chips: For each gene a reading proportional

to the concentrations and a present/absent call.(Average Difference, Absent Call)

– cDNA MicroArrays: competing hybridization of target and control. For each gene the log ratio of target and control. (CH1I-CH1B, CH2I-CH2B)

Feb 2001 (GG) 4

Preprocessing: From one experiment to many

• Chip and Channel Normalization– Aim: bring readings of all experiments to be on the

same scale

– Cause: different RNA amounts, labeling efficiency and image acquisition parameters

– Method: Multiply readings of each array/channel by a scaling factor such that:

• The sum of the scaled readings will be the same for all arrays

• Find scaling factor by a linear fit of the highly expressed genes

– Note: In multi-channel experiments normalize each channel separately.

Feb 2001 (GG) 5

Preprocessing: From one experiment to many

• Filtering of Genes– Remove genes that are absent in most

experiments– Remove genes that are constant in all

experiments– Remove genes with low readings which are not

reliable.

5

10

15

20

25

30

35

40

45

Experiments

Ge

ne

s

Colon cancer data (Alon et. al.)

10 20 30 40 50 60

200

400

600

800

1000

1200

1400

1600

1800

2000

Feb 2001 (GG) 6

Noise and Repeats

• >90% 2 to 3 fold

• Multiplicative noise

• Repeat experiments

• Log scaledist(4,2)=dist(2,1)

log – log plot

Feb 2001 (GG) 7

We can ask many questions?

• Which genes are expressed differently in two known types of conditions?

• What is the minimal set of genes needed to distinguish one type of conditions from the others?

• Which genes behave similarly in the experiments?• How many different types of conditions are there?

Supervised Methods(use predefined labels)

Supervised Methods(use predefined labels)

Unsupervised Methods(use only the data)

Unsupervised Methods(use only the data)

Feb 2001 (GG) 8

• Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.

• Goal B: Divide conditions to groups with similar gene expression profiles. Example: divide drugs according to their effect on gene expression.

Unsupervised Analysis

Clustering Methods

Feb 2001 (GG) 9

What is clustering?• Input: N data points, Xi, i=1,2,…,N in a D

dimensional space. • Goal: Find “natural” groups or clusters.

Data point of same cluster – “more similar”• Note: number of clusters also to be determined

Feb 2001 (GG) 10

Clustering is ill-posed

• Problem specific definitions

• Similarity: which points should be considered close? – Correlation coefficient– Euclidean distance

• Resolution: specify/hierarchical results

• Shape of clusters: general, spherical.

Feb 2001 (GG) 13

Similarity Measure

• Similarity measures – Centered Correlation– Uncentered Correlation– Absolute correlation– Euclidean

Feb 2001 (GG) 14

52 41 3

Agglomerative Hierarchical Clustering

3

1

4 2

5

Distance between joined clusters

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Need to define the distance between thenew cluster and the other clusters.

Single Linkage: distance between closest pair.

Complete Linkage: distance between farthest pair.

Average Linkage: average distance between all pairs

or distance between cluster centers

Dendrogram

The dendrogram induces a linear ordering of the data points

The dendrogram induces a linear ordering of the data points

Feb 2001 (GG) 15

Agglomerative Hierarchical Clustering

• Results depend on distance update method– Single Linkage: elongated clusters– Complete Linkage: sphere-like clusters

• Greedy iterative process

• NOT robust against noise

• No inherent measure to choose the clusters

Feb 2001 (GG) 16

Centroid Methods - K-means

Iteration = 0

•Start with random position of K centroids.

•Iteratre until centroids are stable

•Assign points to centroids

•Move centroids to centerof assign points

Feb 2001 (GG) 17





Iteration = 1


Feb 2001 (GG) 18





Iteration = 1


Feb 2001 (GG) 19

Iteration = 3






Feb 2001 (GG) 20

• Result depends on initial centroids’ position

• Fast algorithm: compute distances from data points to centroids

• No way to choose K.

• Example: 3 clusters / K=2, 3, 4

• Breaks long clusters


Feb 2001 (GG) 21

Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation

• The idea behind SPC is based on the physical properties dilute magnets.

• Calculating correlation between magnet orientations at different temperatures (T).

T=Low

Feb 2001 (GG) 22



T=High


Feb 2001 (GG) 23




T=Intermediate

Feb 2001 (GG) 24

• The algorithm simulates the magnets behavior at a range of temperatures and calculates their correlation

• The temperature (T) controls the resolution

• Example: N=4800 points in D=2

Super-Paramagnetic Clustering (SPC)

Feb 2001 (GG) 25

Output of SPC

Size of largest clusters as function of T

Size of largest clusters as function of T

DendrogramDendrogram

Stable clusters “live” for large T

Stable clusters “live” for large T

A function (T) that peaks when stable clusters break

A function (T) that peaks when stable clusters break

Feb 2001 (GG) 26

Choosing a value for T

Feb 2001 (GG) 27

Advantages of SPC

• Scans all resolutions (T)

• Robust against noise and initialization -calculates collective correlations.

• Identifies “natural” () and stable clusters (T)

• No need to pre-specify number of clusters

• Clusters can be any shape

Feb 2001 (GG) 28

Many clustering methods applied to expression data

• Agglomerative Hierarchical– Average Linkage (Eisen et. al., PNAS 1998)

• Centroid (representative)– K-Means (Golub et. al., Science 1999)

– Self Organized Maps (Tamayo et. al., PNAS 1999)

• Physically motivated – Deterministic Annealing (Alon et. al., PNAS 1999)

– Super-Paramagnetic Clustering (Getz et. al., Physica A 2000)

Feb 2001 (GG) 29

Available Tools

• M. Eisen’s programs for clustering and display of results (Cluster, TreeView)– Predefined set of normalizations and filtering– Agglomerative, K-means, 1D SOM

• Matlab– Agglomerative, public m-files.

• Dedicated software packages (SPC)• Web sites: e.g. http://ep.ebi.ac.uk/EP/EPCLUST/• Statistical programs (SPSS, SAS, S-plus)

Feb 2001 (GG) 30

Back to gene expression data

• 2 Goals: Cluster Genes and Conditions

• 2 independent clustering:– Genes represented as vectors of expression in

all conditions– Conditions are represented as vectors of

expression of all genes

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Experiments

Ge

ne

s

Colon cancer data (normalized genes)

10 20 30 40 50 60

200

400

600

800

1000

1200

1400

1600

1800

2000

Feb 2001 (GG) 31

1. Identify tissue classes (tumor/normal)

First clustering - Experiments

Feb 2001 (GG) 32

2. Find Differentiating And Correlated Genes

Second Clustering - Genes

Ribosomal proteins Cytochrome C

HLA2

metabolism

Feb 2001 (GG) 33

Two-wayClustering

Feb 2001 (GG) 34

Coupled Two-Way Clustering (CTWC)G. Getz, E. Levine and E. Domany (2000) PNAS

• Why use all the genes to represent conditions and all conditions to represent genes? Different structures emerge when clustering sub-matrices.

• New Goal: Find significant structure in subsets of the data matrix.

• A non-trivial task – exponential number of subsets.• Recently we proposed a heuristic to solve this

problem.

Feb 2001 (GG) 35

0 10 20 30 40 50 60

0

10

20

30

40

50

60

0 10 20 30 40 50 60

0

10

20

30

40

50

60

CTWC of colon cancer data

A

B

A

B

10 20 30 40 50 60

200

400

600

800

1000

1200

1400

1600

1800

2000

(A)

(B)

Feb 2001 (GG) 37

Biological Work

• Literature search for the genes• Genomics: search for common regulatory

signal upstream of the genes • Proteomics: infer functions.• Design next experiment – get more data to

validate result.• Find what is in common with sets of

experiments/conditions.

Feb 2001 (GG) 38

Summary

• Clustering methods are used to– find genes from the same biological process

– group the experiments to similar conditions

• Different clustering methods can give different results. The physically motivated ones are more robust.

• Focusing on subsets of the genes and conditions can uncover structure that is masked when using all genes and conditions

www.weizmann.ac.il/physics/complex/compphys

Documents

Clustering Gene Expression Data