Tutorial 8

1

Tutorial 8

Clustering

2

Clustering• General Methods

– Unsupervised Clustering• Hierarchical clustering• K-means clustering

• Expression data– GEO– UCSC– ArrayExpress

• Tools– EPCLUST– Mev

3

Microarray - Reminder

4

Expression Data Matrix

• Each column represents all the gene expression levels from a single experiment.

• Each row represents the expression of a gene across all experiments.

Exp1 Exp 2 Exp3 Exp4 Exp5 Exp6

Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9

Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7

Gene 3 -2.5 1.5 -0.1 -1.1 -1 0.1

Gene 4 2.9 2.6 2.5 -2.3 -0.1 -2.3

Gene 5 0.1 2.6 2.2 2.7 -2.1

Gene 6 -2.9 -1.9 -2.4 -0.1 -1.9 2.9

5

Expression Data Matrix

Each element is a log ratio: log2 (T/R). T - the gene expression level in the testing sample

R - the gene expression level in the reference sample

Exp1 Exp 2 Exp3 Exp4 Exp5 Exp6

Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9

Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7

Gene 3 -2.5 1.5 -0.1 -1.1 -1 0.1

Gene 4 2.9 2.6 2.5 -2.3 -0.1 -2.3

Gene 5 0.1 2.6 2.2 2.7 -2.1

Gene 6 -2.9 -1.9 -2.4 -0.1 -1.9 2.9

6

Microarray Data Matrix

Black indicates a log ratio of zero, i.e.

T=~R

Green indicates a negative log ratio,

i.e. T<R

Red indicates a positive log ratio, i.e. T>R

Grey indicates missing data

7-4

-3

-2

-1

0

1

2

3

4

1 2 3 4 5 6

Exp

Log

ratio

Exp

Log

ratio

Microarray Data:Different representations

T<R

T>R

8

A real example

~500 genes3 knockdown conditions

To complicate to analyze without “help”

9

Microarray Data:Clusters

10

How to determine the similarity between two genes? (for clustering)

Patrik D'haeseleer, How does gene expression clustering work?, Nature Biotechnology 23, 1499 - 1501 (2005) , http://www.nature.com/nbt/journal/v23/n12/full/nbt1205-1499.html

http://www.nature.com/nbt/journal/v23/n12/full/nbt1205-1499.html

11

Unsupervised Clustering

Hierarchical Clustering

12

genes with similar expression patterns are grouped together and are connected by a series of branches (dendrogram).

16

352 4

16

35 2 4

Leaves (shapes in our case) represent genes and the length of the paths between leaves represents the distances between genes.

Hierarchical Clustering

13

If we want a certain number of clusters we need to cut the tree at a level indicates that number (in this case - four).

Hierarchical clustering finds an entire hierarchy of clusters.

14

Hierarchical clustering result

Five clusters

15

An algorithm to classify the data into K number of groups.

K=4

K-means Clustering

16

How does it work?

The algorithm divides iteratively the genes into K groups and calculates the center of each group. The results are the optimal groups (center distances) for K clusters.

1 2 3 4

k initial "means" (in this casek=3) are randomly selected from the data set (shown in color).

k clusters are created by associating every observation with the nearest mean

The centroid of each of the k clusters becomes the new means.

Steps 2 and 3 are repeated until convergence has been reached.

17

Different types of clustering – different results

18

How to search for expression profiles

• GEO (Gene Expression Omnibus)http://www.ncbi.nlm.nih.gov/geo/

• Human genome browserhttp://genome.ucsc.edu/

• ArrayExpresshttp://www.ebi.ac.uk/arrayexpress/

http://www.ncbi.nlm.nih.gov/geo/

http://genome.ucsc.edu/

http://www.ebi.ac.uk/arrayexpress/

19

20

Datasets - suitable for analysis with GEO tools

Expression profiles by gene

Microarray experiments

Probe sets

Groups of related microarray experiments

Searching for expression profiles in the GEO

21

Download dataset

Clustering

Statistic analysis

22

Clustering analysis

23

Download dataset

Clustering

Statistic analysis

24

The expression distribution for different lines in the cluster

25

26

Searching for expression profiles in the Human Genome browser.

27

Keratine 10 is highly expressed

in skin

28


ArrayExpress


29

30

What can we do with all the expression profiles?

Clusters!

How?

EPCLUSThttp://www.bioinf.ebc.ee/EP/EP/EPCLUST/

http://www.bioinf.ebc.ee/EP/EP/EPCLUST/

31

32

33

34

35

36

37

Edit the input matrix: Transpose,Normalize,Randomize

Hierarchical clustering

K-means clustering

In the input matrix each column should represents a gene and each row should represent an experiment (or individual).

38

Clusters

Data

39

Edit the input matrix: Transpose,Normalize,Randomize

Hierarchical clustering

K-means clustering

In the input matrix each column should represents a gene and each row should represent an experiment (or individual).

40

Graphical representation of the

cluster

Graphical representation of the

cluster

Samples found in cluster

41

10 clusters, as requested

42

http://www.tm4.org/mev/

Multi experiment viewer

http://www.tm4.org/mev/

Documents

Tutorial 8