Upload
steven-short
View
43
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Tutorial 8. Clustering. Clustering. General Methods Unsupervised Clustering Hierarchical clustering K-means clustering Expression data GEO UCSC ArrayExpress Tools EPCLUST Mev. Microarray - Reminder. Expression Data Matrix. - PowerPoint PPT Presentation
Citation preview
1
Tutorial 8
Clustering
2
Clustering• General Methods
– Unsupervised Clustering• Hierarchical clustering• K-means clustering
• Expression data– GEO– UCSC– ArrayExpress
• Tools– EPCLUST– Mev
3
Microarray - Reminder
4
Expression Data Matrix
• Each column represents all the gene expression levels from a single experiment.
• Each row represents the expression of a gene across all experiments.
Exp1 Exp 2 Exp3 Exp4 Exp5 Exp6
Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9
Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7
Gene 3 -2.5 1.5 -0.1 -1.1 -1 0.1
Gene 4 2.9 2.6 2.5 -2.3 -0.1 -2.3
Gene 5 0.1 2.6 2.2 2.7 -2.1
Gene 6 -2.9 -1.9 -2.4 -0.1 -1.9 2.9
5
Expression Data Matrix
Each element is a log ratio: log2 (T/R). T - the gene expression level in the testing sample
R - the gene expression level in the reference sample
Exp1 Exp 2 Exp3 Exp4 Exp5 Exp6
Gene 1 -1.2 -2.1 -3 -1.5 1.8 2.9
Gene 2 2.7 0.2 -1.1 1.6 -2.2 -1.7
Gene 3 -2.5 1.5 -0.1 -1.1 -1 0.1
Gene 4 2.9 2.6 2.5 -2.3 -0.1 -2.3
Gene 5 0.1 2.6 2.2 2.7 -2.1
Gene 6 -2.9 -1.9 -2.4 -0.1 -1.9 2.9
6
Microarray Data Matrix
Black indicates a log ratio of zero, i.e.
T=~R
Green indicates a negative log ratio,
i.e. T<R
Red indicates a positive log ratio, i.e. T>R
Grey indicates missing data
7-4
-3
-2
-1
0
1
2
3
4
1 2 3 4 5 6
Exp
Log
ratio
Exp
Log
ratio
Microarray Data:Different representations
T<R
T>R
8
A real example
~500 genes3 knockdown conditions
To complicate to analyze without “help”
9
Microarray Data:Clusters
10
How to determine the similarity between two genes? (for clustering)
Patrik D'haeseleer, How does gene expression clustering work?, Nature Biotechnology 23, 1499 - 1501 (2005) , http://www.nature.com/nbt/journal/v23/n12/full/nbt1205-1499.html
11
Unsupervised Clustering
Hierarchical Clustering
12
genes with similar expression patterns are grouped together and are connected by a series of branches (dendrogram).
16
352 4
16
35 2 4
Leaves (shapes in our case) represent genes and the length of the paths between leaves represents the distances between genes.
Hierarchical Clustering
13
If we want a certain number of clusters we need to cut the tree at a level indicates that number (in this case - four).
Hierarchical clustering finds an entire hierarchy of clusters.
14
Hierarchical clustering result
Five clusters
15
An algorithm to classify the data into K number of groups.
K=4
K-means Clustering
16
How does it work?
The algorithm divides iteratively the genes into K groups and calculates the center of each group. The results are the optimal groups (center distances) for K clusters.
1 2 3 4
k initial "means" (in this casek=3) are randomly selected from the data set (shown in color).
k clusters are created by associating every observation with the nearest mean
The centroid of each of the k clusters becomes the new means.
Steps 2 and 3 are repeated until convergence has been reached.
17
Different types of clustering – different results
18
How to search for expression profiles
• GEO (Gene Expression Omnibus)http://www.ncbi.nlm.nih.gov/geo/
• Human genome browserhttp://genome.ucsc.edu/
• ArrayExpresshttp://www.ebi.ac.uk/arrayexpress/
19
20
Datasets - suitable for analysis with GEO tools
Expression profiles by gene
Microarray experiments
Probe sets
Groups of related microarray experiments
Searching for expression profiles in the GEO
21
Download dataset
Clustering
Statistic analysis
22
Clustering analysis
23
Download dataset
Clustering
Statistic analysis
24
The expression distribution for different lines in the cluster
25
26
Searching for expression profiles in the Human Genome browser.
27
Keratine 10 is highly expressed
in skin
29
30
What can we do with all the expression profiles?
Clusters!
How?
EPCLUSThttp://www.bioinf.ebc.ee/EP/EP/EPCLUST/
31
32
33
34
35
36
37
Edit the input matrix: Transpose,Normalize,Randomize
Hierarchical clustering
K-means clustering
In the input matrix each column should represents a gene and each row should represent an experiment (or individual).
38
Clusters
Data
39
Edit the input matrix: Transpose,Normalize,Randomize
Hierarchical clustering
K-means clustering
In the input matrix each column should represents a gene and each row should represent an experiment (or individual).
40
Graphical representation of the
cluster
Graphical representation of the
cluster
Samples found in cluster
41
10 clusters, as requested