Upload
daisy-hensley
View
212
Download
0
Embed Size (px)
Citation preview
Clustering
Jarno Tuimala
Clustering
• Aim– Grouping objects (genes or chips) into clusters so that
the objects inside one cluster are more closely related to each other than to objects in other clusters
• Exploratory data analysis– View all data simultaneously– Identify clusters and patterns in data
• Uses:– Time series analysis– Visualization of known classes
Unsupervised vs. Supervised
Clustering methods
• Hierarchical clustering– single, average (UPGMA) and complete
linkage
• Non-hierarchical clustering– K-means
Hierarchical clustering
• Two phases– Pick a distance method
• Euclidian• Pearson / Spearman correlation
– Pick the dendrogram drawing method• Single linkage• Average linkage• Complete linkage
Distances
• Euclidian– Average difference between gene or chip
expression profiles– Similar values are clustered together
• Correlation– Difference in trends– Similar trends are clustered together– Typically: Pearson or Spearman correlation
Single, average, and complete linkage
Dendrogram drawing
UPGMA example
X55123Gata3 Kcnd2
2...
Time 0 , Strain chocolate_addict
Time 4 , Strain chocolate_addict
Time 24 , Strain chocolate_addict
Time 0 , Strain normal
Time 4 , Strain normal
Time 24 , Strain normal
X55123Gata3 Kcnd2 Api6
3 gene...
Time 0 , Strain chocolate_addict
Time 4 , Strain chocolate_addict
Time 24 , Strain chocolate_addict
Time 0 , Strain normal
Time 4 , Strain normal
Time 24 , Strain normal
X55123Gata3 Kcnd2 Api6
Y18280Dyrk1b
U16297Cyb561
5 gene tree
Time 0 , Strain chocolate_addict
Time 4 , Strain chocolate_addict
Time 24 , Strain chocolate_addict
Time 0 , Strain normal
Time 4 , Strain normal
Time 24 , Strain normal
X55123Gata3 Kcnd2 Api6
Y18280Dyrk1b
U16297Cyb561
Y13090Casp12 Gria4
7 gene tree
Time 0 , Strain chocolate_addict
Time 4 , Strain chocolate_addict
Time 24 , Strain chocolate_addict
Time 0 , Strain normal
Time 4 , Strain normal
Time 24 , Strain normal
X55123Gata3 Kcnd2 Api6
Y18280Dyrk1b
U16297Cyb561
U39827Gpcr25
Y13090Casp12 Gria4
8 gene tree
Time 0 , Strain chocolate_addict
Time 4 , Strain chocolate_addict
Time 24 , Strain chocolate_addict
Time 0 , Strain normal
Time 4 , Strain normal
Time 24 , Strain normal
X55123Gata3 Kcnd2 Api6
Y18280Dyrk1b
U16297Cyb561
U39827Gpcr25
Y13090Casp12 Gria4
M33760Fgfr1
L06443Gdf3
10 gene tree
Time 0 , Strain chocolate_addict
Time 4 , Strain chocolate_addict
Time 24 , Strain chocolate_addict
Time 0 , Strain normal
Time 4 , Strain normal
Time 24 , Strain normal
X55123Gata3 Kcnd2 Api6
Y18280Dyrk1b
U16297Cyb561
U39827Gpcr25
Y13090Casp12 Gria4
M33760Fgfr1
L06443Gdf3
10 gene tree
Time 0 , Strain chocolate_addict
Time 4 , Strain chocolate_addict
Time 24 , Strain chocolate_addict
Time 0 , Strain normal
Time 4 , Strain normal
Time 24 , Strain normal
X55123Gata3 Kcnd2 Api6
Y18280Dyrk1b
U16297Cyb561
U39827Gpcr25
Y13090Casp12 Gria4
M33760Fgfr1
L06443Gdf3
10 gene tree
Time 0 , Strain chocolate_addict
Time 4 , Strain chocolate_addict
Time 24 , Strain chocolate_addict
Time 0 , Strain normal
Time 4 , Strain normal
Time 24 , Strain normal
X55123Gata3 Kcnd2 Api6
Y18280Dyrk1b
U16297Cyb561
U39827Gpcr25
Y13090Casp12 Gria4
M33760Fgfr1
L06443Gdf3
10 gene tree
Time 0 , Strain chocolate_addict
Time 4 , Strain chocolate_addict
Time 24 , Strain chocolate_addict
Time 0 , Strain normal
Time 4 , Strain normal
Time 24 , Strain normal
X55123Gata3 Kcnd2 Api6
Y18280Dyrk1b
U16297Cyb561
U39827Gpcr25
Y13090Casp12 Gria4
M33760Fgfr1
L06443Gdf3
10 gene tree non-binary
Time 0 , Strain chocolate_addict
Time 4 , Strain chocolate_addict
Time 24 , Strain chocolate_addict
Time 0 , Strain normal
Time 4 , Strain normal
Time 24 , Strain normal
Hierarchical Clustering
Gata3 Kcnd2 Api6 Dyrk1b Cyb561 Casp12 Gria4 Gpcr25 Fgfr1 Gdf3
Silicon Genetics, 2003Silicon Genetics, 2003
Heatmap
K-means clustering
• Partitioning method– The dataset is divided into K clusters– User needs to deside on the K before the run
• K-means is heuristic algorithm, so different runs can give dissimilar results– Make several runs, and select the one giving
the minimum sum of within-clusters variance
K-means Clustering
Silicon Genetics, 2003Silicon Genetics, 2003
K-means Clustering
Silicon Genetics, 2003Silicon Genetics, 2003
K-means Clustering
Silicon Genetics, 2003Silicon Genetics, 2003
K-means Clustering
Silicon Genetics, 2003Silicon Genetics, 2003
Visualization
Gene selection
• Genes are usually filtered before clustering.– This decreases calculation time.
• Typically a few hundred genes with highest variance (or standard deviation) are selected.
• If you have, e.g., two types of cancers, do not use t-test for selecting genes. You will always get a result where the cancer type is differentiates between the clusters.