Clustering and Classification Preprocessing Distance measures
Popular algorithms (not necessarily the best ones) More
sophisticated ones Evaluation Data mining
Slide 5
- Clustering or classification? - Is training data available? -
What domain specific knowledge can be applied? - What preprocessing
of data is needed? - Log / data scale and numerical stability -
Filtering / denoising - Nonlinear kernel - Feature selection (do I
need to use all the data?) - Is the dimensionality of the data too
high?
Slide 6
-Accuracy vs. generality -Overfitting -Model selection Model
complexity Prediction error Training sample Testing sample
(reproduced from Hastie et.al.)
Slide 7
How do we process microarray data (clustering)? - Feature
selection genes, transformations of expression levels. - Genes
discovered in the class comparison (t-test). Risk: missing genes. -
Iterative approach : select genes under different p- value cutoff,
then select the one with good performance using cross-validation. -
Principal components (pro and con). - Discriminant analysis (e.g.,
LDA).
Slide 8
- Dimensionality Reduction - Principal component analysis (PCA)
- Singular value decomposition (SVD) - Karhunen-Loeve transform
(KLT) Basis for P SVD
Slide 9
- Principal Component Analysis (PCA) - Other things to consider
- Numerical balance/data normalization - Noisy direction -
Continuous vs. discrete data - Principal components are orthogonal
to each other, however, biological data are not - Principal
components are linear combinations of original data - Prior
knowledge is important - PCA is not clustering!
Slide 10
Visualization of Microarray Data Multidimensional scaling (MDS)
High-dimensional coordinates unknown Distances between the points
are known The distance may not be Euclidean, but the embedding
maintains the distance in a Euclidean space Try different
dimensions (from one to ???) At each dimension, perform optimal
embedding to minimize embedding error Plot embedding error
(residue) vs. dimension Pick the knee point
Slide 11
Visualization of Microarray Data Multidimensional scaling
(MDS)
Slide 12
Distance Measure (Metric?) -What do you mean by similar?
-Euclidean -Uncentered correlation -Pearson correlation
Slide 13
Distance Metric -Euclidean
102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300
3189.0001321.3002164.400868.600185.300266.4002527.800
160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000
5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d E (Lip1,
Ap1s1) = 12883
Slide 14
Distance Metric -Pearson Correlation
102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300
3189.0001321.3002164.400868.600185.300266.4002527.800
160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000
5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d P (Lip1,
Ap1s1) = 0.904
Slide 15
Distance Metric -Pearson Correlation r = 1r = -1 Ranges from 1
to -1.
Slide 16
Distance Metric -Uncentered Correlation
102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300
3189.0001321.3002164.400868.600185.300266.4002527.800
160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000
5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d u (Lip1,
Ap1s1) = 0.835 About 33.4 o
Slide 17
Distance Metric -Difference between Pearson correlation and
uncentered correlation
102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300
3189.0001321.3002164.400868.600185.300266.4002527.800
160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000
5410.9003162.1004100.9004603.2006066.2005505.8005702.700 Pearson
correlation Baseline expression possible Uncentered correlation All
are considered signals
Slide 18
Distance Metric -Difference between Euclidean and
correlation
Slide 19
Distance Metric -PCC means similarity, how can we transform it
to distance? -1-PCC -Negative correlation may also mean close in
signal pathway (1-|PCC|, 1-PCC^2)
Slide 20
Supervised Learning Perceptron neural networks
Slide 21
Supervised Learning Perceptron neural networks
Slide 22
-Supervised Learning -Support vector machines (SVM) and Kernels
-Only (binary) classifier, no data model
- Dimensionality reduction: linear discriminant analysis (LDA)
B. 2.0 1.5 1.0 0.5 0.5 1.0 1.5 2.0............. A w. (From S. Wus
website)
Slide 25
Linear Discriminant Analysis B. 2.0 1.5 1.0 0.5 0.5 1.0 1.5
2.0............. A w. (From S. Wus website)
Slide 26
-Supervised Learning - Support vector machines (SVM) and
Kernels -Kernel nonlinear mapping
Slide 27
How do we use microarray? Profiling Clustering Cluster to
detect patient subgroups Cluster to detect gene clusters and
regulatory networks
Slide 28
Slide 29
How do we process microarray data (clustering)? - Unsupervised
Learning Hierarchical Clustering
Slide 30
How do we process microarray data (clustering)? -Unsupervised
Learning Hierarchical Clustering Single linkage: The linking
distance is the minimum distance between two clusters.
Slide 31
How do we process microarray data (clustering)? -Unsupervised
Learning Hierarchical Clustering Complete linkage: The linking
distance is the maximum distance between two clusters.
Slide 32
How do we process microarray data (clustering)? -Unsupervised
Learning Hierarchical Clustering Average linkage/UPGMA: The linking
distance is the average of all pair-wise distances between members
of the two clusters. Since all genes and samples carry equal
weight, the linkage is an Unweighted Pair Group Method with
Arithmetic Means (UPGMA).
Slide 33
How do we process microarray data (clustering)? -Unsupervised
Learning Hierarchical Clustering Single linkage Prone to chaining
and sensitive to noise Complete linkage Tends to produce compact
clusters Average linkage Sensitive to distance metric
Slide 34
-Unsupervised Learning Hierarchical Clustering
Slide 35
Dendrograms Distance the height each horizontal line represents
the distance between the two groups it merges. Order Opensource R
uses the convention that the tighter clusters are on the left.
Others proposed to use expression values, loci on chromosomes, and
other ranking criteria.
Slide 36
-Unsupervised Learning - K-means -Vector quantization -K-D
trees -Need to try different K, sensitive to initialization
-Unsupervised Learning - K-means -Number of class K needs to be
specified -Does not always converge -Sensitive to
initialization
Slide 39
-Unsupervised Learning - K-means
Slide 40
-Unsupervised Learning -Self-organized maps (SOM) -Neural
network based method -Originally used as a visualization method for
visualize (embedding) high-dimensional data -Also related vector
quantization -The idea is to map close data points to the same
discrete level
Slide 41
-Issues -Lack of consistency or representative features (5.3
TP53 + 0.8 PTEN doesnt make sense) -Data structure is missing -Not
robust to outliers and noise DHaeseleer 2005 Nat. Biotechnol
23(12):1499-501
Slide 42
-Model-based clustering methods (Han)
http://www.cs.umd.edu/~bhhan/research2.html Pan et al. Genome
Biology 2002 3:research0009.1
doi:10.1186/gb-2002-3-2-research0009
Slide 43
-Structure-based clustering methods
Slide 44
Data Mining is searching for knowledge in data Knowledge mining
from databases Knowledge extraction Data/pattern analysis Data
dredging Knowledge Discovery in Databases (KDD)
Slide 45
The process of discovery Interactive + Iterative Scalable
approaches
Slide 46
Popular Data Mining Techniques Clustering: Most dominant
technique in use for gene expression analysis in particular and
bioinformatics in general. Partition data into groups of similarity
Classification: Supervised version of clustering technique to model
class membership can subsequently classify unseen data. Frequent
Pattern Analysis A method for identifying frequently re-curring
patterns (structural and transactional). Temporal/Sequence Analysis
Model temporal data wavelets, FFT etc. Statistical Methods
Regression, Discriminant analysis
Slide 47
Summary A good clustering method will produce high quality
clusters with high intra-class similarity low inter-class
similarity The quality of a clustering result depends on both the
similarity measure used by the method and its implementation. Other
metrics include: density, information entropy, statistical
variance, radius/diameter The quality of a clustering method is
also measured by its ability to discover some or all of the hidden
patterns.
Slide 48
Recommended Literature 1. Bioinformatics The Machine Learning
Approach by P. Baldi & S. Brunak, 2 nd edition, The MIT Press,
2001 2. Data Mining Concepts and Techniques by J. Han & M.
Kamber, Morgan Kaufmann Publishers, 2001 3. Pattern Classification
by R. Duda, P. Hart and D. Stork, 2 nd edition, John Wiley &
Sons, 2001 4. The Elements of Statistical Learning by T. Hastie, R.
Tibshirani, J. Friedman, Springer-Verlag, 2001