Introduction to Machine Learning BMI/IBGP 730 Kun Huang Department of Biomedical Informatics The...
Preview:
Citation preview
- Slide 1
- Introduction to Machine Learning BMI/IBGP 730 Kun Huang
Department of Biomedical Informatics The Ohio State University
- Slide 2
- Machine Learning Statistical learning Artificial intelligence
Pattern recognition Data mining
- Slide 3
- Machine Learning Supervised Unsupervised Semi-supervised
Regression
- Slide 4
- Clustering and Classification Preprocessing Distance measures
Popular algorithms (not necessarily the best ones) More
sophisticated ones Evaluation Data mining
- Slide 5
- - Clustering or classification? - Is training data available? -
What domain specific knowledge can be applied? - What preprocessing
of data is needed? - Log / data scale and numerical stability -
Filtering / denoising - Nonlinear kernel - Feature selection (do I
need to use all the data?) - Is the dimensionality of the data too
high?
- Slide 6
- -Accuracy vs. generality -Overfitting -Model selection Model
complexity Prediction error Training sample Testing sample
(reproduced from Hastie et.al.)
- Slide 7
- How do we process microarray data (clustering)? - Feature
selection genes, transformations of expression levels. - Genes
discovered in the class comparison (t-test). Risk: missing genes. -
Iterative approach : select genes under different p- value cutoff,
then select the one with good performance using cross-validation. -
Principal components (pro and con). - Discriminant analysis (e.g.,
LDA).
- Slide 8
- - Dimensionality Reduction - Principal component analysis (PCA)
- Singular value decomposition (SVD) - Karhunen-Loeve transform
(KLT) Basis for P SVD
- Slide 9
- - Principal Component Analysis (PCA) - Other things to consider
- Numerical balance/data normalization - Noisy direction -
Continuous vs. discrete data - Principal components are orthogonal
to each other, however, biological data are not - Principal
components are linear combinations of original data - Prior
knowledge is important - PCA is not clustering!
- Slide 10
- Visualization of Microarray Data Multidimensional scaling (MDS)
High-dimensional coordinates unknown Distances between the points
are known The distance may not be Euclidean, but the embedding
maintains the distance in a Euclidean space Try different
dimensions (from one to ???) At each dimension, perform optimal
embedding to minimize embedding error Plot embedding error
(residue) vs. dimension Pick the knee point
- Slide 11
- Visualization of Microarray Data Multidimensional scaling
(MDS)
- Slide 12
- Distance Measure (Metric?) -What do you mean by similar?
-Euclidean -Uncentered correlation -Pearson correlation
- Slide 13
- Distance Metric -Euclidean
102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300
3189.0001321.3002164.400868.600185.300266.4002527.800
160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000
5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d E (Lip1,
Ap1s1) = 12883
- Slide 14
- Distance Metric -Pearson Correlation
102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300
3189.0001321.3002164.400868.600185.300266.4002527.800
160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000
5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d P (Lip1,
Ap1s1) = 0.904
- Slide 15
- Distance Metric -Pearson Correlation r = 1r = -1 Ranges from 1
to -1.
- Slide 16
- Distance Metric -Uncentered Correlation
102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300
3189.0001321.3002164.400868.600185.300266.4002527.800
160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000
5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d u (Lip1,
Ap1s1) = 0.835 About 33.4 o
- Slide 17
- Distance Metric -Difference between Pearson correlation and
uncentered correlation
102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300
3189.0001321.3002164.400868.600185.300266.4002527.800
160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000
5410.9003162.1004100.9004603.2006066.2005505.8005702.700 Pearson
correlation Baseline expression possible Uncentered correlation All
are considered signals
- Slide 18
- Distance Metric -Difference between Euclidean and
correlation
- Slide 19
- Distance Metric -PCC means similarity, how can we transform it
to distance? -1-PCC -Negative correlation may also mean close in
signal pathway (1-|PCC|, 1-PCC^2)
- Slide 20
- Supervised Learning Perceptron neural networks
- Slide 21
- Supervised Learning Perceptron neural networks
- Slide 22
- -Supervised Learning -Support vector machines (SVM) and Kernels
-Only (binary) classifier, no data model
- Slide 23
- -Supervised Learning - Nave Bayesian classifier -Bayes rule
-Maximum a posterior (MAP) Prior prob. Conditional prob.
- Slide 24
- - Dimensionality reduction: linear discriminant analysis (LDA)
B. 2.0 1.5 1.0 0.5 0.5 1.0 1.5 2.0............. A w. (From S. Wus
website)
- Slide 25
- Linear Discriminant Analysis B. 2.0 1.5 1.0 0.5 0.5 1.0 1.5
2.0............. A w. (From S. Wus website)
- Slide 26
- -Supervised Learning - Support vector machines (SVM) and
Kernels -Kernel nonlinear mapping
- Slide 27
- How do we use microarray? Profiling Clustering Cluster to
detect patient subgroups Cluster to detect gene clusters and
regulatory networks
- Slide 28
- Slide 29
- How do we process microarray data (clustering)? - Unsupervised
Learning Hierarchical Clustering
- Slide 30
- How do we process microarray data (clustering)? -Unsupervised
Learning Hierarchical Clustering Single linkage: The linking
distance is the minimum distance between two clusters.
- Slide 31
- How do we process microarray data (clustering)? -Unsupervised
Learning Hierarchical Clustering Complete linkage: The linking
distance is the maximum distance between two clusters.
- Slide 32
- How do we process microarray data (clustering)? -Unsupervised
Learning Hierarchical Clustering Average linkage/UPGMA: The linking
distance is the average of all pair-wise distances between members
of the two clusters. Since all genes and samples carry equal
weight, the linkage is an Unweighted Pair Group Method with
Arithmetic Means (UPGMA).
- Slide 33
- How do we process microarray data (clustering)? -Unsupervised
Learning Hierarchical Clustering Single linkage Prone to chaining
and sensitive to noise Complete linkage Tends to produce compact
clusters Average linkage Sensitive to distance metric
- Slide 34
- -Unsupervised Learning Hierarchical Clustering
- Slide 35
- Dendrograms Distance the height each horizontal line represents
the distance between the two groups it merges. Order Opensource R
uses the convention that the tighter clusters are on the left.
Others proposed to use expression values, loci on chromosomes, and
other ranking criteria.
- Slide 36
- -Unsupervised Learning - K-means -Vector quantization -K-D
trees -Need to try different K, sensitive to initialization
- Slide 37
- -Unsupervised Learning - K-means [cidx, ctrs] =
kmeans(yeastvalueshighexp, 4, 'dist', 'corr', 'rep',20); K
Metric
- Slide 38
- -Unsupervised Learning - K-means -Number of class K needs to be
specified -Does not always converge -Sensitive to
initialization
- Slide 39
- -Unsupervised Learning - K-means
- Slide 40
- -Unsupervised Learning -Self-organized maps (SOM) -Neural
network based method -Originally used as a visualization method for
visualize (embedding) high-dimensional data -Also related vector
quantization -The idea is to map close data points to the same
discrete level
- Slide 41
- -Issues -Lack of consistency or representative features (5.3
TP53 + 0.8 PTEN doesnt make sense) -Data structure is missing -Not
robust to outliers and noise DHaeseleer 2005 Nat. Biotechnol
23(12):1499-501
- Slide 42
- -Model-based clustering methods (Han)
http://www.cs.umd.edu/~bhhan/research2.html Pan et al. Genome
Biology 2002 3:research0009.1
doi:10.1186/gb-2002-3-2-research0009
- Slide 43
- -Structure-based clustering methods
- Slide 44
- Data Mining is searching for knowledge in data Knowledge mining
from databases Knowledge extraction Data/pattern analysis Data
dredging Knowledge Discovery in Databases (KDD)
- Slide 45
- The process of discovery Interactive + Iterative Scalable
approaches
- Slide 46
- Popular Data Mining Techniques Clustering: Most dominant
technique in use for gene expression analysis in particular and
bioinformatics in general. Partition data into groups of similarity
Classification: Supervised version of clustering technique to model
class membership can subsequently classify unseen data. Frequent
Pattern Analysis A method for identifying frequently re-curring
patterns (structural and transactional). Temporal/Sequence Analysis
Model temporal data wavelets, FFT etc. Statistical Methods
Regression, Discriminant analysis
- Slide 47
- Summary A good clustering method will produce high quality
clusters with high intra-class similarity low inter-class
similarity The quality of a clustering result depends on both the
similarity measure used by the method and its implementation. Other
metrics include: density, information entropy, statistical
variance, radius/diameter The quality of a clustering method is
also measured by its ability to discover some or all of the hidden
patterns.
- Slide 48
- Recommended Literature 1. Bioinformatics The Machine Learning
Approach by P. Baldi & S. Brunak, 2 nd edition, The MIT Press,
2001 2. Data Mining Concepts and Techniques by J. Han & M.
Kamber, Morgan Kaufmann Publishers, 2001 3. Pattern Classification
by R. Duda, P. Hart and D. Stork, 2 nd edition, John Wiley &
Sons, 2001 4. The Elements of Statistical Learning by T. Hastie, R.
Tibshirani, J. Friedman, Springer-Verlag, 2001