View
220
Download
1
Embed Size (px)
Citation preview
Bio277 Lab 2: Clustering and Classification of Microarray
Data
Jess Mar
Department of Biostatistics
Quackenbush Lab DFCI
Machine Learning
Machine learning algorithms predict new classes based on patterns discerned from existing data.
Classification algorithms are a form of supervised learning.
Clustering algorithms are a form of unsupervised learning.
Goal: derive a rule (classifier) that assigns a new object (e.g. patient
microarray profile) to a pre-specified group (e.g. aggressive vs non-
aggressive prostate cancer).
The Golub Data
Golub et al. published gene expression microarray data in a 1999 Science paper entitled: Molecular Classification of Cancer – Class Discovery and Class Prediction by Gene Expression Monitoring.
The primary focus of their paper was to demonstrate the use of a class discovery procedure which could assign tumors to either acute myeloid leukemia (ALL) versus acute lymphoblastic leukemia (AML).
Bioconductor has this (pre-processed) data packaged up in golubEsets.
> library(golubEsets)
> library(help=golubEsets)
Some Clustering Algorithms for Array Data
EGEGG
EG
E
NNNNN
NN
N
xxx
x
x
xxx
E
1,1
,1
21
11211
Experiments or Microarray Slides
Genes
EGEGG
EG
E
NNNNN
NN
N
xxx
x
x
xxx
E
1,1
,1
21
11211
Experiments or Microarray Slides
Genes
EGEGG
EG
E
NNNNN
NN
N
xxx
x
x
xxx
E
1,1
,1
21
11211
Experiments or Microarray Slides
Genes
Hierarchical Methods:
Single, Average, Complete Linkage plus other variations.
Partitioning Methods:
Self-Organising Maps (Köhonen)
K-Means Clustering
Gene shaving
(Hastie, Tibshirani et al.)
Model based clustering
…
Plaid models
(Lazzeroni &
Owen)
Cluster Analysis
Hierarchical Methods:
(Agglomerative, Divisive) + (Single, Average, Complete) Linkage…
Model-based Methods:
Mixed models. Plaid models. Mixture models…
A clustering problem is generally much harder than a classification problem because we don’t know the number of classes.
Clustering genes on the basis of experiments or across a time series.
Elucidate unknown gene function.
Clustering slides on the basis of genes.
Discover subclasses in tissue samples.
Hierarchical Clustering
n genes in n clusters
n genes in 1 cluster
divisive
agg
lom
erat
ive
We join (or break) nodes based on the notion of maximum (or minimum) ‘similarity’.
Euclidean distance
(Pearson) correlation
Source: J-Express Manual
Single linkage
Complete linkage
Average linkage
Different Ways to Determine Distances Between Clusters
Implementing Hierarchical Clustering
Agglomerative hierarchical clustering with the function agnes:
> colnames(eset.filt) <- classLabels
> plot(agnes(dist(t(eset.filt)
, method="euclidean")))
Principal Component Analysis
Multi-dimensional scaling tool. See GC's lectures for a more in depth treatment.
In our Golub data set, PCA will take the data (~500 genes x 72 samples) and map each sample vector (ALL or AML) from 558 dimensions to 2 dimensions.
> pca.samples <- princomp(eset.filt)
> plot(pca.samples)
Principal Components
Classification Example: Support Vector Machine
For this example we will use data from Golub et al.
• 47 patients with ALL, 25 patients with AML
• 7129 genes from an Affymettrix HGU6800 but we'll take a subset for this example.
> library(MLInterfaces) ; library(golubEsets)
> library(e1071)
> data(golubMerge)
To fit the support vector machine:
> model <- svm(classLabels[1:40]~., data=t(eset.train))
Visualizing the SVM
What predictions were made for the test set?predLabels <- predict(model, t(eset.test))
> predLabelsALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL AML AML AML AML AML AML AML AML AML ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL AML AML AML AML AML AML AML AML AML AML AML AML AML Levels: ALL AML
How do these stack up to the true classification?> trueLabels <- classLabels[41:72]> table(predLabels, trueLabels)
trueLabelspredLabels ALL AML ALL 21 0 AML 0 11
More Materials, More Labs?
Hypothesis Testing of Differentially Expressed Genes
Gene Set Enrichment
Clustering
Classification
Support Vector Machines
Lecture Topics Covered Since
Last Lab
Tutorial: BioConductor Tour