Upload
melina-simmons
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Computer Aid Discovery Course:
Molecular Classification of Cancer
Chris TK Man, Ph.D.
Texas Children’s Cancer Center
Baylor College of Medicine
Feb 11, 2009
Outline
• Introduction
• Differences between class comparison, class discovery, and class prediction/classification
• Methods used in class discovery
• Methods used in classification
• Examples from the literature
Part 1: Introduction
What is molecular classification in cancer research?
• Use of molecular profiles (e.g. DNA, RNA, or proteins) to classify, diagnose, or predict different types or subtypes of cancer
– histology subtypes– prognostic subtypes
• Chemotherapy response• Metastasis• Survival• Recurrence
– types of similar cancers
The Golub study
•Published in 1999, Science: 286:531
•Classified acute leukemias from lymphoid precursors (ALL) or myeloid precursors (AML)
•Cited 2806 times
Successful example in breast cancer I
• MammaPrint developed by agendia/ Netherlands Cancer Institute and the Antoni Van Leeuwenhoek Hospital in Amsterdam
• A gene expression profiling test based on a 70-gene signature that predicts the risk of metastasis in breast cancer patients
• Superior to current standards for determination of the recurrence risk for breast cancer, like the NIH criteria
• Validated in more than 1,000 patients and is backed by peer-reviewed medical research
Successful example in breast cancer II
• Oncotype DX developed by genomic health
• A clinically validated laboratory, 21-gene assay (RT-PCR) that predicts the likelihood of breast cancer recurrence in women with newly diagnosed, early stage invasive breast cancer based on a Recurrence Score
• Assess the benefit from chemotherapy
• Use formalin-fixed, paraffin-embedded tumor tissue
Part 2: Study objectives
• Class comparison
• Class discovery
• Class prediction/Classification
Class comparison• Determine whether gene expression profiles differ among samples selected from predefined
classes
• Identify which genes are differentially expressed among the classes
• Understand the biology of the disease and the underlying processes or pathways
• Requires control of false discoveries or multiple testing, such as Bonferonni correction
• Examples: cancers with– Different stages– Primary site– genetic mutations– Therapy response– Before and after an intervention
• Classes are predefined independently of the expression profiles
• Methods: t-test and Wilcoxon’s test
Class Discovery• Cluster analysis, unsupervised learning, and unsupervised pattern recognition
• The classes are unknown a priori and need to be discovered from the data
• Involves estimating the number of classes (or clusters) and assigning objects to these classes
• Goal: Identify novel subtypes of specimens with a population
• Assumption: clinically and morphologically similar specimens may be distinguishable at the molecular level
• Example:– identify subclasses of tumors that are biologically homogeneous and whose expression
profiles either reflect different cells of origin or disease pathogenesis, e.g. subtypes of B-cell lymphoma
• Uncover biological features of the disease that may be clinically or therapeutically useful
• Methods: hierarchical and K-mean clustering
Class Prediction/Classification• Also called supervised learning
• The classes are predefined and the task is to understand the basis for the classification from a set of labeled objects (learning or training set)
• To build a classifier that will then be used to predict the class of future unlabeled observations
• Similar to class comparison except that the emphasis is on developing a statistical model that can predict class label of a specimen
• Important for diagnostic classification, prognostic prediction, and treatment selection
• Methods: linear discriminant analysis, weighted voting, nearest neighbors
Part 3: Class Discovery
Hierarchical Clustering
• An agglomerative method to join similar genes or cases into groups based on a distance metric
• The process is iterated until all groups are connected in a hierarchical tree
Hierarchical method
G1G2
G3
G4
G5
G6
G7
G8
G9
G10
G2 G8
G2 is most similar to G8
G6 is most similar to {G2, G8}
G2 G8 G6 G1 G5G2 G8 G6 G1 G5
G1 is most similar to G5
{G1,G5} is most similar to {G6,{G2,G8}}
Repeat joining until all the samples are clusteredG2 G8 G1 G5
Commonly used distance metric
• Euclidean distance
• 1- correlation
p
iii yxyx
1
2)(
2
11
2
1
)()(
))((1
p
i i
p
i i
p
i ii
yyxx
yyxx
Agglomerative linkage method
• Rules or metrics to determine which elements should be linked– Single linkage– Average linkage– Complete linkage
Single Linkage
• Calculate the minimum distance between members of one clusters and members of another cluster
)),(min(arg jiAB vudD
DABA B
Where and Au Bv
Average Linkage
• Calculates the average distance between all members of one cluster and all members of another cluster
DAB
A B
BA
N
i
N
jji
AB NN
vud
D
A B
1 1
),(
Complete Linkage
• Calculates maximum distance between members of one cluster and members of another cluster
)),(max(arg jiAB vudD
DABA B
Differences in linkage methods
• Single linkage creates extended clusters with individual genes
• Average linkage produces clusters of similar variance
• Complete linkage creates clusters of similar size and variability
Class Problem 1: How many clusters of samples?
I II IIII III II III IV
K-Means Clustering
C1
C2
C3
G1
G2
G3
G4
G5
G6
G7
G8
G9
G10
G11
G12
Specify cluster number Step 1:Randomly assign genes to clusters
C1
C2
C3
G1
G2
G3G4
G5
G6
G7
G8
G9
G10
G11
G12
Step 2: Calculate mean expression profile of each cluster
K-means clustering
C1
C2
C3
G1
G2
G3G4
G5
G6
G7
G8
G9
G10
G11
G12
G9
G3
G5
G8
Step 3: Move genes among clusters to minimize the mean distance between genes and clusters Repeat the steps 2 and 3 until no
genes can be shuffled
K-means
• Pros:– Fast algorithm, cluster thousands of object– Little difficulty with missing data
• Cons:– Different solutions for different starting values. Use
multiple runs.– Sensitive to outliers– An appropriate number of clusters is often unknown
Part 4: Class Prediction/ Classification
Classification
• The object to be predicted assumes one of K predefined classes {1, …, K}
• Associated with each object are: a response or dependent variable (class label), and a set of measurements that form the feature vector (genes)
• The task is to classify an object into one of the K classes on the basis of an observed measurement X=x
Steps in Classifier Building
Specimens
Training Set
Test Set
Classifier
Feature SelectionModel Fitting
AccuracySpecificitySensitivity
CV
Define the goal of a classifier
• The goal of the classifier should be biological or clinically relevant and motivated
• An example from cancer treatment—Personalized medicine– Most cancer treatments benefit only a minority of patients– Predict which patients are likely to benefit from the treatment
would prevent unnecessary toxicity and inconvenience– Over treatment also results in major expense for individuals and
society– Provide an alternative therapy to the non-responders
Feature Selection
Feature selection
• Most of the features are uninformative
• Includes a large number of irrelevant features could degrade classification performance
• A small set of features is more useful for downstream applications and analysis
• Feature selection can be performed:– Explicitly, prior the building of the classifier (filter method)– Implicitly, an inherent part of the classifier building procedure
(wrapper method), e.g. CART
Feature selection methods
• t- or F-statistics
• Signal-to-noise statistics
• Nonparametric Wilcoxon statistics
• Correlation
• Fold change
• Univariate classification rate
• And many others……
Welch t-statistic
• Does not assume equal variance
BgBAgA
gBgA
NSNS
XXt
// 22
Where and denote the sample average intensities in groups A and B and and denote the sample variances for each group
gAX gBX
2gAS 2
gBS
Max T
• Determines the family-wise error rate-adjusted p-values using Welch t-statistics
• Algorithm:– Permute class labels and compute Welch t-statistic
for each gene– Record max Welch t-statistic for 10,000 permutation– Compare the distribution of max t-statistic with
observed values for the statistic– Estimate p-value for each gene as the proportion of
the max permutation-based t-statistics that are greater than the observed value
Template matching
• Algorithm:– Define a template or profile of gene expression– Identify genes which match the template using
correlation
• Simple and flexible
• Can be used with any number of groups and templates, such as finding specific biological expression profiles in multigroup microarray datasets
Area Under the ROC
• ROC analysis:– Proportion of true positives
vs. false negatives (sensitivity vs. 1-specificity) from each possible decision threshold value
– Used in two-class problem
– Calculate the area under the ROC curve (AUC)
Rank Product
• Assume that a gene in an experiment with n genes in k replicates has a probability of being ranked first of 1/nk if the list is random
• Calculate the combined probability as a rank product
)/(1 , i
k
i
upgi
upg nrRP
is the position of gene g in the list of genes in the i-th replicate
upgir ,
SAM
• Address a problem in t-statistic that small fold change genes may statistically significant due to small variance
• Add a small “fudge factor”,s0,to the denominator of the test statistic
• The factor is chosen to minimize the coefficient of variation of the test statistic d(i)
Classification algorithms
Problems of microarray classification
• Number of candidate predictors, p >> number of cases, n
• Algorithm works well to uncover structure and provide accurate predictors when n>>p often works poorly when p>>n
• Overfitting problem if the same dataset is used for training and testing
Classification algorithms
• Discriminant analysis
• Nearest neighbors
• Decision trees
• Compound covariate predictor
• Neural networks
• Support vector machines
• And many more….
Comparison of classification methods
• No single method is likely to be optimal in all situations
• Performance depends on:– Biological classification under investigation– Genetic disparity among classes– Within-class heterogeneity– Size of training set, etc
A comparative study• Dudoit et al (2002) compared standard and diagonal discriminant
analysis, Golub’s weighted vote method, classification trees, and nearest neighbors
• Applied to three datasets:– Adult lymphoid malignancies separated into two or three classes
(Alizadeh et al, 2000)– Acute lymphocytic and myelogenous leukemia (Golub et al, 1999)– 60 human tumor cell lines into 8 classes based on site of origin (Ross et
al, 2000)
• Diagonal discriminant analysis and nearest neighbors performed the best, suggesting methods that ignored gene correlations and interactions performed better than more complex models
Diagonal linear discriminant analysis
• Assumptions:– Gene covariances are assumed to be zero– Variances of the two classes are the same
• The new samples is assigned to class 2 if
G
g g
ggG
g g
gg
s
xx
s
xx
12
1
12
2 )()(
Golub’s weighted gene voting scheme
• Proposed in the first microarray-based classification paper in 1999
• A variant of DLDA – correlation of gene j within the
class label is defined by
– Each gene casts a “weighted vote” for class prediction
– Sum of all votes determines the class of the sample (V>0, class 1)
)2()1(
)2()1(
jj
jjj ss
xxP
))(2
1( )2()1(
jjjjj xxxPV
Compound covariate predictor
G
ijiij xtC
1
2
)2()1( ccCt
ti is the t-statistic respect to gene ixij is the log expression in specimen j for gene i
and are mean values of the compound covariate for specimens of class 1 and class 2
)2(c)1(c
Compound covariate for specimen j
Classification threshold
Nearest Neighbors
v
A
B
C
Training set contains 3 classes: A, B, and C
Measure v’s 3 nearest neighbors, such as Euclidean distance
Class C is most frequently represented in v’s nearest neighbors, v is assigned to class C
v is a test case
Nearest Neighbors• Simple and capture nonlinearities in the true boundary between
classes when the number of specimens is sufficient
• Number of neighbors k can have a large impact on the performance of the classifier– Can be selected by LOOCV
• Votes can be weighted according to class prior probabilities
• Assign weights to the neighbors that are inversely proportional to their distance for the test case
• Heavy computing time and storage requirement
Classification TreeNode 1Class 1: 10Class 2: 10
Node 2Class 1: 6Class 2: 9
Node 3Class 1: 4Class 2: 1Prediction: 1
Gene A
Node 4Class 1: 0Class 2: 4Prediction:2
Node 5Class 1: 6Class 2: 5
Node 6Class 1: 1Class 2: 5Prediction:2
Node 7Class 1: 5Class 2: 0Prediction: 1
Gene B
Gene C
>2<=2
>1.7<=1.7
>-0.3<=-0.3
Accuracy = 18/20=80%
Use of clustering methods for classification
• Avoid overfitting problem because class information plays no role in deriving the predictors
• Results in poor performance of the predictor
• Only a subset of genes can distinguish classes, their influence may be lost in a cluster analysis
Questions?
Class problem 2
• Joe the investigator developed a classifier to predict a clinical variable X using a set of expression microarray data. He followed the feature selection procedure and built the classifier using all the samples. After the classifier is completed specified, he predicted the samples in the dataset and found that the classifier was 99% accurate. He came to you to ask for your advice and intended to publish these results shortly in Science.
Cross-validation Strategies
Resubstitution estimation
• Classifier is trained using the entire learning set L
• Estimate of the classification error is obtained by running the same learning set L through the classifier and recording the accuracy of the classification
• Problem: error rate can be severely biased downward
L
Feature selection
Model fitting
Error estimation
Leave-one-out cross validation (Internal)
L
Model fitting
Errorestimation
Repeat for the rest of samples
L-1
Feature selection
Error = Sum of the misclassified samples/totalnumber of samples
Leave-one-out cross validation (External)
L
Feature selectionModel fitting
Errorestimation
L-1
Repeat for the rest of samples
Error = Sum of the misclassified samples/total number of samples
Radmacher et al (2002) J. Comput. Biol., 9:505
K-fold cross-validation
• In K-fold cross-validation, cases in the learning set L are randomly divided into K set Lk of as nearly equal size as possible
• Classifiers are built on training set L- Lk
• Error rates are computed for the validation sets Lk, and averaged over k
• Small K typically give a larger bias but a smaller variance
K-fold cross validation
L
Feature selectionModel fitting
Errorestimation
L-4
Repeat for 5 times
Error = Average of the misclassified rate
5-fold CV
Monte Carlo cross-validation
• A learning set L is randomly divided into two sets, at a training set L1 and a validation set L2
• The classifier is built using L1 and the error rate is computed for L2
• The procedure is repeated a number of times and error rates averaged
• Problem: reduces effective sample size for training purposes
• Large computation time
Monte Carlo cross-validation
L
Feature selectionModel fitting
Errorestimation
Repeat many times
Error = Average of the misclassified rate
Randomly divided into 2 sets
L1L2
Bootstrap estimation
• Error rate for learning set case xi is obtained from bootstrap samples that do not contain this observation
• Bootstrap estimators are typically less variable than LOOCV estimators when sample size is small
• Large computation time
• Upward biased
Bootstrap estimation of leave-one-out cross-validation
L
Feature selectionModel fitting
Errorestimation
L-1
Repeat many times
Error = Average of the misclassified rate
Bootstrapping
Performance• Accuracy
– Probability of correctly predicted both class A and B samples– (TP+TN)/(TP+TN+FP+FN)
• Sensitivity– Probability of a class A sample to be correctly predicted as
class A– TP/(TP+FN)
• Specificity– Probability of non class A sample to be correctly predicted as
non-A– TN/(FP+TN)
• Positive Predictive Value (PPV)– Probability that a sample predicted as class A actually
belongs to class A– TP/(TP+ FP)
• Negative Predictive Value (NPV)– Probability that a sample predicted as non class A actually
does not belong to class A– TN/(FN+ TN)
A B
A TP FN
B FP TN
Predict
Validation Study
Validation Study• Classifier should be completely specified before the validation study
• Internal validation cannot replace external validation
• Validation study should reflect real life situation and contains samples from multiple institutions
• Independent validation studies are essential before results are accepted into medical practice
• The size of a validation study should be large enough to have a meaningful confidence intervals on predictive accuracy
• Study addresses medical utility of new classifier relative to practice standards
Class problem 3
• Validation of the classifiers based on comparing genes present in different classifiers with the same goal– Genes are unstable due to correlation of
genes– Array platforms and methods used in gene
selection
Part 5: Examples from the literature
CNS tumor study
• Pomeroy et al (2002) Nature 415:436
• Study questions:– Relationship of medulloblastoma and other
CNS tumors– Relationship a subtype to a known biological
pathway– Predict treatment outcome
CNS tumors
• Q1: Distinguishing different embryonal CNS tumors
• Motivation: classification of these tumors based on histopathological appearance is not clear
• Samples: – 10 MB– 3 CNS/AT/RT– 5 renal and extrarenal rhabdoid tumors– 8 supratentorial PNETs– 10 non-embryonal brain tumors– 4 normal cerebella
CNS tumors
Classic Desmoplastic
PNET AT/RT GB
All genes
50 S/N genes
Multi-tumor classifierConfusion Matrix
Predicted
Actual MD MGlio Rhab Ncer PNET
MD 8 0 1 0 1 10
MGlio 0 10 0 0 0 10
Rhab 0 0 9 0 1 10
Ncer 0 0 0 4 0 4
PNET 3 0 1 0 4 8
11 10 11 4 6 42
•Use S/N metric to select 10 genes (all samples in the training set)
•Use a weighted 3-NN classifier (based on distance)
•LOOCV
•35/42 correct classification (83%)
Classic vs. desmoplastic• Q2: Classify classic and desmoplastic subtypes in MB
• Motivation: – Histology classification is very subjective– Desmoplastic associated with Gorlin’s syndrome, which is defect in PATCH gene
• Samples: 34 MB samples out of 60
• Methods: k-NN classifier– Not clear how many genes were used– How the genes were selected– What version of k-NN used
• Correct classification of 33/34 (97%), misclassified only 1 classic to desmoplastic
Classic vs. desmoplastic
Response classifier• Q3:Classify the responders to non-responders to treatment
• Motivation: highly variable response of patients to therapy and clinical method for prognostication is imperfect
• Samples: 60 MB patients who were similarly treated
• Methods:– 1-200 genes were used according to the correlation with survivor
vs. failure and n=8 gave the lowest error rate– 5-NN was used– Correct classification = 47/60 (78%)
Survivors vs. Failures
Confusion Matrix
Predicted
Actual Survivors Failures
Survivors 37 2 39
Failures 11 10 21
48 12 60
Example 2• Van ‘t Veer et al (2003) Nature 415:530
• The study used to develop Mammaprint
• Goal: use gene expression profiles to predict clinical outcome of breast cancer
• Platform: Agilent oligo arrays
• Sample:78 sporadic, untreated lymph-node-negative patients
– 44 free of disease for >= 5 years (good prognosis group)
– 34 developed distant metastases within 5 years (poor prognosis group)
Breast cancer study• Methods:
– Selected a subset of genes based on deregulation in 3 out 78 tumors– Uses correlation coefficient to identify prognostic genes (<-0.3 or >0.3,
231 genes)– Sequentially added 5 genes from the top of rank-ordered list to optimize
the classifier (70 genes to be the best)– Prediction is based on the correlation coefficient to the “good prognosis”
template and “poor prognosis” template (average gene expression of a group)
– LOOCV to estimate accuracy– Independently validated using 19 samples
• 7 patients were good prognosis• 12 patients were poor prognosis
• Results:– Correctly predicted 65 out of 78 patients (83%) in training set– Correctly predict 17/19 samples in the test set (89%)
Unsupervised clustering
Prognostic classification
Example III
• Man et al (2005) Cancer Research 65:8142
• Osteosarcoma is the most common malignant bone tumor in children
• Highly prone to chemotherapy resistance
• After the initial treatment, post-operative chemotherapy does not improve the outcome of poor responders
• Hypothesis: expression profiles can distinguish good and poor responders at the time of diagnosis
Personalized medicine
Good Response Poor Response
Right treatment Over-treated
Treatment
Hypothesis driven design
Initial Biopsy
Training and
LOOCV
Classifier
Post-treatmentSpecimens:
Enriched in resistantcells
Initial Biopsy
Classifier
Experimental design
38 tissue samples
Training set:24 Definitive Surgery
7 Good Responders17 Poor Responders
Testing set:14 Initial Biopsies
6 Good Responders8 Poor Responders
SVM
t-test to select features
LOOCV
45-gene chemoresistant signature
Poor Good
Man et al (2005) Cancer Res 65:8142.
Prediction of Initial Biopsies using SVM classifier
Predicted
Good Poor Total
Actual
Good 5 1 6
Poor 0 8 8
Overall accuracy = 93%Sensitivity = 100%Specificity = 83%
Man et al (2005) Cancer Res 65:8142.
Example IV: Proteomic classifiers
• RNA expression may not correlate to protein expression
• Post-transcriptional and post-translational regulations are not detected by genomic profiling
• Genomic approach in tumors is not suitable for disease monitoring, e.g. the relapse in the patients in remission
• Early detection of cancers while the cancer is not yet detectable
• Plasma is a rich source of potential biomarkers, which contain both tumor and host proteins
Inte
nsi
ty
Molecular Mass (Da)
Dete
ctor
Dete
ctor
Laser
TOF-MS
*ProteinChip reader was routinely calibrated using insulin and IgG before data collection
Proteinchip array
ProteinChip Reader
SELDI TOF-MS detection
Discriminatory features
Classifier Building
• Samples:– 29 osteosarcoma plasma– 20 osteochondroma plasma
• Pre-processed and normalized the MS data• Used t-statistic to select features
• Used LOOCV to validate the classification accuracy
• Select classifier based on LOOCV
OS plasma proteomic signature
Li et al (2006) Proteomics 6:3426
OS OC
m/z
Classification Performance
• The 3-Nearest neighbors classifier:
– overall accuracy = 90% (5/48) – sensitivity = 97% (1/28 OS)
specificity = 80% (4/20 OC)
• Permutation analysis showed that the classification accuracy was significant (p < 0.00005)
Classification algorithms
CCP DLA 1-NN 3-NN NC SVM
Cla
ssifi
catio
n ac
cura
cy (
%)
70
75
80
85
90
95
Li et al (2006) Proteomics 6:3426
Summary• What is molecular classification?
• The differences among class comparison, class discovery, and class prediction/ classification
• Methods used in class discovery
• Steps and methods used to construct a classifier
• How to properly validate a classifier and measure its accuracy
• Some examples in the literature
Questions?