Computer Aid Discovery Course: Molecular Classification of Cancer Chris TK Man, Ph.D. Texas Children’s Cancer Center Baylor College of Medicine Feb 11,

Computer Aid Discovery Course:

Molecular Classification of Cancer

Chris TK Man, Ph.D.

Texas Children’s Cancer Center

Baylor College of Medicine

Feb 11, 2009

Outline

• Introduction

• Differences between class comparison, class discovery, and class prediction/classification

• Methods used in class discovery

• Methods used in classification

• Examples from the literature

Part 1: Introduction

What is molecular classification in cancer research?

• Use of molecular profiles (e.g. DNA, RNA, or proteins) to classify, diagnose, or predict different types or subtypes of cancer

– histology subtypes– prognostic subtypes

• Chemotherapy response• Metastasis• Survival• Recurrence

– types of similar cancers

The Golub study

•Published in 1999, Science: 286:531

•Classified acute leukemias from lymphoid precursors (ALL) or myeloid precursors (AML)

•Cited 2806 times

Successful example in breast cancer I

• MammaPrint developed by agendia/ Netherlands Cancer Institute and the Antoni Van Leeuwenhoek Hospital in Amsterdam

• A gene expression profiling test based on a 70-gene signature that predicts the risk of metastasis in breast cancer patients

• Superior to current standards for determination of the recurrence risk for breast cancer, like the NIH criteria

• Validated in more than 1,000 patients and is backed by peer-reviewed medical research

Successful example in breast cancer II

• Oncotype DX developed by genomic health

• A clinically validated laboratory, 21-gene assay (RT-PCR) that predicts the likelihood of breast cancer recurrence in women with newly diagnosed, early stage invasive breast cancer based on a Recurrence Score

• Assess the benefit from chemotherapy

• Use formalin-fixed, paraffin-embedded tumor tissue

Part 2: Study objectives

• Class comparison

• Class discovery

• Class prediction/Classification

Class comparison• Determine whether gene expression profiles differ among samples selected from predefined

classes

• Identify which genes are differentially expressed among the classes

• Understand the biology of the disease and the underlying processes or pathways

• Requires control of false discoveries or multiple testing, such as Bonferonni correction

• Examples: cancers with– Different stages– Primary site– genetic mutations– Therapy response– Before and after an intervention

• Classes are predefined independently of the expression profiles

• Methods: t-test and Wilcoxon’s test

Class Discovery• Cluster analysis, unsupervised learning, and unsupervised pattern recognition

• The classes are unknown a priori and need to be discovered from the data

• Involves estimating the number of classes (or clusters) and assigning objects to these classes

• Goal: Identify novel subtypes of specimens with a population

• Assumption: clinically and morphologically similar specimens may be distinguishable at the molecular level

• Example:– identify subclasses of tumors that are biologically homogeneous and whose expression

profiles either reflect different cells of origin or disease pathogenesis, e.g. subtypes of B-cell lymphoma

• Uncover biological features of the disease that may be clinically or therapeutically useful

• Methods: hierarchical and K-mean clustering

Class Prediction/Classification• Also called supervised learning

• The classes are predefined and the task is to understand the basis for the classification from a set of labeled objects (learning or training set)

• To build a classifier that will then be used to predict the class of future unlabeled observations

• Similar to class comparison except that the emphasis is on developing a statistical model that can predict class label of a specimen

• Important for diagnostic classification, prognostic prediction, and treatment selection

• Methods: linear discriminant analysis, weighted voting, nearest neighbors

Part 3: Class Discovery

Hierarchical Clustering

• An agglomerative method to join similar genes or cases into groups based on a distance metric

• The process is iterated until all groups are connected in a hierarchical tree

Hierarchical method

G1G2

G3

G4

G5

G6

G7

G8

G9

G10

G2 G8

G2 is most similar to G8

G6 is most similar to {G2, G8}

G2 G8 G6 G1 G5G2 G8 G6 G1 G5

G1 is most similar to G5

{G1,G5} is most similar to {G6,{G2,G8}}

Repeat joining until all the samples are clusteredG2 G8 G1 G5

Commonly used distance metric

• Euclidean distance

• 1- correlation

p

iii yxyx

1

2)(

2

11

2

1

)()(

))((1

p

i i

p

i i

p

i ii

yyxx

yyxx

Agglomerative linkage method

• Rules or metrics to determine which elements should be linked– Single linkage– Average linkage– Complete linkage

Single Linkage

• Calculate the minimum distance between members of one clusters and members of another cluster

)),(min(arg jiAB vudD

DABA B

Where and Au Bv

Average Linkage

• Calculates the average distance between all members of one cluster and all members of another cluster

DAB

A B

BA

N

i

N

jji

AB NN

vud

D

A B

1 1

),(

Complete Linkage

• Calculates maximum distance between members of one cluster and members of another cluster

)),(max(arg jiAB vudD

DABA B

Differences in linkage methods

• Single linkage creates extended clusters with individual genes

• Average linkage produces clusters of similar variance

• Complete linkage creates clusters of similar size and variability

Class Problem 1: How many clusters of samples?

I II IIII III II III IV

K-Means Clustering

C1

C2

C3

G1

G2

G3

G4

G5

G6

G7

G8

G9

G10

G11

G12

Specify cluster number Step 1:Randomly assign genes to clusters

C1

C2

C3

G1

G2

G3G4

G5

G6

G7

G8

G9

G10

G11

G12

Step 2: Calculate mean expression profile of each cluster

K-means clustering

C1

C2

C3

G1

G2

G3G4

G5

G6

G7

G8

G9

G10

G11

G12

G9

G3

G5

G8

Step 3: Move genes among clusters to minimize the mean distance between genes and clusters Repeat the steps 2 and 3 until no

genes can be shuffled

K-means

• Pros:– Fast algorithm, cluster thousands of object– Little difficulty with missing data

• Cons:– Different solutions for different starting values. Use

multiple runs.– Sensitive to outliers– An appropriate number of clusters is often unknown

Part 4: Class Prediction/ Classification

Classification

• The object to be predicted assumes one of K predefined classes {1, …, K}

• Associated with each object are: a response or dependent variable (class label), and a set of measurements that form the feature vector (genes)

• The task is to classify an object into one of the K classes on the basis of an observed measurement X=x

Steps in Classifier Building

Specimens

Training Set

Test Set

Classifier

Feature SelectionModel Fitting

AccuracySpecificitySensitivity

CV

Define the goal of a classifier

• The goal of the classifier should be biological or clinically relevant and motivated

• An example from cancer treatment—Personalized medicine– Most cancer treatments benefit only a minority of patients– Predict which patients are likely to benefit from the treatment

would prevent unnecessary toxicity and inconvenience– Over treatment also results in major expense for individuals and

society– Provide an alternative therapy to the non-responders

Feature Selection

Feature selection

• Most of the features are uninformative

• Includes a large number of irrelevant features could degrade classification performance

• A small set of features is more useful for downstream applications and analysis

• Feature selection can be performed:– Explicitly, prior the building of the classifier (filter method)– Implicitly, an inherent part of the classifier building procedure

(wrapper method), e.g. CART

Feature selection methods

• t- or F-statistics

• Signal-to-noise statistics

• Nonparametric Wilcoxon statistics

• Correlation

• Fold change

• Univariate classification rate

• And many others……

Welch t-statistic

• Does not assume equal variance

BgBAgA

gBgA

NSNS

XXt

// 22

Where and denote the sample average intensities in groups A and B and and denote the sample variances for each group

gAX gBX

2gAS 2

gBS

Max T

• Determines the family-wise error rate-adjusted p-values using Welch t-statistics

• Algorithm:– Permute class labels and compute Welch t-statistic

for each gene– Record max Welch t-statistic for 10,000 permutation– Compare the distribution of max t-statistic with

observed values for the statistic– Estimate p-value for each gene as the proportion of

the max permutation-based t-statistics that are greater than the observed value

Template matching

• Algorithm:– Define a template or profile of gene expression– Identify genes which match the template using

correlation

• Simple and flexible

• Can be used with any number of groups and templates, such as finding specific biological expression profiles in multigroup microarray datasets

Area Under the ROC

• ROC analysis:– Proportion of true positives

vs. false negatives (sensitivity vs. 1-specificity) from each possible decision threshold value

– Used in two-class problem

– Calculate the area under the ROC curve (AUC)

Rank Product

• Assume that a gene in an experiment with n genes in k replicates has a probability of being ranked first of 1/nk if the list is random

• Calculate the combined probability as a rank product

)/(1 , i

k

i

upgi

upg nrRP

is the position of gene g in the list of genes in the i-th replicate

upgir ,

SAM

• Address a problem in t-statistic that small fold change genes may statistically significant due to small variance

• Add a small “fudge factor”,s0,to the denominator of the test statistic

• The factor is chosen to minimize the coefficient of variation of the test statistic d(i)

Classification algorithms

Problems of microarray classification

• Number of candidate predictors, p >> number of cases, n

• Algorithm works well to uncover structure and provide accurate predictors when n>>p often works poorly when p>>n

• Overfitting problem if the same dataset is used for training and testing


• Discriminant analysis

• Nearest neighbors

• Decision trees

• Compound covariate predictor

• Neural networks

• Support vector machines

• And many more….

Comparison of classification methods

• No single method is likely to be optimal in all situations

• Performance depends on:– Biological classification under investigation– Genetic disparity among classes– Within-class heterogeneity– Size of training set, etc

A comparative study• Dudoit et al (2002) compared standard and diagonal discriminant

analysis, Golub’s weighted vote method, classification trees, and nearest neighbors

• Applied to three datasets:– Adult lymphoid malignancies separated into two or three classes

(Alizadeh et al, 2000)– Acute lymphocytic and myelogenous leukemia (Golub et al, 1999)– 60 human tumor cell lines into 8 classes based on site of origin (Ross et

al, 2000)

• Diagonal discriminant analysis and nearest neighbors performed the best, suggesting methods that ignored gene correlations and interactions performed better than more complex models

Diagonal linear discriminant analysis

• Assumptions:– Gene covariances are assumed to be zero– Variances of the two classes are the same

• The new samples is assigned to class 2 if

G

g g

ggG

g g

gg

s

xx

s

xx

12

1

12

2 )()(

Golub’s weighted gene voting scheme

• Proposed in the first microarray-based classification paper in 1999

• A variant of DLDA – correlation of gene j within the

class label is defined by

– Each gene casts a “weighted vote” for class prediction

– Sum of all votes determines the class of the sample (V>0, class 1)

)2()1(

)2()1(

jj

jjj ss

xxP

))(2

1( )2()1(

jjjjj xxxPV

Compound covariate predictor

G

ijiij xtC

1

2

)2()1( ccCt

ti is the t-statistic respect to gene ixij is the log expression in specimen j for gene i

and are mean values of the compound covariate for specimens of class 1 and class 2

)2(c)1(c

Compound covariate for specimen j

Classification threshold

Nearest Neighbors

v

A

B

C

Training set contains 3 classes: A, B, and C

Measure v’s 3 nearest neighbors, such as Euclidean distance

Class C is most frequently represented in v’s nearest neighbors, v is assigned to class C

v is a test case

Nearest Neighbors• Simple and capture nonlinearities in the true boundary between

classes when the number of specimens is sufficient

• Number of neighbors k can have a large impact on the performance of the classifier– Can be selected by LOOCV

• Votes can be weighted according to class prior probabilities

• Assign weights to the neighbors that are inversely proportional to their distance for the test case

• Heavy computing time and storage requirement

Classification TreeNode 1Class 1: 10Class 2: 10

Node 2Class 1: 6Class 2: 9

Node 3Class 1: 4Class 2: 1Prediction: 1

Gene A

Node 4Class 1: 0Class 2: 4Prediction:2

Node 5Class 1: 6Class 2: 5

Node 6Class 1: 1Class 2: 5Prediction:2

Node 7Class 1: 5Class 2: 0Prediction: 1

Gene B

Gene C

>2<=2

>1.7<=1.7

>-0.3<=-0.3

Accuracy = 18/20=80%

Use of clustering methods for classification

• Avoid overfitting problem because class information plays no role in deriving the predictors

• Results in poor performance of the predictor

• Only a subset of genes can distinguish classes, their influence may be lost in a cluster analysis

Questions?

Class problem 2

• Joe the investigator developed a classifier to predict a clinical variable X using a set of expression microarray data. He followed the feature selection procedure and built the classifier using all the samples. After the classifier is completed specified, he predicted the samples in the dataset and found that the classifier was 99% accurate. He came to you to ask for your advice and intended to publish these results shortly in Science.

Cross-validation Strategies

Resubstitution estimation

• Classifier is trained using the entire learning set L

• Estimate of the classification error is obtained by running the same learning set L through the classifier and recording the accuracy of the classification

• Problem: error rate can be severely biased downward

L

Feature selection

Model fitting

Error estimation

Leave-one-out cross validation (Internal)

L

Model fitting

Errorestimation

Repeat for the rest of samples

L-1

Feature selection

Error = Sum of the misclassified samples/totalnumber of samples

Leave-one-out cross validation (External)

L

Feature selectionModel fitting

Errorestimation

L-1

Repeat for the rest of samples

Error = Sum of the misclassified samples/total number of samples

Radmacher et al (2002) J. Comput. Biol., 9:505

K-fold cross-validation

• In K-fold cross-validation, cases in the learning set L are randomly divided into K set Lk of as nearly equal size as possible

• Classifiers are built on training set L- Lk

• Error rates are computed for the validation sets Lk, and averaged over k

• Small K typically give a larger bias but a smaller variance

K-fold cross validation

L


Errorestimation

L-4

Repeat for 5 times

Error = Average of the misclassified rate

5-fold CV

Monte Carlo cross-validation

• A learning set L is randomly divided into two sets, at a training set L1 and a validation set L2

• The classifier is built using L1 and the error rate is computed for L2

• The procedure is repeated a number of times and error rates averaged

• Problem: reduces effective sample size for training purposes

• Large computation time

Monte Carlo cross-validation

L


Errorestimation

Repeat many times


Randomly divided into 2 sets

L1L2

Bootstrap estimation

• Error rate for learning set case xi is obtained from bootstrap samples that do not contain this observation

• Bootstrap estimators are typically less variable than LOOCV estimators when sample size is small

• Large computation time

• Upward biased

Bootstrap estimation of leave-one-out cross-validation

L


Errorestimation

L-1

Repeat many times


Bootstrapping

Performance• Accuracy

– Probability of correctly predicted both class A and B samples– (TP+TN)/(TP+TN+FP+FN)

• Sensitivity– Probability of a class A sample to be correctly predicted as

class A– TP/(TP+FN)

• Specificity– Probability of non class A sample to be correctly predicted as

non-A– TN/(FP+TN)

• Positive Predictive Value (PPV)– Probability that a sample predicted as class A actually

belongs to class A– TP/(TP+ FP)

• Negative Predictive Value (NPV)– Probability that a sample predicted as non class A actually

does not belong to class A– TN/(FN+ TN)

A B

A TP FN

B FP TN

Predict

Validation Study

Validation Study• Classifier should be completely specified before the validation study

• Internal validation cannot replace external validation

• Validation study should reflect real life situation and contains samples from multiple institutions

• Independent validation studies are essential before results are accepted into medical practice

• The size of a validation study should be large enough to have a meaningful confidence intervals on predictive accuracy

• Study addresses medical utility of new classifier relative to practice standards

Class problem 3

• Validation of the classifiers based on comparing genes present in different classifiers with the same goal– Genes are unstable due to correlation of

genes– Array platforms and methods used in gene

selection

Part 5: Examples from the literature

CNS tumor study

• Pomeroy et al (2002) Nature 415:436

• Study questions:– Relationship of medulloblastoma and other

CNS tumors– Relationship a subtype to a known biological

pathway– Predict treatment outcome

CNS tumors

• Q1: Distinguishing different embryonal CNS tumors

• Motivation: classification of these tumors based on histopathological appearance is not clear

• Samples: – 10 MB– 3 CNS/AT/RT– 5 renal and extrarenal rhabdoid tumors– 8 supratentorial PNETs– 10 non-embryonal brain tumors– 4 normal cerebella

CNS tumors

Classic Desmoplastic

PNET AT/RT GB

All genes

50 S/N genes

Multi-tumor classifierConfusion Matrix

Predicted

Actual MD MGlio Rhab Ncer PNET

MD 8 0 1 0 1 10

MGlio 0 10 0 0 0 10

Rhab 0 0 9 0 1 10

Ncer 0 0 0 4 0 4

PNET 3 0 1 0 4 8

11 10 11 4 6 42

•Use S/N metric to select 10 genes (all samples in the training set)

•Use a weighted 3-NN classifier (based on distance)

•LOOCV

•35/42 correct classification (83%)

Classic vs. desmoplastic• Q2: Classify classic and desmoplastic subtypes in MB

• Motivation: – Histology classification is very subjective– Desmoplastic associated with Gorlin’s syndrome, which is defect in PATCH gene

• Samples: 34 MB samples out of 60

• Methods: k-NN classifier– Not clear how many genes were used– How the genes were selected– What version of k-NN used

• Correct classification of 33/34 (97%), misclassified only 1 classic to desmoplastic

Classic vs. desmoplastic

Response classifier• Q3:Classify the responders to non-responders to treatment

• Motivation: highly variable response of patients to therapy and clinical method for prognostication is imperfect

• Samples: 60 MB patients who were similarly treated

• Methods:– 1-200 genes were used according to the correlation with survivor

vs. failure and n=8 gave the lowest error rate– 5-NN was used– Correct classification = 47/60 (78%)

Survivors vs. Failures

Confusion Matrix

Predicted

Actual Survivors Failures

Survivors 37 2 39

Failures 11 10 21

48 12 60

Example 2• Van ‘t Veer et al (2003) Nature 415:530

• The study used to develop Mammaprint

• Goal: use gene expression profiles to predict clinical outcome of breast cancer

• Platform: Agilent oligo arrays

• Sample:78 sporadic, untreated lymph-node-negative patients

– 44 free of disease for >= 5 years (good prognosis group)

– 34 developed distant metastases within 5 years (poor prognosis group)

Breast cancer study• Methods:

– Selected a subset of genes based on deregulation in 3 out 78 tumors– Uses correlation coefficient to identify prognostic genes (<-0.3 or >0.3,

231 genes)– Sequentially added 5 genes from the top of rank-ordered list to optimize

the classifier (70 genes to be the best)– Prediction is based on the correlation coefficient to the “good prognosis”

template and “poor prognosis” template (average gene expression of a group)

– LOOCV to estimate accuracy– Independently validated using 19 samples

• 7 patients were good prognosis• 12 patients were poor prognosis

• Results:– Correctly predicted 65 out of 78 patients (83%) in training set– Correctly predict 17/19 samples in the test set (89%)

Unsupervised clustering

Prognostic classification

Example III

• Man et al (2005) Cancer Research 65:8142

• Osteosarcoma is the most common malignant bone tumor in children

• Highly prone to chemotherapy resistance

• After the initial treatment, post-operative chemotherapy does not improve the outcome of poor responders

• Hypothesis: expression profiles can distinguish good and poor responders at the time of diagnosis

Personalized medicine

Good Response Poor Response

Right treatment Over-treated

Treatment

Hypothesis driven design

Initial Biopsy

Training and

LOOCV

Classifier

Post-treatmentSpecimens:

Enriched in resistantcells

Initial Biopsy

Classifier

Experimental design

38 tissue samples

Training set:24 Definitive Surgery

7 Good Responders17 Poor Responders

Testing set:14 Initial Biopsies

6 Good Responders8 Poor Responders

SVM

t-test to select features

LOOCV

45-gene chemoresistant signature

Poor Good

Man et al (2005) Cancer Res 65:8142.

Prediction of Initial Biopsies using SVM classifier

Predicted

Good Poor Total

Actual

Good 5 1 6

Poor 0 8 8

Overall accuracy = 93%Sensitivity = 100%Specificity = 83%

Man et al (2005) Cancer Res 65:8142.

Example IV: Proteomic classifiers

• RNA expression may not correlate to protein expression

• Post-transcriptional and post-translational regulations are not detected by genomic profiling

• Genomic approach in tumors is not suitable for disease monitoring, e.g. the relapse in the patients in remission

• Early detection of cancers while the cancer is not yet detectable

• Plasma is a rich source of potential biomarkers, which contain both tumor and host proteins

Inte

nsi

ty

Molecular Mass (Da)

Dete

ctor

Dete

ctor

Laser

TOF-MS

*ProteinChip reader was routinely calibrated using insulin and IgG before data collection

Proteinchip array

ProteinChip Reader

SELDI TOF-MS detection

Discriminatory features

Classifier Building

• Samples:– 29 osteosarcoma plasma– 20 osteochondroma plasma

• Pre-processed and normalized the MS data• Used t-statistic to select features

• Used LOOCV to validate the classification accuracy

• Select classifier based on LOOCV

OS plasma proteomic signature

Li et al (2006) Proteomics 6:3426

OS OC

m/z

Classification Performance

• The 3-Nearest neighbors classifier:

– overall accuracy = 90% (5/48) – sensitivity = 97% (1/28 OS)

specificity = 80% (4/20 OC)

• Permutation analysis showed that the classification accuracy was significant (p < 0.00005)


CCP DLA 1-NN 3-NN NC SVM

Cla

ssifi

catio

n ac

cura

cy (

%)

70

75

80

85

90

95

Li et al (2006) Proteomics 6:3426

Summary• What is molecular classification?

• The differences among class comparison, class discovery, and class prediction/ classification

• Methods used in class discovery

• Steps and methods used to construct a classifier

• How to properly validate a classifier and measure its accuracy

• Some examples in the literature

Questions?

Documents

Computer Aid Discovery Course: Molecular Classification of Cancer Chris TK Man, Ph.D. Texas Children’s Cancer Center Baylor College of Medicine Feb 11,