94
Computer Aid Discovery Course: Molecular Classification of Cancer Chris TK Man, Ph.D. Texas Children’s Cancer Center Baylor College of Medicine Feb 11, 2009

Computer Aid Discovery Course: Molecular Classification of Cancer

  • Upload
    sveta

  • View
    39

  • Download
    3

Embed Size (px)

DESCRIPTION

Computer Aid Discovery Course: Molecular Classification of Cancer. Chris TK Man, Ph.D. Texas Children’s Cancer Center Baylor College of Medicine Feb 11, 2009. Outline. Introduction Differences between class comparison, class discovery, and class prediction/classification - PowerPoint PPT Presentation

Citation preview

Page 1: Computer Aid Discovery Course:  Molecular Classification of Cancer

Computer Aid Discovery Course:

Molecular Classification of Cancer

Chris TK Man, Ph.D.

Texas Children’s Cancer Center

Baylor College of Medicine

Feb 11, 2009

Page 2: Computer Aid Discovery Course:  Molecular Classification of Cancer

Outline

• Introduction

• Differences between class comparison, class discovery, and class prediction/classification

• Methods used in class discovery

• Methods used in classification

• Examples from the literature

Page 3: Computer Aid Discovery Course:  Molecular Classification of Cancer

Part 1: Introduction

Page 4: Computer Aid Discovery Course:  Molecular Classification of Cancer

What is molecular classification in cancer research?

• Use of molecular profiles (e.g. DNA, RNA, or proteins) to classify, diagnose, or predict different types or subtypes of cancer

– histology subtypes– prognostic subtypes

• Chemotherapy response• Metastasis• Survival• Recurrence

– types of similar cancers

Page 5: Computer Aid Discovery Course:  Molecular Classification of Cancer

The Golub study

•Published in 1999, Science: 286:531

•Classified acute leukemias from lymphoid precursors (ALL) or myeloid precursors (AML)

•Cited 2806 times

Page 6: Computer Aid Discovery Course:  Molecular Classification of Cancer
Page 7: Computer Aid Discovery Course:  Molecular Classification of Cancer

Successful example in breast cancer I

• MammaPrint developed by agendia/ Netherlands Cancer Institute and the Antoni Van Leeuwenhoek Hospital in Amsterdam

• A gene expression profiling test based on a 70-gene signature that predicts the risk of metastasis in breast cancer patients

• Superior to current standards for determination of the recurrence risk for breast cancer, like the NIH criteria

• Validated in more than 1,000 patients and is backed by peer-reviewed medical research

Page 8: Computer Aid Discovery Course:  Molecular Classification of Cancer

Successful example in breast cancer II

• Oncotype DX developed by genomic health

• A clinically validated laboratory, 21-gene assay (RT-PCR) that predicts the likelihood of breast cancer recurrence in women with newly diagnosed, early stage invasive breast cancer based on a Recurrence Score

• Assess the benefit from chemotherapy

• Use formalin-fixed, paraffin-embedded tumor tissue

Page 9: Computer Aid Discovery Course:  Molecular Classification of Cancer

Part 2: Study objectives

• Class comparison

• Class discovery

• Class prediction/Classification

Page 10: Computer Aid Discovery Course:  Molecular Classification of Cancer

Class comparison• Determine whether gene expression profiles differ among samples selected from predefined

classes

• Identify which genes are differentially expressed among the classes

• Understand the biology of the disease and the underlying processes or pathways

• Requires control of false discoveries or multiple testing, such as Bonferonni correction

• Examples: cancers with– Different stages– Primary site– genetic mutations– Therapy response– Before and after an intervention

• Classes are predefined independently of the expression profiles

• Methods: t-test and Wilcoxon’s test

Page 11: Computer Aid Discovery Course:  Molecular Classification of Cancer

Class Discovery• Cluster analysis, unsupervised learning, and unsupervised pattern recognition

• The classes are unknown a priori and need to be discovered from the data

• Involves estimating the number of classes (or clusters) and assigning objects to these classes

• Goal: Identify novel subtypes of specimens with a population

• Assumption: clinically and morphologically similar specimens may be distinguishable at the molecular level

• Example:– identify subclasses of tumors that are biologically homogeneous and whose expression

profiles either reflect different cells of origin or disease pathogenesis, e.g. subtypes of B-cell lymphoma

• Uncover biological features of the disease that may be clinically or therapeutically useful

• Methods: hierarchical and K-mean clustering

Page 12: Computer Aid Discovery Course:  Molecular Classification of Cancer

Class Prediction/Classification• Also called supervised learning

• The classes are predefined and the task is to understand the basis for the classification from a set of labeled objects (learning or training set)

• To build a classifier that will then be used to predict the class of future unlabeled observations

• Similar to class comparison except that the emphasis is on developing a statistical model that can predict class label of a specimen

• Important for diagnostic classification, prognostic prediction, and treatment selection

• Methods: linear discriminant analysis, weighted voting, nearest neighbors

Page 13: Computer Aid Discovery Course:  Molecular Classification of Cancer

Part 3: Class Discovery

Page 14: Computer Aid Discovery Course:  Molecular Classification of Cancer

Hierarchical Clustering

• An agglomerative method to join similar genes or cases into groups based on a distance metric

• The process is iterated until all groups are connected in a hierarchical tree

Page 15: Computer Aid Discovery Course:  Molecular Classification of Cancer

Hierarchical method

G1G2

G3

G4

G5

G6

G7

G8

G9

G10

G2 G8

G2 is most similar to G8

G6 is most similar to {G2, G8}

G2 G8 G6 G1 G5G2 G8 G6 G1 G5

G1 is most similar to G5

{G1,G5} is most similar to {G6,{G2,G8}}

Repeat joining until all the samples are clusteredG2 G8 G1 G5

Page 16: Computer Aid Discovery Course:  Molecular Classification of Cancer

Commonly used distance metric

• Euclidean distance

• 1- correlation

p

iii yxyx

1

2)(

2

11

2

1

)()(

))((1

p

i i

p

i i

p

i ii

yyxx

yyxx

Page 17: Computer Aid Discovery Course:  Molecular Classification of Cancer

Agglomerative linkage method

• Rules or metrics to determine which elements should be linked– Single linkage– Average linkage– Complete linkage

Page 18: Computer Aid Discovery Course:  Molecular Classification of Cancer

Single Linkage

• Calculate the minimum distance between members of one clusters and members of another cluster

)),(min(arg jiAB vudD

DABA B

Where and Au Bv

Page 19: Computer Aid Discovery Course:  Molecular Classification of Cancer

Average Linkage

• Calculates the average distance between all members of one cluster and all members of another cluster

DAB

A B

BA

N

i

N

jji

AB NN

vud

D

A B

1 1

),(

Page 20: Computer Aid Discovery Course:  Molecular Classification of Cancer

Complete Linkage

• Calculates maximum distance between members of one cluster and members of another cluster

)),(max(arg jiAB vudD

DABA B

Page 21: Computer Aid Discovery Course:  Molecular Classification of Cancer

Differences in linkage methods

• Single linkage creates extended clusters with individual genes

• Average linkage produces clusters of similar variance

• Complete linkage creates clusters of similar size and variability

Page 22: Computer Aid Discovery Course:  Molecular Classification of Cancer

Class Problem 1: How many clusters of samples?

I II IIII III II III IV

Page 23: Computer Aid Discovery Course:  Molecular Classification of Cancer

K-Means Clustering

C1

C2

C3

G1

G2

G3

G4

G5

G6

G7

G8

G9

G10

G11

G12

Specify cluster number Step 1:Randomly assign genes to clusters

C1

C2

C3

G1

G2

G3G4

G5

G6

G7

G8

G9

G10

G11

G12

Step 2: Calculate mean expression profile of each cluster

Page 24: Computer Aid Discovery Course:  Molecular Classification of Cancer

K-means clustering

C1

C2

C3

G1

G2

G3G4

G5

G6

G7

G8

G9

G10

G11

G12

G9

G3

G5

G8

Step 3: Move genes among clusters to minimize the mean distance between genes and clusters Repeat the steps 2 and 3 until no

genes can be shuffled

Page 25: Computer Aid Discovery Course:  Molecular Classification of Cancer

K-means

• Pros:– Fast algorithm, cluster thousands of object– Little difficulty with missing data

• Cons:– Different solutions for different starting values. Use

multiple runs.– Sensitive to outliers– An appropriate number of clusters is often unknown

Page 26: Computer Aid Discovery Course:  Molecular Classification of Cancer

Part 4: Class Prediction/ Classification

Page 27: Computer Aid Discovery Course:  Molecular Classification of Cancer

Classification

• The object to be predicted assumes one of K predefined classes {1, …, K}

• Associated with each object are: a response or dependent variable (class label), and a set of measurements that form the feature vector (genes)

• The task is to classify an object into one of the K classes on the basis of an observed measurement X=x

Page 28: Computer Aid Discovery Course:  Molecular Classification of Cancer

Steps in Classifier Building

Specimens

Training Set

Test Set

Classifier

Feature SelectionModel Fitting

AccuracySpecificitySensitivity

CV

Page 29: Computer Aid Discovery Course:  Molecular Classification of Cancer

Define the goal of a classifier

• The goal of the classifier should be biological or clinically relevant and motivated

• An example from cancer treatment—Personalized medicine– Most cancer treatments benefit only a minority of patients– Predict which patients are likely to benefit from the treatment

would prevent unnecessary toxicity and inconvenience– Over treatment also results in major expense for individuals and

society– Provide an alternative therapy to the non-responders

Page 30: Computer Aid Discovery Course:  Molecular Classification of Cancer

Feature Selection

Page 31: Computer Aid Discovery Course:  Molecular Classification of Cancer

Feature selection

• Most of the features are uninformative

• Includes a large number of irrelevant features could degrade classification performance

• A small set of features is more useful for downstream applications and analysis

• Feature selection can be performed:– Explicitly, prior the building of the classifier (filter method)– Implicitly, an inherent part of the classifier building procedure

(wrapper method), e.g. CART

Page 32: Computer Aid Discovery Course:  Molecular Classification of Cancer

Feature selection methods

• t- or F-statistics

• Signal-to-noise statistics

• Nonparametric Wilcoxon statistics

• Correlation

• Fold change

• Univariate classification rate

• And many others……

Page 33: Computer Aid Discovery Course:  Molecular Classification of Cancer

Welch t-statistic

• Does not assume equal variance

BgBAgA

gBgA

NSNS

XXt

// 22

Where and denote the sample average intensities in groups A and B and and denote the sample variances for each group

gAX gBX

2gAS 2

gBS

Page 34: Computer Aid Discovery Course:  Molecular Classification of Cancer

Max T

• Determines the family-wise error rate-adjusted p-values using Welch t-statistics

• Algorithm:– Permute class labels and compute Welch t-statistic

for each gene– Record max Welch t-statistic for 10,000 permutation– Compare the distribution of max t-statistic with

observed values for the statistic– Estimate p-value for each gene as the proportion of

the max permutation-based t-statistics that are greater than the observed value

Page 35: Computer Aid Discovery Course:  Molecular Classification of Cancer

Template matching

• Algorithm:– Define a template or profile of gene expression– Identify genes which match the template using

correlation

• Simple and flexible

• Can be used with any number of groups and templates, such as finding specific biological expression profiles in multigroup microarray datasets

Page 36: Computer Aid Discovery Course:  Molecular Classification of Cancer

Area Under the ROC

• ROC analysis:– Proportion of true positives

vs. false negatives (sensitivity vs. 1-specificity) from each possible decision threshold value

– Used in two-class problem

– Calculate the area under the ROC curve (AUC)

Page 37: Computer Aid Discovery Course:  Molecular Classification of Cancer

Rank Product

• Assume that a gene in an experiment with n genes in k replicates has a probability of being ranked first of 1/nk if the list is random

• Calculate the combined probability as a rank product

)/(1 , i

k

i

upgi

upg nrRP

is the position of gene g in the list of genes in the i-th replicate

upgir ,

Page 38: Computer Aid Discovery Course:  Molecular Classification of Cancer

SAM

• Address a problem in t-statistic that small fold change genes may statistically significant due to small variance

• Add a small “fudge factor”,s0,to the denominator of the test statistic

• The factor is chosen to minimize the coefficient of variation of the test statistic d(i)

Page 39: Computer Aid Discovery Course:  Molecular Classification of Cancer

Classification algorithms

Page 40: Computer Aid Discovery Course:  Molecular Classification of Cancer

Problems of microarray classification

• Number of candidate predictors, p >> number of cases, n

• Algorithm works well to uncover structure and provide accurate predictors when n>>p often works poorly when p>>n

• Overfitting problem if the same dataset is used for training and testing

Page 41: Computer Aid Discovery Course:  Molecular Classification of Cancer

Classification algorithms

• Discriminant analysis

• Nearest neighbors

• Decision trees

• Compound covariate predictor

• Neural networks

• Support vector machines

• And many more….

Page 42: Computer Aid Discovery Course:  Molecular Classification of Cancer

Comparison of classification methods

• No single method is likely to be optimal in all situations

• Performance depends on:– Biological classification under investigation– Genetic disparity among classes– Within-class heterogeneity– Size of training set, etc

Page 43: Computer Aid Discovery Course:  Molecular Classification of Cancer

A comparative study• Dudoit et al (2002) compared standard and diagonal discriminant

analysis, Golub’s weighted vote method, classification trees, and nearest neighbors

• Applied to three datasets:– Adult lymphoid malignancies separated into two or three classes

(Alizadeh et al, 2000)– Acute lymphocytic and myelogenous leukemia (Golub et al, 1999)– 60 human tumor cell lines into 8 classes based on site of origin (Ross et

al, 2000)

• Diagonal discriminant analysis and nearest neighbors performed the best, suggesting methods that ignored gene correlations and interactions performed better than more complex models

Page 44: Computer Aid Discovery Course:  Molecular Classification of Cancer

Diagonal linear discriminant analysis

• Assumptions:– Gene covariances are assumed to be zero– Variances of the two classes are the same

• The new samples is assigned to class 2 if

G

g g

ggG

g g

gg

s

xx

s

xx

12

1

12

2 )()(

Page 45: Computer Aid Discovery Course:  Molecular Classification of Cancer

Golub’s weighted gene voting scheme

• Proposed in the first microarray-based classification paper in 1999

• A variant of DLDA – correlation of gene j within the

class label is defined by

– Each gene casts a “weighted vote” for class prediction

– Sum of all votes determines the class of the sample (V>0, class 1)

)2()1(

)2()1(

jj

jjj ss

xxP

))(2

1( )2()1(

jjjjj xxxPV

Page 46: Computer Aid Discovery Course:  Molecular Classification of Cancer

Compound covariate predictor

G

ijiij xtC

1

2

)2()1( ccCt

ti is the t-statistic respect to gene ixij is the log expression in specimen j for gene i

and are mean values of the compound covariate for specimens of class 1 and class 2

)2(c)1(c

Compound covariate for specimen j

Classification threshold

Page 47: Computer Aid Discovery Course:  Molecular Classification of Cancer

Nearest Neighbors

v

A

B

C

Training set contains 3 classes: A, B, and C

Measure v’s 3 nearest neighbors, such as Euclidean distance

Class C is most frequently represented in v’s nearest neighbors, v is assigned to class C

v is a test case

Page 48: Computer Aid Discovery Course:  Molecular Classification of Cancer

Nearest Neighbors• Simple and capture nonlinearities in the true boundary between

classes when the number of specimens is sufficient

• Number of neighbors k can have a large impact on the performance of the classifier– Can be selected by LOOCV

• Votes can be weighted according to class prior probabilities

• Assign weights to the neighbors that are inversely proportional to their distance for the test case

• Heavy computing time and storage requirement

Page 49: Computer Aid Discovery Course:  Molecular Classification of Cancer

Classification TreeNode 1Class 1: 10Class 2: 10

Node 2Class 1: 6Class 2: 9

Node 3Class 1: 4Class 2: 1Prediction: 1

Gene A

Node 4Class 1: 0Class 2: 4Prediction:2

Node 5Class 1: 6Class 2: 5

Node 6Class 1: 1Class 2: 5Prediction:2

Node 7Class 1: 5Class 2: 0Prediction: 1

Gene B

Gene C

>2<=2

>1.7<=1.7

>-0.3<=-0.3

Accuracy = 18/20=80%

Page 50: Computer Aid Discovery Course:  Molecular Classification of Cancer

Use of clustering methods for classification

• Avoid overfitting problem because class information plays no role in deriving the predictors

• Results in poor performance of the predictor

• Only a subset of genes can distinguish classes, their influence may be lost in a cluster analysis

Page 51: Computer Aid Discovery Course:  Molecular Classification of Cancer

Questions?

Page 52: Computer Aid Discovery Course:  Molecular Classification of Cancer

Class problem 2

• Joe the investigator developed a classifier to predict a clinical variable X using a set of expression microarray data. He followed the feature selection procedure and built the classifier using all the samples. After the classifier is completed specified, he predicted the samples in the dataset and found that the classifier was 99% accurate. He came to you to ask for your advice and intended to publish these results shortly in Science.

Page 53: Computer Aid Discovery Course:  Molecular Classification of Cancer

Cross-validation Strategies

Page 54: Computer Aid Discovery Course:  Molecular Classification of Cancer

Resubstitution estimation

• Classifier is trained using the entire learning set L

• Estimate of the classification error is obtained by running the same learning set L through the classifier and recording the accuracy of the classification

• Problem: error rate can be severely biased downward

L

Feature selection

Model fitting

Error estimation

Page 55: Computer Aid Discovery Course:  Molecular Classification of Cancer

Leave-one-out cross validation (Internal)

L

Model fitting

Errorestimation

Repeat for the rest of samples

L-1

Feature selection

Error = Sum of the misclassified samples/totalnumber of samples

Page 56: Computer Aid Discovery Course:  Molecular Classification of Cancer

Leave-one-out cross validation (External)

L

Feature selectionModel fitting

Errorestimation

L-1

Repeat for the rest of samples

Error = Sum of the misclassified samples/total number of samples

Page 57: Computer Aid Discovery Course:  Molecular Classification of Cancer

Radmacher et al (2002) J. Comput. Biol., 9:505

Page 58: Computer Aid Discovery Course:  Molecular Classification of Cancer

K-fold cross-validation

• In K-fold cross-validation, cases in the learning set L are randomly divided into K set Lk of as nearly equal size as possible

• Classifiers are built on training set L- Lk

• Error rates are computed for the validation sets Lk, and averaged over k

• Small K typically give a larger bias but a smaller variance

Page 59: Computer Aid Discovery Course:  Molecular Classification of Cancer

K-fold cross validation

L

Feature selectionModel fitting

Errorestimation

L-4

Repeat for 5 times

Error = Average of the misclassified rate

5-fold CV

Page 60: Computer Aid Discovery Course:  Molecular Classification of Cancer

Monte Carlo cross-validation

• A learning set L is randomly divided into two sets, at a training set L1 and a validation set L2

• The classifier is built using L1 and the error rate is computed for L2

• The procedure is repeated a number of times and error rates averaged

• Problem: reduces effective sample size for training purposes

• Large computation time

Page 61: Computer Aid Discovery Course:  Molecular Classification of Cancer

Monte Carlo cross-validation

L

Feature selectionModel fitting

Errorestimation

Repeat many times

Error = Average of the misclassified rate

Randomly divided into 2 sets

L1L2

Page 62: Computer Aid Discovery Course:  Molecular Classification of Cancer

Bootstrap estimation

• Error rate for learning set case xi is obtained from bootstrap samples that do not contain this observation

• Bootstrap estimators are typically less variable than LOOCV estimators when sample size is small

• Large computation time

• Upward biased

Page 63: Computer Aid Discovery Course:  Molecular Classification of Cancer

Bootstrap estimation of leave-one-out cross-validation

L

Feature selectionModel fitting

Errorestimation

L-1

Repeat many times

Error = Average of the misclassified rate

Bootstrapping

Page 64: Computer Aid Discovery Course:  Molecular Classification of Cancer

Performance• Accuracy

– Probability of correctly predicted both class A and B samples– (TP+TN)/(TP+TN+FP+FN)

• Sensitivity– Probability of a class A sample to be correctly predicted as

class A– TP/(TP+FN)

• Specificity– Probability of non class A sample to be correctly predicted as

non-A– TN/(FP+TN)

• Positive Predictive Value (PPV)– Probability that a sample predicted as class A actually

belongs to class A– TP/(TP+ FP)

• Negative Predictive Value (NPV)– Probability that a sample predicted as non class A actually

does not belong to class A– TN/(FN+ TN)

A B

A TP FN

B FP TN

Predict

Page 65: Computer Aid Discovery Course:  Molecular Classification of Cancer

Validation Study

Page 66: Computer Aid Discovery Course:  Molecular Classification of Cancer

Validation Study• Classifier should be completely specified before the validation study

• Internal validation cannot replace external validation

• Validation study should reflect real life situation and contains samples from multiple institutions

• Independent validation studies are essential before results are accepted into medical practice

• The size of a validation study should be large enough to have a meaningful confidence intervals on predictive accuracy

• Study addresses medical utility of new classifier relative to practice standards

Page 67: Computer Aid Discovery Course:  Molecular Classification of Cancer

Class problem 3

• Validation of the classifiers based on comparing genes present in different classifiers with the same goal– Genes are unstable due to correlation of

genes– Array platforms and methods used in gene

selection

Page 68: Computer Aid Discovery Course:  Molecular Classification of Cancer

Part 5: Examples from the literature

Page 69: Computer Aid Discovery Course:  Molecular Classification of Cancer

CNS tumor study

• Pomeroy et al (2002) Nature 415:436

• Study questions:– Relationship of medulloblastoma and other

CNS tumors– Relationship a subtype to a known biological

pathway– Predict treatment outcome

Page 70: Computer Aid Discovery Course:  Molecular Classification of Cancer

CNS tumors

• Q1: Distinguishing different embryonal CNS tumors

• Motivation: classification of these tumors based on histopathological appearance is not clear

• Samples: – 10 MB– 3 CNS/AT/RT– 5 renal and extrarenal rhabdoid tumors– 8 supratentorial PNETs– 10 non-embryonal brain tumors– 4 normal cerebella

Page 71: Computer Aid Discovery Course:  Molecular Classification of Cancer

CNS tumors

Classic Desmoplastic

PNET AT/RT GB

All genes

50 S/N genes

Page 72: Computer Aid Discovery Course:  Molecular Classification of Cancer

Multi-tumor classifierConfusion Matrix

Predicted

Actual MD MGlio Rhab Ncer PNET

MD 8 0 1 0 1 10

MGlio 0 10 0 0 0 10

Rhab 0 0 9 0 1 10

Ncer 0 0 0 4 0 4

PNET 3 0 1 0 4 8

11 10 11 4 6 42

•Use S/N metric to select 10 genes (all samples in the training set)

•Use a weighted 3-NN classifier (based on distance)

•LOOCV

•35/42 correct classification (83%)

Page 73: Computer Aid Discovery Course:  Molecular Classification of Cancer

Classic vs. desmoplastic• Q2: Classify classic and desmoplastic subtypes in MB

• Motivation: – Histology classification is very subjective– Desmoplastic associated with Gorlin’s syndrome, which is defect in PATCH gene

• Samples: 34 MB samples out of 60

• Methods: k-NN classifier– Not clear how many genes were used– How the genes were selected– What version of k-NN used

• Correct classification of 33/34 (97%), misclassified only 1 classic to desmoplastic

Page 74: Computer Aid Discovery Course:  Molecular Classification of Cancer

Classic vs. desmoplastic

Page 75: Computer Aid Discovery Course:  Molecular Classification of Cancer

Response classifier• Q3:Classify the responders to non-responders to treatment

• Motivation: highly variable response of patients to therapy and clinical method for prognostication is imperfect

• Samples: 60 MB patients who were similarly treated

• Methods:– 1-200 genes were used according to the correlation with survivor

vs. failure and n=8 gave the lowest error rate– 5-NN was used– Correct classification = 47/60 (78%)

Page 76: Computer Aid Discovery Course:  Molecular Classification of Cancer

Survivors vs. Failures

Confusion Matrix

Predicted

Actual Survivors Failures

Survivors 37 2 39

Failures 11 10 21

48 12 60

Page 77: Computer Aid Discovery Course:  Molecular Classification of Cancer

Example 2• Van ‘t Veer et al (2003) Nature 415:530

• The study used to develop Mammaprint

• Goal: use gene expression profiles to predict clinical outcome of breast cancer

• Platform: Agilent oligo arrays

• Sample:78 sporadic, untreated lymph-node-negative patients

– 44 free of disease for >= 5 years (good prognosis group)

– 34 developed distant metastases within 5 years (poor prognosis group)

Page 78: Computer Aid Discovery Course:  Molecular Classification of Cancer

Breast cancer study• Methods:

– Selected a subset of genes based on deregulation in 3 out 78 tumors– Uses correlation coefficient to identify prognostic genes (<-0.3 or >0.3,

231 genes)– Sequentially added 5 genes from the top of rank-ordered list to optimize

the classifier (70 genes to be the best)– Prediction is based on the correlation coefficient to the “good prognosis”

template and “poor prognosis” template (average gene expression of a group)

– LOOCV to estimate accuracy– Independently validated using 19 samples

• 7 patients were good prognosis• 12 patients were poor prognosis

• Results:– Correctly predicted 65 out of 78 patients (83%) in training set– Correctly predict 17/19 samples in the test set (89%)

Page 79: Computer Aid Discovery Course:  Molecular Classification of Cancer

Unsupervised clustering

Page 80: Computer Aid Discovery Course:  Molecular Classification of Cancer

Prognostic classification

Page 81: Computer Aid Discovery Course:  Molecular Classification of Cancer

Example III

• Man et al (2005) Cancer Research 65:8142

• Osteosarcoma is the most common malignant bone tumor in children

• Highly prone to chemotherapy resistance

• After the initial treatment, post-operative chemotherapy does not improve the outcome of poor responders

• Hypothesis: expression profiles can distinguish good and poor responders at the time of diagnosis

Page 82: Computer Aid Discovery Course:  Molecular Classification of Cancer

Personalized medicine

Good Response Poor Response

Right treatment Over-treated

Treatment

Page 83: Computer Aid Discovery Course:  Molecular Classification of Cancer

Hypothesis driven design

Initial Biopsy

Training and

LOOCV

Classifier

Post-treatmentSpecimens:

Enriched in resistantcells

Initial Biopsy

Classifier

Page 84: Computer Aid Discovery Course:  Molecular Classification of Cancer

Experimental design

38 tissue samples

Training set:24 Definitive Surgery

7 Good Responders17 Poor Responders

Testing set:14 Initial Biopsies

6 Good Responders8 Poor Responders

SVM

t-test to select features

LOOCV

Page 85: Computer Aid Discovery Course:  Molecular Classification of Cancer

45-gene chemoresistant signature

Poor Good

Man et al (2005) Cancer Res 65:8142.

Page 86: Computer Aid Discovery Course:  Molecular Classification of Cancer

Prediction of Initial Biopsies using SVM classifier

    Predicted  

    Good Poor Total

Actual

Good 5 1 6

Poor 0 8 8

Overall accuracy = 93%Sensitivity = 100%Specificity = 83%

Man et al (2005) Cancer Res 65:8142.

Page 87: Computer Aid Discovery Course:  Molecular Classification of Cancer

Example IV: Proteomic classifiers

• RNA expression may not correlate to protein expression

• Post-transcriptional and post-translational regulations are not detected by genomic profiling

• Genomic approach in tumors is not suitable for disease monitoring, e.g. the relapse in the patients in remission

• Early detection of cancers while the cancer is not yet detectable

• Plasma is a rich source of potential biomarkers, which contain both tumor and host proteins

Page 88: Computer Aid Discovery Course:  Molecular Classification of Cancer

Inte

nsi

ty

Molecular Mass (Da)

Dete

ctor

Dete

ctor

Laser

TOF-MS

*ProteinChip reader was routinely calibrated using insulin and IgG before data collection

Proteinchip array

ProteinChip Reader

SELDI TOF-MS detection

Page 89: Computer Aid Discovery Course:  Molecular Classification of Cancer

Discriminatory features

Page 90: Computer Aid Discovery Course:  Molecular Classification of Cancer

Classifier Building

• Samples:– 29 osteosarcoma plasma– 20 osteochondroma plasma

• Pre-processed and normalized the MS data• Used t-statistic to select features

• Used LOOCV to validate the classification accuracy

• Select classifier based on LOOCV

Page 91: Computer Aid Discovery Course:  Molecular Classification of Cancer

OS plasma proteomic signature

Li et al (2006) Proteomics 6:3426

OS OC

m/z

Page 92: Computer Aid Discovery Course:  Molecular Classification of Cancer

Classification Performance

• The 3-Nearest neighbors classifier:

– overall accuracy = 90% (5/48) – sensitivity = 97% (1/28 OS)

specificity = 80% (4/20 OC)

• Permutation analysis showed that the classification accuracy was significant (p < 0.00005)

Classification algorithms

CCP DLA 1-NN 3-NN NC SVM

Cla

ssifi

catio

n ac

cura

cy (

%)

70

75

80

85

90

95

Li et al (2006) Proteomics 6:3426

Page 93: Computer Aid Discovery Course:  Molecular Classification of Cancer

Summary• What is molecular classification?

• The differences among class comparison, class discovery, and class prediction/ classification

• Methods used in class discovery

• Steps and methods used to construct a classifier

• How to properly validate a classifier and measure its accuracy

• Some examples in the literature

Page 94: Computer Aid Discovery Course:  Molecular Classification of Cancer

Questions?