11
Multivariate Algorithms and Classifiers in Cancer Micro-RNA profiles help predict distant disease-free survival in breast cancer Bits and pieces of bioinformatics workflow Mehis Pold, MD October 18, 2013

Development of multivariate classifiers in cancer

  • Upload
    mehisp

  • View
    347

  • Download
    2

Embed Size (px)

DESCRIPTION

Short presentation about development of multivariate classifiers to predict chemotherapy treatment responses in breast cancer. The steps of workflow are briefly described and the results indicate that expression data on micro-RNA in breast cancer alone are not sufficient to predict treatment responses.

Citation preview

Page 1: Development of multivariate classifiers in cancer

Multivariate Algorithms and Classifiers in Cancer

Micro-RNA profiles help predict distant disease-free survival in breast cancer

Bits and pieces of bioinformatics workflow

Mehis Pold, MDOctober 18, 2013

Page 2: Development of multivariate classifiers in cancer

Feature Selection & algorithm development

Internal Algorithm Validation

Clinical Validation

Iterative process

Training and validation datasets in each step don’t overlap

Rule of thumb: validation always produces weaker statistics than training

Trainingsamples

Validationsamples

Page 3: Development of multivariate classifiers in cancer

• Analysis of early primary breast cancer to identify prognostic markers and associated pathways: mRNA and miRNA profiling

• GEO (Gene Expression Omnibus) accession ID: GSE22220

• Technology platform: ILLUMINA

• 733 micro-RNA

• 210 breast cancer samples

• 79 complete pathological response (pCR) to chemotherapy; 131 recurrent disease samples (RD)

• Data collected up to 10 years after start of chemotherapy

Buffa et al. microRNA-Associated Progression Pathways and Potential Therapeutic Targets Identified by Integrated mRNA and microRNA Expression Profiling in Breast Cancer. Cancer Res. 2011, 71:5635

Page 4: Development of multivariate classifiers in cancer

TRAINING SET:36 RD74 pCR

VALIDATION SET:43 RD57 pCR

BIOINFORMATICS WORKFLOWMultiple statistical

approaches to maximize outcome

Comparison of two algorithms and

classification by kNN

Kaplan-Meier & ROC

Sensitivity (Se)Specificity (Sp)Positive Predictive Value (PPV)Negative Predictive Value (NPV)

Custom-scripting (R, VBA)Standard Software : MS ExcelMedical Statistics: MedCalc

Page 5: Development of multivariate classifiers in cancer

FEATURE SELECTIONReduction of dimensionality from n = 733 to n = 1

Approach 1: iterative clustering

Approach 2: T-test combined with enriching for weak inter-profile correlation

Significance of feature selection evaluated by Kaplan-Meyer survival analysis and ROC (receiver-operator curve)

pCR

RDUp

Down

Page 6: Development of multivariate classifiers in cancer

KAPLAN-MEIER SURVIVAL CURVEThe Kaplan–Meier estimator, also known as the product limit estimator, is an estimator for estimating the survival function from lifetime data. In medical research, it is often used to measure the fraction of patients living for a certain amount of time after treatment. In economics, it can be used to measure the length of time people remain unemployed after a job loss. In engineering, it can be used to measure the time until failure of machine parts. In ecology, it can be used to estimate how long fleshy fruits remain on plants before they are removed by frugivores. The estimator is named after Edward L. Kaplan and Paul Meier.

Receiver operating characteristic (ROC)In signal detection theory, a receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the total actual positives (TPR = true positive rate) vs. the fraction of false positives out of the total actual negatives (FPR = false positive rate), at various threshold settings. TPR is also known as sensitivity (also called recall in some fields), and FPR is one minus the specificity or true negative rate.

Page 7: Development of multivariate classifiers in cancer

ITERATIVE CLUSTERING TO BINARY OUTCOME

Page 8: Development of multivariate classifiers in cancer

T-TEST ENRICHED TOWARD WEAK CORRELATIONS

Page 9: Development of multivariate classifiers in cancer

Nearest Neighbor Classification - kNN• Based on a measure of distance between observations (e.g.

Euclidean distance or one minus correlation).

• k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows:– find the k closest observations in the training data,– predict the class by majority vote, i.e. choose the class that is

most common among those k neighbors.

Classification of data in 2D space

K = 3

K = 5

Page 10: Development of multivariate classifiers in cancer

SUMMARY

TRAINING p-value AOC Sensitivity Specificity PPV NPV

Kaplan-Meier 0.0001ROC <.0001 0.773 72 68

VALIDATION

Kaplan-Meier 0.0002ROC 0.0024 0.67 65 65

CLASSIFICATION

kNN 0.51 0.61 0.50 0.63

TRAINING p-value AOC Sensitivity Specificity PPV NPV

Kaplan-Meier <.0001ROC <.0001 0.898 83 82

VALIDATION

Kaplan-Meier 0.012ROC 0.0334 0.624 58 56

CLASSIFICATION

kNN 0.35 0.86 0.65 0.64

T-TEST ENRICHED FOR WEAK CORRELATIONS

ITERATIVE CLUSTERING TO BINARY OUTCOME

Page 11: Development of multivariate classifiers in cancer

CONCLUDING REMARKS• There is no single ‘right’ approach to algorithm development.

• Validation always produces weaker statistics than training.

• Significance of training statistics and validation statistics are not very well correlating.

• Algorithms are only as stable and significant as upstream R&D data. The better standardized and controlled the wet-bench, the more stable and significant the algorithms and eventual clinical validation.