Upload
henry-jennings
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
University of Washington
Institute of Technology
Tacoma, WA, USA
Ecole des Hautes Etudes en Santé Publique
Département Infobiostat
Rennes, France
Isabelle Bichindaritz
Purpose of this TalkPurpose of this Talk
Once upon a time …There was biology (~1800), and There were computers (~1920)
Of their common interests was born bioinformatics (~1979) …
Question: How can CBR contribute to bioinformatics research ?
An example to microarray data analysis
ICCBR '10ICCBR '10
Bioinformatics ChallengesBioinformatics Challenges
Frequent tasks in bioinformatics
Similarity search in genetic sequencesMicroarray data analysisMacromolecule shape predictionEvolutionary tree constructionGene regulatory network mining
ICCBR '10ICCBR '10
Bioinformatics ChallengesBioinformatics Challenges
Microarray data analysis
Microarrays are made from a collection of purified DNA’s. A drop of each type of DNA in solution is placed onto a specially-prepared glass microscope slide by an arraying machine.
Please note that … … the human genome contains about 30,000 genes. … a microarray can contain thousands or tens of thousands
relatively short nucleotides of known sequences.
ICCBR '10ICCBR '10
The end product of a comparative hybridization experiment is a scanned array image.
ICCBR '10ICCBR '10
Bioinformatics ChallengesBioinformatics Challenges
Microarray applications Determine relative DNA levels associated with huge
number of known and predicted genes in a single experiment.
The most attractive application of microarrays is in the study of differential gene expression in disease.
The up– or down-regulation of gene activity can either be the cause of the pathophysiology or the result of the disease.
Accurate measurement of every single gene is assessed. Sensitivity: very high – detect the presence of one transcript
in one-tenth of a cell.ICCBR '10ICCBR '10
Bioinformatics ChallengesBioinformatics Challenges
ICCBR '10ICCBR '10
Data mining challenges
Volume of data (Giga bytes, number of features)
Characteristics of data (specific constraints)
Domain specific knowledge (expert interpretation)
Bioinformatics ChallengesBioinformatics Challenges
ICCBR '10ICCBR '10
BMA-CBR SystemBMA-CBR System
BMA-CBR system performs feature selection through BMA before using CBR for microarray data classification and prediction (survival analysis)Introduction and motivation of variable selection
What is Bayesian Model Averaging (BMA)? One approach: the iterative BMA algorithm Application 1: Chronic Myeloid Leukemia (CML) Application 2: Survival analysis
Presentation of CBR
ICCBR '10ICCBR '10
Feature selection
Used to select a subset of relevant features for building robust learning models in machine learning.
Often used in supervised learning.Select relevant features from the training set (for which class
labels are known).Apply the selected features in the test set.
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
Feature selectionA minimal set of relevant genes for future prediction or assay
development
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
Typical variable selection methods – one variable at a time
Examples:T-testBetween group to within group sum of squares (BSS/ WSS)
[Dudoit et al. 2001]
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
Multivariate gene selectionOur goal: consider multiple genesSimultaneously to exploit the interdependence between genes
to reduce # relevant genes
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
Bayesian Model Averaging (BMA) [Raftery 1995], [Hoeting et. al. 1999]A multivariate variable selection technique.Typical model selection approaches select a model and then
proceed as if the selected model has generated the data --> overconfident inferences
Advantages of BMA: Fewer selected genes Can be generalized to any number of classes Posterior probabilities for selected genes and selected models
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
BMAAverage over predictions from several models
What do we need?Prediction with a given model k --> logistic regression How to choose a set of “good” models? --> variable selection
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
What models to average over?All possible models --> way too many!! Eg. 2^30~1 billion, 2^50~10^15 etc…The BMA solution:1. “leaps and bounds” [Furnival and Wilson 1974] : when
#variables (genes) <= 30, we can efficiently produce a reduced set of good models (branch and bound).
2. Cut down the # models? Discard models that are much less likely than the best model.
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
Iterative BMA algorithm [Yeung, Bumgarner, Raftery 2005]Pre-processing step: Rank genes using BSS/WSS ratio.Initial step:
Repeat until all genes are processed:
Output: selected genes and models with their posterior probabilities
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
Application 1: Classification of progression of chronic myeloid leukemia (CML)
Motivation: New Candidates for Prognostic studies in CML
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
Progression of CMLCML usually presents in chronic phase (CP), but in the absence
of effective therapy, CP CML invariably transforms to accelerated phase (AP) disease, and then to an acute leukemia, blast crisis (BC).
BC is highly resistant to treatment, and all treatments are more successful when administered during CP.
Imatinib is most effective in early CP patients with excellent survival (86% at 7 years).
Currently there are limited clinical markers and no molecular tests that can predict the “clock” of CML progression for individual patients at the time of CP diagnosis, making it difficult to adapt therapy to the risk level of each patient.
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
Why predictors for CML progression?
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
Identification of CML progression biomarkers
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
Genes associated with CML progression
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
BMA selected genes using microarray data
Selected 6 genes over 21 modelsRepeat CV 100 times
Average Brier Score = 0.21Average prediction accuracy = 99.17%
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
Summary: CML dataBMA applied to a microarray data consisting of patient samples
in different phases of CML identified 6 signature genes (ART4, DDX47, IGSF2,LTB4R, SCARB1, SLC25A3).
Results validated the gene signature using quantitative PCR: 6-gene signature is highly predictive of CP-early vs CP-late.
What is next?To identify biologically meaningful biomarkers for CML
progression and response to therapy.Biomarkers that are functionally related (connected in an
underlying network) to known reference genes.
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
Application 2: Survival analysis
Bayesian Model AveragingBayesian Model Averaging
ICCBR '10ICCBR '10
Results: Breast cancer data - Annest, Bumgarner, Raftery, Yeung. BMC Bioinformatics 2009
Bayesian Model AveragingBayesian Model Averaging
CBRCBR
Classification taskSimilarity measure
Weights provided by BMA for selected features
ICCBR '10ICCBR '10
CBRCBR
Classification taskChoose the class for which the average similar score is
highest
ICCBR '10ICCBR '10
CBRCBR
Survival analysis taskSimilarity measure
Weights provided by BMA for selected features
ICCBR '10ICCBR '10
CBRCBR
Survival analysis taskChoose the class for which the average similar score is
highest
ICCBR '10ICCBR '10
Evaluation / ClassificationEvaluation / Classification
ICCBR '10ICCBR '10
Dataset Total Numberof Samples
# TrainingSamples
# ValidationSamples
Numberof Genes
Leukemia 2 72 38 34 3051
Leukemia 3 72 38 34 3051
Dataset # classes BMA-CBR iterativeBMA Other published
results
Leukemia 2 2 #genes = 20#errors =
1/34
#genes = 20#errors = 2/34
#genes = 5#errors = 1/34
Leukemia 3 3 #genes = 15#errors =
1/34
#genes = 15#errors = 1/34
#genes ~ 40#errors = 1/34
Evaluation / PredictionEvaluation / Prediction
ICCBR '10ICCBR '10
Dataset Total Number # TrainingSamples
# ValidationSamples
NumberOf Genes
DLBCL 240 160 80 7,399
Breast Cancer 295 61 234 4,919
Dataset BMA-CBR iterativeBMA Best OtherPublished Results
DLBCL #genes = 25p-value = 0.00121
#genes = 25p-value = 0.00139
#genes = 17p-value = 0.00124
Breast cancer #genes = 15p-value = 2.14e-10
#genes = 15p-value = 3.38e-10
#genes = 5p-value = 3.12e-05
ConclusionConclusion
The combination of BMA and CBR provides excellent classification and prediction results.
It provides promising results for the application of CBR to bioinformatics tasks and data.
ICCBR '10ICCBR '10
ConclusionConclusion
Future developments
Refine risk classes into more than two risk groups.
Refine CBR algorithm.
Test on additional datasets.
Provide automatic interpretation of the classification / prediction both for gene selection and for case-based reasoning.
ICCBR '10ICCBR '10