Upload
lina-dummitt
View
227
Download
2
Embed Size (px)
Citation preview
A gene expression analysis A gene expression analysis system for medical diagnosissystem for medical diagnosis
D. Maroulis, D. Iakovidis, S. Karkanis, I. FlaounasD. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas
University of AthensUniversity of AthensDept. of Informatics and TelecommuncationsDept. of Informatics and Telecommuncations
ObjectivesObjectives
A system to support medical diagnosis using A system to support medical diagnosis using molecular level information molecular level information
Efficient classification of pathological Efficient classification of pathological conditions into multiple classes conditions into multiple classes
A user friendly interface for physicians and A user friendly interface for physicians and biologistsbiologists
DNA MicroarraysDNA Microarrays
Microscope glassesMicroscope glassesThousands of spotsThousands of spotsSpotSpot cDNA partcDNA part
DNA MicroarraysDNA Microarrays
GeneGene e expression xpression vvectorector
(feature vector)(feature vector)
Gene expression analysis toolsGene expression analysis tools
Image processing & analysis for microarray Image processing & analysis for microarray spot detectionspot detection
Visualization & clustering for discovery of Visualization & clustering for discovery of unknown classes of pathological conditionsunknown classes of pathological conditions
Gene ranking for identification of Gene ranking for identification of differentially expressed marker genesdifferentially expressed marker genes
Supervised classification of gene expression Supervised classification of gene expression vectors into known classes vectors into known classes
Gene expression analysis toolsGene expression analysis tools GeneClustGeneClust Do et al, 2000Do et al, 2000 dChipdChip Li & Wong, 2001Li & Wong, 2001 ClusfavorClusfavor Peterson, 2002Peterson, 2002 GenesisGenesis Sturn et al, 2002Sturn et al, 2002 SnomadSnomad Collantuoni et al, 2002Collantuoni et al, 2002 BaseBase Saal et al, 2002Saal et al, 2002 TM4 SuiteTM4 Suite Saeed et al, 2003Saeed et al, 2003 RankGeneRankGene Yang et al, 2003Yang et al, 2003 ExcavatorExcavator Xu et al, 2003Xu et al, 2003 KnowledgeEditorKnowledgeEditor Toyoda & Konagaya, Toyoda & Konagaya,
20032003 ArrayNormArrayNorm Pieler et al, 2004Pieler et al, 2004
Today’s challengeToday’s challenge
None of the existent tools takes None of the existent tools takes into account the usability profile of into account the usability profile of a physician or a biologista physician or a biologist
Such tools could hardly be used in Such tools could hardly be used in everyday medical practiceeveryday medical practice
Supervised approachesSupervised approaches Most known supervised approaches have Most known supervised approaches have
been applied to classification of gene been applied to classification of gene expression vectorsexpression vectors– Linear discriminant analysisLinear discriminant analysis– kk-nearest neighbors-nearest neighbors– Parzen windowsParzen windows– Decision treesDecision trees– Neural networks, etc.Neural networks, etc.
Support Vector MachinesSupport Vector Machines(Brown et al, 2000; Furey et al, 2000; Ryu & Cho, 2000; (Brown et al, 2000; Furey et al, 2000; Ryu & Cho, 2000; Dudoit et al, 2002; Lu & Han, 2003; Aliferis et al, 2003)Dudoit et al, 2002; Lu & Han, 2003; Aliferis et al, 2003)
Support Vector MachinesSupport Vector Machines Robust binary classifiersRobust binary classifiers
Not easily affected by the dimensionality Not easily affected by the dimensionality of the feature vectorsof the feature vectors
SVM methods for classification into SVM methods for classification into multiple classes multiple classes – One vs oneOne vs one– One vs allOne vs all– Directed Acyclic Graph (DAG)Directed Acyclic Graph (DAG)– Weston & WatkinsWeston & Watkins– Cramer & SingerCramer & Singer
(Weston & Watkins, 1999; Platt, 2000; (Weston & Watkins, 1999; Platt, 2000; Yeang et al, 2001; Cramer & Singer, 2001; Hsu & Lin, 2002)Yeang et al, 2001; Cramer & Singer, 2001; Hsu & Lin, 2002)
About multiclass SVM About multiclass SVM classifiersclassifiers
They all lead to comparable resultsThey all lead to comparable results
They utilize a common, constant set of They utilize a common, constant set of genes as input in each SVM nodegenes as input in each SVM node
They assume that the various They assume that the various pathological conditions correspond to pathological conditions correspond to separable clusters in the same gene separable clusters in the same gene spacespace
((Hsu et al, 2002; Lee et al, 2003; Statnikov et al, 2004)Hsu et al, 2002; Lee et al, 2003; Statnikov et al, 2004)
The proposed approachThe proposed approach
We consider the fact that We consider the fact that – Only a small subset of genes is differentially Only a small subset of genes is differentially
expressed for each type or subtype of a expressed for each type or subtype of a pathological conditionpathological condition
We propose We propose – The combination of SVMs in a cascading The combination of SVMs in a cascading
architecture that embodies gene selection in architecture that embodies gene selection in its structure its structure
Cascading architectureCascading architecture
Classifies input vector Classifies input vector xx into into ωω11, , ωω22,… ,… ωωΝΝ
Pre-processing Unit Diagnostic Unit
Cascading architectureCascading architecture
Poor quality cDNA targets generate missing valuesPoor quality cDNA targets generate missing values
((Trovanskaya et al, 2001)Trovanskaya et al, 2001)
Pre-processing Unit Diagnostic Unit
Cascading architectureCascading architecture
Normalization facilitates comparability of samplesNormalization facilitates comparability of samples
Pre-processing Unit Diagnostic Unit
((Zhang & Shmulevich, 2002)Zhang & Shmulevich, 2002)
Cascading architectureCascading architecture
Pre-processing Unit Diagnostic Unit
A subset of genes is selected by ranking A subset of genes is selected by ranking for each blockfor each block
Three ranking criteria are availableThree ranking criteria are available
Cascading architectureCascading architecture
N
jpphhjj xX
1
,
N
jpphhjj xX
1
,
The classification module The classification module CCjj is autonomously trained is autonomously trained using a subset using a subset XXjj of the available training samples of the available training samples
Cascading architectureCascading architecture
A standard binary SVM classifier implements each A standard binary SVM classifier implements each classification moduleclassification module
Model selectionModel selection
The best architecture is determined by The best architecture is determined by leave one out cross validationleave one out cross validation
Selection bias is minimizedSelection bias is minimized
– Gene selection and parameter tuning take Gene selection and parameter tuning take place on the training samples during each place on the training samples during each iteration of the leave one out iteration of the leave one out
((Ambroise & McLahian, 2002; Varma & Simon, 2006)Ambroise & McLahian, 2002; Varma & Simon, 2006)
ResultsResults
Prostate cancer dataProstate cancer data
112112 samples (patients) samples (patients)
ClassesClasses– 62 primary prostate tumors62 primary prostate tumors– 41 normal prostate specimens41 normal prostate specimens– 9 pelvic lymph node metastases9 pelvic lymph node metastases
4401644016 gene expressions per sample gene expressions per sample
((Lapointe et al, 2004Lapointe et al, 2004))
ResultsResults
Colon cancer dataset Colon cancer dataset (Alon et al, 1999)(Alon et al, 1999)
– Minimum classification error 9.7%Minimum classification error 9.7%
Lung cancer dataset Lung cancer dataset (Bhattacharjee et al, 2001)(Bhattacharjee et al, 2001)
– Minimum classification error 1.5%Minimum classification error 1.5%
ConclusionsConclusions We presented a user friendly system that We presented a user friendly system that
implements a cascading SVM architecture implements a cascading SVM architecture
It aims to the classification of gene It aims to the classification of gene expression data into known classesexpression data into known classes
The cascading architecture automatically The cascading architecture automatically tunes its parameters and determines its tunes its parameters and determines its optimal configurationoptimal configuration
In most cases leads to a diagnostic In most cases leads to a diagnostic accuracy that exceeds 90%accuracy that exceeds 90%
ConclusionsConclusions
Its performance is usually better than Its performance is usually better than one-vs-one SVM combination methodone-vs-one SVM combination method
It utilizes It utilizes NN-1 binary SVM classifiers, -1 binary SVM classifiers, whereas one-vs-one utilizes whereas one-vs-one utilizes NN((NN-1)/2-1)/2
It could be used in everyday clinical It could be used in everyday clinical practicepractice
Within our future perspectives is the Within our future perspectives is the adoption of incremental learning adoption of incremental learning approachesapproaches