28
A gene expression A gene expression analysis system for analysis system for medical diagnosis medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas Flaounas University of Athens University of Athens Dept. of Informatics and Telecommuncations Dept. of Informatics and Telecommuncations

A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I

Embed Size (px)

Citation preview

A gene expression analysis A gene expression analysis system for medical diagnosissystem for medical diagnosis

D. Maroulis, D. Iakovidis, S. Karkanis, I. FlaounasD. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas

University of AthensUniversity of AthensDept. of Informatics and TelecommuncationsDept. of Informatics and Telecommuncations

ObjectivesObjectives

A system to support medical diagnosis using A system to support medical diagnosis using molecular level information molecular level information

Efficient classification of pathological Efficient classification of pathological conditions into multiple classes conditions into multiple classes

A user friendly interface for physicians and A user friendly interface for physicians and biologistsbiologists

DNA MicroarraysDNA Microarrays

Microscope glassesMicroscope glassesThousands of spotsThousands of spotsSpotSpot cDNA partcDNA part

DNA MicroarraysDNA Microarrays

Gene eGene expressionxpression level level

(feature)(feature)

DNA MicroarraysDNA Microarrays

GeneGene e expression xpression vvectorector

(feature vector)(feature vector)

DNA MicroarraysDNA Microarrays

GeneGene e expression xpression matrixmatrix

(data set)(data set)

Gene expression analysis toolsGene expression analysis tools

Image processing & analysis for microarray Image processing & analysis for microarray spot detectionspot detection

Visualization & clustering for discovery of Visualization & clustering for discovery of unknown classes of pathological conditionsunknown classes of pathological conditions

Gene ranking for identification of Gene ranking for identification of differentially expressed marker genesdifferentially expressed marker genes

Supervised classification of gene expression Supervised classification of gene expression vectors into known classes vectors into known classes

Gene expression analysis toolsGene expression analysis tools GeneClustGeneClust Do et al, 2000Do et al, 2000 dChipdChip Li & Wong, 2001Li & Wong, 2001 ClusfavorClusfavor Peterson, 2002Peterson, 2002 GenesisGenesis Sturn et al, 2002Sturn et al, 2002 SnomadSnomad Collantuoni et al, 2002Collantuoni et al, 2002 BaseBase Saal et al, 2002Saal et al, 2002 TM4 SuiteTM4 Suite Saeed et al, 2003Saeed et al, 2003 RankGeneRankGene Yang et al, 2003Yang et al, 2003 ExcavatorExcavator Xu et al, 2003Xu et al, 2003 KnowledgeEditorKnowledgeEditor Toyoda & Konagaya, Toyoda & Konagaya,

20032003 ArrayNormArrayNorm Pieler et al, 2004Pieler et al, 2004

Today’s challengeToday’s challenge

None of the existent tools takes None of the existent tools takes into account the usability profile of into account the usability profile of a physician or a biologista physician or a biologist

Such tools could hardly be used in Such tools could hardly be used in everyday medical practiceeveryday medical practice

Supervised approachesSupervised approaches Most known supervised approaches have Most known supervised approaches have

been applied to classification of gene been applied to classification of gene expression vectorsexpression vectors– Linear discriminant analysisLinear discriminant analysis– kk-nearest neighbors-nearest neighbors– Parzen windowsParzen windows– Decision treesDecision trees– Neural networks, etc.Neural networks, etc.

Support Vector MachinesSupport Vector Machines(Brown et al, 2000; Furey et al, 2000; Ryu & Cho, 2000; (Brown et al, 2000; Furey et al, 2000; Ryu & Cho, 2000; Dudoit et al, 2002; Lu & Han, 2003; Aliferis et al, 2003)Dudoit et al, 2002; Lu & Han, 2003; Aliferis et al, 2003)

Support Vector MachinesSupport Vector Machines Robust binary classifiersRobust binary classifiers

Not easily affected by the dimensionality Not easily affected by the dimensionality of the feature vectorsof the feature vectors

SVM methods for classification into SVM methods for classification into multiple classes multiple classes – One vs oneOne vs one– One vs allOne vs all– Directed Acyclic Graph (DAG)Directed Acyclic Graph (DAG)– Weston & WatkinsWeston & Watkins– Cramer & SingerCramer & Singer

(Weston & Watkins, 1999; Platt, 2000; (Weston & Watkins, 1999; Platt, 2000; Yeang et al, 2001; Cramer & Singer, 2001; Hsu & Lin, 2002)Yeang et al, 2001; Cramer & Singer, 2001; Hsu & Lin, 2002)

About multiclass SVM About multiclass SVM classifiersclassifiers

They all lead to comparable resultsThey all lead to comparable results

They utilize a common, constant set of They utilize a common, constant set of genes as input in each SVM nodegenes as input in each SVM node

They assume that the various They assume that the various pathological conditions correspond to pathological conditions correspond to separable clusters in the same gene separable clusters in the same gene spacespace

((Hsu et al, 2002; Lee et al, 2003; Statnikov et al, 2004)Hsu et al, 2002; Lee et al, 2003; Statnikov et al, 2004)

The proposed approachThe proposed approach

We consider the fact that We consider the fact that – Only a small subset of genes is differentially Only a small subset of genes is differentially

expressed for each type or subtype of a expressed for each type or subtype of a pathological conditionpathological condition

We propose We propose – The combination of SVMs in a cascading The combination of SVMs in a cascading

architecture that embodies gene selection in architecture that embodies gene selection in its structure its structure

Cascading architectureCascading architecture

Classifies input vector Classifies input vector xx into into ωω11, , ωω22,… ,… ωωΝΝ

Pre-processing Unit Diagnostic Unit

Cascading architectureCascading architecture

Poor quality cDNA targets generate missing valuesPoor quality cDNA targets generate missing values

((Trovanskaya et al, 2001)Trovanskaya et al, 2001)

Pre-processing Unit Diagnostic Unit

Cascading architectureCascading architecture

Normalization facilitates comparability of samplesNormalization facilitates comparability of samples

Pre-processing Unit Diagnostic Unit

((Zhang & Shmulevich, 2002)Zhang & Shmulevich, 2002)

Cascading architectureCascading architecture

Pre-processing Unit Diagnostic Unit

A subset of genes is selected by ranking A subset of genes is selected by ranking for each blockfor each block

Three ranking criteria are availableThree ranking criteria are available

Gene ranking criteriaGene ranking criteria

Cascading architectureCascading architecture

N

jpphhjj xX

1

,

N

jpphhjj xX

1

,

The classification module The classification module CCjj is autonomously trained is autonomously trained using a subset using a subset XXjj of the available training samples of the available training samples

Cascading architectureCascading architecture

A standard binary SVM classifier implements each A standard binary SVM classifier implements each classification moduleclassification module

Model selectionModel selection

The best architecture is determined by The best architecture is determined by leave one out cross validationleave one out cross validation

Selection bias is minimizedSelection bias is minimized

– Gene selection and parameter tuning take Gene selection and parameter tuning take place on the training samples during each place on the training samples during each iteration of the leave one out iteration of the leave one out

((Ambroise & McLahian, 2002; Varma & Simon, 2006)Ambroise & McLahian, 2002; Varma & Simon, 2006)

Graphical User InterfaceGraphical User Interface

ResultsResults

Prostate cancer dataProstate cancer data

112112 samples (patients) samples (patients)

ClassesClasses– 62 primary prostate tumors62 primary prostate tumors– 41 normal prostate specimens41 normal prostate specimens– 9 pelvic lymph node metastases9 pelvic lymph node metastases

4401644016 gene expressions per sample gene expressions per sample

((Lapointe et al, 2004Lapointe et al, 2004))

ResultsResults

Minimum error 6.3% using 1 input geneMinimum error 6.3% using 1 input gene

ResultsResults

Colon cancer dataset Colon cancer dataset (Alon et al, 1999)(Alon et al, 1999)

– Minimum classification error 9.7%Minimum classification error 9.7%

Lung cancer dataset Lung cancer dataset (Bhattacharjee et al, 2001)(Bhattacharjee et al, 2001)

– Minimum classification error 1.5%Minimum classification error 1.5%

ConclusionsConclusions We presented a user friendly system that We presented a user friendly system that

implements a cascading SVM architecture implements a cascading SVM architecture

It aims to the classification of gene It aims to the classification of gene expression data into known classesexpression data into known classes

The cascading architecture automatically The cascading architecture automatically tunes its parameters and determines its tunes its parameters and determines its optimal configurationoptimal configuration

In most cases leads to a diagnostic In most cases leads to a diagnostic accuracy that exceeds 90%accuracy that exceeds 90%

ConclusionsConclusions

Its performance is usually better than Its performance is usually better than one-vs-one SVM combination methodone-vs-one SVM combination method

It utilizes It utilizes NN-1 binary SVM classifiers, -1 binary SVM classifiers, whereas one-vs-one utilizes whereas one-vs-one utilizes NN((NN-1)/2-1)/2

It could be used in everyday clinical It could be used in everyday clinical practicepractice

Within our future perspectives is the Within our future perspectives is the adoption of incremental learning adoption of incremental learning approachesapproaches

Thank you