24
Evaluating Classifiers for Disease Gene Discovery Lon Turnbull and Kino Coursey [email protected], [email protected] University of North Texas

Evaluating Classifiers for Disease Gene Discovery

  • Upload
    rod

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

Evaluating Classifiers for Disease Gene Discovery. Lon Turnbull and Kino Coursey [email protected], [email protected] University of North Texas. Biocomputing Fall 2005. CSCD 4930.004/CSCE 5933.007 Biol 4930.773/Biol 5905.773 Instructors: Armin Mikler and Kaja Abbas. Outline. - PowerPoint PPT Presentation

Citation preview

Page 1: Evaluating Classifiers for Disease Gene Discovery

Evaluating Classifiers for Disease Gene Discovery

Lon Turnbull and Kino Coursey

[email protected], [email protected]

University of North Texas

Page 2: Evaluating Classifiers for Disease Gene Discovery

Biocomputing Fall 2005

CSCD 4930.004/CSCE 5933.007Biol 4930.773/Biol 5905.773

Instructors: Armin Mikler and Kaja Abbas

Page 3: Evaluating Classifiers for Disease Gene Discovery

Outline

• An interesting hypothesis

• What is a disease gene?

• Can disease genes be classified using machine learning tools?

• If so, can we do better?

• Classifiers + Data

• Analysis + Conclusions

Page 4: Evaluating Classifiers for Disease Gene Discovery

Hypothesis

• It has been suggested that the genes which have some relationship to hereditary disease might have common variations in their DNA sequence structure.

Page 5: Evaluating Classifiers for Disease Gene Discovery

What is a disease gene?

• Any gene that has mutated in such a way that the proteins created from it are dysfunctional.

Page 6: Evaluating Classifiers for Disease Gene Discovery

What is a disease gene?

• Any gene that has mutated in such a way that the proteins created from it are dysfunctional.

• However, mutation can happen to any gene, so can one actually search for physical characteristics of a “disease” gene?

Page 7: Evaluating Classifiers for Disease Gene Discovery

Reviewed Paper• A research group has used the alternating

decision tree algorithm from Weka to test the hypothesis.

• On average, 70% of the genes marked as disease phenotype were correctly identified with their automatic classifier they called PROSPECTR.

• They found that about 40% of their chosen features had statistically significant differences.

Page 8: Evaluating Classifiers for Disease Gene Discovery

PROSPECTR results

Feature Ratio

Gene encodes signal peptide 2.06

Gene Length 1.42

5' CpG islands 1.33

Protein length 1.29

Exon Number 1.25

cDNA length 1.15

Distance to neighboring gene 1.13

3' UTR length 1.09

Page 9: Evaluating Classifiers for Disease Gene Discovery

Question

Can we do better with other methods of classification?

Page 10: Evaluating Classifiers for Disease Gene Discovery

Classification Methods

1. ADTree: alternating decision tree, optimized for two-class problems.

2. J48: a variant of classification 7.3. Logistic: Linear logistic regression.4. SMO: Sequential Minimal Optimization algorithm

for training a support vector classifier.5. Naïve Bayes: Standard probabilistic Naïve Bayes.6. Ibk-K: K-nearest neighbor classifier (k=5).7. PART: Obtains rules from partial decision trees

build using C4.5 heuristics.

Page 11: Evaluating Classifiers for Disease Gene Discovery

Test Data

• A training set that consisted of 1,084 genes known to be associated with a disease and 1,084 genes not known to be associated with genes diseases.

• A set with 675 disease genes listed in the Human Gene Mutation Database (HGMD) and 675 genes not known to be involved in disease.

• A set based on oliongenic disorders. It contained 54 genes known to be associated with an oliongenic disorder and 54 genes not known to be associated with gene diseases.

Page 12: Evaluating Classifiers for Disease Gene Discovery

Classifier interpretationThere are four possible results from a classification analysis. They are that a selected gene either:

1. Matches a disease gene.

2. Matches a non disease gene.

3. Is selected to match a disease gene but does not do so.

4. Is selected to match a non-disease gene but does not do so.

Page 13: Evaluating Classifiers for Disease Gene Discovery

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7

Training SetDisease classified as diseaseDisease classified as not diseaseNot disease classified as diseaseNot disease classified as not disease

Number of genes

Classifiers

Page 14: Evaluating Classifiers for Disease Gene Discovery

Validity

• The analysis of an independent data set ought to produce similar results to the training set. If not the analysis is suspect.

Page 15: Evaluating Classifiers for Disease Gene Discovery

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7

Training SetDisease classified as diseaseDisease classified as not diseaseNot disease classified as diseaseNot disease classified as not disease

Number of genes

Classifiers

0

100

200

300

400

500

600

1 2 3 4 5 6 7

HGMD SetDisease classified as diseaseDisease classified as not diseaseNot disease classified as diseaseNot disease classified as not disease

Number of Genes

Classifiers

Page 16: Evaluating Classifiers for Disease Gene Discovery

Validity

• If the analysis is valid, we would expect that classification using the only the successful subset of features found by the PROSPECTR application would result in improved results.

• The removal of non-relevant features ought to decrease the number of mismatches.

Page 17: Evaluating Classifiers for Disease Gene Discovery

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7

ReducedTraining Set

Disease classified as diseaseDisease classified as not diseaseNot disease classified as diseaseNot disease classified as not disease

Number of Genes

Classifiers

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7

Training SetDisease classified as diseaseDisease classified as not diseaseNot disease classified as diseaseNot disease classified as not disease

Number of genes

Classifiers

Page 18: Evaluating Classifiers for Disease Gene Discovery

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7

Training SetDisease classified as diseaseDisease classified as not diseaseNot disease classified as diseaseNot disease classified as not disease

Number of genes

Classifiers

0

10

20

30

40

50

1 2 3 4 5 6 7

Oligogenic SetDisease classified as diseaseDisease classified as not diseaseNot disease classified as diseaseNot disease classified as not disease

Number of Genes

Classifiers

0

200

400

600

800

1000

1200

1 2 3 4 5 6 7

ReducedTraining Set

Disease classified as diseaseDisease classified as not diseaseNot disease classified as diseaseNot disease classified as not disease

Number of Genes

Classifiers

All data analyzed

0

100

200

300

400

500

600

700

1 2 3 4 5 6 7

ReducedHGMD Set

Disease classified as diseaseDisease classified as not diseaseNot disease classified as diseaseNot disease classified as not disease

Number of Genes

Classifiers

0

10

20

30

40

50

1 2 3 4 5 6 7

ReducedOligogenic Set

Disease classified as diseaseDisease classified as not diseaseNot disease classified as diseaseNot disease classified as not disease

Number of Genes

Classifiers

0

100

200

300

400

500

600

700

1 2 3 4 5 6 7

HGMD SetDisease classified as diseaseDisease classified as not diseaseNot disease classified as diseaseNot disease classified as not disease

Number of Genes

Classifiers

Page 19: Evaluating Classifiers for Disease Gene Discovery

The Best Classifier Results

Classifier Percent total corrects

Difference with best features

J48 88.7 -15.1

PART 80.7 -10.6

ADTree 75.5 -3.1

Ibk-K 75.4 -0.32

Naïve Bayes 73.0 -12.0

SMO 72.3 -6.0

Logistic 70.0 -4.9

Page 20: Evaluating Classifiers for Disease Gene Discovery

Conclusions• We have shown that classifier 2, performs

better than classifier 1, the one chosen by PROSPECTR method.

• The features that showed the largest differences in the PROPSPECTR study were most likely a statistical anomaly.

• It seems that using these machine learning methods to classify disease genes is not very productive. At best it needs to be combined with some other independent method.

Page 21: Evaluating Classifiers for Disease Gene Discovery

References

• Euan Adie et. al., Speeding disease gene discovery by sequence based candidate prioritization, BMC Bioinformatics 2005, 6:55.

• Hammond MP, Birney E, Genome information resources - developments at Ensembl. Trends in Genetics 2004, 20:268-272.

• http://www.biomedcentral.com/1471-2105/6/55.

• http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=gnd

• http://www.genetics.med.ed.ac.uk/prospectr/

Page 22: Evaluating Classifiers for Disease Gene Discovery

Questions

Page 23: Evaluating Classifiers for Disease Gene Discovery

What causes disease?

• Causes of disease are a continuum of genetic activity interacting with nongenetic factors.

The Metabolic Molecular Basis of Inherited Disease. Vol 1, Chapter 1. 8ed.

RC 627.8.M47.2001

Page 24: Evaluating Classifiers for Disease Gene Discovery

Weka

• A collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a data set or called from your own Java code.

• Contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization.

• Well-suited for developing new machine learning schemes.

• Is open source software issued under the GNU General Public License.