13
9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu ,Kenneth A De Jong Department of Computer Science George Mason University Fairfax,VA, 22030 {ukamath, amarda, kdejong}@gmu.edu Genetic Programming Based Feature Generation for Automated DNA Sequence Analysis

9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University

Embed Size (px)

Citation preview

Page 1: 9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University

9th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania

Uday Kamath, Amarda Shehu ,Kenneth A De JongDepartment of Computer Science

George Mason UniversityFairfax,VA, 22030

{ukamath, amarda, kdejong}@gmu.edu

Genetic Programming Based Feature Generation for Automated DNA Sequence Analysis

Page 2: 9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University

Bioinformatics and Molecular Biology

LarrañagaP et al. Brief Bioinform2006;7:86-112

Page 3: 9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University

Promoter Site Identification

Copyright 2012 the British Journal of Anaesthesia

Background

• Promoters signal the beginning of a coding region • They are important signals for initiation of DNA->RNA transcription.

Challenges

•Complex•Gene-specific•Many decoys

Page 4: 9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University

DNA Splice Site Identification

Asa Ben-Hur, Cheng Soon Ong, Sören Sonnenburg, Bernhard Schölkopf, and Gunnar RätschTUTORIAL: SUPPORT VECTOR MACHINES AND KERNELS FOR COMPUTATIONAL BIOLOGY [2008]

Background

• Splice sites mark boundaries between exons and introns in a gene

Challenges•No known sequence pattern

i. Diverse sequence lengthii. Diverse exon lengthsiii. Diverse number and

lengths of introns

•0.1 to 1% true splice sites, rest decoys

Page 5: 9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University

Evolutionary (GP) Approach

Page 6: 9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University

Finding Functional Features• GP Functional Features

Terminals A,C,T,G Integers for position/regionBasic Non Terminals Motif (combination of ACTG) Position based Motifs Correlation based Motifs Region based Motifs Composition based Motifs

Complex Non Terminals Conjuntions Disjunctions Negations

Features Evolved combining accuracy/precision

Page 7: 9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University

Why Human Competitive ?B) The result >=  than a result that was accepted as a new scientific result

E) The result >= than the most recent human-created solution to a long-standing  problem 

F) The result >= than a result that was considered an achievement when was first discovered

G) The result solves a problem of indisputable difficulty in its field

Page 8: 9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University

Why Human Competitive ?B) The result >=  than a result that was accepted as a new 

scientific result

Splice Site Prediction•Research compares state of the art Enumeration, Iterative, Probabilistic      methods, Kernel methods etc.•Best Precision with statistical significant improvements on most datasets

Promoter Prediction•Research compares results with 7 state of the art algorithms ranging from   Enumeration, Iterative, Neural Networks, Kernel based etc.•Best Precision and with statistical significant improvements on different datasets

F) The result >= than a result that was considered an achievement when was first discovered

Page 9: 9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University

Why Human Competitive ?F) The result >= than a result that was considered an 

achievement when was first discovered

On Promoter Identification Problem 

What was considered achievement Where we stand

Uday Kamath, Kenneth A De Jong, and Amarda Shehu. "An Evolutionary-based Approach for Feature Generation: Eukaryotic Promoter Recognition." IEEE Congress on Evolutionary Computation (IEEE CEC), New Orleans, LA, pg. 277-284, 2011

Page 10: 9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University

Why Human Competitive ?

On Splice site Identification Problem 

F) The result >= than a result that was considered an achievement when was first discovered

What was considered achievement

Where we stand

Uday Kamath, Jack Compton, Rezarta Islamaj Dogan, Kenneth A. De Jong, and Amarda Shehu. An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and its Application to DNA Splice-Site Prediction. Trans Comp Biol and Bioinf 2012

Page 11: 9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University

Why Human Competitive ?

Long Standing Problem(s)Genome Sequence prediction and annotation of Splice sites and Promoters

Computational Results >=Around 7 datasets and 10 algorithms compared

Advancing Understanding in Genomics•Our top features do contain signals painstakingly determined by biologists through decades of wet-lab research.

• More importantly, new features are found that may help biologists further advance their understanding of DNA architecture

•All our features are available online for experts to analyze and spur further wet-lab research

E) The result >= than the most recent human-created solution to a long-standing  problem 

Page 12: 9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University

Why Human Competitive?G) The result solves a problem of indisputable difficulty in its 

field

• Estimated 10-25K human protein-coding genes (only 1.5% of entire genome) • Wet-lab models of discovery costly and prone to errors

• Cannot keep pace with growing genomic sequences• Computational models good complements, but

• Black Box Models – No or Little help to Biologists• White Box Models- Lower precision/accuracy and reliant on manual steps

• Decades of research into DNA function and architecture•“Gene finding” on pubmed returns > 80,000 research articles

• Progress crucial to speed up our understanding of disease and development of targeted treatments

Page 13: 9 th Annual "Humies" Awards 2012 — Philadelphia, Pennsylvania Uday Kamath, Amarda Shehu,Kenneth A De Jong Department of Computer Science George Mason University

Why is this the Best Entry• Addresses central problems to molecular biology and health research

• Finding functional signals in genome sequences is complex and NP-Hard

• Improvements over state of the art are statistically significant

• Extensive statistical analysis validates usefulness of GP features– F-score and Information gain techniques

• Advances understanding to motivate further research– Features found by GP reproduce results of decades of research by biologists– Novel interesting features also reported– Features, data sets, and software publicly available for community

• Far reaching implications, spurring research beyond genomics– Example: finding what features determine anti-microbial activity for the

purpose of generating novel peptides to combat drug resistance.