Upload
locke
View
29
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work. Exploring Alternative Splicing Features using Support Vector Machines. Jing Xia 1 , Doina Caragea 1 , Susan J. Brown 2. - PowerPoint PPT Presentation
Citation preview
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Exploring Alternative Splicing Features using Support Vector Machines
Jing Xia1, Doina Caragea1, Susan J. Brown2
1 Computing and Information Sciences Kansas State University, USA
2 Bioinformatics Center Kansas State University, USA
Jan 16 2008
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Outline
1 Background & Motivation
2 Problem & Feature Construction Problem Definition Data Set Feature Construction
3 Experiments Design & Results Experimental Design Experimental Results
4 Conclusions and Future Work Conclusion
Background & Motivation
Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Alternative Splicing
Alternative Splicing exon intron exon intron exon
DNA
5’UTR GT AG GT AG 3’UTR Splicing: important step Trasncription TSS ATG during gene expression
exon intron exon intron exon
pre−mNRA cap Variable splicing process 5’UTR GU AG GT AG 3’UTR
(Alternative splicing) one Splicing AUG
gene -> many proteins mRNA
Translation
protein
Genes expression: genes to pro- teins
Background & Motivation
Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Alternative Splicing
Alternative Splicing
pre−mRNA Gene Splicing: important step during gene expression Alternative Splicing
Variable splicing process (Alternative splicing) one transcript isoforms
gene -> many proteins
Proteins
One genes to many proteins
Background & Motivation
Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Patterns of Alternative Splicing
Patterns of Alternative Splicing Constitutively Spliced Exon (CSE)
Alternatively Spliced Exon (ASE) Exon skipping (most
CSE CSE ASE frequent) exon1 exon2 exon3 exon4
Alternative 5’ splice sites Alternative 3’ splice sites Intron retention
Mutually exclusive
Here, focus on predicting alternatively spliced exons (ASE) and constitutively spliced exons (CSE) based on SVM
Alternative splicing
Wet lab experiments finding AS is time
Traditionally, align EST to genome alignments (limited to amount of EST available to the genome)
Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Background & Motivation
Background & Motivation
Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Identifying Alternative Splicing in genome
Transcripts Alternative splicing
Wet lab experiments finding AS is time
genomic DNA
consuming
Traditionally, align EST to genome alignments (limited to amount of EST available to the genome)
Alternative 3’ Exon Exon Skipping
Background & Motivation
Problem & Feature Construction Experiments Design & Results Conclusions and Future Work
Identifying Alternative Splicing in genome
Alternative splicing
Wet lab experiments finding AS is time consuming
Traditionally, align EST to genome alignments (limited to amount of EST available to the genome)
Use machine learning algorithms that to predict AS at the genome level
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Problem Definition
Problem Definition: given an exon, can we predict it as alternatively spliced exons (ASE) or constitutively spliced exons (CSE)?
Constitutively Spliced Exon (CSE)
Alternatively Spliced Exon (ASE)
CSE CSE ASE CSE
exon1 exon2 exon3 exon4
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM)
Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE)
Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM)
Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE)
Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM)
Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE)
Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM)
Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE)
Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Problem Addressed and Our Approach
Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM)
Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE)
Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Data Set
Published data set from the model organism, C. elegans (worm)
Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE)
Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites
Example of data set
ASE GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT ASE ATACTATAGCGTCTTG....ACCGATCGTACACGCT CSE GTACTATAGCGTCTTG....ACCGATCGTACTCGCT
AG exon GT
AG
−100 0 +100 −100 0 +100
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Data Set
Published data set from the model organism, C. elegans (worm)
Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE) Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites
Example of data set
ASE GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT ASE ATACTATAGCGTCTTG....ACCGATCGTACACGCT CSE GTACTATAGCGTCTTG....ACCGATCGTACTCGCT
AG exon GT
AG
−100 0 +100 −100 0 +100
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Data Set
Published data set from the model organism, C. elegans (worm)
Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE) Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites
Example of data set
ASE GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT ASE ATACTATAGCGTCTTG....ACCGATCGTACACGCT CSE GTACTATAGCGTCTTG....ACCGATCGTACTCGCT
AG exon GT
AG
−100 0 +100 −100 0 +100
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Data Set
Published data set from the model organism, C. elegans (worm)
Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE)
Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites
Previous work:
Motifs captured and identified by kernel G. Ratch et al., Length of exons and flanking introns Sorek et al.
Our work:
Exploit more biologically significant features
Use several additional approaches to derive features
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Data Set
Published data set from the model organism, C. elegans (worm)
Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE)
Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites
Previous work:
Motifs captured and identified by kernel G. Ratch et al., Length of exons and flanking introns Sorek et al.
Our work:
Exploit more biologically significant features
Use several additional approaches to derive features
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Feature List
Several features known to be biologically important
Strength of splice sites (SSS) Motif features
Intronic splicing regulator (ISR) Motifs derived from local sequences (MAST) Exonic splicing enhancer (ESE)
Reduced set of motif features based on locations of motifs on secondary structure (MAST-R)
Optimal folding energy (OPE) Basic sequence features (BSF)
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
SSS: Strength of Splice Site CGAG exon AGGTAAGT
We consider all splice sites CGAG exon AGGTAAGT
∑ exon GGAG AGGTAGGT
score = logF(Xi)
F(X) , CGAG exon AGGTTAGT i
CCAG exon AGGTAAGT where X {A,U,G,C}. i {−3,+7} ∈ ∈−3 +7 −26 +2
for 3’ splice sites (3’ss) and 3’ ss 5’ ss i ∈ {−26,+2} for 5’ splice sites (5’ss).
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Motif: sequence pattern that occurs repeatedly in group of sequences
Intronic Splicing Regulator: identified in Kabat et al.
MAST: derived by MEME using [-100,+100] sequence
Exon Splicing Enhancers: based on two assumption
ISR
exon
Illustration of ISR dispersed among sequences
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Motif: sequence pattern that occurs repeatedly in group of sequences
Intronic Splicing Regulator: identified in Kabat et al.
MAST: derived by MEME using [-100,+100] sequence
Exon Splicing Enhancers: based on two assumption
Example: a 20-base motif derived from sequences around splice sites
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Motif: sequence pattern that occurs repeatedly in group of sequences
Intronic Splicing Regulator: identified in Kabat et al.
MAST: derived by MEME using [-100,+100] sequence
Exon Splicing Enhancers: based on two assumption
more frequent in exons than in introns more frequent in exons with weak splice sites than in exons with strong splice sites
ISR MAST ESE
Motifs - dispersed among exons and introns
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Pre-mRNA secondary structures motif influence exon recognition
AUCCAUGGGCCGGAUGUGACGGUAGUAGGGUAUACGUCACAUAGGCUUCCUCUCAUGA
Located at different structure Secondary structure: derived from Mfold filter motifs using secondary structure
Loop Stem
Optimal Folding Energy: stability of RNA secondary structure
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
Pre-mRNA secondary structures motif influence exon recognition
AUCCAUGGGCCGGAUGUGACGGUAGUAGGGUAUACGUCACAUAGGCUUCCUCUCAUGA
Located at different structure Secondary structure: derived from Mfold filter motifs using secondary structure
Loop Stem
Optimal Folding Energy: stability of RNA secondary structure
Background & Motivation Problem Definition
Problem & Feature Construction Data Set
Experiments Design & Results Feature Construction
Conclusions and Future Work
GC content (G & C ratio),= A+U+G+C , characteristics of sequence Sequence length
Length of exons and length of exons’ flanking introns frames of stop codons
Summary of features
Motif features
Secondary structure Strength of splice sites Sequence features
Background & Motivation Problem & Feature Construction Experimental Design
Experiments Design & Results Experimental Results Conclusions and Future Work
Experimental Design
Experimental Design
List of previous defined features as SVM input
Combination of different features to represent ASEs & CSEs
split1 split2 split3 split4 split5
Tune SVM parameters to train (kernel linear, RBF.., Cost C) 20% 80% 5−fold cross validation
Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs
Background & Motivation Problem & Feature Construction Experimental Design
Experiments Design & Results Experimental Results Conclusions and Future Work
Experimental Design
Experimental Design
List of previous defined features as SVM input
Combination of different features to represent ASEs & CSEs
split1 split2 split3 split4 split5
Tune SVM parameters to train (kernel linear, RBF.., Cost C) 20% 80% 5−fold cross validation
Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs
Background & Motivation Problem & Feature Construction Experimental Design
Experiments Design & Results Experimental Results Conclusions and Future Work
Experimental Design
Experimental Design
List of previous defined features as SVM input
Combination of different features to represent ASEs & CSEs
split1 split2 split3 split4 split5
Tune SVM parameters to train (kernel linear, RBF.., Cost C) 20% 80% 5−fold cross validation
Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs
Background & Motivation Problem & Feature Construction Experimental Design
Experiments Design & Results Experimental Results Conclusions and Future Work
Experimental Design
Experimental Design
List of previous defined features as SVM input
Combination of different features to represent ASEs & CSEs
split1 split2 split3 split4 split5
Tune SVM parameters to train (kernel linear, RBF.., Cost C) 20% 80% 5−fold cross validation
Choose parameters with best cross-validation (CV) accuracy
Test trained SVM on testing ASEs & CSEs
Background & Motivation Problem & Feature Construction Experimental Design
Experiments Design & Results Experimental Results Conclusions and Future Work
Experimental Design
Experimental Design
List of previous defined features as SVM input
Combination of different features to represent ASEs & CSEs
split1 split2 split3 split4 split5
Tune SVM parameters to train (kernel linear, RBF.., Cost C) 20% 80% 5−fold cross validation
Choose parameters with best cross-validation (CV) accuracy
Test trained SVM on testing ASEs & CSEs
Background & Motivation Problem & Feature Construction Experimental Design
Experiments Design & Results Experimental Results Conclusions and Future Work
Experimental results
Results of alternatively spliced exon classification. All features, including ISR motifs, are used.
C Cross Validation Score Test score fp 1% AUC % fp 1% AUC%
Split1 0.05 32.45 86.55 56.48 90.05 Split2 0.05 39.33 88.32 52.04 89.04 Split3 0.1 37.56 87.76 38.71 87.97 Split4 0.01 40.86 89.02 37.63 84.42 Split5 0.1 36.48 87.50 35.79 85.69
Background & Motivation Problem & Feature Construction Experimental Design
Experiments Design & Results Experimental Results Conclusions and Future Work
Experimental results
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1 Mixed-Feas (85.55%) Base-Feas(78.78%)
0 0 0.2 0.4 0.6 0.8 1
False Positive Rate
Comparison of ROC curves obtained using basic features only and basic features plus other mixed features (except
conserved ISR motifs). Models trained using 5-fold CV with C = 1.
Background & Motivation Problem & Feature Construction Experimental Design
Experiments Design & Results Experimental Results Conclusions and Future Work
Experimental results
AUC score comparison between data sets with secondary struc- tural features and data sets without secondary structural fea- tures
Background & Motivation Problem & Feature Construction Experimental Design
Experiments Design & Results Experimental Results Conclusions and Future Work
Motif Evaluation
Intersection between motifs derived from sequences & intronic splicing regulators
Background & Motivation Problem & Feature Construction Experimental Design
Experiments Design & Results Experimental Results Conclusions and Future Work
Motif Evaluation
Conserved ESE in metazoans (animals), Human and Mouse
Background & Motivation Problem & Feature Construction Experimental Design
Experiments Design & Results Experimental Results Conclusions and Future Work
Motif Evaluation
Comparison with A. thaliana
Background & Motivation Problem & Feature Construction
Conclusion Experiments Design & Results Conclusions and Future Work
Conclusions
Alternative splicing (AS) events can be found using transcripts
Machine learning effectively used for prediction of AS events
Identified features informative in predicting AS Explored comparatively comprehensive feature sets from biological point of view
Background & Motivation Problem & Feature Construction
Conclusion Experiments Design & Results Conclusions and Future Work
Conclusions
Alternative splicing (AS) events can be found using transcripts
Machine learning effectively used for prediction of AS events
Identified features informative in predicting AS Explored comparatively comprehensive feature sets from biological point of view
Background & Motivation Problem & Feature Construction
Conclusion Experiments Design & Results Conclusions and Future Work
Conclusions
Alternative splicing (AS) events can be found using transcripts
Machine learning effectively used for prediction of AS events
Identified features informative in predicting AS
Explored comparatively comprehensive feature sets from biological point of view
Background & Motivation Problem & Feature Construction
Conclusion Experiments Design & Results Conclusions and Future Work
Conclusions
Alternative splicing (AS) events can be found using transcripts
Machine learning effectively used for prediction of AS events
Identified features informative in predicting AS
Explored comparatively comprehensive feature sets from biological point of view
Background & Motivation Problem & Feature Construction
Conclusion Experiments Design & Results Conclusions and Future Work
Future Work
Apply this approach to specific organism Identify motifs more accurately
Refine relationships between features (2nd Structure:w and motifs)
Learn other types of AS events (not only skipped exons)
adapted from "Detection of Alternative Splicing Events Using Machine Learning"
Background & Motivation Problem & Feature Construction
Conclusion Experiments Design & Results Conclusions and Future Work
Thank you for your attention!
Questions?
Related work
RASE http://www.fml.tuebingen.mpg.de/raetsch/projects/RASE
Acknowledgement
data set from Dr. Ratsch’s FML group http://www.fml.tuebingen.mpg.de/raetsch/ projects/RASE/altsplicedexonsplits.tar.gz
Dr. Caragea’s MLB group http://people.cis.ksu.edu/~dcaragea/mlb
Dr. Brown’s Bininformatics Center at KSU http://bioinformatics.ksu.edu