44
Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work Exploring Alternative Splicing Features using Support Vector Machines Jing Xia 1 , Doina Caragea 1 , Susan J. Brown 2 1 Computing and Information Sciences Kansas State University, USA 2 Bioinformatics Center Kansas State University, USA Jan 16 2008

Exploring Alternative Splicing Features using Support Vector Machines

  • Upload
    locke

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work. Exploring Alternative Splicing Features using Support Vector Machines. Jing Xia 1 , Doina Caragea 1 , Susan J. Brown 2. - PowerPoint PPT Presentation

Citation preview

Page 1: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work

Exploring Alternative Splicing Features using Support Vector Machines

Jing Xia1, Doina Caragea1, Susan J. Brown2

1 Computing and Information Sciences Kansas State University, USA

2 Bioinformatics Center Kansas State University, USA

Jan 16 2008

Page 2: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction Experiments Design & Results Conclusions and Future Work

Outline

1 Background & Motivation

2 Problem & Feature Construction Problem Definition Data Set Feature Construction

3 Experiments Design & Results Experimental Design Experimental Results

4 Conclusions and Future Work Conclusion

Page 3: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation

Problem & Feature Construction Experiments Design & Results Conclusions and Future Work

Alternative Splicing

Alternative Splicing exon intron exon intron exon

DNA

5’UTR GT AG GT AG 3’UTR Splicing: important step Trasncription TSS ATG during gene expression

exon intron exon intron exon

pre−mNRA cap Variable splicing process 5’UTR GU AG GT AG 3’UTR

(Alternative splicing) one Splicing AUG

gene -> many proteins mRNA

Translation

protein

Genes expression: genes to pro- teins

Page 4: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation

Problem & Feature Construction Experiments Design & Results Conclusions and Future Work

Alternative Splicing

Alternative Splicing

pre−mRNA Gene Splicing: important step during gene expression Alternative Splicing

Variable splicing process (Alternative splicing) one transcript isoforms

gene -> many proteins

Proteins

One genes to many proteins

Page 5: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation

Problem & Feature Construction Experiments Design & Results Conclusions and Future Work

Patterns of Alternative Splicing

Patterns of Alternative Splicing Constitutively Spliced Exon (CSE)

Alternatively Spliced Exon (ASE) Exon skipping (most

CSE CSE ASE frequent) exon1 exon2 exon3 exon4

Alternative 5’ splice sites Alternative 3’ splice sites Intron retention

Mutually exclusive

Here, focus on predicting alternatively spliced exons (ASE) and constitutively spliced exons (CSE) based on SVM

Page 6: Exploring Alternative Splicing Features using Support Vector Machines

Alternative splicing

Wet lab experiments finding AS is time

Traditionally, align EST to genome alignments (limited to amount of EST available to the genome)

Problem & Feature Construction Experiments Design & Results Conclusions and Future Work

Background & Motivation

Page 7: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation

Problem & Feature Construction Experiments Design & Results Conclusions and Future Work

Identifying Alternative Splicing in genome

Transcripts Alternative splicing

Wet lab experiments finding AS is time

genomic DNA

consuming

Traditionally, align EST to genome alignments (limited to amount of EST available to the genome)

Alternative 3’ Exon Exon Skipping

Page 8: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation

Problem & Feature Construction Experiments Design & Results Conclusions and Future Work

Identifying Alternative Splicing in genome

Alternative splicing

Wet lab experiments finding AS is time consuming

Traditionally, align EST to genome alignments (limited to amount of EST available to the genome)

Use machine learning algorithms that to predict AS at the genome level

Page 9: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Problem Definition

Problem Definition: given an exon, can we predict it as alternatively spliced exons (ASE) or constitutively spliced exons (CSE)?

Constitutively Spliced Exon (CSE)

Alternatively Spliced Exon (ASE)

CSE CSE ASE CSE

exon1 exon2 exon3 exon4

Page 10: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Problem Addressed and Our Approach

Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM)

Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE)

Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs

Page 11: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Problem Addressed and Our Approach

Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM)

Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE)

Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs

Page 12: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Problem Addressed and Our Approach

Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM)

Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE)

Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs

Page 13: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Problem Addressed and Our Approach

Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM)

Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE)

Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs

Page 14: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Problem Addressed and Our Approach

Problem Definition predict alternatively spliced exons (ASE) vs constitutively spliced exons (CSE) Use Support Vector Machine (SVM)

Task:Two-class (ASE and CSE) classification problem Need:Training data set containing labeled examples (ASE & CSE)

Learning: Train classifier with training data Application: Predict unknown ASE Need features to represent ASEs & CSEs

Page 15: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Data Set

Published data set from the model organism, C. elegans (worm)

Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE)

Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites

Example of data set

ASE GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT ASE ATACTATAGCGTCTTG....ACCGATCGTACACGCT CSE GTACTATAGCGTCTTG....ACCGATCGTACTCGCT

AG exon GT

AG

−100 0 +100 −100 0 +100

Page 16: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Data Set

Published data set from the model organism, C. elegans (worm)

Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE) Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites

Example of data set

ASE GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT ASE ATACTATAGCGTCTTG....ACCGATCGTACACGCT CSE GTACTATAGCGTCTTG....ACCGATCGTACTCGCT

AG exon GT

AG

−100 0 +100 −100 0 +100

Page 17: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Data Set

Published data set from the model organism, C. elegans (worm)

Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE) Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites

Example of data set

ASE GTACTATAGCGTGCTG....ACCGTTCGTACTCGCT ASE ATACTATAGCGTCTTG....ACCGATCGTACACGCT CSE GTACTATAGCGTCTTG....ACCGATCGTACTCGCT

AG exon GT

AG

−100 0 +100 −100 0 +100

Page 18: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Data Set

Published data set from the model organism, C. elegans (worm)

Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE)

Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites

Previous work:

Motifs captured and identified by kernel G. Ratch et al., Length of exons and flanking introns Sorek et al.

Our work:

Exploit more biologically significant features

Use several additional approaches to derive features

Page 19: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Data Set

Published data set from the model organism, C. elegans (worm)

Includes alternatively spliced exons (ASE) and constitutively spliced exons (CSE)

Contains 487 ASEs and 2531 CSEs 100-base local sequences around splice sites

Previous work:

Motifs captured and identified by kernel G. Ratch et al., Length of exons and flanking introns Sorek et al.

Our work:

Exploit more biologically significant features

Use several additional approaches to derive features

Page 20: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Feature List

Several features known to be biologically important

Strength of splice sites (SSS) Motif features

Intronic splicing regulator (ISR) Motifs derived from local sequences (MAST) Exonic splicing enhancer (ESE)

Reduced set of motif features based on locations of motifs on secondary structure (MAST-R)

Optimal folding energy (OPE) Basic sequence features (BSF)

Page 21: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

SSS: Strength of Splice Site CGAG exon AGGTAAGT

We consider all splice sites CGAG exon AGGTAAGT

∑ exon GGAG AGGTAGGT

score = logF(Xi)

F(X) , CGAG exon AGGTTAGT i

CCAG exon AGGTAAGT where X {A,U,G,C}. i {−3,+7} ∈ ∈−3 +7 −26 +2

for 3’ splice sites (3’ss) and 3’ ss 5’ ss i ∈ {−26,+2} for 5’ splice sites (5’ss).

Page 22: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Motif: sequence pattern that occurs repeatedly in group of sequences

Intronic Splicing Regulator: identified in Kabat et al.

MAST: derived by MEME using [-100,+100] sequence

Exon Splicing Enhancers: based on two assumption

ISR

exon

Illustration of ISR dispersed among sequences

Page 23: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Motif: sequence pattern that occurs repeatedly in group of sequences

Intronic Splicing Regulator: identified in Kabat et al.

MAST: derived by MEME using [-100,+100] sequence

Exon Splicing Enhancers: based on two assumption

Example: a 20-base motif derived from sequences around splice sites

Page 24: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Motif: sequence pattern that occurs repeatedly in group of sequences

Intronic Splicing Regulator: identified in Kabat et al.

MAST: derived by MEME using [-100,+100] sequence

Exon Splicing Enhancers: based on two assumption

more frequent in exons than in introns more frequent in exons with weak splice sites than in exons with strong splice sites

ISR MAST ESE

Motifs - dispersed among exons and introns

Page 25: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Pre-mRNA secondary structures motif influence exon recognition

AUCCAUGGGCCGGAUGUGACGGUAGUAGGGUAUACGUCACAUAGGCUUCCUCUCAUGA

Located at different structure Secondary structure: derived from Mfold filter motifs using secondary structure

Loop Stem

Optimal Folding Energy: stability of RNA secondary structure

Page 26: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

Pre-mRNA secondary structures motif influence exon recognition

AUCCAUGGGCCGGAUGUGACGGUAGUAGGGUAUACGUCACAUAGGCUUCCUCUCAUGA

Located at different structure Secondary structure: derived from Mfold filter motifs using secondary structure

Loop Stem

Optimal Folding Energy: stability of RNA secondary structure

Page 27: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem Definition

Problem & Feature Construction Data Set

Experiments Design & Results Feature Construction

Conclusions and Future Work

GC content (G & C ratio),= A+U+G+C , characteristics of sequence Sequence length

Length of exons and length of exons’ flanking introns frames of stop codons

Summary of features

Motif features

Secondary structure Strength of splice sites Sequence features

Page 28: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction Experimental Design

Experiments Design & Results Experimental Results Conclusions and Future Work

Experimental Design

Experimental Design

List of previous defined features as SVM input

Combination of different features to represent ASEs & CSEs

split1 split2 split3 split4 split5

Tune SVM parameters to train (kernel linear, RBF.., Cost C) 20% 80% 5−fold cross validation

Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs

Page 29: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction Experimental Design

Experiments Design & Results Experimental Results Conclusions and Future Work

Experimental Design

Experimental Design

List of previous defined features as SVM input

Combination of different features to represent ASEs & CSEs

split1 split2 split3 split4 split5

Tune SVM parameters to train (kernel linear, RBF.., Cost C) 20% 80% 5−fold cross validation

Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs

Page 30: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction Experimental Design

Experiments Design & Results Experimental Results Conclusions and Future Work

Experimental Design

Experimental Design

List of previous defined features as SVM input

Combination of different features to represent ASEs & CSEs

split1 split2 split3 split4 split5

Tune SVM parameters to train (kernel linear, RBF.., Cost C) 20% 80% 5−fold cross validation

Choose parameters with best cross-validation (CV) accuracy Test trained SVM on testing ASEs & CSEs

Page 31: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction Experimental Design

Experiments Design & Results Experimental Results Conclusions and Future Work

Experimental Design

Experimental Design

List of previous defined features as SVM input

Combination of different features to represent ASEs & CSEs

split1 split2 split3 split4 split5

Tune SVM parameters to train (kernel linear, RBF.., Cost C) 20% 80% 5−fold cross validation

Choose parameters with best cross-validation (CV) accuracy

Test trained SVM on testing ASEs & CSEs

Page 32: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction Experimental Design

Experiments Design & Results Experimental Results Conclusions and Future Work

Experimental Design

Experimental Design

List of previous defined features as SVM input

Combination of different features to represent ASEs & CSEs

split1 split2 split3 split4 split5

Tune SVM parameters to train (kernel linear, RBF.., Cost C) 20% 80% 5−fold cross validation

Choose parameters with best cross-validation (CV) accuracy

Test trained SVM on testing ASEs & CSEs

Page 33: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction Experimental Design

Experiments Design & Results Experimental Results Conclusions and Future Work

Experimental results

Results of alternatively spliced exon classification. All features, including ISR motifs, are used.

C Cross Validation Score Test score fp 1% AUC % fp 1% AUC%

Split1 0.05 32.45 86.55 56.48 90.05 Split2 0.05 39.33 88.32 52.04 89.04 Split3 0.1 37.56 87.76 38.71 87.97 Split4 0.01 40.86 89.02 37.63 84.42 Split5 0.1 36.48 87.50 35.79 85.69

Page 34: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction Experimental Design

Experiments Design & Results Experimental Results Conclusions and Future Work

Experimental results

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1 Mixed-Feas (85.55%) Base-Feas(78.78%)

0 0 0.2 0.4 0.6 0.8 1

False Positive Rate

Comparison of ROC curves obtained using basic features only and basic features plus other mixed features (except

conserved ISR motifs). Models trained using 5-fold CV with C = 1.

Page 35: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction Experimental Design

Experiments Design & Results Experimental Results Conclusions and Future Work

Experimental results

AUC score comparison between data sets with secondary struc- tural features and data sets without secondary structural fea- tures

Page 36: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction Experimental Design

Experiments Design & Results Experimental Results Conclusions and Future Work

Motif Evaluation

Intersection between motifs derived from sequences & intronic splicing regulators

Page 37: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction Experimental Design

Experiments Design & Results Experimental Results Conclusions and Future Work

Motif Evaluation

Conserved ESE in metazoans (animals), Human and Mouse

Page 38: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction Experimental Design

Experiments Design & Results Experimental Results Conclusions and Future Work

Motif Evaluation

Comparison with A. thaliana

Page 39: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction

Conclusion Experiments Design & Results Conclusions and Future Work

Conclusions

Alternative splicing (AS) events can be found using transcripts

Machine learning effectively used for prediction of AS events

Identified features informative in predicting AS Explored comparatively comprehensive feature sets from biological point of view

Page 40: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction

Conclusion Experiments Design & Results Conclusions and Future Work

Conclusions

Alternative splicing (AS) events can be found using transcripts

Machine learning effectively used for prediction of AS events

Identified features informative in predicting AS Explored comparatively comprehensive feature sets from biological point of view

Page 41: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction

Conclusion Experiments Design & Results Conclusions and Future Work

Conclusions

Alternative splicing (AS) events can be found using transcripts

Machine learning effectively used for prediction of AS events

Identified features informative in predicting AS

Explored comparatively comprehensive feature sets from biological point of view

Page 42: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction

Conclusion Experiments Design & Results Conclusions and Future Work

Conclusions

Alternative splicing (AS) events can be found using transcripts

Machine learning effectively used for prediction of AS events

Identified features informative in predicting AS

Explored comparatively comprehensive feature sets from biological point of view

Page 43: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction

Conclusion Experiments Design & Results Conclusions and Future Work

Future Work

Apply this approach to specific organism Identify motifs more accurately

Refine relationships between features (2nd Structure:w and motifs)

Learn other types of AS events (not only skipped exons)

adapted from "Detection of Alternative Splicing Events Using Machine Learning"

Page 44: Exploring Alternative Splicing Features using Support Vector Machines

Background & Motivation Problem & Feature Construction

Conclusion Experiments Design & Results Conclusions and Future Work

Thank you for your attention!

Questions?

Related work

RASE http://www.fml.tuebingen.mpg.de/raetsch/projects/RASE

Acknowledgement

data set from Dr. Ratsch’s FML group http://www.fml.tuebingen.mpg.de/raetsch/ projects/RASE/altsplicedexonsplits.tar.gz

Dr. Caragea’s MLB group http://people.cis.ksu.edu/~dcaragea/mlb

Dr. Brown’s Bininformatics Center at KSU http://bioinformatics.ksu.edu