View
27
Download
0
Category
Tags:
Preview:
DESCRIPTION
SVM: Non-coding Neutral S equences V s Regulatory M odules. Ying Zhang, BMB, Penn State Ritendra Datta, CSE, Penn State Bioinformatics – I Fall 2005. Outline. Background: Machine Learning & Bioinformatics Data Collection and Encoding Distinguish sequences using SVM Results - PowerPoint PPT Presentation
Citation preview
SVM: Non-coding Neutral Sequences Vs Regulatory
Modules
Ying Zhang, BMB, Penn StateRitendra Datta, CSE, Penn State
Bioinformatics – IFall 2005
Outline
Background: Machine Learning & Bioinformatics
Data Collection and Encoding
Distinguish sequences using SVM
Results
Discussion
Expression of genes are under regulation. Right protein, right time, right amount, right location…
Regulation: cis-element vs trans-element Cis-element: Non-coding functional sequence Trans-element: Proteins interact with cis-element
Predicting cis-regulatory elements remains a challenge: Significant effort put in the past Current trends: TFBS clusters, pattern analysis
Regulation: A Recurring Challenge
Alignments and Sequences: The Data
Information: Sequence Genetics information encoded in DNA sequence Typical information: Codon, Binding site, …
Codon: ATG (Met), CGT (Arg.), … Binding sites: A/TGATAA/G ( Gata1 ), …
Evolutionary Information: Aligned Sequence Similarity between species Conservation ~ Function
Human: TCCTTATCAGCCATTACC Mouse: TCCTTATCAGCCACCACC
Problem
Given the genome sequence information, is it possible to automatically distinguish Regulatory Regions from other genomic non-coding Neutral sequences using machine learning ?
Predicting Genes
Machine Learning: The Tool
Sub-field of A.I. Computers programs “learn” from
experience, i.e. by analyzing data and corresponding behavior
Confluence of Statistics, Mathematical Logic, Numerical Optimization
Applied in Information Retrieval, Financial Analysis, Computer Vision, Speech Recognition, Robotics, Bioinformatics, etc.
Statistics
Optimization Logic
M.L.
Analyzing Stocks
Personalized WWW search
Applications
Machine Learning: Types of Learning
Supervised Learning Learning statistical models from past
sample-label data pairs, e.g. Classification
Unsupervised Learning Building models to capture the inherent
organization in data, e.g. Clustering Reinforcement Learning
Building models from interactive feedback on how well the current model is doing, e.g. Robotic learning
Machine Learning and Bioinformatics: The Confluence
Learning problems in Bioinformatics [ICML ’03] Protein folding and protein structure prediction Inference of genetic and molecular networks Gene-protein interactions Data mining from micro arrays Functional and comparative genomics, etc.
Identification of DNaseI Hypersensitive Sites in the human genome (may disclose the location of cis-regulatory sequences) W.S. Noble et al., “Predicting the in vivo signature of human gene
regulatory sequences,” Bioinformatics, 2005. Functionally classifying genes based on gene expression data from
DNA microarray hybridization experiments using SVMs M. P. S. Brown, “Knowledge-based analysis of microarray gene
expression data by using support vector machines,” PNAS, 2004. Using Log-odds ratios from Markov models for identifying regulatory
regions in DNA sequences L. Elnitski et al., “Distinguishing Regulatory DNA From Neutral Sites,”
Genome Research, 2003. Selection of informative genes using an SVM-based feature
selection algorithm I. Guyon et al., “Gene selection for cancer classification using support
vector machines,” Machine Learning, 2002.
Machine Learning and Bioinformatics: Sample Publications
Machine Learning and Bioinformatics: Books
Support Vector Machines: A Powerful Statistical Learning Technique
Which of the linear separators is optimal?
Support Vector Machines: A Powerful Statistical Learning Technique
ξiξi
Choose the one that maximizes the margin between the classes
0
Support Vector Machines: A Powerful Statistical Learning TechniqueThe classes in these datasets linearly separate easily
x
What about these datasets ?
0 x
x
Support Vector Machines: A Powerful Statistical Learning TechniqueSolution: Kernel Trick !
x2
x
0 x
Experiments: Overview
Classification in Question: Regulatory regions (REG) vs Ancestral Repeats (AR)
Two types of experiments: Nucleotide sequences – ATCG Alignments (reduced 5-symbol) - SWVIG
(S: match involving G & C, W: match involving A & T, G:gap V:transversion, I: transition)
Two datasets: Elnitski et al. dataset Dataset from PennState CCGB
Mapping Sequences/Alignments → Real Numbers Frequencies of short length K-mers (K=1, 2, 3) Normalizing factor - sequence length (Ambiguous for K > 1) Stability of variance – Equal length sequences (whenever possible)
Total number of features: Sequences: 4 + 42 + 43 = 84 Alignments: 5 + 52 + 53 = 155
Relatively high-dimensionality: Curse of dimensionality: Convergence of estimators very slow Over-fitting: Poor generalization performance
Solutions: Dimension Reduction – e.g., PCA Feature Selection - e.g., Forward Selection, Backward Elimination
Experiments: Feature Selection
Training Set: Elnitski et al. dataset Sequences: 300 samples of 100 bp each class (REG and AR) Alignments: 300 samples of length 100 from each class
SVM setup: RBF Kernel: k(x1, x2) = exp( δ || x1 – x2 || ) Implementation: LibSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/)
Validation: N-fold Cross-validation Used in feature selection, parameter tuning, and testing
Experiments: Training and Validation
Results: The Elnitski et al. dataset
Parameter selection SVM Parameters: δ and C
Feature Selection Assessing Feature Importance G-C Normalization Sequences: 10 out of 84 Symbols: 10 out of 155
Accuracy scores Overall Ancestral Repeats (AR) Regulatory Regions (Reg)
Results: SVM Parameter Selection
Iterative selection procedure Coarse selection – Initial neighborhood Fine-grained selection - Brute force
Validation Set from data Within-loop CV Chosen Parameters:
δ = 1.6 C = 1.5
Results: Feature Selection - Sequence
Distribution of Nucleotide frequencies of the top 9 most significant k-mers
Cho
sen
by O
ne-d
imen
sion
al S
VMs
Results: Feature Selection - Symbol
Distribution of 5-symbol frequencies of the top 9 most significant k-mers
Cho
sen
by O
ne-d
imen
sion
al S
VMs
Results: Feature Selection
Procedure: Greedy Forward Selection + Backward Elimination
Chosen Features: Sequence: [5 68 3 20 63 4 16 10 1 22]
( 0 = A, 1 = T, 2 = G, 3 =C, 4 = AA, 5 = AT, etc. ) [AT,CAA,C,AAA,GGC,AA,CA,TG,T,AAG]
Symbol: [3 5 4 18 24 124 17 143 19 95 103] ( 0 = G, 1 = V, 2 = W, 3 =S, 4 = I, 5 = GG, 6 = GV, etc. ) [S, GG, I, WS,SI,SIG,WW,IWI,WI,WSV,WII]
Results: Accuracy Scores
Experiment Type Overall Accuracy
Reg Precision
AR Precision
Elnitski et al. 5-symbolHexamers
≈ 74.7%≈ 75%
78.49%81.4%
73%72.5%
Sequences only 1-mers2-mers3-mersSelection
78.33%77.67%80.17%80.33%
76.54%72.84%83.67%80.87%
80.5482.97%77.21%79.63%
Symbols only 1-mers2-mers3-mersSelection
84.33%84.33%85.17%86.00%
79.39%77.53%78.83 %80.58%
90.03%90.96%92.42%91.54%
Results: Laboratory Data Training:
SVM models built using Elnitski et al. data Same parameters; Same features selected
Data: 9 candidate cis-regulatory regions predicted by RP score 1: negative control based on the definition. 5 of the 9 candidates passed current biological testing,positive
Accuracy Classification result for sequence (1-, 2-, 3-mer):
1 negative control 4 out of 5 positive element + 3 out of 4 “negative” element
Classification result for alignment (1-, 2-, 3-mer): 1 negative control 9 original candidates
Discussion
High validation rate for Ancestral repeat The structure of selected training set is not that diverse Ancestral repeat tends to be AT-rich AR: LINE, SINE etc.
SVM performs a little better than RP scores in training set Statistically more powerful
RP: Markov model for pattern recognition SVM: Hyper-plane in high-dimensional feature space
Feature selection using wrapper method possible
Discussion (cont’d) Performance degradation in Lab Data classification
No improvement in SVM classification compared to RP score Features identified from the Elnitski et al. data may have some bias – other
features may be more informative on the Lab data
Sequence classification vs Alignment (Accuracy Table) SVM yields higher overall cross-validation accuracy for aligned symbol
sequences compared to nucleotide sequences Gained accuracy rate: Ancestral Repeat driven
No improvement for aligned symbol sequence In Lab data classification, sequence classification is better than aligned
symbol sequence No information gained from evolutionary history !!!
Alphabet reduction not optimal Assumption worng!!!
Summary
Generally, SVM is a powerful tool for classification Performance better than RP in distinguishing AR training set from
Reg training set
SVM: answer “yes or no” question RP: Probabilistic method, can generate quantitative measurement
genome-wide SVM: Results can be extended using probabilistic forms of SVM
SVM can reveal potentially interesting biological features e.g. the transcription regulation scheme
Explore more complex features Refine models for neutral non-coding genomic segments Utilize multi-species alignment for the classification Combining sequence and alignment information to build
more robust multi-classifiers – “Committee of Experts” Pattern recognition for more accurate prediction
Future Directions: Possible extensions
Questions and recommendations?
Using original alignment features, 20 columns.
Other lab data (avoiding the possible bias of RP preselection) for SVM performance testing.
References L. Elnitski et al., “Distinguishing Regulatory DNA From Neutral
Sites,” Genome Research, 2003. Machine Learning Group, University of Texas at Austin, “Support
Vector Machines,” http://www.cs.utexas.edu/~ml/ . N. Cristianini, “Support Vector and Kernel Methods for Pattern
Recognition,” http://www.support-vector.net/tutorial.html.
Acknowledgement
Dr. Webb Miller Dr. Francesca Chiaromonte David King
Recommended