Upload
others
View
36
Download
0
Embed Size (px)
Citation preview
by C. Leslie, E. Eskin, J. Weston, W.S. Noble
Mismatch String Kernels for SVM
Protein Classification
by C. Leslie, E. Eskin, J. Weston, W.S. Noble
Athina Spiliopoulou
Morfoula Fragopoulou
Ioannis Konstas
Outline� Definitions & Background
� Proteins
� Remote Homology Detection
� SVMs
� Insides of the algorithm
� Feature mapping� Feature mapping
� Mismatch tree data structure
� Mismatch tree traversal
� Computational Efficiency
� Experiments
� Discussion
Proteins
Primary structure: amino acid sequence
Secondary Structure
3D Structure
� The same amino-acid sequence almost always folds
into the same 3D structure
Homologues, Remote Homologues
� Amino acid sequences subject to mutation
� Structures serving important biological function
highly conserved
� Homologues: share the same ancestor + sequence
similarity > 30% similarity > 30%
� Remote Homologues: share the same ancestor +
sequence similarity < 30%
Protein Classification
Superfamily
Family
Homologues Remote
Homologues
Non-homologues
• Homology Detection: Classify sequences into families
• Remote Homology Detection: Classify sequences into superfamilies
Remote Homology Detection
• Data available: amino-acid sequences
• Remote Homology Detection: great challenge due
to low sequence similarity
• Previous Methods (generative models):
pairwise sequence alignment• pairwise sequence alignment
• profiles for protein families
• consensus patterns using motifs
• profile Hidden Markov Models
• SVM-Fisher: breakthrough for remote homology
detection
SVMs in Remote Homology Detection
• Discriminative classifiers that learn linear decision
boundaries
• Explicitly model difference between positive and
negative examples
• Behave and generalise well with sparse data• Behave and generalise well with sparse data
• Input data can be mapped to a feature space
• Kernel Trick
– Explicit calculation of feature vectors can be
avoided
Outline� Definitions & Background
� Proteins
� Remote Homology Detection
� SVMs
� Insides of the algorithm
� Feature mapping� Feature mapping
� Mismatch tree data structure
� Mismatch tree traversal
� Computational Efficiency
� Experiments
� Discussion
Feature Mapping
� Amino acid alphabet A: length symbols
� k-mer: a k-length subsequence in a protein
sequence
20=l
� Feature Space: the -dimensional vector space
indexed by the set of all possible k-mers from A
kl
Feature Mapping (cont.)
… A A L A A V …… A A L A A V …
Alphabet A = (A, V, L)
k = 3
AAL ALA LAA AAV
AAA AAL AAV ALA AVA LAA VAA …
0 1 1 1 0 1 0 …
Mismatch String Kernel
• Allows for mutations
AAVAAALALVAL
Mismatch neighbourhood:
οf the 3 -mer α =
( ) ( )3,1N aA A L A A VA A L A A V
k = 3, m = 1
( )( ) ( )( ) kmk Α=Φ
βεβ αφα,
( ) 1=αφβ
• The feature mapping of a k-mer α is given by:
,
where if β belongs to neighbourhood
and 0 otherwise
VALAVLALL
οf the 3 -mer α = “AAL”
Mismatch String Kernel (cont.)
� The feature mapping of sequence x is given by:
� The (k,m)-mismatch kernel is given by:
( )( ) ( )( )∑−
=Φxamersk
mkmk xin
,, αφ
( )( ) ( )( ) ( )( )yxyxK mkmkmk ,,, ,, ΦΦ=
Mismatch Tree - An efficient data
Structure
� Representation of feature space as a tree
� Depth of tree: k
� Number of branches of each internal node:
lA =� Label of each branch: a symbol from A
Mismatch Tree- An efficient data
Structure (cont.)
Alphabet A = (A, V, L)
k = 3 A V L
A V L
Internal nodes: prefix of k-mer
A V L
A V L
AA AV AL
AAA AAV AAL
…
Leaf nodes: fixed k-mers
Mismatch Tree – Traversal (DFS)
0 0
A A
A L
L A
A
0 0
A L
Sequence: AALA
k = 3, m = 1
L
A L
L AA
0 1
L A
A
2
L
1
V
1
( ) ( ) )()(,, ycountxcountyxKyxK ⋅+←
Outline� Definitions & Background
� Proteins
� Remote Homology Detection
� SVMs
� Insides of the algorithm
� Feature mapping� Feature mapping
� Mismatch tree data structure
� Mismatch tree traversal
� Computational Efficiency
� Experiments
� Discussion
Efficiency – Space Complexity
� No need to store the entire tree
� For k = 7 1.28 billion nodes!
� No need to store all feature vectors � No need to store all feature vectors
� Kernel trick!
Efficiency – Time Complexity
• A fixed k-mer α has: k-mers to its neighbourhood
• , where
• M: number of sequences and
• n: the length of each sequence
• N: total length of the dataset
( )mmlkO
MnN =
• N: total length of the dataset
• Whole dataset: k-mers
• Worst case: perform updates to the kernel matrix
• Overall running complexity:
( )mmlNkO
( )2MO
( )mmlnkMO 2
System Pipeline
• Training Phase
– Compute the kernel matrix for all the training
sequences
– Normalize (divide by the length of the vectors)
– Train the SVM classifier– Train the SVM classifier
– Compute and store the k-mer scores of the
Support Vectors
• Testing Phase
– Compute the feature vector for each test datum
and predict its class in linear time
( ) ( )( ) ( )( )∑=
+ΦΦ=r
imkimkii bxxayxf
1,, ,
Experiments
� Benchmark dataset designed by Jaakkola et al. from
the SCOP database
� 33 Families
Superfamily
FamilyPos. Train Pos.
TestNegative Train
Experiments (cont.)
� Comparison to other methods:
� PSI-BLAST (mainly used for homology detection)
� SAM-T98
� Fisher-SVM (the state-of-the- art)
Discussion
• Mismatch-SVM performs equally well with Fisher-
SVM method
• Mismatch-SVM much more efficient
• Efficiency: important issue
• Large real-world datasets• Large real-world datasets
• Multi-class prediction
• Accuracy increased by incorporating biological
knowledge