Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Scalable Kernel Methods and Algorithms for General
Sequence Analysis
Pavel P. Kuksa
Department of Computer ScienceRutgers University
PhD Pre-Defense, 2010
Our topic• Classification of sequences:
• Inferring functions of proteins, species labels from DNA, music genre, or text topic
• Approach: Similarity-based inference using kernel functions
• Pattern detection/extraction: e.g., motif discovery2
Music GenreArtistetc
AnnotationAudio/Music Text
Bio-informatics
Problem Solving Framework• Kernel Methods for Sequences
• Infer class label via sequence similarity measures (kernels) K(X,Y) = <F(X), F(Y)> (dot-products of feature vectors)
• measure similarity between two objects (e.g., documents, bio-sequences)
• have corresponding (dot-product) feature space F (what F should be for sequences?)
• the problem is mapped from original sequence space to vector feature space
String Kernel ConceptGGAATTTGAGCAGGACTAATTGGAACTTCTTTAAGATTACTTATTCGAACTGAATTAGGAACCCCAGGATCTTTAATTGGAGATGATCAAATTTATAATACAAT
TGTTACAGCTCATGCATTTATTATAATTTTTTTTATAGTTATACCTATTATAATCGGAGGATTTGGAAATTGACTAGTTCCATTAATAATAGGTGCCCCAGATATAG
CTTTCCCCCGTATAAATAACATAAGATTTTGATTATTACCCCCATCTTTAACTTTATTAATTTCAAGAAGAATTGTTGAAAATGGGGCTGGTACAGGATGAACA
GTTTATCCCCCTCTTTCATCAAATATCGCCCATCAAGGAGCATCTGTTGATTTAGCAATTTTTTCCCTTCATCTTGCTGGTATTTCATCAATTCTTGGAGCTA
TTAATTTTATTACAACAATTATTAATATACGAATTAATAATTTATCTTTTGATCAAATACCATTATTTGTTTGAGCTGTAGGAATTACAGCATTATTATTATTACTTTC
ATTACCTGTTTTAGCAGGTGCTATTACTATATTATTAACAGATCGAAATTTAAATACTTCTTTTTTTGATCCTGCAGGAGGAGGAGATCCAATCTTATACCAACA
CTTATTT
Spectrum(5) Mismatch(5,1)
GGAATGxAATxGAAT
GGxATGGAxTGGAAx
Fig. 1. String kernel. A sliding window of size k=5 is used to derive feature representation φ(X) of an input sequence X.
structural families. A novel unsupervised “abstrac-
tion” strategy is applied on classic string kernel to
leverage supervised learning with the power of many
unannotated protein sequences. The “abstraction”
includes two stages,
• (1) An unsupervised auxiliary task is em-
ployed to learn accurate representations
(embedding) for amino acids based on con-
textual similarity in biological sequences.
• (2) Amino acids are grouped to generate
more abstract entities according to their
learned representations.
On three benchmark data sets for structural classifi-
cation of proteins, the proposed semi-supervised ker-
nel achieves state-of-the-art performance, improves
over classic string kernels, and is more general than
previous approaches.
2. Protein Sequence Analysis UsingString Kernels
In this paper we focus on string kernel-based meth-
ods 6 for classification of protein sequences under the
discriminative learning setting. The discriminative
setting tries to capture the differences among classes
(e.g. folds and superfamilies), while its counterpart,
generative models 7 focus on capturing shared char-
acteristics within classes. Previous studies 8, 9 show
that the discriminative models have better distinc-
tion power over the generative models 7.
Our target problems of remote protein homology
and fold recognition essentially try to infer structural
relationships between proteins from their primary se-
quences. These problems could be treated as tasks of
learning sequences of amino acids that typically de-
scribe the patterns of protein structural properties of
interest. Given an input protein sequence, the target
structural relationship will be recognized by perform-
ing matching between patterns present in the test
sequence and the patterns in the annotated training
examples. This amounts to computing string kernel
(similarity score) between protein sequences based
on certain feature representations. The key idea of
string kernels is to apply a mapping φ(·) to map se-
quences of arbitrary length into a vectorial feature
space of fixed length. In this space a standard classi-
fier such as support vector machine (SVM) 6 can then
be applied. As SVMs require only inner products
between examples in the feature space, rather than
the feature vectors themselves, string kernel 10, 9, 4
is thus implicitly computed as an inner product in
this feature space:
K(X, Y ) = �φ(X), φ(Y )�, (1)
where X = (x1, . . . , x|X|), where |X| means the
length of the sequence. X,Y ∈ S, S is the set of
F(X)
X
Similar spectra ~ High Similarity
Length k=5 Up to m=1 mismatch
K(X, Y) = <F(X), F(Y)>
String Kernel ConceptGGAATTTGAGCAGGACTAATTGGAACTTCTTTAAGATTACTTATTCGAACTGAATTAGGAACCCCAGGATCTTTAATTGGAGATGATCAAATTTATAATACAAT
TGTTACAGCTCATGCATTTATTATAATTTTTTTTATAGTTATACCTATTATAATCGGAGGATTTGGAAATTGACTAGTTCCATTAATAATAGGTGCCCCAGATATAG
CTTTCCCCCGTATAAATAACATAAGATTTTGATTATTACCCCCATCTTTAACTTTATTAATTTCAAGAAGAATTGTTGAAAATGGGGCTGGTACAGGATGAACA
GTTTATCCCCCTCTTTCATCAAATATCGCCCATCAAGGAGCATCTGTTGATTTAGCAATTTTTTCCCTTCATCTTGCTGGTATTTCATCAATTCTTGGAGCTA
TTAATTTTATTACAACAATTATTAATATACGAATTAATAATTTATCTTTTGATCAAATACCATTATTTGTTTGAGCTGTAGGAATTACAGCATTATTATTATTACTTTC
ATTACCTGTTTTAGCAGGTGCTATTACTATATTATTAACAGATCGAAATTTAAATACTTCTTTTTTTGATCCTGCAGGAGGAGGAGATCCAATCTTATACCAACA
CTTATTT
Spectrum(5) Mismatch(5,1)
GGAATGxAATxGAAT
GGxATGGAxTGGAAx
Fig. 1. String kernel. A sliding window of size k=5 is used to derive feature representation φ(X) of an input sequence X.
structural families. A novel unsupervised “abstrac-
tion” strategy is applied on classic string kernel to
leverage supervised learning with the power of many
unannotated protein sequences. The “abstraction”
includes two stages,
• (1) An unsupervised auxiliary task is em-
ployed to learn accurate representations
(embedding) for amino acids based on con-
textual similarity in biological sequences.
• (2) Amino acids are grouped to generate
more abstract entities according to their
learned representations.
On three benchmark data sets for structural classifi-
cation of proteins, the proposed semi-supervised ker-
nel achieves state-of-the-art performance, improves
over classic string kernels, and is more general than
previous approaches.
2. Protein Sequence Analysis UsingString Kernels
In this paper we focus on string kernel-based meth-
ods 6 for classification of protein sequences under the
discriminative learning setting. The discriminative
setting tries to capture the differences among classes
(e.g. folds and superfamilies), while its counterpart,
generative models 7 focus on capturing shared char-
acteristics within classes. Previous studies 8, 9 show
that the discriminative models have better distinc-
tion power over the generative models 7.
Our target problems of remote protein homology
and fold recognition essentially try to infer structural
relationships between proteins from their primary se-
quences. These problems could be treated as tasks of
learning sequences of amino acids that typically de-
scribe the patterns of protein structural properties of
interest. Given an input protein sequence, the target
structural relationship will be recognized by perform-
ing matching between patterns present in the test
sequence and the patterns in the annotated training
examples. This amounts to computing string kernel
(similarity score) between protein sequences based
on certain feature representations. The key idea of
string kernels is to apply a mapping φ(·) to map se-
quences of arbitrary length into a vectorial feature
space of fixed length. In this space a standard classi-
fier such as support vector machine (SVM) 6 can then
be applied. As SVMs require only inner products
between examples in the feature space, rather than
the feature vectors themselves, string kernel 10, 9, 4
is thus implicitly computed as an inner product in
this feature space:
K(X, Y ) = �φ(X), φ(Y )�, (1)
where X = (x1, . . . , x|X|), where |X| means the
length of the sequence. X,Y ∈ S, S is the set of
F(X)
X
Similar spectra ~ High Similarity
Length k=5 Up to m=1 mismatch
Three important questions:
1. Representation. What are the features? 2. Computational complexity. How efficient is the similarity evaluation? (quadratic? linear?) 3. Pattern extraction. What features are important/representative?
K(X, Y) = <F(X), F(Y)>
Main Results• Sparse Spatial Sample Embedding for Sequences
• multi-resolution representation of sequences• more efficient and accurate
• Unified Computational Framework for String Kernels with Inexact Matching• generalizes computation and significantly improves
time-efficiency of many string kernel functions• Algorithms for Pattern (motif) Extraction from
Sequence Data• significantly improve search efficiency over existing
algorithms
5
String Kernels• Original string kernels
Pairwise-alignment algorithms (Needleman-Wunsch) Not Mercer kernels [Vert et al.’04]
Pair HMMs [Watkins’99], convolution kernels [Haussler’99], gappy n-gram kernels [Lodhi et al.’02], rational kernels [Cortes et al.’02]
O(n2) complexity in sequence length n / pair
• Spectrum-like kernelsSpectrum kernels [Leslie’02], mismatch kernels [Leslie’04], substring kernels [Vishwanathan & Smola’02]
O(n) complexity in sequence length n / pairBest performing kernels have large multiplicative complexity factors (e.g., exponential dependency on alphabet size and kernel parameters)
Accuracy & Algorithmic complexity still insufficient for large-scale sequence comparison / annotation
Spectrum Kernels• Measure similarity between sequences based on co-
occurrence of fixed-length substrings (k-mers)
• Feature map for exact spectrum kernel (no mismatches)
• Feature map for mismatch kernel (m mismatches)
• more dense representation (up to km|∑|m more dense)
MotivationOur Results/Contributions
Summary
Basic Problems That We StudiedPrevious Work
Spectrum-like kernelsMeasure similarity based on common contiguousfixed-length substrings (k -mers)Feature map for exact spectrum kernel (no mismatches):
Φ(X ) =� �
α∈X
|α|=k
I(α, γ)�
γ∈Σk
sparse representation: e.g. HKY→ (00...01(HKY )0...0)
Feature map for the mismatch kernel (m mismatches):
Φ(X ) =� �
α∈X
|α|=k
Im(α, γ)�
γ∈Σk
less sparse representation (up to km|Σ|m more dense): e.g.HKY→ (01(AKY )...1(HAY )...1(HKA)0...1(HKY )0...0)
Pavel Kuksa Kernel Methods and Algorithms for General Sequence Analysis
MotivationOur Results/Contributions
Summary
Basic Problems That We StudiedPrevious Work
Spectrum-like kernelsMeasure similarity based on common contiguousfixed-length substrings (k -mers)Feature map for exact spectrum kernel (no mismatches):
Φ(X ) =� �
α∈X
|α|=k
I(α, γ)�
γ∈Σk
sparse representation: e.g. HKY→ (00...01(HKY )0...0)
Feature map for the mismatch kernel (m mismatches):
Φ(X ) =� �
α∈X
|α|=k
Im(α, γ)�
γ∈Σk
less sparse representation (up to km|Σ|m more dense): e.g.HKY→ (01(AKY )...1(HAY )...1(HKA)0...1(HKY )0...0)
Pavel Kuksa Kernel Methods and Algorithms for General Sequence Analysis
Profile kernel• Probabilistic string representation
• Replaces string representation nx1 with a nx|∑| profile
• Each character in a sequence is replaced with a probability distribution over alphabet characters
• Measures sequence similarity base on co-occurrence (in a probabilistic sense!) of k-mers
• State-of-the-art in protein remote homology
Motivation
Our Results/Contributions
Summary
Basic Problems That We Studied
Previous Work
Profile kernel: probabilistic string representation
Replaces string representation (nx1) with a profile (nxΣ)
Each character in a string is replaced with a probability
distribution over alphabet characters
Profile can be estimated using PSI-BLAST, etc.
Measure similarity based on common (in probabilistic
sense) k -mers
Profile feature map:
Φ(X ) =� �
Mσ∈X
I(γ ∈ Mσ)�
γ∈Σk
Mσ = {a : P(a) > σ}(mutation neighborhood)
Pavel Kuksa Kernel Methods and Algorithms for General Sequence Analysis
String Kernels: Summary
• All kernels have the same feature space indexed by all possible substrings of length k (k-mers)
• Position or spatial arrangement of features is not retained (representation issue)
• Best performing methods are computationally expensive (e.g. mismatch kernel)
Spatial Representations• New spatial representations and kernels for
sequences
• model sequences at multiple scales, preserve positional information
• have time complexity O(cn) linear in the sequence length n and small, alphabet-independent constant c
• display state-of-the-art performance on protein, text, and music classification tasks
Sparse Spatial Sample Embedding
- Models multi-scale spatial/temporal dependencies- Matching, comparison, classification more efficient O(cn)
k=1, t=2, d=1-5
• Differences in sequence modeling
• mismatch: very few feature retained
• spatial: still significant overlap in features
Motivation
Our Results/Contributions
Summary
Main Results: New Kernels/Algorithms
Experimental Results
Sequence Modeling
S = HKYNQLIMXKYNQHXYNQHKXNQHKYXQHKYNX
XYNQLKXNQLKYXQLKYNXLYKNQXXQLIMNXLIMNQXIMNQLXMNQLIX
XNQLIYXQLIYNXLIYNQXIYNQLX
XKINQHXINQHKXNQHKIXQHKINQ
XINQIKXNQIKIXQIKINXIKINQXXQIIMNXIIMNQXIMNQIXMNQIIX
XNQIIIXQIIINXIIINQXIINQIX
HKKYYNNQQLLIIM
H_YK_NY_QN_LQ_IL_M
H__NK__QY__LN__IQ__M
H___QK___LY___IN___M
H____LK____IY____M
HKKIINNQQIIIIM
H_IK_NI_QN_IQ_II_M
H__NK__QI__IN__IQ__M
H___QK___II___IN___M
H____IK____II____M
mismatch(5,1)
S’= HKINQIIM
double-(1,5)
Differences in handling substitutions by spectrum-like and spatial
kernels
Mismatch: very few features retained
Spatial kernels: still significant overlap in sequence features
For the spectrum-like kernel, the number of allowed mismatches
needs to be increased (at a high computational cost)
Pavel Kuksa Kernel Methods and Algorithms for General Sequence Analysis
SSK Evaluation:Representational abilityAlgorithm & running time
• Semi-supervised setting• 1M unlabeled sequences (NR dataset)
Structural classification of proteins(SCOP)
(54 superfam)
Large-scale Protein Annotation
Text: Test F1 Scores
• Improve over many state-of-the-art methods• 9K x 9K Gram matrix in <60sec
SS = subsequence kernel NG = n-gram kernelKSG = key-substring-group method TF-IDF = tf-idf kernel
• Reuters 21578 Dataset: 8 topics, ~9K documents, 30k words
• Task: classify text documents by topic
Music Genre Prediction• 10 genres, ~30s of audio per song
• VQ with |∑|=1024 codewords (MFCC feat.), seq. length ~7000
Method ErrorMismatch kernel (k=5,m=1) 35.6Spatial kernel (t=3,k=1,d=5) 29.4
Baseline 1: DWCH (Daubechies Wavelet), Li et al
41.6
Music Artist Recognition
• Album-wise cross-validation over 20 artists (120 albums)
• VQ with |∑|=1024 codewords
Method ErrorMismatch kernel (k=5,m=1) 44.66Spatial kernel (t=3,k=1,d=5) 32.50
Baseline 1: GMM 44.0
Spatial Kernels - SSSK Evaluation
• Efficient computation using Sorting + Counting (Alphabet size independent)
Alphabet-free matching
Order-of-magnitude faster than existing methods for inexact sequence matching (e.g., mismatch, subsequence methods)
~104 speedup for large alphabet ~102-103 speedup for longsequences
SSSK: Our contributions• New spatial representation for sequence
data (improves representational ability)
• Fast O(cn) evaluation (alphabet-independent)
• important for bio-sequences, text, music, etc.
• State-of-the-art performance for remote homology, fold prediction, text & music classification
Pavel Kuksa, Vladimir Pavlovic, ICPR 2010“Spatial representations for efficient sequence classification”
Algorithms for String Kernels with Inexact Matching
What are the best algorithms for string kernels?• Suffix trees (Vishwanathan & Smola, 2002)• linear-time all-substring kernel
• (Sparse) dynamic programming (Rousu, 2005)• gapped kernel
• Mismatch trie (Leslie, 2004)• applies to mismatch, spectrum, gapped
kernels, etc (inexact matching!!!)
• Complexity
• limits applicability to small k,m,|∑|
O(km+1|Σ|m(|X| + |Y |)) for k-mers (contiguous substring of length k) with up to m mismatchesand the alphabet size |Σ|. This limits the applicability of such algorithms to simpler transformationmodels (shorter k and m) and smaller alphabets, reducing their practical utility on complex real data.As an alternative, more complex transformation models such as [2] lead to state-of-the-art predictiveperformance at the expense of increased computational effort.
In this work we propose novel algorithms for modeling sequences under complex transformations(such as multiple insertions, deletions, mutations) that exhibit state-of-the-art performance on avariety of distinct classification tasks. In particular, we present new algorithms for inexact (e.g.with mismatches) string comparison that improve currently known time bounds for such tasks andshow orders-of-magnitude running time improvements. The algorithms rely on an efficient implicitcomputation of mismatch neighborhoods and k-mer statistic on sets of sequences. This leads toa mismatch kernel algorithm with complexity O(ck,m(|X| + |Y |)), where ck,m is independent ofthe alphabet Σ. The algorithm can be easily generalized to other families of string kernels, such asthe spectrum and gapped kernels [6], as well as to semi-supervised settings such as the neighbor-hood kernel of [7]. We demonstrate the benefits of our algorithms on many challenging classifica-tion problems, such as detecting homology (evolutionary similarity) of remotely related proteins,recognizing protein fold, and performing classification of music samples. The algorithms displaystate-of-the-art classification performance and run substantially faster than existing methods. Lowcomputational complexity of our algorithms opens the possibility of analyzing very large datasetsunder both fully-supervised and semi-supervised setting with modest computational resources.
2 Related Work
Over the past decade, various methods were proposed to solve the string classification problem,including generative, such as HMMs, or discriminative approaches. Among the discriminative ap-proaches, in many sequence analysis tasks, kernel-based [8] machine learning methods provide mostaccurate results [2, 3, 4, 6].
Sequence matching is frequently based on common co-occurrence of exact sub-patterns (k-mers,features), as in spectrum kernels [9] or substring kernels [10]. Inexact comparison in this frameworkis typically achieved using different families of mismatch [3] or profile [2] kernels. Both spectrum-kand mismatch(k,m) kernel directly extract string features based on the observed sequence, X . Onthe other hand, the profile kernel, proposed by Kuang et al. in [2], builds a profile [11] PX anduses a similar |Σ|k-dimensional representation, derived from PX . Constructing the profile for eachsequence may not be practical in some application domains, since the size of the profile is dependenton the size of the alphabet set. While for bio-sequences |Σ| = 4 or 20, for music or text classification|Σ| can potentially be very large, on the order of tens of thousands of symbols. In this case, a verysimple semi-supervised learning method, the sequence neighborhood kernel, can be employed [7]as an alternative to lone k-mers with many mismatches.
The most efficient available trie-based algorithms [3, 5] for mismatch kernels have a strong depen-dency on the size of alphabet set and the number of allowed mismatches, both of which need to berestricted in practice to control the complexity of the algorithm. Under the trie-based framework, thelist of k-mers extracted from given strings is traversed in a depth-first search with branches corre-sponding to all possible σ ∈ Σ. Each leaf node at depth k corresponds to a particular k-mer feature(either exact or inexact instance of the observed exact string features) and contains a list of matchingfeatures from each string. The kernel matrix is updated at leaf nodes with corresponding counts.The complexity of the trie-based algorithm for mismatch kernel computation for two strings X andY is O(km+1|Σ|m(|X| + |Y |)) [3]. The algorithm complexity depends on the size of Σ since dur-ing a trie traversal, possible substitutions are drawn from Σ explicitly; consequently, to control thecomplexity of the algorithm we need to restrict the number of allowed mismatches (m), as well asthe alphabet size (|Σ|). Such limitations hinder wide application of the powerful computational tool,as in biological sequence analysis, mutation, insertions and deletions frequently co-occur, henceestablishing the need to relax the parameter m; on the other hand, restricting the size of the alpha-bet sets strongly limits applications of the mismatch model. While other efficient string algorithmsexist, such as [6, 12] and the suffix-tree based algorithms in [10], they do not readily extend to themismatch framework. In this study, we aim to extend the works presented in [6, 10] and close theexisting gap in theoretical complexity between the mismatch and other fast string kernels.
Mismatch Trie Algorithm• Depth-first search over complete tree with
leaves/paths for all possible k-length substrings
• Problematic for large-∑ inputs and relaxed matching (larger m)
depth k
|∑|k leaves
A
C G
O(km+1|Σ|m(|X| + |Y |)) for k-mers (contiguous substring of length k) with up to m mismatchesand the alphabet size |Σ|. This limits the applicability of such algorithms to simpler transformationmodels (shorter k and m) and smaller alphabets, reducing their practical utility on complex real data.As an alternative, more complex transformation models such as [2] lead to state-of-the-art predictiveperformance at the expense of increased computational effort.
In this work we propose novel algorithms for modeling sequences under complex transformations(such as multiple insertions, deletions, mutations) that exhibit state-of-the-art performance on avariety of distinct classification tasks. In particular, we present new algorithms for inexact (e.g.with mismatches) string comparison that improve currently known time bounds for such tasks andshow orders-of-magnitude running time improvements. The algorithms rely on an efficient implicitcomputation of mismatch neighborhoods and k-mer statistic on sets of sequences. This leads toa mismatch kernel algorithm with complexity O(ck,m(|X| + |Y |)), where ck,m is independent ofthe alphabet Σ. The algorithm can be easily generalized to other families of string kernels, such asthe spectrum and gapped kernels [6], as well as to semi-supervised settings such as the neighbor-hood kernel of [7]. We demonstrate the benefits of our algorithms on many challenging classifica-tion problems, such as detecting homology (evolutionary similarity) of remotely related proteins,recognizing protein fold, and performing classification of music samples. The algorithms displaystate-of-the-art classification performance and run substantially faster than existing methods. Lowcomputational complexity of our algorithms opens the possibility of analyzing very large datasetsunder both fully-supervised and semi-supervised setting with modest computational resources.
2 Related Work
Over the past decade, various methods were proposed to solve the string classification problem,including generative, such as HMMs, or discriminative approaches. Among the discriminative ap-proaches, in many sequence analysis tasks, kernel-based [8] machine learning methods provide mostaccurate results [2, 3, 4, 6].
Sequence matching is frequently based on common co-occurrence of exact sub-patterns (k-mers,features), as in spectrum kernels [9] or substring kernels [10]. Inexact comparison in this frameworkis typically achieved using different families of mismatch [3] or profile [2] kernels. Both spectrum-kand mismatch(k,m) kernel directly extract string features based on the observed sequence, X . Onthe other hand, the profile kernel, proposed by Kuang et al. in [2], builds a profile [11] PX anduses a similar |Σ|k-dimensional representation, derived from PX . Constructing the profile for eachsequence may not be practical in some application domains, since the size of the profile is dependenton the size of the alphabet set. While for bio-sequences |Σ| = 4 or 20, for music or text classification|Σ| can potentially be very large, on the order of tens of thousands of symbols. In this case, a verysimple semi-supervised learning method, the sequence neighborhood kernel, can be employed [7]as an alternative to lone k-mers with many mismatches.
The most efficient available trie-based algorithms [3, 5] for mismatch kernels have a strong depen-dency on the size of alphabet set and the number of allowed mismatches, both of which need to berestricted in practice to control the complexity of the algorithm. Under the trie-based framework, thelist of k-mers extracted from given strings is traversed in a depth-first search with branches corre-sponding to all possible σ ∈ Σ. Each leaf node at depth k corresponds to a particular k-mer feature(either exact or inexact instance of the observed exact string features) and contains a list of matchingfeatures from each string. The kernel matrix is updated at leaf nodes with corresponding counts.The complexity of the trie-based algorithm for mismatch kernel computation for two strings X andY is O(km+1|Σ|m(|X| + |Y |)) [3]. The algorithm complexity depends on the size of Σ since dur-ing a trie traversal, possible substitutions are drawn from Σ explicitly; consequently, to control thecomplexity of the algorithm we need to restrict the number of allowed mismatches (m), as well asthe alphabet size (|Σ|). Such limitations hinder wide application of the powerful computational tool,as in biological sequence analysis, mutation, insertions and deletions frequently co-occur, henceestablishing the need to relax the parameter m; on the other hand, restricting the size of the alpha-bet sets strongly limits applications of the mismatch model. While other efficient string algorithmsexist, such as [6, 12] and the suffix-tree based algorithms in [10], they do not readily extend to themismatch framework. In this study, we aim to extend the works presented in [6, 10] and close theexisting gap in theoretical complexity between the mismatch and other fast string kernels.Want to address this
O(km |∑|m) - number of substrings at Hamming distance of at most m
Our Results• New sufficient-statistic based algorithms for kernel
computation
• Improve complexity bounds over existing algorithms, removes dependency on the alphabet size
• Several orders-of-magnitude speed-up ⇒ can now be applied to large-alphabet inputs (music, text, time series, etc)
• Enhance performance: more complex kernels improve predictive accuracy
104 speed-up for large (|∑|=1000) alphabets
Figure 1: Relative running time vs. alphabet size
Scalable Algorithms for String KernelsScalable Algorithms for String Kernelswith Inexact Matchingwith Inexact Matching
Pavel P. Kuksa, Pai-Hsi Huang, Vladimir Pavlovic
Theoretical resultsNew algorithmic approach for string kernels with inexact matching, including mismatch, gapped, and other count-based kernelsImproves complexity bounds over existing algorithmsExtends to large-scale semi-supervised learning (eg. neighborhood kernel approach)
Empirical resultsSeveral orders of magnitude speed-up Explore more complex kernels yielding improved predictions
Points of Interest! Removing dependency on the size of the alphabet under the mismatch measure O(km|∑|mn) " O(ck,mn)! Generalization of algorithms for kernels on strings, improving time-efficiency! Enhancing performance with new algorithms
! computational biology! recognizing protein fold from sequence! detecting remote evolutionary similarity
! general sequence data! music genre recognition
! State-of-the-art results for protein classification
Problem: what are the best algorithms for strings under inexact matching?Spotlight ID: ???
Speed-upProtein DNA Text Music100 75 >106 104
Table 2: Remote homology detectionOriginal Relaxed
Supervised 41.92 52.00Semi-supervised 81.05 86.32http://seqam.rutgers.edu/projects/new-inexact/
Table 1: Protein fold prediction (multi-class)Original Relaxed
Supervised 46.78 52.87Semi-supervised 70.87 77.51
features index
features index
ordered
unordered
similarity matrix
Sequence data
Original 57.4Relaxed 64.4
Table 3: Music genreclassification (multi-class)
#Reduce error by 15% compared to the best available method#Reduce error by 30% compared to the state-of-the-art profile kernel
Our approach: Sufficient Statistics
• (I) Exact spectrum computation with counting sort
• (2) Sufficient statistics for string kernels and their computation using (1) as sub-algorithm
Two steps
Evaluation of string kernel functions using Sufficient Statistics
(I) Exact Spectrum computation with counting sort
http://seqam.rutgers.edu/projects/new-inexact
Problem! Explicit feature vectors are not readily available for many data domains (eg. biosequences, text,
audio data, etc.)" Explicit feature extraction can be complex and expensive
! Alternative: use kernel functions or inner products between data points in a higher dimensional feature space" Eg. feature space formed by all subsequences of length k (eg. subsequence, gapped, or
mismatch string kernels):
" as the dimensionality grows exponentially |!|k, direct computation of the inner products is prohibitive
!Eg. music with 1024-alphaber ~109-10
15 features (k=3,5)
# Need efficient algorithms for the inner product (kernel)
How do we compute string kernel inner products more efficiently for large-|!| inputs and inexact matching with relaxed (eg. with many mismatches)
constraints?
Our approach! Sufficient statistics – based method for efficient RKHS inner product evaluation for string kernels
" generalizes computation and improves time-efficiency of many existing string kernel functions! Previous work: suffix trees (Vishwanathan '02), (sparse) dynamic programming (Rousu '05),
mismatch trie (Leslie '04)! This work: new algorithmic approach (sufficient statistics & sorting), improves complexity bounds
over prior works, orders of magnitude speed-up
! Open questions addressed in this work:" dependency on |"| for mismatch model" unified computational framework for string kernels with inexact matching" efficient computation for large-|"| inputs
Main ContributionsNew efficient algorithms based on sufficient statistics for string kernel computations
Theoretical results! New algorithmic approach for string kernel computation with inexact matching! Improve complexity bounds over existing algorithms
Empirical results! Orders of magnitude speed-up, large datasets (millions of examples, semi-supervised)! State-of-the-art results for protein classification
Scalable Algorithms for String Kernels with Inexact MatchingPavel Kuksa, Pai-Hsi Huang, Vladimir Pavlovic
Department of Computer Science, Rutgers University, NJ
Algorithm OverviewI. Mismatch kernel function:
Rewriting (1) we get
This gives a quadratic algorithm (Hamming distance-based ), does not depend on |!#$%&'()*)+(,
"#$%&'%()%*'++',-
-%./)+)01$234$.%$1%+
2.#+/01$2%3+#+13+1/34
We use sorting-based counting to compute in linear time cumulative statistics Ci first,
Then derive Mi from C
i to evaluate the inner product (1)
i.e. we obtain a linear time algorithm independent of |!#
10,000x faster for large (|∑|=1000) alphabets
Relative running time vs. alphabet size
5/6+%)0 789 :%&+ ;<=)*100 75 >106 104
Speed-up [x]:
K $X ,Y %&'()'K c $( |X %c $( |Y % $1%
c $( |X %&'*)X I m $* ,(%
I m $( ,*%&+1, d $( ,*%,m0, otherwise
d $( ,*%-Hamming distance between ( and *
K $X ,Y %&'*)X'.)Y '()'k I m $( ,*% I m $( ,.%#neighbors / shared by a and bconstant for given d(a,b)!
4'.)+'%0).)5)26%('+'/+1)$%748"9:;Original (5,1) Relaxed (7,3)
Supervised 41.92 52.00Semi-supervised 81.05 89.52
<=5+1>/5#33%?,)+'1$%@)5(%?,'(1/+1)$A%BMethod / Accuracy Original (5,1) Relaxed (7,3)Supervised 46.78 52.87Semi-supervised 70.87 77.51
Original 57.4Relaxed 64.4
Music genre classificationaccuracy (multi-class), %
K $X ,Y %&'*)X '()Y /$( ,*% $2%
K $X ,Y %&'i&0
i&2m Mi/i $3%
K $X ,Y %&'i&0
i&mCi
K $X ,Y %&'i&0
i&mCi'
K $X ,Y %&M0
Results
AAAA ACAC AGAG ...... CGCG ...... GAGA ...... TTTT00
11
22 >2(‘ACGAC’)
k=2m=0
1 2 3 ... |∑|k
FeaturesFeatures (k-mers)
Cou
nts
index
index
features
features
ordered
unordered
AA AC AG ... TT 1 2 3 ... |∑|k
k-mer feature space:
!"-!"#$%&'$! "#% "!)$ "#)% " '#(')$&(#*+,' &
' $()0-.' .)*% 1 ' $+(")*% " +(")&'&&/01 $(& %$(-&%
ReferencesVishwanathan & Smola, Fast kernels for string and tree matching, NIPS 2002Leslie & Kuang, Fast string kernels using inexact matching, JMLR 2004Rousu & Shawe-Taylor, Efficient computation of gapped substring kernels on large alphabets, JMLR 2005Lodhi et al. Text classification using string kernels, JMLR 2002
C',$'5%.#+,1D%/).?=+#+1)$%%73;!"#$% &'()' &*
+,-./$0 1'2 34*
5,-./$067#/!$8#"+9 &'(' ':;<*
String kernels! Kernel functions defined on feature vectors computed from strings! Features:
" substrings (contiguous), " subsequences (non-contiguous)
AACACTCTG
Feature space (AAA-TTT)AACTG GCAAC
GCACAAAAC
AAA AACACTCAACTGGCA
010101
011010
" Iteratively remove i positions, sort, and count number of matches for (k-i)-mers
" Reduces computation to a series of spectrum kernel computations
Ci&Mi0' j&0
j&i-1$k- ji- j %M j
NIPS 2008,December 8-11
EEF%G31$2%3=@@1/1'$+%3+#+13+1/3%+)%/).?=+'%3+,1$2%H',$'5%@=$/+1)$3
?@$A'%*+/<B$C%/0%(=$2D%=()%E$3FF34G
H@$I)(J*?/J$C%/0%(=$2D%=()%E$3FFK4G$
$
*@$L?''%J$C%/0%(=$2D%=()%E$3FFK4G$
J@$A'?+)?($C%/0%(=$2M<C=?E$3FFN4
%@$8%)1OH6/O66J$C%/0%(=$2I%=+60E$3FFP4G$$$$=%B)Q=<'%/R)=%J$0%)1OH6/QH?=%J
$$$$$$$$$$M282S4E82T44UM2SVETV4
X1: GGAAX2: TTTGAAX3: GGAAT
GGAGAATTTTTGTGAGAAGGAGAAAAT
112222333
AATGAAGAAGAAGGAGGATTATTGTTT
312313222
Kernel computation with counting sort:
Kernel updates: M2WEW4$U$M2WEW4$X$**:
Algorithm. (Mismatch-SS) Mismatch kernel algorithm based on Sufficient StatisticsInput: strings X, Y, parameters k,m,1. Compute min(2m,k) cumulative matching statistics, C
i, using counting sort
2. Compute exact matching statistics, Mi
3. Evaluate kernel using :
Output: Mismatch kernel value K(X,Y|k,m)
/!
" $# $% !&$' %&'!&"&!($#$ $ ' %
)!/!
)!&*!-' +&"!-% $'- +!- + %) + $ !&"&' '' $&!($&!($#$ $ ' % $ '-%%
)'&,-' +&"'-%) +
Features: Seq. Index:
Counting sortSeq. index: 1 2 3Counts: 1 1 1Seq. index: 1 3Counts: 1 1
Seq. index: 2 Counts: 1
Seq. index: 3 Counts: 1
...
" Jointly sort the observed k-mers in N(X) and N(Y)" Apply desired kernel evaluation method (eg. spectrum, mismatch, gapped, etc)
" $# $% %&'!&$-%&-#&' '' $ -.-%%
*!
Input sequences
Kernel updates (per substring):K(J,J) = K(J,J) + ccT
c - feature counts, J - sequence index
substring kernel
k=3,m=0
Counting sort:O(kn) evaluation for
exact spectrum
How does this extend to inexact matching (m>0)?
Mismatch Kernel
builds a profile [10] PX and uses a similar |Σ|k-dimensionalrepresentation, derived from PX . Constructing the profilefor each sequence may not be practical in some applicationdomains, since the size of the profile is dependent on thesize of the alphabet set. While for bio-sequences |Σ| = 4or 20, for music or text classification |Σ| can potentially bevery large, on the order of tens of thousands of symbols. Inthis case, a very simple semi-supervised learning method,the sequence neighborhood kernel, can be employed [11] asan alternative to lone k-mers with many mismatches.
Some of the most efficient available trie-based algo-rithms [3], [5] for mismatch kernels have a strong de-pendency on the size of alphabet set and the number ofallowed mismatches, both of which need to be restrictedin practice to control the complexity of the algorithm.Under the trie-based framework, the list of k-mers extractedfrom given strings is traversed in a depth-first search withbranches corresponding to all possible σ ∈ Σ. Each leafnode at depth k corresponds to a particular k-mer feature(either exact or inexact instance of the observed exact stringfeatures) and contains a list of matching features fromeach string. The kernel matrix is updated at leaf nodeswith corresponding counts. The complexity of the trie-basedalgorithm for mismatch kernel computation for two stringsX and Y is O(km+1|Σ|m(|X| + |Y |)) [3]. The algorithmcomplexity depends on the size of Σ since during a trietraversal, possible substitutions are drawn from Σ explicitly;consequently, to control the complexity of the algorithm weneed to restrict the number of allowed mismatches (m), aswell as the alphabet size (|Σ|).
We note that most of existing k-mer string kernels (e.g.,mismatch/spectrum kernels, gapped and wildcard kernels)essentially use symbolic Hamming-distance based match-ing, which may not necessarily reflect underlying simi-larity/dissimilarity between sequence fragments (k-mers).For a large class of k-mer string kernels, which includemismatch/spectrum, gapped, wildcard kernels, the matchingfunction I(α,β) is independent of the actual k-mers beingmatched and depends only on the Hamming distance [12].As a result, related k-mers may not be matched becausetheir symbolic dissimilarity exceeds the maximum allowedHamming distance. This presents a limitation as in manycases similarity relationships are not entirely based onsymbolic similarity, e.g., as in matching word n-grams oramino-acid sequences, where, for instance, words may besemantically related or amino-acids could share structuralor physical properties not reflected on a symbolic level. ForHamming-distance based matching, recent work in [12] haveintroduced linear time algorithms with alphabet-independentcomplexity applicable to computation of a large class ofexisting string kernels. However, above approaches do notreadily extend to the case of a general (non-Hamming)similarity metrics (e.g., BLOSUM-based scoring functionsin biological sequence analysis, or measures of semantic
relatedness between words, etc.) without introducing a highcomputational cost (e.g., requiring quadratic or exponen-tial running time). In this work, we aim to extend theworks presented in [12], [7] to the case of general (non-Hamming) similarity metrics and introduce efficient linear-time generalized string kernel algorithms. We also showempirically that using these generalized similarity kernelsprovides effective improvements in practice for a number ofchallenging classification problems.
III. DISTANCE-PRESERVING STRING KERNELS
A. Spectrum and Mismatch Kernels Definition
Given a sequence X with symbols from alphabet Σ thespectrum-k kernel [8] and the mismatch(k,m) kernel [3]induce the following |Σ|k-dimensional representation for thesequence:
Φk,m(γ|X) =
��
α∈X
Im(α, γ)
�
γ∈Σk
, (1)
where Im(α, γ) = 1 if α ∈ Nk,m(γ), and Nk,m(γ) is themutational neighborhood, the set of all k-mers that differfrom γ by at most m mismatches. Note that, by definition,for spectrum kernels, m = 0. Effectively, these are the bag-of-substrings representations, with either exact (spectrum)or approximate/smoothed (mismatch) counts of substringspresent in the sequence X .
The spectrum/mismatch kernel is then defined as
K(X, Y |k, m) =�
γ∈Σk
Φk,m(γ|X)Φk,m(γ|Y ). (2)
=�
α∈X
�
β∈Y
�
γ∈Σk
Im(α, γ)Im(β, γ).(3)
One interpretation of this kernel is that of cumulative pair-wise comparison of all substrings α and β contained in se-quences X and Y , respectively. In the case of mismatch ker-nels the level of similarity of each pair of substrings (α,β) isbased on the number of identical substrings their mutationalneighborhoods give rise to,
�γ∈Σk Im(α, γ)Im(β, γ). For
the spectrum kernel, this similarity is simply the exactmatching of α and β.
One can generalize this and allow an arbitrary metricsimilarity function S to replace the Hamming similarity�
γ∈Σk Im(α, γ)Im(β, γ)→ S(α,β):
K(X,Y |k,S) =�
α∈X
�
β∈Y
S(α,β). (4)
Such similarity function can go significantly beyond therelatively simple substring mismatch (substitution) modelsmentioned above. However, a drawback of this generalrepresentation over the simple mismatch model that employsthe Hamming similarity is, of course, that the complexityof comparing two sequences in general becomes quadratic
Observation I: number of substrings within distance m from both a and b is independent of a and b
Im(α, γ) =
�1, d(α, γ) ≤ m0, otherwise
(1)
1
Main issue: denser substring spectrum, many more features due to approximate counts (m>0)
introduces for every a mismatch neighborhood N(a,m) of size
O(km|∑|m)
intersection size for twomismatch neighborhoods
Sufficient Statistics for String Kernels
Algorithm 1. (Hamming-Mismatch) Mismatch algorithm based on Hamming distance
Input: strings X, Y, |X| = nx, |Y | = ny , parameters k, m, lookup table I for intersection sizes
Evaluate kernel using Equation 5:
K(X, Y |k, m) =�nx−k+1
ix=1
�ny−k+1iy=1 I(d(xix:ix+k−1, yiy:iy+k−1)|k, m)
where I(d) is the intersection size for distance d
Output: Mismatch kernel value K(X, Y |k, m)
The overall complexity of the algorithm is O(knxny) since the Hamming distances between all
k-mer pairs observed in X and Y need to be known. In the following section, we discuss how to
efficiently compute the size of the intersection.
3.3 Intersection Size: Closed Form Solution
The number of neighboring k-mers shared by two observed k-mers a and b can be directly com-
puted, in a closed-form, from the Hamming distance d(a, b) for fixed k and m, requiring no explicit
traversal of the k-mer space as in the case of trie-based computations. We first consider the case
a = b (i.e. d(a, b) = 0). The intersection size corresponds to the size of the (k, m)-mismatch
neighborhood, i.e. I(a, b) = |Nk,m| =�m
i=0
�ki
�(|Σ| − 1)i
. For higher values of Hamming dis-
tance d, the key observation is that for fixed Σ, k, and m, given any distance d(a, b) = d, I(a, b) is
also a constant, regardless of the mismatch positions. As a result, intersection values can always be
pre-computed once, stored and looked up when necessary. To illustrate this, we show two examples
for m = 1, 2:
I(a, b)(m = 1) =
|Nk,m|, d(a, b) = 0|Σ|, d(a, b) = 1
2, d(a, b) = 2
I(a, b)(m = 2) =
|Nk,m|, d(a, b) = 01 + k(|Σ|− 1) + (k − 1)(|Σ|− 1)2, d(a, b) = 11 + 2(k − 1)(|Σ|− 1) + (|Σ|− 1)2, d(a, b) = 2
6(|Σ|− 1), d(a, b) = 3�42
�, d(a, b) = 4
In general, the intersection size can be found in a weighted form�
i wi(|Σ| − 1)iand can be pre-
computed in constant time.
4 Mismatch Algorithm based on Sufficient Statistics
In this section, we further develop ideas from the previous section and present an improved mismatch
algorithm that does not require pairwise comparison of the k-mers between two strings and dependes
linearly on sequence length. The crucial observation is that in Equation 5, I(a, b) is non-zero
only when d(a, b) ≤ 2m. As a result, the kernel computed in Equation 5 is incremented only by
min(2m, k) + 1 distinct values, corresponding to min(2m, k) + 1 possible intersection sizes. We
then can re-write the equation in the following form:
K(X, Y |m, k) =nx−k+1�
ix=1
ny−k+1�
iy=1
I(xix:ix+k−1, yiy:iy+k−1) =min(2m,k)�
i=0
MiIi, (6)
where Ii is the size of the intersection of k-mer mutational neighborhood for Hamming distance
i, and Mi, the number of observed k-mer pairs in X and Y having Hamming distance i. The
problem of computing the kernel has been further reduced to a single summation. We have shown
in Section 3.3 that given any i, we can compute Ii in advance. The crucial task now becomes
computing the sufficient statistics Mi efficiently. In the following, we will show how to compute the
mismatch statistics {Mi} in O(ck,m(nx + ny)) time, where ck,m is a constant that does not depend
on the alphabet size. We formulate the task of inferring matching statistics {Mi} as the following
auxiliary counting problem:
Mismatch Statistic Counting: Given a set of n k-mers from two strings X and Y ,
for each Hamming distance i = 0, 1, ...,min(2m, k), output the number of k-mer
pairs (a, b), a ∈ X, b ∈ Y with d(a, b) = i.
Matching (sufficient) statistics: Mi = number of pairs of substrings (a in X, b in Y) at Hamming distance d(a,b)=i
Observation II: kernel is incremented only by min(2m,k)+1 distinct values corresponding to Hamming distances 0-2m
How to compute matching statistics Mi efficiently?
• Problem: direct computation of Mi is still quadratic!
Computing matching statistics• Instead of Mi first compute approximate (with over-
counting) number of pairs Ci at distance at most i
• Algorithm: iteratively remove i positions, sort and compute spectrum kernel for (k-i)-length substrings
• Inexact matching statistics Ci:
http://seqam.rutgers.edu/projects/new-inexact
Problem! Explicit feature vectors are not readily available for many data domains (eg. biosequences, text,
audio data, etc.)" Explicit feature extraction can be complex and expensive
! Alternative: use kernel functions or inner products between data points in a higher dimensional feature space" Eg. feature space formed by all subsequences of length k (eg. subsequence, gapped, or
mismatch string kernels):
" as the dimensionality grows exponentially |!|k, direct computation of the inner products is prohibitive
!Eg. music with 1024-alphaber ~109-10
15 features (k=3,5)
# Need efficient algorithms for the inner product (kernel)
How do we compute string kernel inner products more efficiently for large-|!| inputs and inexact matching with relaxed (eg. with many mismatches)
constraints?
Our approach! Sufficient statistics – based method for efficient RKHS inner product evaluation for string kernels
" generalizes computation and improves time-efficiency of many existing string kernel functions! Previous work: suffix trees (Vishwanathan '02), (sparse) dynamic programming (Rousu '05),
mismatch trie (Leslie '04)! This work: new algorithmic approach (sufficient statistics & sorting), improves complexity bounds
over prior works, orders of magnitude speed-up
! Open questions addressed in this work:" dependency on |"| for mismatch model" unified computational framework for string kernels with inexact matching" efficient computation for large-|"| inputs
Main ContributionsNew efficient algorithms based on sufficient statistics for string kernel computations
Theoretical results! New algorithmic approach for string kernel computation with inexact matching! Improve complexity bounds over existing algorithms
Empirical results! Orders of magnitude speed-up, large datasets (millions of examples, semi-supervised)! State-of-the-art results for protein classification
Scalable Algorithms for String Kernels with Inexact MatchingPavel Kuksa, Pai-Hsi Huang, Vladimir Pavlovic
Department of Computer Science, Rutgers University, NJ
Algorithm OverviewI. Mismatch kernel function:
Rewriting (1) we get
This gives a quadratic algorithm (Hamming distance-based ), does not depend on |!#$%&'()*)+(,
"#$%&'%()%*'++',-
-%./)+)01$234$.%$1%+
2.#+/01$2%3+#+13+1/34
We use sorting-based counting to compute in linear time cumulative statistics Ci first,
Then derive Mi from C
i to evaluate the inner product (1)
i.e. we obtain a linear time algorithm independent of |!#
10,000x faster for large (|∑|=1000) alphabets
Relative running time vs. alphabet size
5/6+%)0 789 :%&+ ;<=)*100 75 >106 104
Speed-up [x]:
K $X ,Y %&'()'K c $( |X %c $( |Y % $1%
c $( |X %&'*)X I m $* ,(%
I m $( ,*%&+1, d $( ,*%,m0, otherwise
d $( ,*%-Hamming distance between ( and *
K $X ,Y %&'*)X'.)Y '()'k I m $( ,*% I m $( ,.%#neighbors / shared by a and bconstant for given d(a,b)!
4'.)+'%0).)5)26%('+'/+1)$%748"9:;Original (5,1) Relaxed (7,3)
Supervised 41.92 52.00Semi-supervised 81.05 89.52
<=5+1>/5#33%?,)+'1$%@)5(%?,'(1/+1)$A%BMethod / Accuracy Original (5,1) Relaxed (7,3)Supervised 46.78 52.87Semi-supervised 70.87 77.51
Original 57.4Relaxed 64.4
Music genre classificationaccuracy (multi-class), %
K $X ,Y %&'*)X '()Y /$( ,*% $2%
K $X ,Y %&'i&0
i&2m Mi/i $3%
K $X ,Y %&'i&0
i&mCi
K $X ,Y %&'i&0
i&mCi'
K $X ,Y %&M0
Results
AAAA ACAC AGAG ...... CGCG ...... GAGA ...... TTTT00
11
22 >2(‘ACGAC’)
k=2m=0
1 2 3 ... |∑|k
FeaturesFeatures (k-mers)
Cou
nts
index
index
features
features
ordered
unordered
AA AC AG ... TT 1 2 3 ... |∑|k
k-mer feature space:
!"-!"#$%&'$! "#% "!)$ "#)% " '#(')$&(#*+,' &
' $()0-.' .)*% 1 ' $+(")*% " +(")&'&&/01 $(& %$(-&%
ReferencesVishwanathan & Smola, Fast kernels for string and tree matching, NIPS 2002Leslie & Kuang, Fast string kernels using inexact matching, JMLR 2004Rousu & Shawe-Taylor, Efficient computation of gapped substring kernels on large alphabets, JMLR 2005Lodhi et al. Text classification using string kernels, JMLR 2002
C',$'5%.#+,1D%/).?=+#+1)$%%73;!"#$% &'()' &*
+,-./$0 1'2 34*
5,-./$067#/!$8#"+9 &'(' ':;<*
String kernels! Kernel functions defined on feature vectors computed from strings! Features:
" substrings (contiguous), " subsequences (non-contiguous)
AACACTCTG
Feature space (AAA-TTT)AACTG GCAAC
GCACAAAAC
AAA AACACTCAACTGGCA
010101
011010
" Iteratively remove i positions, sort, and count number of matches for (k-i)-mers
" Reduces computation to a series of spectrum kernel computations
Ci&Mi0' j&0
j&i-1$k- ji- j %M j
NIPS 2008,December 8-11
EEF%G31$2%3=@@1/1'$+%3+#+13+1/3%+)%/).?=+'%3+,1$2%H',$'5%@=$/+1)$3
?@$A'%*+/<B$C%/0%(=$2D%=()%E$3FF34G
H@$I)(J*?/J$C%/0%(=$2D%=()%E$3FFK4G$
$
*@$L?''%J$C%/0%(=$2D%=()%E$3FFK4G$
J@$A'?+)?($C%/0%(=$2M<C=?E$3FFN4
%@$8%)1OH6/O66J$C%/0%(=$2I%=+60E$3FFP4G$$$$=%B)Q=<'%/R)=%J$0%)1OH6/QH?=%J
$$$$$$$$$$M282S4E82T44UM2SVETV4
X1: GGAAX2: TTTGAAX3: GGAAT
GGAGAATTTTTGTGAGAAGGAGAAAAT
112222333
AATGAAGAAGAAGGAGGATTATTGTTT
312313222
Kernel computation with counting sort:
Kernel updates: M2WEW4$U$M2WEW4$X$**:
Algorithm. (Mismatch-SS) Mismatch kernel algorithm based on Sufficient StatisticsInput: strings X, Y, parameters k,m,1. Compute min(2m,k) cumulative matching statistics, C
i, using counting sort
2. Compute exact matching statistics, Mi
3. Evaluate kernel using :
Output: Mismatch kernel value K(X,Y|k,m)
/!
" $# $% !&$' %&'!&"&!($#$ $ ' %
)!/!
)!&*!-' +&"!-% $'- +!- + %) + $ !&"&' '' $&!($&!($#$ $ ' % $ '-%%
)'&,-' +&"'-%) +
Features: Seq. Index:
Counting sortSeq. index: 1 2 3Counts: 1 1 1Seq. index: 1 3Counts: 1 1
Seq. index: 2 Counts: 1
Seq. index: 3 Counts: 1
...
" Jointly sort the observed k-mers in N(X) and N(Y)" Apply desired kernel evaluation method (eg. spectrum, mismatch, gapped, etc)
" $# $% %&'!&$-%&-#&' '' $ -.-%%
*!
In this problem it is not necessary to know the distance between each pair of k-mers; one only
needs to know the number of pairs (Mi) at each distance i. We show next that the above problem
of computing matching statistics can be solved in linear time (in the number n of k-mers) using
multiple rounds of counting sort as a sub-algorithm.
We first consider the problem of computing number of k-mers at distance 0, i.e. the number of
exact matches. In this case, we can apply counting sort to order all k-mers lexicographically and
find the number of exact matches by scanning the sorted list. The counting then requires linear
O(kn) time. Efficient direct computation of Mi for any i > 0 is difficult (requires quadratic time);
we take another approach and first compute inexact cumulative mismatch statistics, Ci = Mi +�i−1j=0
�k−ji−j
�Mj , that overcount the number of k-mer pairs at a given distance i, as follows. Consider
two k-mers a and b. Pick i positions and remove from the k-mers the symbols at the corresponding
positions to obtain (k−i)-mers a� and b�. The key observation is that d(a�, b�) = 0 ⇒ d(a, b) ≤ i.
As a result, given n k-mers, we can compute the cumulative mismatch statistics Ci in linear time
using�k
i
�rounds of counting sort on (k − i)-mers. The exact mismatch statistics Mi can then be
obtained from Ci by subtracting the exact counts to compensate for overcounting as follows:
Mi = Ci −i−1�
j=0
�k − j
i− j
�Mj , i = 0, . . . ,min(min(2m, k), k − 1) (7)
The last mismatch statistic Mk can be computed by subtracting the preceding statistics M0, ...Mk−1
from the total number of possible matches:
Mk = T −k−1�
j=0
Mj , where T = (nx − k + 1)(ny − k + 1). (8)
Our algorithm for mismatch kernel computations based on sufficient statistics is summarized in
Algorithm 2. The overall complexity of the algorithm is O(nck,m) with the constant ck,m =�min(2m,k)l=0
�kl
�(k − l), independent of the size of the alphabet set, and
�kl
�is the number of rounds
of counting sort for evaluating the cumulative mismatch statistics Cl.
Algorithm 2. (Mismatch-SS) Mismatch kernel algorithm based on Sufficient Statistics
Input: strings X, Y, |X| = nx, |Y | = ny , parameters k, m, pre-computed intersection values I1. Compute min(2m, k) cumulative matching statistics, Ci, using counting sort
2. Compute exact matching statistics, Mi, using Equation 7
3. Evaluate kernel using Equation 6: K(X, Y |m, k) =�min(2m,k)
i=0 MiIi
Output: Mismatch kernel value K(X, Y |k, m)
5 Extensions
Our algorithmic approach can also be applied to a variety of existing string kernels, leading to very
efficient and simple algorithms that could benefit many applications.
Spectrum Kernels. The spectrum kernel [9] in our notation is the first sufficient statistic M0, i.e.
K(X, Y |k) = M0, which can be computed in k rounds of counting sort (i.e. in O(kn) time).
Gapped Kernels. The gapped kernels [6] measure similarity between strings X and Y based on the
co-occurrence of gapped instances g, |g| = k + m > k of k-long substrings:
K(X, Y |k, g) =�
γ∈Σk
� �
g∈X,|g|=k+m
I(γ, g)�� �
g∈Y,|g|=k+m
I(γ, g)�, (9)
where I(γ, g) = 1 when γ is a subsequence of g. Similar to the algorithmic approach for extracting
cumulative mismatch statistics in Algorithm-2, to compute the gapped(g,k) kernel, we perform a
single round of counting sort over k-mers contained in the g-mers. This gives a very simple and
efficient O(�gk
�kn) time algorithm for gapped kernel computations.
Wildcard kernels. The wildcard(k,m) kernel [6] in our notation is the sum of the cumulative
statistics K(X, Y |k, m) =�m
i=0 Ci, i.e. can be computed in�m
i=0
�ki
�rounds of counting sort,
giving a simple and efficient O(�m
i=0
�ki
�(k − i)n) algorithm.
II: Sufficient Statistic (SS) algorithm
http://seqam.rutgers.edu/projects/new-inexact
Problem! Explicit feature vectors are not readily available for many data domains (eg. biosequences, text,
audio data, etc.)" Explicit feature extraction can be complex and expensive
! Alternative: use kernel functions or inner products between data points in a higher dimensional feature space" Eg. feature space formed by all subsequences of length k (eg. subsequence, gapped, or
mismatch string kernels):
" as the dimensionality grows exponentially |!|k, direct computation of the inner products is prohibitive
!Eg. music with 1024-alphaber ~109-10
15 features (k=3,5)
# Need efficient algorithms for the inner product (kernel)
How do we compute string kernel inner products more efficiently for large-|!| inputs and inexact matching with relaxed (eg. with many mismatches)
constraints?
Our approach! Sufficient statistics – based method for efficient RKHS inner product evaluation for string kernels
" generalizes computation and improves time-efficiency of many existing string kernel functions! Previous work: suffix trees (Vishwanathan '02), (sparse) dynamic programming (Rousu '05),
mismatch trie (Leslie '04)! This work: new algorithmic approach (sufficient statistics & sorting), improves complexity bounds
over prior works, orders of magnitude speed-up
! Open questions addressed in this work:" dependency on |"| for mismatch model" unified computational framework for string kernels with inexact matching" efficient computation for large-|"| inputs
Main ContributionsNew efficient algorithms based on sufficient statistics for string kernel computations
Theoretical results! New algorithmic approach for string kernel computation with inexact matching! Improve complexity bounds over existing algorithms
Empirical results! Orders of magnitude speed-up, large datasets (millions of examples, semi-supervised)! State-of-the-art results for protein classification
Scalable Algorithms for String Kernels with Inexact MatchingPavel Kuksa, Pai-Hsi Huang, Vladimir Pavlovic
Department of Computer Science, Rutgers University, NJ
Algorithm OverviewI. Mismatch kernel function:
Rewriting (1) we get
This gives a quadratic algorithm (Hamming distance-based ), does not depend on |!#$%&'()*)+(,
"#$%&'%()%*'++',-
-%./)+)01$234$.%$1%+
2.#+/01$2%3+#+13+1/34
We use sorting-based counting to compute in linear time cumulative statistics Ci first,
Then derive Mi from C
i to evaluate the inner product (1)
i.e. we obtain a linear time algorithm independent of |!#
10,000x faster for large (|∑|=1000) alphabets
Relative running time vs. alphabet size
5/6+%)0 789 :%&+ ;<=)*100 75 >106 104
Speed-up [x]:
K $X ,Y %&'()'K c $( |X %c $( |Y % $1%
c $( |X %&'*)X I m $* ,(%
I m $( ,*%&+1, d $( ,*%,m0, otherwise
d $( ,*%-Hamming distance between ( and *
K $X ,Y %&'*)X'.)Y '()'k I m $( ,*% I m $( ,.%#neighbors / shared by a and bconstant for given d(a,b)!
4'.)+'%0).)5)26%('+'/+1)$%748"9:;Original (5,1) Relaxed (7,3)
Supervised 41.92 52.00Semi-supervised 81.05 89.52
<=5+1>/5#33%?,)+'1$%@)5(%?,'(1/+1)$A%BMethod / Accuracy Original (5,1) Relaxed (7,3)Supervised 46.78 52.87Semi-supervised 70.87 77.51
Original 57.4Relaxed 64.4
Music genre classificationaccuracy (multi-class), %
K $X ,Y %&'*)X '()Y /$( ,*% $2%
K $X ,Y %&'i&0
i&2m Mi/i $3%
K $X ,Y %&'i&0
i&mCi
K $X ,Y %&'i&0
i&mCi'
K $X ,Y %&M0
Results
AAAA ACAC AGAG ...... CGCG ...... GAGA ...... TTTT00
11
22 >2(‘ACGAC’)
k=2m=0
1 2 3 ... |∑|k
FeaturesFeatures (k-mers)
Cou
nts
index
index
features
features
ordered
unordered
AA AC AG ... TT 1 2 3 ... |∑|k
k-mer feature space:
!"-!"#$%&'$! "#% "!)$ "#)% " '#(')$&(#*+,' &
' $()0-.' .)*% 1 ' $+(")*% " +(")&'&&/01 $(& %$(-&%
ReferencesVishwanathan & Smola, Fast kernels for string and tree matching, NIPS 2002Leslie & Kuang, Fast string kernels using inexact matching, JMLR 2004Rousu & Shawe-Taylor, Efficient computation of gapped substring kernels on large alphabets, JMLR 2005Lodhi et al. Text classification using string kernels, JMLR 2002
C',$'5%.#+,1D%/).?=+#+1)$%%73;!"#$% &'()' &*
+,-./$0 1'2 34*
5,-./$067#/!$8#"+9 &'(' ':;<*
String kernels! Kernel functions defined on feature vectors computed from strings! Features:
" substrings (contiguous), " subsequences (non-contiguous)
AACACTCTG
Feature space (AAA-TTT)AACTG GCAAC
GCACAAAAC
AAA AACACTCAACTGGCA
010101
011010
" Iteratively remove i positions, sort, and count number of matches for (k-i)-mers
" Reduces computation to a series of spectrum kernel computations
Ci&Mi0' j&0
j&i-1$k- ji- j %M j
NIPS 2008,December 8-11
EEF%G31$2%3=@@1/1'$+%3+#+13+1/3%+)%/).?=+'%3+,1$2%H',$'5%@=$/+1)$3
?@$A'%*+/<B$C%/0%(=$2D%=()%E$3FF34G
H@$I)(J*?/J$C%/0%(=$2D%=()%E$3FFK4G$
$
*@$L?''%J$C%/0%(=$2D%=()%E$3FFK4G$
J@$A'?+)?($C%/0%(=$2M<C=?E$3FFN4
%@$8%)1OH6/O66J$C%/0%(=$2I%=+60E$3FFP4G$$$$=%B)Q=<'%/R)=%J$0%)1OH6/QH?=%J
$$$$$$$$$$M282S4E82T44UM2SVETV4
X1: GGAAX2: TTTGAAX3: GGAAT
GGAGAATTTTTGTGAGAAGGAGAAAAT
112222333
AATGAAGAAGAAGGAGGATTATTGTTT
312313222
Kernel computation with counting sort:
Kernel updates: M2WEW4$U$M2WEW4$X$**:
Algorithm. (Mismatch-SS) Mismatch kernel algorithm based on Sufficient StatisticsInput: strings X, Y, parameters k,m,1. Compute min(2m,k) cumulative matching statistics, C
i, using counting sort
2. Compute exact matching statistics, Mi
3. Evaluate kernel using :
Output: Mismatch kernel value K(X,Y|k,m)
/!
" $# $% !&$' %&'!&"&!($#$ $ ' %
)!/!
)!&*!-' +&"!-% $'- +!- + %) + $ !&"&' '' $&!($&!($#$ $ ' % $ '-%%
)'&,-' +&"'-%) +
Features: Seq. Index:
Counting sortSeq. index: 1 2 3Counts: 1 1 1Seq. index: 1 3Counts: 1 1
Seq. index: 2 Counts: 1
Seq. index: 3 Counts: 1
...
" Jointly sort the observed k-mers in N(X) and N(Y)" Apply desired kernel evaluation method (eg. spectrum, mismatch, gapped, etc)
" $# $% %&'!&$-%&-#&' '' $ -.-%%
*!
http://seqam.rutgers.edu/projects/new-inexact
Problem! Explicit feature vectors are not readily available for many data domains (eg. biosequences, text,
audio data, etc.)" Explicit feature extraction can be complex and expensive
! Alternative: use kernel functions or inner products between data points in a higher dimensional feature space" Eg. feature space formed by all subsequences of length k (eg. subsequence, gapped, or
mismatch string kernels):
" as the dimensionality grows exponentially |!|k, direct computation of the inner products is prohibitive
!Eg. music with 1024-alphaber ~109-10
15 features (k=3,5)
# Need efficient algorithms for the inner product (kernel)
How do we compute string kernel inner products more efficiently for large-|!| inputs and inexact matching with relaxed (eg. with many mismatches)
constraints?
Our approach! Sufficient statistics – based method for efficient RKHS inner product evaluation for string kernels
" generalizes computation and improves time-efficiency of many existing string kernel functions! Previous work: suffix trees (Vishwanathan '02), (sparse) dynamic programming (Rousu '05),
mismatch trie (Leslie '04)! This work: new algorithmic approach (sufficient statistics & sorting), improves complexity bounds
over prior works, orders of magnitude speed-up
! Open questions addressed in this work:" dependency on |"| for mismatch model" unified computational framework for string kernels with inexact matching" efficient computation for large-|"| inputs
Main ContributionsNew efficient algorithms based on sufficient statistics for string kernel computations
Theoretical results! New algorithmic approach for string kernel computation with inexact matching! Improve complexity bounds over existing algorithms
Empirical results! Orders of magnitude speed-up, large datasets (millions of examples, semi-supervised)! State-of-the-art results for protein classification
Scalable Algorithms for String Kernels with Inexact MatchingPavel Kuksa, Pai-Hsi Huang, Vladimir Pavlovic
Department of Computer Science, Rutgers University, NJ
Algorithm OverviewI. Mismatch kernel function:
Rewriting (1) we get
This gives a quadratic algorithm (Hamming distance-based ), does not depend on |!#$%&'()*)+(,
"#$%&'%()%*'++',-
-%./)+)01$234$.%$1%+
2.#+/01$2%3+#+13+1/34
We use sorting-based counting to compute in linear time cumulative statistics Ci first,
Then derive Mi from C
i to evaluate the inner product (1)
i.e. we obtain a linear time algorithm independent of |!#
10,000x faster for large (|∑|=1000) alphabets
Relative running time vs. alphabet size
5/6+%)0 789 :%&+ ;<=)*100 75 >106 104
Speed-up [x]:
K $X ,Y %&'()'K c $( |X %c $( |Y % $1%
c $( |X %&'*)X I m $* ,(%
I m $( ,*%&+1, d $( ,*%,m0, otherwise
d $( ,*%-Hamming distance between ( and *
K $X ,Y %&'*)X'.)Y '()'k I m $( ,*% I m $( ,.%#neighbors / shared by a and bconstant for given d(a,b)!
4'.)+'%0).)5)26%('+'/+1)$%748"9:;Original (5,1) Relaxed (7,3)
Supervised 41.92 52.00Semi-supervised 81.05 89.52
<=5+1>/5#33%?,)+'1$%@)5(%?,'(1/+1)$A%BMethod / Accuracy Original (5,1) Relaxed (7,3)Supervised 46.78 52.87Semi-supervised 70.87 77.51
Original 57.4Relaxed 64.4
Music genre classificationaccuracy (multi-class), %
K $X ,Y %&'*)X '()Y /$( ,*% $2%
K $X ,Y %&'i&0
i&2m Mi/i $3%
K $X ,Y %&'i&0
i&mCi
K $X ,Y %&'i&0
i&mCi'
K $X ,Y %&M0
Results
AAAA ACAC AGAG ...... CGCG ...... GAGA ...... TTTT00
11
22 >2(‘ACGAC’)
k=2m=0
1 2 3 ... |∑|k
FeaturesFeatures (k-mers)
Cou
nts
index
index
features
features
ordered
unordered
AA AC AG ... TT 1 2 3 ... |∑|k
k-mer feature space:
!"-!"#$%&'$! "#% "!)$ "#)% " '#(')$&(#*+,' &
' $()0-.' .)*% 1 ' $+(")*% " +(")&'&&/01 $(& %$(-&%
ReferencesVishwanathan & Smola, Fast kernels for string and tree matching, NIPS 2002Leslie & Kuang, Fast string kernels using inexact matching, JMLR 2004Rousu & Shawe-Taylor, Efficient computation of gapped substring kernels on large alphabets, JMLR 2005Lodhi et al. Text classification using string kernels, JMLR 2002
C',$'5%.#+,1D%/).?=+#+1)$%%73;!"#$% &'()' &*
+,-./$0 1'2 34*
5,-./$067#/!$8#"+9 &'(' ':;<*
String kernels! Kernel functions defined on feature vectors computed from strings! Features:
" substrings (contiguous), " subsequences (non-contiguous)
AACACTCTG
Feature space (AAA-TTT)AACTG GCAAC
GCACAAAAC
AAA AACACTCAACTGGCA
010101
011010
" Iteratively remove i positions, sort, and count number of matches for (k-i)-mers
" Reduces computation to a series of spectrum kernel computations
Ci&Mi0' j&0
j&i-1$k- ji- j %M j
NIPS 2008,December 8-11
EEF%G31$2%3=@@1/1'$+%3+#+13+1/3%+)%/).?=+'%3+,1$2%H',$'5%@=$/+1)$3
?@$A'%*+/<B$C%/0%(=$2D%=()%E$3FF34G
H@$I)(J*?/J$C%/0%(=$2D%=()%E$3FFK4G$
$
*@$L?''%J$C%/0%(=$2D%=()%E$3FFK4G$
J@$A'?+)?($C%/0%(=$2M<C=?E$3FFN4
%@$8%)1OH6/O66J$C%/0%(=$2I%=+60E$3FFP4G$$$$=%B)Q=<'%/R)=%J$0%)1OH6/QH?=%J
$$$$$$$$$$M282S4E82T44UM2SVETV4
X1: GGAAX2: TTTGAAX3: GGAAT
GGAGAATTTTTGTGAGAAGGAGAAAAT
112222333
AATGAAGAAGAAGGAGGATTATTGTTT
312313222
Kernel computation with counting sort:
Kernel updates: M2WEW4$U$M2WEW4$X$**:
Algorithm. (Mismatch-SS) Mismatch kernel algorithm based on Sufficient StatisticsInput: strings X, Y, parameters k,m,1. Compute min(2m,k) cumulative matching statistics, C
i, using counting sort
2. Compute exact matching statistics, Mi
3. Evaluate kernel using :
Output: Mismatch kernel value K(X,Y|k,m)
/!
" $# $% !&$' %&'!&"&!($#$ $ ' %
)!/!
)!&*!-' +&"!-% $'- +!- + %) + $ !&"&' '' $&!($&!($#$ $ ' % $ '-%%
)'&,-' +&"'-%) +
Features: Seq. Index:
Counting sortSeq. index: 1 2 3Counts: 1 1 1Seq. index: 1 3Counts: 1 1
Seq. index: 2 Counts: 1
Seq. index: 3 Counts: 1
...
" Jointly sort the observed k-mers in N(X) and N(Y)" Apply desired kernel evaluation method (eg. spectrum, mismatch, gapped, etc)
" $# $% %&'!&$-%&-#&' '' $ -.-%%
*!
original complexity(trie-based) Sufficient-statistic based
• Spectrum kernels (Leslie, 2002)
K(X,Y) = M0
• Gapped kernels (Leslie, 2004; Rousu, 2005)
• Spatial kernels (Kuksa, 2008 & 2010)
• In all cases reduce computation to multiple rounds of exact spectrum kernel computation (counting sort)
Using Sufficient Statistics to compute String Kernel Functions
http://seqam.rutgers.edu/projects/new-inexact
Problem! Explicit feature vectors are not readily available for many data domains (eg. biosequences, text,
audio data, etc.)" Explicit feature extraction can be complex and expensive
! Alternative: use kernel functions or inner products between data points in a higher dimensional feature space" Eg. feature space formed by all subsequences of length k (eg. subsequence, gapped, or
mismatch string kernels):
" as the dimensionality grows exponentially |!|k, direct computation of the inner products is prohibitive
!Eg. music with 1024-alphaber ~109-10
15 features (k=3,5)
# Need efficient algorithms for the inner product (kernel)
How do we compute string kernel inner products more efficiently for large-|!| inputs and inexact matching with relaxed (eg. with many mismatches)
constraints?
Our approach! Sufficient statistics – based method for efficient RKHS inner product evaluation for string kernels
" generalizes computation and improves time-efficiency of many existing string kernel functions! Previous work: suffix trees (Vishwanathan '02), (sparse) dynamic programming (Rousu '05),
mismatch trie (Leslie '04)! This work: new algorithmic approach (sufficient statistics & sorting), improves complexity bounds
over prior works, orders of magnitude speed-up
! Open questions addressed in this work:" dependency on |"| for mismatch model" unified computational framework for string kernels with inexact matching" efficient computation for large-|"| inputs
Main ContributionsNew efficient algorithms based on sufficient statistics for string kernel computations
Theoretical results! New algorithmic approach for string kernel computation with inexact matching! Improve complexity bounds over existing algorithms
Empirical results! Orders of magnitude speed-up, large datasets (millions of examples, semi-supervised)! State-of-the-art results for protein classification
Scalable Algorithms for String Kernels with Inexact MatchingPavel Kuksa, Pai-Hsi Huang, Vladimir Pavlovic
Department of Computer Science, Rutgers University, NJ
Algorithm OverviewI. Mismatch kernel function:
Rewriting (1) we get
This gives a quadratic algorithm (Hamming distance-based ), does not depend on |!#$%&'()*)+(,
"#$%&'%()%*'++',-
-%./)+)01$234$.%$1%+
2.#+/01$2%3+#+13+1/34
We use sorting-based counting to compute in linear time cumulative statistics Ci first,
Then derive Mi from C
i to evaluate the inner product (1)
i.e. we obtain a linear time algorithm independent of |!#
10,000x faster for large (|∑|=1000) alphabets
Relative running time vs. alphabet size
5/6+%)0 789 :%&+ ;<=)*100 75 >106 104
Speed-up [x]:
K $X ,Y %&'()'K c $( |X %c $( |Y % $1%
c $( |X %&'*)X I m $* ,(%
I m $( ,*%&+1, d $( ,*%,m0, otherwise
d $( ,*%-Hamming distance between ( and *
K $X ,Y %&'*)X'.)Y '()'k I m $( ,*% I m $( ,.%#neighbors / shared by a and bconstant for given d(a,b)!
4'.)+'%0).)5)26%('+'/+1)$%748"9:;Original (5,1) Relaxed (7,3)
Supervised 41.92 52.00Semi-supervised 81.05 89.52
<=5+1>/5#33%?,)+'1$%@)5(%?,'(1/+1)$A%BMethod / Accuracy Original (5,1) Relaxed (7,3)Supervised 46.78 52.87Semi-supervised 70.87 77.51
Original 57.4Relaxed 64.4
Music genre classificationaccuracy (multi-class), %
K $X ,Y %&'*)X '()Y /$( ,*% $2%
K $X ,Y %&'i&0
i&2m Mi/i $3%
K $X ,Y %&'i&0
i&mCi
K $X ,Y %&'i&0
i&mCi'
K $X ,Y %&M0
Results
AAAA ACAC AGAG ...... CGCG ...... GAGA ...... TTTT00
11
22 >2(‘ACGAC’)
k=2m=0
1 2 3 ... |∑|k
FeaturesFeatures (k-mers)
Cou
nts
index
index
features
features
ordered
unordered
AA AC AG ... TT 1 2 3 ... |∑|k
k-mer feature space:
!"-!"#$%&'$! "#% "!)$ "#)% " '#(')$&(#*+,' &
' $()0-.' .)*% 1 ' $+(")*% " +(")&'&&/01 $(& %$(-&%
ReferencesVishwanathan & Smola, Fast kernels for string and tree matching, NIPS 2002Leslie & Kuang, Fast string kernels using inexact matching, JMLR 2004Rousu & Shawe-Taylor, Efficient computation of gapped substring kernels on large alphabets, JMLR 2005Lodhi et al. Text classification using string kernels, JMLR 2002
C',$'5%.#+,1D%/).?=+#+1)$%%73;!"#$% &'()' &*
+,-./$0 1'2 34*
5,-./$067#/!$8#"+9 &'(' ':;<*
String kernels! Kernel functions defined on feature vectors computed from strings! Features:
" substrings (contiguous), " subsequences (non-contiguous)
AACACTCTG
Feature space (AAA-TTT)AACTG GCAAC
GCACAAAAC
AAA AACACTCAACTGGCA
010101
011010
" Iteratively remove i positions, sort, and count number of matches for (k-i)-mers
" Reduces computation to a series of spectrum kernel computations
Ci&Mi0' j&0
j&i-1$k- ji- j %M j
NIPS 2008,December 8-11
EEF%G31$2%3=@@1/1'$+%3+#+13+1/3%+)%/).?=+'%3+,1$2%H',$'5%@=$/+1)$3
?@$A'%*+/<B$C%/0%(=$2D%=()%E$3FF34G
H@$I)(J*?/J$C%/0%(=$2D%=()%E$3FFK4G$
$
*@$L?''%J$C%/0%(=$2D%=()%E$3FFK4G$
J@$A'?+)?($C%/0%(=$2M<C=?E$3FFN4
%@$8%)1OH6/O66J$C%/0%(=$2I%=+60E$3FFP4G$$$$=%B)Q=<'%/R)=%J$0%)1OH6/QH?=%J
$$$$$$$$$$M282S4E82T44UM2SVETV4
X1: GGAAX2: TTTGAAX3: GGAAT
GGAGAATTTTTGTGAGAAGGAGAAAAT
112222333
AATGAAGAAGAAGGAGGATTATTGTTT
312313222
Kernel computation with counting sort:
Kernel updates: M2WEW4$U$M2WEW4$X$**:
Algorithm. (Mismatch-SS) Mismatch kernel algorithm based on Sufficient StatisticsInput: strings X, Y, parameters k,m,1. Compute min(2m,k) cumulative matching statistics, C
i, using counting sort
2. Compute exact matching statistics, Mi
3. Evaluate kernel using :
Output: Mismatch kernel value K(X,Y|k,m)
/!
" $# $% !&$' %&'!&"&!($#$ $ ' %
)!/!
)!&*!-' +&"!-% $'- +!- + %) + $ !&"&' '' $&!($&!($#$ $ ' % $ '-%%
)'&,-' +&"'-%) +
Features: Seq. Index:
Counting sortSeq. index: 1 2 3Counts: 1 1 1Seq. index: 1 3Counts: 1 1
Seq. index: 2 Counts: 1
Seq. index: 3 Counts: 1
...
" Jointly sort the observed k-mers in N(X) and N(Y)" Apply desired kernel evaluation method (eg. spectrum, mismatch, gapped, etc)
" $# $% %&'!&$-%&-#&' '' $ -.-%%
*!
http://seqam.rutgers.edu/projects/new-inexact
Problem! Explicit feature vectors are not readily available for many data domains (eg. biosequences, text,
audio data, etc.)" Explicit feature extraction can be complex and expensive
! Alternative: use kernel functions or inner products between data points in a higher dimensional feature space" Eg. feature space formed by all subsequences of length k (eg. subsequence, gapped, or
mismatch string kernels):
" as the dimensionality grows exponentially |!|k, direct computation of the inner products is prohibitive
!Eg. music with 1024-alphaber ~109-10
15 features (k=3,5)
# Need efficient algorithms for the inner product (kernel)
How do we compute string kernel inner products more efficiently for large-|!| inputs and inexact matching with relaxed (eg. with many mismatches)
constraints?
Our approach! Sufficient statistics – based method for efficient RKHS inner product evaluation for string kernels
" generalizes computation and improves time-efficiency of many existing string kernel functions! Previous work: suffix trees (Vishwanathan '02), (sparse) dynamic programming (Rousu '05),
mismatch trie (Leslie '04)! This work: new algorithmic approach (sufficient statistics & sorting), improves complexity bounds
over prior works, orders of magnitude speed-up
! Open questions addressed in this work:" dependency on |"| for mismatch model" unified computational framework for string kernels with inexact matching" efficient computation for large-|"| inputs
Main ContributionsNew efficient algorithms based on sufficient statistics for string kernel computations
Theoretical results! New algorithmic approach for string kernel computation with inexact matching! Improve complexity bounds over existing algorithms
Empirical results! Orders of magnitude speed-up, large datasets (millions of examples, semi-supervised)! State-of-the-art results for protein classification
Scalable Algorithms for String Kernels with Inexact MatchingPavel Kuksa, Pai-Hsi Huang, Vladimir Pavlovic
Department of Computer Science, Rutgers University, NJ
Algorithm OverviewI. Mismatch kernel function:
Rewriting (1) we get
This gives a quadratic algorithm (Hamming distance-based ), does not depend on |!#$%&'()*)+(,
"#$%&'%()%*'++',-
-%./)+)01$234$.%$1%+
2.#+/01$2%3+#+13+1/34
We use sorting-based counting to compute in linear time cumulative statistics Ci first,
Then derive Mi from C
i to evaluate the inner product (1)
i.e. we obtain a linear time algorithm independent of |!#
10,000x faster for large (|∑|=1000) alphabets
Relative running time vs. alphabet size
5/6+%)0 789 :%&+ ;<=)*100 75 >106 104
Speed-up [x]:
K $X ,Y %&'()'K c $( |X %c $( |Y % $1%
c $( |X %&'*)X I m $* ,(%
I m $( ,*%&+1, d $( ,*%,m0, otherwise
d $( ,*%-Hamming distance between ( and *
K $X ,Y %&'*)X'.)Y '()'k I m $( ,*% I m $( ,.%#neighbors / shared by a and bconstant for given d(a,b)!
4'.)+'%0).)5)26%('+'/+1)$%748"9:;Original (5,1) Relaxed (7,3)
Supervised 41.92 52.00Semi-supervised 81.05 89.52
<=5+1>/5#33%?,)+'1$%@)5(%?,'(1/+1)$A%BMethod / Accuracy Original (5,1) Relaxed (7,3)Supervised 46.78 52.87Semi-supervised 70.87 77.51
Original 57.4Relaxed 64.4
Music genre classificationaccuracy (multi-class), %
K $X ,Y %&'*)X '()Y /$( ,*% $2%
K $X ,Y %&'i&0
i&2m Mi/i $3%
K $X ,Y %&'i&0
i&mCi
K $X ,Y %&'i&0
i&mCi'
K $X ,Y %&M0
Results
AAAA ACAC AGAG ...... CGCG ...... GAGA ...... TTTT00
11
22 >2(‘ACGAC’)
k=2m=0
1 2 3 ... |∑|k
FeaturesFeatures (k-mers)
Cou
nts
index
index
features
features
ordered
unordered
AA AC AG ... TT 1 2 3 ... |∑|k
k-mer feature space:
!"-!"#$%&'$! "#% "!)$ "#)% " '#(')$&(#*+,' &
' $()0-.' .)*% 1 ' $+(")*% " +(")&'&&/01 $(& %$(-&%
ReferencesVishwanathan & Smola, Fast kernels for string and tree matching, NIPS 2002Leslie & Kuang, Fast string kernels using inexact matching, JMLR 2004Rousu & Shawe-Taylor, Efficient computation of gapped substring kernels on large alphabets, JMLR 2005Lodhi et al. Text classification using string kernels, JMLR 2002
C',$'5%.#+,1D%/).?=+#+1)$%%73;!"#$% &'()' &*
+,-./$0 1'2 34*
5,-./$067#/!$8#"+9 &'(' ':;<*
String kernels! Kernel functions defined on feature vectors computed from strings! Features:
" substrings (contiguous), " subsequences (non-contiguous)
AACACTCTG
Feature space (AAA-TTT)AACTG GCAAC
GCACAAAAC
AAA AACACTCAACTGGCA
010101
011010
" Iteratively remove i positions, sort, and count number of matches for (k-i)-mers
" Reduces computation to a series of spectrum kernel computations
Ci&Mi0' j&0
j&i-1$k- ji- j %M j
NIPS 2008,December 8-11
EEF%G31$2%3=@@1/1'$+%3+#+13+1/3%+)%/).?=+'%3+,1$2%H',$'5%@=$/+1)$3
?@$A'%*+/<B$C%/0%(=$2D%=()%E$3FF34G
H@$I)(J*?/J$C%/0%(=$2D%=()%E$3FFK4G$
$
*@$L?''%J$C%/0%(=$2D%=()%E$3FFK4G$
J@$A'?+)?($C%/0%(=$2M<C=?E$3FFN4
%@$8%)1OH6/O66J$C%/0%(=$2I%=+60E$3FFP4G$$$$=%B)Q=<'%/R)=%J$0%)1OH6/QH?=%J
$$$$$$$$$$M282S4E82T44UM2SVETV4
X1: GGAAX2: TTTGAAX3: GGAAT
GGAGAATTTTTGTGAGAAGGAGAAAAT
112222333
AATGAAGAAGAAGGAGGATTATTGTTT
312313222
Kernel computation with counting sort:
Kernel updates: M2WEW4$U$M2WEW4$X$**:
Algorithm. (Mismatch-SS) Mismatch kernel algorithm based on Sufficient StatisticsInput: strings X, Y, parameters k,m,1. Compute min(2m,k) cumulative matching statistics, C
i, using counting sort
2. Compute exact matching statistics, Mi
3. Evaluate kernel using :
Output: Mismatch kernel value K(X,Y|k,m)
/!
" $# $% !&$' %&'!&"&!($#$ $ ' %
)!/!
)!&*!-' +&"!-% $'- +!- + %) + $ !&"&' '' $&!($&!($#$ $ ' % $ '-%%
)'&,-' +&"'-%) +
Features: Seq. Index:
Counting sortSeq. index: 1 2 3Counts: 1 1 1Seq. index: 1 3Counts: 1 1
Seq. index: 2 Counts: 1
Seq. index: 3 Counts: 1
...
" Jointly sort the observed k-mers in N(X) and N(Y)" Apply desired kernel evaluation method (eg. spectrum, mismatch, gapped, etc)
" $# $% %&'!&$-%&-#&' '' $ -.-%%
*!
Evaluation
Running time (I)
104 speed-up for large (|∑|=1000) alphabets
Figure 1: Relative running time vs. alphabet size
Scalable Algorithms for String KernelsScalable Algorithms for String Kernelswith Inexact Matchingwith Inexact Matching
Pavel P. Kuksa, Pai-Hsi Huang, Vladimir Pavlovic
Theoretical resultsNew algorithmic approach for string kernels with inexact matching, including mismatch, gapped, and other count-based kernelsImproves complexity bounds over existing algorithmsExtends to large-scale semi-supervised learning (eg. neighborhood kernel approach)
Empirical resultsSeveral orders of magnitude speed-up Explore more complex kernels yielding improved predictions
Points of Interest! Removing dependency on the size of the alphabet under the mismatch measure O(km|∑|mn) " O(ck,mn)! Generalization of algorithms for kernels on strings, improving time-efficiency! Enhancing performance with new algorithms
! computational biology! recognizing protein fold from sequence! detecting remote evolutionary similarity
! general sequence data! music genre recognition
! State-of-the-art results for protein classification
Problem: what are the best algorithms for strings under inexact matching?Spotlight ID: ???
Speed-upProtein DNA Text Music100 75 >106 104
Table 2: Remote homology detectionOriginal Relaxed
Supervised 41.92 52.00Semi-supervised 81.05 86.32http://seqam.rutgers.edu/projects/new-inexact/
Table 1: Protein fold prediction (multi-class)Original Relaxed
Supervised 46.78 52.87Semi-supervised 70.87 77.51
features index
features index
ordered
unordered
similarity matrix
Sequence data
Original 57.4Relaxed 64.4
Table 3: Music genreclassification (multi-class)
#Reduce error by 15% compared to the best available method#Reduce error by 30% compared to the state-of-the-art profile kernel
Protein 100
Music ~104
Text ~106
Speed-up
Settings: seq. length = 10K, k=5, m=1
Running time (2)
100 200 300 400 500 600 700 800 900 1000100
101
102
103
104
105
alphabet size
rela
tive
runn
ing
time,
Ttri
e/Tss
Figure 1: Relative running time Ttrie/Tss
(in logarithmic scale) of the mismatch-trie andmismatch-ss as a function of the alphabet size(mismatch(5,1) kernel, n = 105)
Table 1: Running time (in seconds) for kernel compu-tation between two strings on real data
longprotein protein dna text music
n 36672 116 570 242 6892|Σ| 20 20 4 29224 1024(5,1)-trie 1.6268 0.0212 0.0260 20398 526.8(5,1)-ss 0.1987 0.0052 0.0054 0.0178 0.0331time ratio 8 4 5 106 16,000(5,2)-trie 31.5519 0.2918 0.4800 - -(5,2)-ss 0.2957 0.0067 0.0064 0.0649 0.0941time ratio 100 44 75 - -
the semi-supervised setting for neighborhood mismatch kernels; for example, computing a smallerneighborhood mismatch(5,2) kernel matrix for the labeled sequences only (2862-by-2862 matrix)using the Swiss-Prot unlabeled dataset takes 1, 480 seconds with our algorithm, whereas performingthe same task with the trie-based algorithm takes about 5 days.
6.2 Empirical performance analysis
In this section we show predictive performance results for several sequence analysis tasks using ournew algorithms. We consider the tasks of the multi-class music genre classification [16], with resultsin Table 2, and the protein remote homology (superfamily) prediction [9, 2, 18] in Table 3. We alsoinclude preliminary results for multi-class fold prediction [14, 15] in Table 4.
On the music classification task, we observe significant improvements in accuracy for larger numberof mismatches. The obtained error rate (35.6%) on this dataset compares well with the state-of-the-art results based on the same signal representation in [16]. The remote protein homology detection,as evident from Table 3, clearly benefits from larger number of allowed mismatches because theremotely related proteins are likely to be separated by multiple mutations or insertions/deletions.For example, we observe improvement in the average ROC-50 score from 41.92 to 52.00 under afully-supervised setting, and similar significant improvements in the semi-supervised settings. Inparticular, the result on the Swiss-Prot dataset for the (7, 3)-mismatch kernel is very promising andcompares well with the best results of the state-of-the-art, but computationally more demanding,profile kernels [2]. The neighborhood kernels proposed by Weston et al. have already shown verypromising results in [7], though slightly worse than the profile kernel. However, using our newalgorithm that significantly improves the speed of the neighborhood kernels, we show that withlarger number of allowed mismatches the neighborhood can perform even better than the state-of-the-art profile kernel: the (7,3)-mismatch neighborhood achieves the average ROC-50 score of86.32, compared to 84.00 of the profile kernel on the Swiss-Prot dataset. This is an important resultthat addresses a main drawback of the neighborhood kernels, the running time [7, 2].
Table 2: Classification per-formance on music genreclassification (multi-class)Method ErrorMismatch (5,1) 42.6±6.34Mismatch (5,2) 35.6±4.99
Table 3: Classification performance on protein remote homologypredictiondataset mismatch (5,1) mismatch (5,2) mismatch (7,3)
ROC ROC50 ROC ROC50 ROC ROC50SCOP (supervised) 87.75 41.92 90.67 49.09 91.31 52.00SCOP (unlabeled) 90.93 67.20 91.42 69.35 92.27 73.29SCOP (PDB) 97.06 80.39 97.24 81.35 97.93 84.56SCOP (Swiss-Prot) 96.73 81.05 97.05 82.25 97.78 86.32
For multi-class protein fold recognition (Table 4), we similarly observe improvements in perfor-mance for larger numbers of allowed mismatches. The balanced error of 25% for the (7,3)-mismatchneighborhood kernel using Swiss-Prot compares well with the best error rate of 26.5% for the state-
Kernel computation K(X,Y), real dataRunning time, (s)
Large-scale protein sequence fold recognition
• Unlabeled data: >1M protein sequences; Labeled data: ~4K sequences
• 27 fold classes (very low sequence similarity)
• 20% reduction in error with relaxed matching (larger m); 40-50% reduction with large unlabeled corpus
of-the-art profile kernel with adaptive codes in [15] that used a much larger non-redundant (NR)dataset. Using NR, the balanced error further reduces to 22.5% for the (7,3)-mismatch.
Table 4: Classification performance on fold prediction (multi-class)
Method Error Top 5Error
BalancedError
Top 5BalancedError
Recall Top 5Recall Precision Top 5
Precision F1 Top5F1
Mismatch (5, 1) 51.17 22.72 53.22 28.86 46.78 71.14 90.52 95.25 61.68 81.45Mismatch (5, 2) 42.30 19.32 44.89 22.66 55.11 77.34 67.36 84.77 60.62 80.89Mismatch (5, 2)† 27.42 14.36 24.98 13.36 75.02 86.64 79.01 91.02 76.96 88.78Mismatch (7, 3) 43.60 19.06 47.13 22.76 52.87 77.24 84.65 91.95 65.09 83.96Mismatch (7, 3)† 26.11 12.53 25.01 12.57 74.99 87.43 85.00 92.78 79.68 90.02Mismatch (7, 3)‡ 23.76 11.75 22.49 12.14 77.59 87.86 84.90 91.99 81.04 89.88† used the Swiss-Prot sequence database; ‡ used NR (non-redundant) database
7 Conclusions
We presented new algorithms for inexact matching of the discrete-valued string representations thatreduce computational complexity of current algorithms, demonstrate state-of-the-art performanceand significantly improved running times. This improvement makes the string kernels with approxi-mate but looser matching a viable alternative for practical tasks of sequence analysis. Our algorithmswork with large databases in supervised and semi-supervised settings and scale well in the alphabetsize and the number of allowed mismatches. As a consequence, the proposed algorithms can bereadily applied to other challenging problems in sequence analysis and mining.
References[1] Jianlin Cheng and Pierre Baldi. A machine learning information retrieval approach to protein fold recog-
nition. Bioinformatics, 22(12):1456–1463, June 2006.[2] Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund, and Christina S. Leslie.
Profile-based string kernels for remote homology detection and motif extraction. In CSB, pages 152–160, 2004.
[3] Christina S. Leslie, Eleazar Eskin, Jason Weston, and William Stafford Noble. Mismatch string kernelsfor SVM protein classification. In NIPS, pages 1417–1424, 2002.
[4] Soren Sonnenburg, Gunnar Ratsch, and Bernhard Scholkopf. Large scale genomic sequence SVM clas-sifiers. In ICML ’05, pages 848–855, New York, NY, USA, 2005.
[5] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge UniversityPress, New York, NY, USA, 2004.
[6] Christina Leslie and Rui Kuang. Fast string kernels using inexact matching for protein sequences. J.Mach. Learn. Res., 5:1435–1455, 2004.
[7] Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff, and William Stafford Noble.Semi-supervised protein classification using cluster kernels. Bioinformatics, 21(15):3241–3247, 2005.
[8] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998.[9] Christina S. Leslie, Eleazar Eskin, and William Stafford Noble. The spectrum kernel: A string kernel for
SVM protein classification. In Pacific Symposium on Biocomputing, pages 566–575, 2002.[10] S. V. N. Vishwanathan and Alex Smola. Fast kernels for string and tree matching. Advances in Neural
Information Processing Systems, 15, 2002.[11] M. Gribskov, A.D. McLachlan, and D. Eisenberg. Profile analysis: detection of distantly related proteins.
Proceedings of the National Academy of Sciences, 84:4355–4358, 1987.[12] Juho Rousu and John Shawe-Taylor. Efficient computation of gapped substring kernels on large alphabets.
J. Mach. Learn. Res., 6:1323–1344, 2005.[13] Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Fast protein homology and fold detection with
sparse spatial sample kernels. In ICPR 2008, 2008.[14] Chris H.Q. Ding and Inna Dubchak. Multi-class protein fold recognition using support vector machines
and neural networks. Bioinformatics, 17(4):349–358, 2001.[15] Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, and Christina Leslie. Multi-class protein
classification using adaptive codes. J. Mach. Learn. Res., 8:1557–1581, 2007.[16] Tao Li, Mitsunori Ogihara, and Qi Li. A comparative study on content-based music genre classification.
In SIGIR ’03, pages 282–289, New York, NY, USA, 2003. ACM.[17] Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. On the role of local matching for efficient semi-
supervised protein sequence classification. In BIBM, 2008.[18] Tommi Jaakkola, Mark Diekhans, and David Haussler. A discriminative framework for detecting remote
protein homologies. In Journal of Computational Biology, volume 7, pages 95–114, 2000.
Algorithms for string kernels with inexact matching: Our contributions
• New family of algorithms based on sufficient statistics and counting sort for string kernels
• unified approach
• improves time complexity bounds over existing algorithms
• reduces kernel computation to exact spectrum kernel computation
• enhances performance with relaxed matching
Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic, NIPS 2008“Scalable Algorithms for String Kernels with Inexact Matching”
Pattern extraction:Motif discovery
Motif Discovery
GAACTCATGGTG
AAAAGCACGGTC
TCAAAGCAAGGC
CCTAATCAGGGC
AAGTATGGACTC
ACTAAGCAGGGT
TCTCACGGCCCA
CCTCGTGGTGGG
TACCGTATGGTT
ACCACTCGTCGA
GAACTCATGGTG
AAAAGCACGGTC
TCAAAGCAAGGC
CCTAATCAGGGC
AAGTATGGACTC
ACTAAGCAGGGT
TCTCACGGCCCA
CCTCGTGGTGGG
TACCGTATGGTT
ACCACTCGTCGA
Sequence database
motifinstanceslength k=8
A[AC]TCATGGMotif:
consensussequence
• Motifs: sequence patterns of biological significance- E.g.: TF binding sites in DNA, structural motifs in proteins
Motiffinder
Motif finding algorithms• State-of-the-art algorithms (e.g., MITRA,
RISOTTO, etc) depend on the (k,m) mutational neighborhood O(km |∑|m)
• Can use similar approach and remove alphabet dependency (Kuksa & Pavlovic, BMC Bioinformatics, 2010)
depth k
|∑|k leaves (all possible k-length motif sequences)
A
C GMotif search tree:
Algorithm Overview– 4 –
! " #
!"#$%&'(%&) *(+',-.(&,"'%+"/('(%0&121331)1
45%,6&,"'%+"/('
Figure 2: Feasible instance selection.
At first, it seems that in order to obtain a set of k-mers R at distance of at most 2m from
every string one needs to compute all O(n2N2) pairwise distances between input k-mers and
select k-mers at distance of at most 2m — which would require quadratic running time. We
will show next how to obtain a reduced set of k-mers in time linear in the input length n.
The reduced set R of potential motif instances, i.e. k-mers at distance of at most 2m
from each of the input strings, can be obtained using our linear time selection algorithm
(Algorithm 1). To select the valid k-mers (i.e. a set of potential motif instances), we use
multiple rounds of count sort by removing iteratively 2m out of k positions and sorting
the resulting sets of (k − 2m)-mers. A k-mer is deemed a potential motif instance if it
matched at least one k-mer from each of the other strings in at least one of the sorting
rounds. The purpose of sorting is to group the same k-mers together; note that (k − 2m)-
mers corresponding to k-mers at distance of at most 2m will match exactly. Using a simple
linear scan over the sorted list of all input k-mers, we can find the set of potential motif
instances and construct R. This algorithm is outlined below (Algorithm 1).
Algorithm 1 Selection algorithm
Input: set S of k-mers with associated sequence index L, distance parameter d
Output: set R of k-mers at distance d from each input string
1. Pick d positions and remove from the k-mers symbols at the corresponding positions
to obtain a set of (k − d)-mers.
2. Use counting sort to order (lexicographically) the resulting set of (k − d)-mers.
3. Scan the sorted list to create the list of all sequences in which k-mers appear using
sequence index L.
4. Output the k-mers that appear in every input string.
As we will see in the experiments (Section 4), the selection using Algorithm 1 significantly
reduces the number of k-mer instances considered by the motif algorithm and improves its
search efficiency. The number of selected k-mers, i.e. the size of R, is small, especially
Feasible instances: at Hamming distance at most 2mACTCATGG
AAGCACGG
AAGCAAGG
AATCAGGG
CCTCGTGG
ACTCATGG
AAGCACGG
AAGCAAGG
AATCAGGG
CCTCGTGG
k-mers (k-d)-mers
mat
ch
count sort:(TGT, TGC)intersection:TGx, Txx, xGx, TxC, xGCx - wildcardk=3, m=2
Selection Efficiency
! !
!"" #""" $""" %""" &""""
#"""
$"""
%"""
$'!$'!&(#&(#
#"'##"'##)$*#)$*
$&""$&""
+,-./0/.12.1304566275+,-./0/.12.130456627589157-0:627;-.,1<89157-0:627;-.,1<
8=,-./0>613-?0@A##B0=CDE0F=.4=C-;?640=A%<
GH5.;C>046I761;60>613-?
9157-0:627;-.,10/C;-,:
!"#$%&'$()%*+,+-$('.%)'/%0(','1(-",%*+23+.-+#4"5+,%637#"8%9,":(;(/%4"5,'5(-
<+="/$;+.$%')%>';=3$+/%*-(+.-+8%?3$1+/#%@.(5+/#($A
4/'B,+;4/'B,+;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%&'$()%!(.:(.1
=,-./871@1,J1<
=,-./0.14-C1;648.16DC;-0;,5.64B071@1,J1<
46I761;64
,K6:0C>5?CL6-0!08MNOB05:,-6.1<
C78%;8%!"#$%&'()*+#,-./01#2/(34-'
D.=3$P046-0,/04-:.1340*B0Q*Q0A0NB0;,14-:7;-620/:,=0C10C>5?CL6-0!
56)26)!"#$%"70,/0=,-./4047;?0-?C-06C;?0=,-./0C0!0&[email protected]=.1.=7=0RC==.1302.4-C1;60,/0C-0=,4-0!0/:,=04-:.1340.10*
#" #& $" &""
$"""
)"""
!"""
'"""
#*#* %&%& #'"#'"
!*'!!*'!
+,-./0/.12.1304566275+,-./0/.12.13045662758S711.130-.=6<8S711.130-.=6<
8=,-./0>613-?0@A##B0=CDE0F=.4=C-;?640=A%<
O>5?CL6-04.T6
U5662750/C;-,:
#" #& $" &" #"""
&"
#""
#&"
$""
$&"
#%#% $"$"&"&"
##$##$
$"&$"&
+,-./0/.12.1304566275+,-./0/.12.13045662758S711.130-.=6<8S711.130-.=6<
8=,-./0>613-?0@A##B0=CDE0F=.4=C-;?640=A%<
O>5?CL6-04.T6
U5662750/C;-,:
&'$()%!(.:(.1%EF=+/(;+.$#&'$()%!(.:(.1%EF=+/(;+.$#
?+)+/+.-+#
V#W0NE0X,Y.;B0ZE0X,Y.;B[E0\:6HB0]E0+66@B0ME0R6;@6:=C1E0^4.130_65.-,=64`0-,0=,26>03616-.;02.K6:4.-HE0N9aU0$""[email protected]:E0\.12.130;,=5,4.-60:637>C-,:H05C--6:140.10MNO046I761;64E0[.,.1/,:=C-.;4B0$""$V%W0bEaE0c.13B0+E9E0X,:2C1B0SE+E0dC:5B0UES7446>E0O0?.6:C:;?.;C>0[CH64.C10+C:@,K.C10=,26>0/,:0=,-./4E0N9aU0$""%
O>3,:.-?= G.=60;,=5>6D.-H U5C;6
UabeebS08UC3,-B0#**'<
+9GSO08a6KT16:B0$""$< f81N@<
]bNU^U08bKC14B0$""%< f81N@<
Z,-.1308]?.1B0$""&< f81Z8=B@<<
S9UfGGf08a.4C1-.B0$""!<
a+U08MCK.>CB0$""(<
f81N$9C;87G< f81N$gJ<
f8@1N9C;87G<
f8@1N9C;87G<
f81N9C;87G<
f81N$9C;87G< f81N$<
f81$N9C;87G< f81$N<
! ""#$ #$%%$"
"
"$% # #& #$$& "$" #& #"#
%&'()*+!,-!./00*+&1!12/+*3!)*04**&!04,!1')10+5&61!/!/&3!)7
%%%%&"(.%"==/'"-H+#%$'%C78;8&"(.%"==/'"-H+#%$'%C78;8!!8IGJ;'$()%)(.:(.18IGJ;'$()%)(.:(.100826-6:=.1.4-.;B0;,=5>6-6<
E.3;+/"$('.K%%#E0617=6:C-60C>>0=,-./40J.-?.100000002.4-C1;60"0/,:06C;?04-:.130#00$E0,7-57-0.1-6:46;-.,10,/0C>>00000047;?046-40
L/++#%M%L/(+#K%G:660J.-?0>6CK64g5C-?40/,:0C>>05,44.L>60=,-./40$00#E0.-6:C-60,K6:0C>>05C-?40/:,=0:,,-0-,0>6CK6400$E0,7-57-0>6CK640$04E-E0$0,;;7:40.10000006K6:H04-:.13
EF=,(-($%;"==(.1%CH"#H(.1GK%%RVCW0A0F0K,-640/,:0$08.E6E017=L6:0,/04-:.1340J.-?0$<00#E046-0RV$W0A0"0/,:0C>>0$00$E0.-6:C-60,K6:06K6:H0.157-04-:.130C120752C-60RVhW00%E0,7-57-0C0./0RV$W0A0NB0.E6E0,;;7:40.106K6:H04-:.13
&"(.%:/"NB"-7P02,01,-026C>0J.-?0>C:36iQ!8#&'()%#"*$+,+-"(./%$&'"#$0)$'1$#2-"#%./',34"5$($'5"/'"%6$"
73(678$%"#&9$-"$+,+"7#":!89"
[9[+0$""*jC4?.13-,10ME]EB0N,K0#i)B0$""*
D;=/'5(.1%*=++:%')%>';;'.%&'$()%!(.:+/#D;=/'5(.1%*=++:%')%>';;'.%&'$()%!(.:+/#
8,05-!-5&35&6!9,(.:*;50<!5(.+,=*1!0,
& "" $$( #'('()()! "$ #"## ()(*'(
*+,+-$('.%O,1'/($H;#%)'/%&'$()%!(.:(.1*+,+-$('.%O,1'/($H;#%)'/%&'$()%!(.:(.1
O,1'/($H;P%&'$()%-".:(:"$+%#+,+-$('.%C)+"#(B,+%(.#$".-+%#+$%+F$/"-$('.GD.=3$K% .157-046-0UAk8$B%<l0,/0@i=6:40J.-?0C44,;.C-62046I761;60.126D0%m02.4-C1;605C:C=6-6:02082A$=<
#E0fL-C.10C046-0,/08@i2<i=6:40LH0&'()$'&*(%+046>6;-.130205,4.-.,140C120:6=,K.130;,::645,12.1304H=L,>40000$E0U,:-0-?60:647>-.13046-0,/08@i2<i=6:4074.130;,71-.1304,:-%E0U;C10-?604,:-620>.4-0-,0;:6C-60>.4-0,/0C>>046I761;640.10J?.;?0@i=6:40C556C:)E0f7-57-0-?60@i=6:40-?C-0C556C:0.106K6:H046I761;6
Q3$=3$K%:627;62046-0SAk8LB><l0,/0@i=6:40C-02.4-C1;60,/0C-0=,4-020/:,=06C;?0.157-04-:.13
;
C ;
aC--6:10-:66
;C ;;
EEE
N6.3?L,:046-
/,:04-:.130#
N6.3?L,:046-0/,:04-:.130$
N6.3?L,:46-0/,:04-:.130%
+,-./046-0+
N,-6026561261;H0,1C>5?CL6-04.T6n
>'.-,3#('.#%".:%!3$3/+%N'/7>'.-,3#('.#%".:%!3$3/+%N'/7#E0O55>.640C40C0#,(("-.,0=6;?C1.4=0-,0C0KC:.6-H0,/00617=6:C-.,1iLC462086DC;-<00=,-./0/.126:4
$E0],=L.1640J.-?05:,LCL.>.4-.;0=,-./0/.126:40C400=,-./0/$0"&"$'(-#(%(/'1)-86E3EB0.157-0466240/,:0+b+bB0b+<
%E0U.31./.;C1->H0456627540.105:C;-.;60=,-./0/.12.13+0;C1046C:;?0/,:0>,136:B0>6440;,146:K620=,-./4
0000000000000-?C105:6K.,74>H0LH0617=6:C-.K60=6-?,24
U.31./.;C1-0456627540.105:C;-.;6n
!$,"!# !$ "" %##-$(.!/& % '()( *+,)-*). /0 1/.'0 *).
+9GSO0.=5:,K62 Z,-.130.=5:,K62
%%%%%%%%%%E.3;+/"$('.JB"#+:%;'$()%)(.:+/#
9E
9E 99E
?+:3-+:%5#%D.=3$?+:3-+:%5#%D.=3$%%%%%%%%%%%%%%%%%%%%%%%%#+$#+$
&'$() &DL?O *J&DL?O 9'$(.1 *J9'$(.1
8*B$< )E(% RPST ("E& RPSURPSU
8##B%< %"( VPU ') VPUVPU
8#%B)< *&$) UPS *'E( WPXWPX
8#&B&< ∞ VYR ##* VTPWVTPW
8#(B!< ∞ SWSZ #&$E* SXPTSXPT
%%%%O,1'/($H;%-';="/(#'.%'.%:())(-3,$%C78;GJ;'$()%:(#-'5+/A%%%%%%%%-"#+#%%C=/'$+(.%#+23+.-+#G
<"$"#+$ &'$() *+,+-$('. I'%#+,+-$('.
e.5,;C>.1408eCJ:61;6B#**%< 8#!B!< YXURU o
p.1;0=6-C>>,565-.2C4608pC4>CK4@HB0$""!<
8##B&< TTWR ##"!&"
U756:46;,12C:HU-:7;-7:608UUU<0=,-./408d.4-6:B0$""!<
8$"B)< WWPS &!#E&
D.=3$%*([+
?+:3-+:%(.=3$%?C)+"#(B,+%#+$%#([+G
##'"" SY
#*'"" TW
%*'"" TU
&*'"" SY
**'"" SR
*+",-./%01 > & "" $$(#'( #
4/'$+(.%#+23+.-+#4/'$+(.%#+23+.-+#
9157-046-0,/:6>C-62046I761;64
\.1C>0=,-./0=,26>0b17=6:C-.K60C>3,:.-?=408+9GSOB0Z,-.13B06-;<0a:,LCL.>.4-.;0C>3,:.-?=408b+B+b+bBq.LL4B6-;<
a:,-6.1B0MNO46I761;64
\6C4.L>6914-C1;6046-46>6;-.,1
U-C:-.130=,-./0=,26>4
+,-./0S6/.16=61-
)/2 \6C4.L>60914-C1;6046-BQSQrrQUQ
* ? &
9157-046-0U
+,-./0.14-C1;64
Selection Efficiency
! !
!"" #""" $""" %""" &""""
#"""
$"""
%"""
$'!$'!&(#&(#
#"'##"'##)$*#)$*
$&""$&""
+,-./0/.12.1304566275+,-./0/.12.130456627589157-0:627;-.,1<89157-0:627;-.,1<
8=,-./0>613-?0@A##B0=CDE0F=.4=C-;?640=A%<
GH5.;C>046I761;60>613-?
9157-0:627;-.,10/C;-,:
!"#$%&'$()%*+,+-$('.%)'/%0(','1(-",%*+23+.-+#4"5+,%637#"8%9,":(;(/%4"5,'5(-
<+="/$;+.$%')%>';=3$+/%*-(+.-+8%?3$1+/#%@.(5+/#($A
4/'B,+;4/'B,+;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%&'$()%!(.:(.1
=,-./871@1,J1<
=,-./0.14-C1;648.16DC;-0;,5.64B071@1,J1<
46I761;64
,K6:0C>5?CL6-0!08MNOB05:,-6.1<
C78%;8%!"#$%&'()*+#,-./01#2/(34-'
D.=3$P046-0,/04-:.1340*B0Q*Q0A0NB0;,14-:7;-620/:,=0C10C>5?CL6-0!
56)26)!"#$%"70,/0=,-./4047;?0-?C-06C;?0=,-./0C0!0&[email protected]=.1.=7=0RC==.1302.4-C1;60,/0C-0=,4-0!0/:,=04-:.1340.10*
#" #& $" &""
$"""
)"""
!"""
'"""
#*#* %&%& #'"#'"
!*'!!*'!
+,-./0/.12.1304566275+,-./0/.12.13045662758S711.130-.=6<8S711.130-.=6<
8=,-./0>613-?0@A##B0=CDE0F=.4=C-;?640=A%<
O>5?CL6-04.T6
U5662750/C;-,:
#" #& $" &" #"""
&"
#""
#&"
$""
$&"
#%#% $"$"&"&"
##$##$
$"&$"&
+,-./0/.12.1304566275+,-./0/.12.13045662758S711.130-.=6<8S711.130-.=6<
8=,-./0>613-?0@A##B0=CDE0F=.4=C-;?640=A%<
O>5?CL6-04.T6
U5662750/C;-,:
&'$()%!(.:(.1%EF=+/(;+.$#&'$()%!(.:(.1%EF=+/(;+.$#
?+)+/+.-+#
V#W0NE0X,Y.;B0ZE0X,Y.;B[E0\:6HB0]E0+66@B0ME0R6;@6:=C1E0^4.130_65.-,=64`0-,0=,26>03616-.;02.K6:4.-HE0N9aU0$""[email protected]:E0\.12.130;,=5,4.-60:637>C-,:H05C--6:140.10MNO046I761;64E0[.,.1/,:=C-.;4B0$""$V%W0bEaE0c.13B0+E9E0X,:2C1B0SE+E0dC:5B0UES7446>E0O0?.6:C:;?.;C>0[CH64.C10+C:@,K.C10=,26>0/,:0=,-./4E0N9aU0$""%
O>3,:.-?= G.=60;,=5>6D.-H U5C;6
UabeebS08UC3,-B0#**'<
+9GSO08a6KT16:B0$""$< f81N@<
]bNU^U08bKC14B0$""%< f81N@<
Z,-.1308]?.1B0$""&< f81Z8=B@<<
S9UfGGf08a.4C1-.B0$""!<
a+U08MCK.>CB0$""(<
f81N$9C;87G< f81N$gJ<
f8@1N9C;87G<
f8@1N9C;87G<
f81N9C;87G<
f81N$9C;87G< f81N$<
f81$N9C;87G< f81$N<
! ""#$ #$%%$"
"
"$% # #& #$$& "$" #& #"#
%&'()*+!,-!./00*+&1!12/+*3!)*04**&!04,!1')10+5&61!/!/&3!)7
%%%%&"(.%"==/'"-H+#%$'%C78;8&"(.%"==/'"-H+#%$'%C78;8!!8IGJ;'$()%)(.:(.18IGJ;'$()%)(.:(.100826-6:=.1.4-.;B0;,=5>6-6<
E.3;+/"$('.K%%#E0617=6:C-60C>>0=,-./40J.-?.100000002.4-C1;60"0/,:06C;?04-:.130#00$E0,7-57-0.1-6:46;-.,10,/0C>>00000047;?046-40
L/++#%M%L/(+#K%G:660J.-?0>6CK64g5C-?40/,:0C>>05,44.L>60=,-./40$00#E0.-6:C-60,K6:0C>>05C-?40/:,=0:,,-0-,0>6CK6400$E0,7-57-0>6CK640$04E-E0$0,;;7:40.10000006K6:H04-:.13
EF=,(-($%;"==(.1%CH"#H(.1GK%%RVCW0A0F0K,-640/,:0$08.E6E017=L6:0,/04-:.1340J.-?0$<00#E046-0RV$W0A0"0/,:0C>>0$00$E0.-6:C-60,K6:06K6:H0.157-04-:.130C120752C-60RVhW00%E0,7-57-0C0./0RV$W0A0NB0.E6E0,;;7:40.106K6:H04-:.13
&"(.%:/"NB"-7P02,01,-026C>0J.-?0>C:36iQ!8#&'()%#"*$+,+-"(./%$&'"#$0)$'1$#2-"#%./',34"5$($'5"/'"%6$"
73(678$%"#&9$-"$+,+"7#":!89"
[9[+0$""*jC4?.13-,10ME]EB0N,K0#i)B0$""*
D;=/'5(.1%*=++:%')%>';;'.%&'$()%!(.:+/#D;=/'5(.1%*=++:%')%>';;'.%&'$()%!(.:+/#
8,05-!-5&35&6!9,(.:*;50<!5(.+,=*1!0,
& "" $$( #'('()()! "$ #"## ()(*'(
*+,+-$('.%O,1'/($H;#%)'/%&'$()%!(.:(.1*+,+-$('.%O,1'/($H;#%)'/%&'$()%!(.:(.1
O,1'/($H;P%&'$()%-".:(:"$+%#+,+-$('.%C)+"#(B,+%(.#$".-+%#+$%+F$/"-$('.GD.=3$K% .157-046-0UAk8$B%<l0,/0@i=6:40J.-?0C44,;.C-62046I761;60.126D0%m02.4-C1;605C:C=6-6:02082A$=<
#E0fL-C.10C046-0,/08@i2<i=6:40LH0&'()$'&*(%+046>6;-.130205,4.-.,140C120:6=,K.130;,::645,12.1304H=L,>40000$E0U,:-0-?60:647>-.13046-0,/08@i2<i=6:4074.130;,71-.1304,:-%E0U;C10-?604,:-620>.4-0-,0;:6C-60>.4-0,/0C>>046I761;640.10J?.;?0@i=6:40C556C:)E0f7-57-0-?60@i=6:40-?C-0C556C:0.106K6:H046I761;6
Q3$=3$K%:627;62046-0SAk8LB><l0,/0@i=6:40C-02.4-C1;60,/0C-0=,4-020/:,=06C;?0.157-04-:.13
;
C ;
aC--6:10-:66
;C ;;
EEE
N6.3?L,:046-
/,:04-:.130#
N6.3?L,:046-0/,:04-:.130$
N6.3?L,:46-0/,:04-:.130%
+,-./046-0+
N,-6026561261;H0,1C>5?CL6-04.T6n
>'.-,3#('.#%".:%!3$3/+%N'/7>'.-,3#('.#%".:%!3$3/+%N'/7#E0O55>.640C40C0#,(("-.,0=6;?C1.4=0-,0C0KC:.6-H0,/00617=6:C-.,1iLC462086DC;-<00=,-./0/.126:4
$E0],=L.1640J.-?05:,LCL.>.4-.;0=,-./0/.126:40C400=,-./0/$0"&"$'(-#(%(/'1)-86E3EB0.157-0466240/,:0+b+bB0b+<
%E0U.31./.;C1->H0456627540.105:C;-.;60=,-./0/.12.13+0;C1046C:;?0/,:0>,136:B0>6440;,146:K620=,-./4
0000000000000-?C105:6K.,74>H0LH0617=6:C-.K60=6-?,24
U.31./.;C1-0456627540.105:C;-.;6n
!$,"!# !$ "" %##-$(.!/& % '()( *+,)-*). /0 1/.'0 *).
+9GSO0.=5:,K62 Z,-.130.=5:,K62
%%%%%%%%%%E.3;+/"$('.JB"#+:%;'$()%)(.:+/#
9E
9E 99E
?+:3-+:%5#%D.=3$?+:3-+:%5#%D.=3$%%%%%%%%%%%%%%%%%%%%%%%%#+$#+$
&'$() &DL?O *J&DL?O 9'$(.1 *J9'$(.1
8*B$< )E(% RPST ("E& RPSURPSU
8##B%< %"( VPU ') VPUVPU
8#%B)< *&$) UPS *'E( WPXWPX
8#&B&< ∞ VYR ##* VTPWVTPW
8#(B!< ∞ SWSZ #&$E* SXPTSXPT
%%%%O,1'/($H;%-';="/(#'.%'.%:())(-3,$%C78;GJ;'$()%:(#-'5+/A%%%%%%%%-"#+#%%C=/'$+(.%#+23+.-+#G
<"$"#+$ &'$() *+,+-$('. I'%#+,+-$('.
e.5,;C>.1408eCJ:61;6B#**%< 8#!B!< YXURU o
p.1;0=6-C>>,565-.2C4608pC4>CK4@HB0$""!<
8##B&< TTWR ##"!&"
U756:46;,12C:HU-:7;-7:608UUU<0=,-./408d.4-6:B0$""!<
8$"B)< WWPS &!#E&
D.=3$%*([+
?+:3-+:%(.=3$%?C)+"#(B,+%#+$%#([+G
##'"" SY
#*'"" TW
%*'"" TU
&*'"" SY
**'"" SR
*+",-./%01 > & "" $$(#'( #
4/'$+(.%#+23+.-+#4/'$+(.%#+23+.-+#
9157-046-0,/:6>C-62046I761;64
\.1C>0=,-./0=,26>0b17=6:C-.K60C>3,:.-?=408+9GSOB0Z,-.13B06-;<0a:,LCL.>.4-.;0C>3,:.-?=408b+B+b+bBq.LL4B6-;<
a:,-6.1B0MNO46I761;64
\6C4.L>6914-C1;6046-46>6;-.,1
U-C:-.130=,-./0=,26>4
+,-./0S6/.16=61-
)/2 \6C4.L>60914-C1;6046-BQSQrrQUQ
* ? &
9157-046-0U
+,-./0.14-C1;64
Time efficiency• Finding k-length motifs planted in iid sequences with
up to m substitutionsTable 2: Running time comparison on the challenging instances of the planted motif problem (DNA, |Σ| = 4,N = 20 sequences of length n = 600). Problem instances are denoted by (k, m, |Σ|), where k is the length of themotif implanted with m mismatches.
Motif problem instances (k, m, |Σ|)Algorithm (9,2,4) (11,3,4) (13,4,4) (15,5,4) (17,6,4) (19,7,4)
Stemming 0.95 8.8 31 187 1462 8397MITRA [5] 0.89 17.9 203 1835 4012 n/aPMSPrune [10] 0.99 10.4 103 858 7743 81010RISOTTO [6] 1.64 24.6 291 2974 29792 n/a
Table 3: Running time, in seconds, on large-|Σ| inputs. (k, m) instances denote implanted motifs of length kwith up to m substitutions.
|Σ| (9,2) (11,3) (13,4) (15,5)MITRA Stemming MITRA Stemming MITRA Stemming MITRA Stemming
20 8.39 0.637 1032.17 1.07 28905 5.247 n/a 12.3150 89.82 0.633 12295.73 0.963 685015 2.244 n/a 11.92100 265.94 0.645 n/a 0.967 > 1 month 2.227 n/a 11.86
(a) CPR binding sites
(b) Lipocalin motifs
Figure 3: (a) Recognition of CRP binding sites (k =18, m = 7, |Σ| = 4). (b) Lipocalin motifs (k = 15, m =7, |Σ| = 20).
superfamily we find long motifs of length 20 (using m = 4mismatches) corresponding to the secondary structure unitsstrand 1 - loop - strand 2 (VIPPISCPENE[KR]GPFPKNLV) andstrand 3 - loop - strand 4 (YSITGQGAD[KNQT]PPVGVFII) (3DSSS motif [23]). Our algorithm finds 36 potential motifinstances (out of 330 samples) after the selection (step 1)and takes about 47 seconds (compared to about 600 sec-onds using the trie traversal). In Immunoglobin superfam-ily (C1 set domains), we find a sequence motif of length 19SSVTLGCLVKGYFPEPVTV which corresponds to strand 2-loop-strand 3 secondary structure units (2E SSS motif).
5. CONCLUSIONS
We presented a new deterministic and exhaustive algo-rithm for finding motifs, the common patterns in sequences.Our algorithm reduces computational complexity of the cur-rent motif finding algorithms and demonstrate strong run-ning time improvements over existing exact algorithms, es-pecially in large-alphabet sequences (e.g., proteins), as weshowed on several motif discovery problems in both DNAand protein sequences. The proposed algorithms could beapplied to other cases and challenging problems in sequenceanalysis and mining.
6. ACKNOWLEDGMENT
This research was partially supported by DIMACS, Cen-ter for Discrete Mathematics and Theoretical Computer Sci-ence, Rutgers University.
7. REFERENCES
[1] Eric P. Xing, Michael I. Jordan, Richard M. Karp, andStuart Russell. A hierarchical Bayesian Markovianmodel for motifs in biopolymer sequences. In In Proc.of Advances in Neural Information ProcessingSystems, pages 200–3. MIT Press, 2003.
[2] Pavel A. Pevzner and Sing-Hoi Sze. Combinatorialapproaches to finding subtle signals in dna sequences.In Proceedings of the Eighth International Conferenceon Intelligent Systems for Molecular Biology, pages269–278. AAAI Press, 2000.
[3] Jean-Marc Fellous, Paul H. E. Tiesinga, Peter J.Thomas, and Terrence J. Sejnowski. Discovering Spike
Table 2: Running time comparison on the challenging instances of the planted motif problem (DNA, |Σ| = 4,N = 20 sequences of length n = 600). Problem instances are denoted by (k, m, |Σ|), where k is the length of themotif implanted with m mismatches.
Motif problem instances (k, m, |Σ|)Algorithm (9,2,4) (11,3,4) (13,4,4) (15,5,4) (17,6,4) (19,7,4)
Stemming 0.95 8.8 31 187 1462 8397MITRA [5] 0.89 17.9 203 1835 4012 n/aPMSPrune [10] 0.99 10.4 103 858 7743 81010RISOTTO [6] 1.64 24.6 291 2974 29792 n/a
Table 3: Running time, in seconds, on large-|Σ| inputs. (k, m) instances denote implanted motifs of length kwith up to m substitutions.
|Σ| (9,2) (11,3) (13,4) (15,5)MITRA Stemming MITRA Stemming MITRA Stemming MITRA Stemming
20 8.39 0.637 1032.17 1.07 28905 5.247 n/a 12.3150 89.82 0.633 12295.73 0.963 685015 2.244 n/a 11.92100 265.94 0.645 n/a 0.967 > 1 month 2.227 n/a 11.86
(a) CPR binding sites
(b) Lipocalin motifs
Figure 3: (a) Recognition of CRP binding sites (k =18, m = 7, |Σ| = 4). (b) Lipocalin motifs (k = 15, m =7, |Σ| = 20).
superfamily we find long motifs of length 20 (using m = 4mismatches) corresponding to the secondary structure unitsstrand 1 - loop - strand 2 (VIPPISCPENE[KR]GPFPKNLV) andstrand 3 - loop - strand 4 (YSITGQGAD[KNQT]PPVGVFII) (3DSSS motif [23]). Our algorithm finds 36 potential motifinstances (out of 330 samples) after the selection (step 1)and takes about 47 seconds (compared to about 600 sec-onds using the trie traversal). In Immunoglobin superfam-ily (C1 set domains), we find a sequence motif of length 19SSVTLGCLVKGYFPEPVTV which corresponds to strand 2-loop-strand 3 secondary structure units (2E SSS motif).
5. CONCLUSIONS
We presented a new deterministic and exhaustive algo-rithm for finding motifs, the common patterns in sequences.Our algorithm reduces computational complexity of the cur-rent motif finding algorithms and demonstrate strong run-ning time improvements over existing exact algorithms, es-pecially in large-alphabet sequences (e.g., proteins), as weshowed on several motif discovery problems in both DNAand protein sequences. The proposed algorithms could beapplied to other cases and challenging problems in sequenceanalysis and mining.
6. ACKNOWLEDGMENT
This research was partially supported by DIMACS, Cen-ter for Discrete Mathematics and Theoretical Computer Sci-ence, Rutgers University.
7. REFERENCES
[1] Eric P. Xing, Michael I. Jordan, Richard M. Karp, andStuart Russell. A hierarchical Bayesian Markovianmodel for motifs in biopolymer sequences. In In Proc.of Advances in Neural Information ProcessingSystems, pages 200–3. MIT Press, 2003.
[2] Pavel A. Pevzner and Sing-Hoi Sze. Combinatorialapproaches to finding subtle signals in dna sequences.In Proceedings of the Eighth International Conferenceon Intelligent Systems for Molecular Biology, pages269–278. AAAI Press, 2000.
[3] Jean-Marc Fellous, Paul H. E. Tiesinga, Peter J.Thomas, and Terrence J. Sejnowski. Discovering Spike
DNA sequences (|∑|=4, k=9-19, m=2-7), runtime [sec]
Large-alphabet setting (|∑|=20-100), runtime [sec]
• DNA: 3-10x speedup
• Proteins: 10-104 speedups for larger k, m
Motif Discovery: Our Contributions• Algorithm to improve search efficiency for
• large-alphabet• large length inputs• compared to prior art does not depend on alphabet
• Significant speed-ups in practice
• Can search for longer, less conserved motifs
• Outlook:• Applies as a speed-up mechanism to a variety of exact motif
finders (R could be used as an input instead of S)• Combines with probabilistic motif finders as a candidate
selector (e.g., MEME, EM)
Pavel Kuksa, Vladimir Pavlovic. Efficient motif finding algorithms for large-alphabet inputs, BMC Bioinformatics, 2010
Summary: Key Contributions• New framework (efficient algorithms and methods) for
sequence comparison, classification, pattern extraction• Spatial representations and kernels• Unified computational framework for string kernels with
inexact matching• Efficient (alphabet-independent) algorithms for motif
discovery• Benefit I (complexity): fast, alphabet-free sequence
matching (sufficient statistics, spatial representations)• Benefit II (accuracy): enhances performance on many
practical tasks (bio-informatics, text, music)• Extends to richer semi-supervised settings
• neighborhood kernels• abstraction-based kernels
Future Work• non-Hamming distances
• semi-supervised string kernels
• graph, image kernels
• information extraction, e.g., identifying relationships, extracting interesting/relevant sentences from documents
• sequence labeling
Thank you!
Journal Publications• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Efficient use of unla- beled data for protein sequence classification: a comparative study. BMC Bioinformatics, 10(Suppl 4):S2, 2009. Impact factor: 3.78• Pavel Kuksa and Vladimir Pavlovic. Efficient alignment-free DNA barcode analytics. BMC Bioinformatics, 10(Suppl 14):S9, 2009. Impact factor: 3.78• Pavel Kuksa and Vladimir Pavlovic. Efficient motif finding algorithms for large-alphabet inputs. BMC Bioinformatics, 11(Suppl 8):S1, 2010.Conference papers• Pavel P. Kuksa and Vladimir Pavlovic. Efficient Motif Finding Algorithms for Large-Alphabet Inputs. In BIOKDD, 2010. Acceptance rate: 7/29 reg- ular (24%)• Pavel Kuksa and Vladimir Pavlovic. Fast motif selection for biological se- quences. In IEEE International Conference on Bioinformatics and Biomedicine BIBMʼ09, 2009. Acceptance rate: (44+37)/233 (35%)• Pavel P. Kuksa and Vladimir Pavlovic. Spatial Representation for Efficient Sequence Classification. In ICPR, 2010. Acceptance rate: 385/2140 oral (18%)• Pavel Kuksa and Vladimir Pavlovic. Efficient Alignment-free Barcode Ana- lytics. In Third International Barcode of Life Conference, 2009.• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Scalable Algorithms for String Kernels with Inexact Matching. In NIPS, 2008. Spotlight Presen- tation. Acceptance rate: 123/1022 (12%)
Conference papers• Pavel P. Kuksa, Yanjun Qi, Bing Bai, Ronan Collobert, Jason Weston, Vladimir Pavlovic, and Xia Ning. Semi-Supervised Abstraction-Augmented String Kernel for Multi-Level Bio-Relation Extraction. In ECML, 2010. Ac- ceptance rate: 106/658 (16%)• Yanjun Qi, Ronan Collobert, Pavel Kuksa, Koray Kavukcuoglu, and Jason Weston. Combining labeled and unlabeled data with word-class distribution learning. In Proceeding of the 18th ACM Conference on Information and Knowledge Management CIKM 2009, pp. 17371740, 2009. Acceptance rate: (123+171)/847 (20% short paper)• Yanjun Qi, Pavel P. Kuksa, Ronan Collobert, Kunihiko Sadamasa, Koray Kavukcuoglu, and Jason Weston. Semi-Supervised Sequence Labeling with Self-Learned Features. In Proc. International Conference on Data Mining (ICDMʼ09), IEEE, 2009. Acceptance rate: 8.9% regular (70/786)• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. On the role of lo- cal matching for efficient semi-supervised protein sequence classification. In BIBM, 2008. Acceptance rate: 38/156 (24%)• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. A fast, semi-supervised learning method for protein sequence classification. In 8th International Workshop on Data Mining in Bioinformatics (BIOKDD 2008), pp. 2937, 2008. Acceptance rate: 8/25 (32%)• Pavel Kuksa and Yanjun Qi. Semi-Supervised Bio-Named Entity Recog- nition with Word-Codebook Learning. In SDM, 2010. Acceptance rate: 82/351 (23%)
Conference papersPavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Fast and Accurate Multi-class Protein Fold Recognition with Spatial Sample Kernels. In Com- putational Systems Bioinformatics: Proceedings of the CSB2008 Conference, pp. 133143, 2008. Acceptance rate: 30/135 (22%)• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Fast Protein Homol- ogy and Fold Detection with Sparse Spatial Sample Kernels. In 19th Inter- national Conference on Pattern Recognition ICPR 2008, 2008. Acceptance rate: 18% (oral). Best paper nominee.• Pavel Kuksa and Vladimir Pavlovic. Fast Barcode-Based Species Identifica- tion Using String Kernels. In Second International Barcode of Life Confer- ence, 2007. Acceptance rate: 30% (oral)• Pavel Kuksa and Vladimir Pavlovic. Fast Kernel Methods for SVM Sequence Classifiers. In WABI, pp. 228239, 2007. Acceptance rate: 37/131 (28%)
Workshop papers• Pavel Kuksa and Vladimir Pavlovic. Efficient Sequence Classification with Spatial Representations. In Snowbird Learning Workshop, April 2010. Oral presentation.• Jason Weston, Ronan Collobert, Frederic Ratle, Hossein Mobahi, Pavel Kuksa, and Koray Kavukcuoglu. Deep Learning via Semi-Supervised Em- bedding. In ICML 2009 Workshop on Learning Feature Hierarchies, 2009.• Pavel Kuksa and Vladimir Pavlovic. Efficient Discovery of Common Patterns in Sequences. Snowbird Learning Workshop, Clearwater, Florida, April 13- 16 2009, 2009.• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. High Performance Sequence Classification with Novel Spatial Sample Embedding. 3rd Annual Machine Learning Symposium, NY, Oct 10, 2008, 2008.• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Spatially-constrained sample kernel for sequence classification. Snowbird Learning Workshop, Utah, April 1-4, 2008, 2008.• Pavel Kuksa and Vladimir Pavlovic. Kernel methods for DNA barcoding. Snowbird Learning Workshop, San Juan, Puerto Rico, March 2007, 2007.
References• Bio-Informatics
• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Efficient use of unlabeled data for protein sequence classification: a comparative study. BMC Bioinformatics, 10(Suppl 4):S2, 2009
• Pavel Kuksa and Vladimir Pavlovic. Efficient alignment-free DNA barcode analytics. BMC Bioinformatics, 10(Suppl 14):S9, 2009
• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Fast and Accurate Multi-class Protein Fold Recognition with Spatial Sample Kernels. In Computational Systems Bioinformatics: Proceedings of the CSB2008 Conference, pp. 133–143, 2008
• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Fast Protein Homology and Fold Detection with Sparse Spatial Sample Kernels. In 19th International Conference on Pattern Recognition ICPR 2008, 2008.
• String kernels
• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Scalable Algorithms for String Kernels with Inexact Matching. In NIPS, 2008. Spotlight Presentation
• Pavel Kuksa and Vladimir Pavlovic. Fast Kernel Methods for SVM Sequence Classifiers. In WABI, pp. 228–239, 2007.
• Pavel P. Kuksa and Vladimir Pavlovic. Spatial Representation for Efficient Sequence Classification. In ICPR, 2010.
• NLP
• “Semi-Supervised Abstraction-Augmented String Kernel for Multi-Level Bio-Relation Extraction.”, Pavel Kuksa, Yanjun Qi, Bing Bai, Ronan Collobert, Jason Weston, ECML, 2010
• Pavel Kuksa and Yanjun Qi. Semi-Supervised Bio-Named Entity Recognition with Word-Codebook Learning. In SDM, 2010.
• Yanjun Qi, Pavel P. Kuksa, Ronan Collobert, Kunihiko Sadamasa, Koray Kavukcuoglu, and Jason Weston. Semi-Supervised Sequence Labeling with Self-Learned Features. In Proc. International Conference on Data Mining (ICDM'09), IEEE, 2009.