pkuksa.orgpkuksa.org/~pkuksa/publications/talks/pavel_predefense_2010.pdf · String Kernel Concept GGAATTTGAGCAGGACTAATTGGAACTTCTTTAAGATTACTTATTCGAACTGAATTAGGAACCCCAGGATCTTTAATTGGAGATGATCAAATTTATAATACAAT

Scalable Kernel Methods and Algorithms for General

Sequence Analysis

Pavel P. Kuksa

Department of Computer ScienceRutgers University

PhD Pre-Defense, 2010

Our topic• Classification of sequences:

• Inferring functions of proteins, species labels from DNA, music genre, or text topic

• Approach: Similarity-based inference using kernel functions

• Pattern detection/extraction: e.g., motif discovery2

Music GenreArtistetc

AnnotationAudio/Music Text

Bio-informatics

Problem Solving Framework• Kernel Methods for Sequences

• Infer class label via sequence similarity measures (kernels) K(X,Y) = <F(X), F(Y)> (dot-products of feature vectors)

• measure similarity between two objects (e.g., documents, bio-sequences)

• have corresponding (dot-product) feature space F (what F should be for sequences?)

• the problem is mapped from original sequence space to vector feature space

String Kernel ConceptGGAATTTGAGCAGGACTAATTGGAACTTCTTTAAGATTACTTATTCGAACTGAATTAGGAACCCCAGGATCTTTAATTGGAGATGATCAAATTTATAATACAAT

TGTTACAGCTCATGCATTTATTATAATTTTTTTTATAGTTATACCTATTATAATCGGAGGATTTGGAAATTGACTAGTTCCATTAATAATAGGTGCCCCAGATATAG

CTTTCCCCCGTATAAATAACATAAGATTTTGATTATTACCCCCATCTTTAACTTTATTAATTTCAAGAAGAATTGTTGAAAATGGGGCTGGTACAGGATGAACA

GTTTATCCCCCTCTTTCATCAAATATCGCCCATCAAGGAGCATCTGTTGATTTAGCAATTTTTTCCCTTCATCTTGCTGGTATTTCATCAATTCTTGGAGCTA

TTAATTTTATTACAACAATTATTAATATACGAATTAATAATTTATCTTTTGATCAAATACCATTATTTGTTTGAGCTGTAGGAATTACAGCATTATTATTATTACTTTC

ATTACCTGTTTTAGCAGGTGCTATTACTATATTATTAACAGATCGAAATTTAAATACTTCTTTTTTTGATCCTGCAGGAGGAGGAGATCCAATCTTATACCAACA

CTTATTT

Spectrum(5) Mismatch(5,1)

GGAATGxAATxGAAT

GGxATGGAxTGGAAx

Fig. 1. String kernel. A sliding window of size k=5 is used to derive feature representation φ(X) of an input sequence X.

structural families. A novel unsupervised “abstrac-

tion” strategy is applied on classic string kernel to

leverage supervised learning with the power of many

unannotated protein sequences. The “abstraction”

includes two stages,

• (1) An unsupervised auxiliary task is em-

ployed to learn accurate representations

(embedding) for amino acids based on con-

textual similarity in biological sequences.

• (2) Amino acids are grouped to generate

more abstract entities according to their

learned representations.

On three benchmark data sets for structural classifi-

cation of proteins, the proposed semi-supervised ker-

nel achieves state-of-the-art performance, improves

over classic string kernels, and is more general than

previous approaches.

2. Protein Sequence Analysis UsingString Kernels

In this paper we focus on string kernel-based meth-

ods 6 for classification of protein sequences under the

discriminative learning setting. The discriminative

setting tries to capture the differences among classes

(e.g. folds and superfamilies), while its counterpart,

generative models 7 focus on capturing shared char-

acteristics within classes. Previous studies 8, 9 show

that the discriminative models have better distinc-

tion power over the generative models 7.

Our target problems of remote protein homology

and fold recognition essentially try to infer structural

relationships between proteins from their primary se-

quences. These problems could be treated as tasks of

learning sequences of amino acids that typically de-

scribe the patterns of protein structural properties of

interest. Given an input protein sequence, the target

structural relationship will be recognized by perform-

ing matching between patterns present in the test

sequence and the patterns in the annotated training

examples. This amounts to computing string kernel

(similarity score) between protein sequences based

on certain feature representations. The key idea of

string kernels is to apply a mapping φ(·) to map se-

quences of arbitrary length into a vectorial feature

space of fixed length. In this space a standard classi-

fier such as support vector machine (SVM) 6 can then

be applied. As SVMs require only inner products

between examples in the feature space, rather than

the feature vectors themselves, string kernel 10, 9, 4

is thus implicitly computed as an inner product in

this feature space:

K(X, Y ) = �φ(X), φ(Y )�, (1)

where X = (x1, . . . , x|X|), where |X| means the

length of the sequence. X,Y ∈ S, S is the set of

F(X)

X

Similar spectra ~ High Similarity

Length k=5 Up to m=1 mismatch

K(X, Y) = <F(X), F(Y)>

String Kernel ConceptGGAATTTGAGCAGGACTAATTGGAACTTCTTTAAGATTACTTATTCGAACTGAATTAGGAACCCCAGGATCTTTAATTGGAGATGATCAAATTTATAATACAAT

TGTTACAGCTCATGCATTTATTATAATTTTTTTTATAGTTATACCTATTATAATCGGAGGATTTGGAAATTGACTAGTTCCATTAATAATAGGTGCCCCAGATATAG

CTTTCCCCCGTATAAATAACATAAGATTTTGATTATTACCCCCATCTTTAACTTTATTAATTTCAAGAAGAATTGTTGAAAATGGGGCTGGTACAGGATGAACA

GTTTATCCCCCTCTTTCATCAAATATCGCCCATCAAGGAGCATCTGTTGATTTAGCAATTTTTTCCCTTCATCTTGCTGGTATTTCATCAATTCTTGGAGCTA

TTAATTTTATTACAACAATTATTAATATACGAATTAATAATTTATCTTTTGATCAAATACCATTATTTGTTTGAGCTGTAGGAATTACAGCATTATTATTATTACTTTC

ATTACCTGTTTTAGCAGGTGCTATTACTATATTATTAACAGATCGAAATTTAAATACTTCTTTTTTTGATCCTGCAGGAGGAGGAGATCCAATCTTATACCAACA

CTTATTT

Spectrum(5) Mismatch(5,1)

GGAATGxAATxGAAT

GGxATGGAxTGGAAx

Fig. 1. String kernel. A sliding window of size k=5 is used to derive feature representation φ(X) of an input sequence X.

structural families. A novel unsupervised “abstrac-

tion” strategy is applied on classic string kernel to

leverage supervised learning with the power of many

unannotated protein sequences. The “abstraction”

includes two stages,

• (1) An unsupervised auxiliary task is em-

ployed to learn accurate representations

(embedding) for amino acids based on con-

textual similarity in biological sequences.

• (2) Amino acids are grouped to generate

more abstract entities according to their

learned representations.

On three benchmark data sets for structural classifi-

cation of proteins, the proposed semi-supervised ker-

nel achieves state-of-the-art performance, improves

over classic string kernels, and is more general than

previous approaches.

2. Protein Sequence Analysis UsingString Kernels

In this paper we focus on string kernel-based meth-

ods 6 for classification of protein sequences under the

discriminative learning setting. The discriminative

setting tries to capture the differences among classes

(e.g. folds and superfamilies), while its counterpart,

generative models 7 focus on capturing shared char-

acteristics within classes. Previous studies 8, 9 show

that the discriminative models have better distinc-

tion power over the generative models 7.

Our target problems of remote protein homology

and fold recognition essentially try to infer structural

relationships between proteins from their primary se-

quences. These problems could be treated as tasks of

learning sequences of amino acids that typically de-

scribe the patterns of protein structural properties of

interest. Given an input protein sequence, the target

structural relationship will be recognized by perform-

ing matching between patterns present in the test

sequence and the patterns in the annotated training

examples. This amounts to computing string kernel

(similarity score) between protein sequences based

on certain feature representations. The key idea of

string kernels is to apply a mapping φ(·) to map se-

quences of arbitrary length into a vectorial feature

space of fixed length. In this space a standard classi-

fier such as support vector machine (SVM) 6 can then

be applied. As SVMs require only inner products

between examples in the feature space, rather than

the feature vectors themselves, string kernel 10, 9, 4

is thus implicitly computed as an inner product in

this feature space:

K(X, Y ) = �φ(X), φ(Y )�, (1)

where X = (x1, . . . , x|X|), where |X| means the

length of the sequence. X,Y ∈ S, S is the set of

F(X)

X

Similar spectra ~ High Similarity

Length k=5 Up to m=1 mismatch

Three important questions:

1. Representation. What are the features? 2. Computational complexity. How efficient is the similarity evaluation? (quadratic? linear?) 3. Pattern extraction. What features are important/representative?

K(X, Y) = <F(X), F(Y)>

Main Results• Sparse Spatial Sample Embedding for Sequences

• multi-resolution representation of sequences• more efficient and accurate

• Unified Computational Framework for String Kernels with Inexact Matching• generalizes computation and significantly improves

time-efficiency of many string kernel functions• Algorithms for Pattern (motif) Extraction from

Sequence Data• significantly improve search efficiency over existing

algorithms

5

String Kernels• Original string kernels

Pairwise-alignment algorithms (Needleman-Wunsch) Not Mercer kernels [Vert et al.’04]

Pair HMMs [Watkins’99], convolution kernels [Haussler’99], gappy n-gram kernels [Lodhi et al.’02], rational kernels [Cortes et al.’02]

O(n2) complexity in sequence length n / pair

• Spectrum-like kernelsSpectrum kernels [Leslie’02], mismatch kernels [Leslie’04], substring kernels [Vishwanathan & Smola’02]

O(n) complexity in sequence length n / pairBest performing kernels have large multiplicative complexity factors (e.g., exponential dependency on alphabet size and kernel parameters)

Accuracy & Algorithmic complexity still insufficient for large-scale sequence comparison / annotation

Spectrum Kernels• Measure similarity between sequences based on co-

occurrence of fixed-length substrings (k-mers)

• Feature map for exact spectrum kernel (no mismatches)

• Feature map for mismatch kernel (m mismatches)

• more dense representation (up to km|∑|m more dense)

MotivationOur Results/Contributions

Summary

Basic Problems That We StudiedPrevious Work

Spectrum-like kernelsMeasure similarity based on common contiguousfixed-length substrings (k -mers)Feature map for exact spectrum kernel (no mismatches):

Φ(X ) =� �

α∈X

|α|=k

I(α, γ)�

γ∈Σk

sparse representation: e.g. HKY→ (00...01(HKY )0...0)

Feature map for the mismatch kernel (m mismatches):

Φ(X ) =� �

α∈X

|α|=k

Im(α, γ)�

γ∈Σk

less sparse representation (up to km|Σ|m more dense): e.g.HKY→ (01(AKY )...1(HAY )...1(HKA)0...1(HKY )0...0)

Pavel Kuksa Kernel Methods and Algorithms for General Sequence Analysis

MotivationOur Results/Contributions

Summary

Basic Problems That We StudiedPrevious Work

Spectrum-like kernelsMeasure similarity based on common contiguousfixed-length substrings (k -mers)Feature map for exact spectrum kernel (no mismatches):

Φ(X ) =� �

α∈X

|α|=k

I(α, γ)�

γ∈Σk

sparse representation: e.g. HKY→ (00...01(HKY )0...0)

Feature map for the mismatch kernel (m mismatches):

Φ(X ) =� �

α∈X

|α|=k

Im(α, γ)�

γ∈Σk

less sparse representation (up to km|Σ|m more dense): e.g.HKY→ (01(AKY )...1(HAY )...1(HKA)0...1(HKY )0...0)


Profile kernel• Probabilistic string representation

• Replaces string representation nx1 with a nx|∑| profile

• Each character in a sequence is replaced with a probability distribution over alphabet characters

• Measures sequence similarity base on co-occurrence (in a probabilistic sense!) of k-mers

• State-of-the-art in protein remote homology

Motivation

Our Results/Contributions

Summary

Basic Problems That We Studied

Previous Work

Profile kernel: probabilistic string representation

Replaces string representation (nx1) with a profile (nxΣ)

Each character in a string is replaced with a probability

distribution over alphabet characters

Profile can be estimated using PSI-BLAST, etc.

Measure similarity based on common (in probabilistic

sense) k -mers

Profile feature map:

Φ(X ) =� �

Mσ∈X

I(γ ∈ Mσ)�

γ∈Σk

Mσ = {a : P(a) > σ}(mutation neighborhood)


String Kernels: Summary

• All kernels have the same feature space indexed by all possible substrings of length k (k-mers)

• Position or spatial arrangement of features is not retained (representation issue)

• Best performing methods are computationally expensive (e.g. mismatch kernel)

Spatial Representations• New spatial representations and kernels for

sequences

• model sequences at multiple scales, preserve positional information

• have time complexity O(cn) linear in the sequence length n and small, alphabet-independent constant c

• display state-of-the-art performance on protein, text, and music classification tasks

Sparse Spatial Sample Embedding

- Models multi-scale spatial/temporal dependencies- Matching, comparison, classification more efficient O(cn)

k=1, t=2, d=1-5

• Differences in sequence modeling

• mismatch: very few feature retained

• spatial: still significant overlap in features

Motivation

Our Results/Contributions

Summary

Main Results: New Kernels/Algorithms

Experimental Results

Sequence Modeling

S = HKYNQLIMXKYNQHXYNQHKXNQHKYXQHKYNX

XYNQLKXNQLKYXQLKYNXLYKNQXXQLIMNXLIMNQXIMNQLXMNQLIX

XNQLIYXQLIYNXLIYNQXIYNQLX

XKINQHXINQHKXNQHKIXQHKINQ

XINQIKXNQIKIXQIKINXIKINQXXQIIMNXIIMNQXIMNQIXMNQIIX

XNQIIIXQIIINXIIINQXIINQIX

HKKYYNNQQLLIIM

H_YK_NY_QN_LQ_IL_M

H__NK__QY__LN__IQ__M

H___QK___LY___IN___M

H____LK____IY____M

HKKIINNQQIIIIM

H_IK_NI_QN_IQ_II_M

H__NK__QI__IN__IQ__M

H___QK___II___IN___M

H____IK____II____M

mismatch(5,1)

S’= HKINQIIM

double-(1,5)

Differences in handling substitutions by spectrum-like and spatial

kernels

Mismatch: very few features retained

Spatial kernels: still significant overlap in sequence features

For the spectrum-like kernel, the number of allowed mismatches

needs to be increased (at a high computational cost)


SSK Evaluation:Representational abilityAlgorithm & running time

• Semi-supervised setting• 1M unlabeled sequences (NR dataset)

Structural classification of proteins(SCOP)

(54 superfam)

Large-scale Protein Annotation

Text: Test F1 Scores

• Improve over many state-of-the-art methods• 9K x 9K Gram matrix in <60sec

SS = subsequence kernel NG = n-gram kernelKSG = key-substring-group method TF-IDF = tf-idf kernel

• Reuters 21578 Dataset: 8 topics, ~9K documents, 30k words

• Task: classify text documents by topic

Music Genre Prediction• 10 genres, ~30s of audio per song

• VQ with |∑|=1024 codewords (MFCC feat.), seq. length ~7000

Method ErrorMismatch kernel (k=5,m=1) 35.6Spatial kernel (t=3,k=1,d=5) 29.4

Baseline 1: DWCH (Daubechies Wavelet), Li et al

41.6

Music Artist Recognition

• Album-wise cross-validation over 20 artists (120 albums)

• VQ with |∑|=1024 codewords

Method ErrorMismatch kernel (k=5,m=1) 44.66Spatial kernel (t=3,k=1,d=5) 32.50

Baseline 1: GMM 44.0

Spatial Kernels - SSSK Evaluation

• Efficient computation using Sorting + Counting (Alphabet size independent)

Alphabet-free matching

Order-of-magnitude faster than existing methods for inexact sequence matching (e.g., mismatch, subsequence methods)

~104 speedup for large alphabet ~102-103 speedup for longsequences

SSSK: Our contributions• New spatial representation for sequence

data (improves representational ability)

• Fast O(cn) evaluation (alphabet-independent)

• important for bio-sequences, text, music, etc.

• State-of-the-art performance for remote homology, fold prediction, text & music classification

Pavel Kuksa, Vladimir Pavlovic, ICPR 2010“Spatial representations for efficient sequence classification”

Algorithms for String Kernels with Inexact Matching

What are the best algorithms for string kernels?• Suffix trees (Vishwanathan & Smola, 2002)• linear-time all-substring kernel

• (Sparse) dynamic programming (Rousu, 2005)• gapped kernel

• Mismatch trie (Leslie, 2004)• applies to mismatch, spectrum, gapped

kernels, etc (inexact matching!!!)

• Complexity

• limits applicability to small k,m,|∑|

O(km+1|Σ|m(|X| + |Y |)) for k-mers (contiguous substring of length k) with up to m mismatchesand the alphabet size |Σ|. This limits the applicability of such algorithms to simpler transformationmodels (shorter k and m) and smaller alphabets, reducing their practical utility on complex real data.As an alternative, more complex transformation models such as [2] lead to state-of-the-art predictiveperformance at the expense of increased computational effort.

In this work we propose novel algorithms for modeling sequences under complex transformations(such as multiple insertions, deletions, mutations) that exhibit state-of-the-art performance on avariety of distinct classification tasks. In particular, we present new algorithms for inexact (e.g.with mismatches) string comparison that improve currently known time bounds for such tasks andshow orders-of-magnitude running time improvements. The algorithms rely on an efficient implicitcomputation of mismatch neighborhoods and k-mer statistic on sets of sequences. This leads toa mismatch kernel algorithm with complexity O(ck,m(|X| + |Y |)), where ck,m is independent ofthe alphabet Σ. The algorithm can be easily generalized to other families of string kernels, such asthe spectrum and gapped kernels [6], as well as to semi-supervised settings such as the neighbor-hood kernel of [7]. We demonstrate the benefits of our algorithms on many challenging classifica-tion problems, such as detecting homology (evolutionary similarity) of remotely related proteins,recognizing protein fold, and performing classification of music samples. The algorithms displaystate-of-the-art classification performance and run substantially faster than existing methods. Lowcomputational complexity of our algorithms opens the possibility of analyzing very large datasetsunder both fully-supervised and semi-supervised setting with modest computational resources.

2 Related Work

Over the past decade, various methods were proposed to solve the string classification problem,including generative, such as HMMs, or discriminative approaches. Among the discriminative ap-proaches, in many sequence analysis tasks, kernel-based [8] machine learning methods provide mostaccurate results [2, 3, 4, 6].

Sequence matching is frequently based on common co-occurrence of exact sub-patterns (k-mers,features), as in spectrum kernels [9] or substring kernels [10]. Inexact comparison in this frameworkis typically achieved using different families of mismatch [3] or profile [2] kernels. Both spectrum-kand mismatch(k,m) kernel directly extract string features based on the observed sequence, X . Onthe other hand, the profile kernel, proposed by Kuang et al. in [2], builds a profile [11] PX anduses a similar |Σ|k-dimensional representation, derived from PX . Constructing the profile for eachsequence may not be practical in some application domains, since the size of the profile is dependenton the size of the alphabet set. While for bio-sequences |Σ| = 4 or 20, for music or text classification|Σ| can potentially be very large, on the order of tens of thousands of symbols. In this case, a verysimple semi-supervised learning method, the sequence neighborhood kernel, can be employed [7]as an alternative to lone k-mers with many mismatches.

The most efficient available trie-based algorithms [3, 5] for mismatch kernels have a strong depen-dency on the size of alphabet set and the number of allowed mismatches, both of which need to berestricted in practice to control the complexity of the algorithm. Under the trie-based framework, thelist of k-mers extracted from given strings is traversed in a depth-first search with branches corre-sponding to all possible σ ∈ Σ. Each leaf node at depth k corresponds to a particular k-mer feature(either exact or inexact instance of the observed exact string features) and contains a list of matchingfeatures from each string. The kernel matrix is updated at leaf nodes with corresponding counts.The complexity of the trie-based algorithm for mismatch kernel computation for two strings X andY is O(km+1|Σ|m(|X| + |Y |)) [3]. The algorithm complexity depends on the size of Σ since dur-ing a trie traversal, possible substitutions are drawn from Σ explicitly; consequently, to control thecomplexity of the algorithm we need to restrict the number of allowed mismatches (m), as well asthe alphabet size (|Σ|). Such limitations hinder wide application of the powerful computational tool,as in biological sequence analysis, mutation, insertions and deletions frequently co-occur, henceestablishing the need to relax the parameter m; on the other hand, restricting the size of the alpha-bet sets strongly limits applications of the mismatch model. While other efficient string algorithmsexist, such as [6, 12] and the suffix-tree based algorithms in [10], they do not readily extend to themismatch framework. In this study, we aim to extend the works presented in [6, 10] and close theexisting gap in theoretical complexity between the mismatch and other fast string kernels.

Mismatch Trie Algorithm• Depth-first search over complete tree with

leaves/paths for all possible k-length substrings

• Problematic for large-∑ inputs and relaxed matching (larger m)

depth k

|∑|k leaves

A

C G

O(km+1|Σ|m(|X| + |Y |)) for k-mers (contiguous substring of length k) with up to m mismatchesand the alphabet size |Σ|. This limits the applicability of such algorithms to simpler transformationmodels (shorter k and m) and smaller alphabets, reducing their practical utility on complex real data.As an alternative, more complex transformation models such as [2] lead to state-of-the-art predictiveperformance at the expense of increased computational effort.

In this work we propose novel algorithms for modeling sequences under complex transformations(such as multiple insertions, deletions, mutations) that exhibit state-of-the-art performance on avariety of distinct classification tasks. In particular, we present new algorithms for inexact (e.g.with mismatches) string comparison that improve currently known time bounds for such tasks andshow orders-of-magnitude running time improvements. The algorithms rely on an efficient implicitcomputation of mismatch neighborhoods and k-mer statistic on sets of sequences. This leads toa mismatch kernel algorithm with complexity O(ck,m(|X| + |Y |)), where ck,m is independent ofthe alphabet Σ. The algorithm can be easily generalized to other families of string kernels, such asthe spectrum and gapped kernels [6], as well as to semi-supervised settings such as the neighbor-hood kernel of [7]. We demonstrate the benefits of our algorithms on many challenging classifica-tion problems, such as detecting homology (evolutionary similarity) of remotely related proteins,recognizing protein fold, and performing classification of music samples. The algorithms displaystate-of-the-art classification performance and run substantially faster than existing methods. Lowcomputational complexity of our algorithms opens the possibility of analyzing very large datasetsunder both fully-supervised and semi-supervised setting with modest computational resources.

2 Related Work

Over the past decade, various methods were proposed to solve the string classification problem,including generative, such as HMMs, or discriminative approaches. Among the discriminative ap-proaches, in many sequence analysis tasks, kernel-based [8] machine learning methods provide mostaccurate results [2, 3, 4, 6].

Sequence matching is frequently based on common co-occurrence of exact sub-patterns (k-mers,features), as in spectrum kernels [9] or substring kernels [10]. Inexact comparison in this frameworkis typically achieved using different families of mismatch [3] or profile [2] kernels. Both spectrum-kand mismatch(k,m) kernel directly extract string features based on the observed sequence, X . Onthe other hand, the profile kernel, proposed by Kuang et al. in [2], builds a profile [11] PX anduses a similar |Σ|k-dimensional representation, derived from PX . Constructing the profile for eachsequence may not be practical in some application domains, since the size of the profile is dependenton the size of the alphabet set. While for bio-sequences |Σ| = 4 or 20, for music or text classification|Σ| can potentially be very large, on the order of tens of thousands of symbols. In this case, a verysimple semi-supervised learning method, the sequence neighborhood kernel, can be employed [7]as an alternative to lone k-mers with many mismatches.

The most efficient available trie-based algorithms [3, 5] for mismatch kernels have a strong depen-dency on the size of alphabet set and the number of allowed mismatches, both of which need to berestricted in practice to control the complexity of the algorithm. Under the trie-based framework, thelist of k-mers extracted from given strings is traversed in a depth-first search with branches corre-sponding to all possible σ ∈ Σ. Each leaf node at depth k corresponds to a particular k-mer feature(either exact or inexact instance of the observed exact string features) and contains a list of matchingfeatures from each string. The kernel matrix is updated at leaf nodes with corresponding counts.The complexity of the trie-based algorithm for mismatch kernel computation for two strings X andY is O(km+1|Σ|m(|X| + |Y |)) [3]. The algorithm complexity depends on the size of Σ since dur-ing a trie traversal, possible substitutions are drawn from Σ explicitly; consequently, to control thecomplexity of the algorithm we need to restrict the number of allowed mismatches (m), as well asthe alphabet size (|Σ|). Such limitations hinder wide application of the powerful computational tool,as in biological sequence analysis, mutation, insertions and deletions frequently co-occur, henceestablishing the need to relax the parameter m; on the other hand, restricting the size of the alpha-bet sets strongly limits applications of the mismatch model. While other efficient string algorithmsexist, such as [6, 12] and the suffix-tree based algorithms in [10], they do not readily extend to themismatch framework. In this study, we aim to extend the works presented in [6, 10] and close theexisting gap in theoretical complexity between the mismatch and other fast string kernels.Want to address this

O(km |∑|m) - number of substrings at Hamming distance of at most m

Our Results• New sufficient-statistic based algorithms for kernel

computation

• Improve complexity bounds over existing algorithms, removes dependency on the alphabet size

• Several orders-of-magnitude speed-up ⇒ can now be applied to large-alphabet inputs (music, text, time series, etc)

• Enhance performance: more complex kernels improve predictive accuracy

104 speed-up for large (|∑|=1000) alphabets

Figure 1: Relative running time vs. alphabet size

Scalable Algorithms for String KernelsScalable Algorithms for String Kernelswith Inexact Matchingwith Inexact Matching

Pavel P. Kuksa, Pai-Hsi Huang, Vladimir Pavlovic

Theoretical resultsNew algorithmic approach for string kernels with inexact matching, including mismatch, gapped, and other count-based kernelsImproves complexity bounds over existing algorithmsExtends to large-scale semi-supervised learning (eg. neighborhood kernel approach)

Empirical resultsSeveral orders of magnitude speed-up Explore more complex kernels yielding improved predictions

Points of Interest! Removing dependency on the size of the alphabet under the mismatch measure O(km|∑|mn) " O(ck,mn)! Generalization of algorithms for kernels on strings, improving time-efficiency! Enhancing performance with new algorithms

! computational biology! recognizing protein fold from sequence! detecting remote evolutionary similarity

! general sequence data! music genre recognition

! State-of-the-art results for protein classification

Problem: what are the best algorithms for strings under inexact matching?Spotlight ID: ???

Speed-upProtein DNA Text Music100 75 >106 104

Table 2: Remote homology detectionOriginal Relaxed

Supervised 41.92 52.00Semi-supervised 81.05 86.32http://seqam.rutgers.edu/projects/new-inexact/

Table 1: Protein fold prediction (multi-class)Original Relaxed

Supervised 46.78 52.87Semi-supervised 70.87 77.51

features index

features index

ordered

unordered

similarity matrix

Sequence data

Original 57.4Relaxed 64.4

Table 3: Music genreclassification (multi-class)

#Reduce error by 15% compared to the best available method#Reduce error by 30% compared to the state-of-the-art profile kernel

Our approach: Sufficient Statistics

• (I) Exact spectrum computation with counting sort

• (2) Sufficient statistics for string kernels and their computation using (1) as sub-algorithm

Two steps

Evaluation of string kernel functions using Sufficient Statistics

(I) Exact Spectrum computation with counting sort

http://seqam.rutgers.edu/projects/new-inexact

Problem! Explicit feature vectors are not readily available for many data domains (eg. biosequences, text,

audio data, etc.)" Explicit feature extraction can be complex and expensive

! Alternative: use kernel functions or inner products between data points in a higher dimensional feature space" Eg. feature space formed by all subsequences of length k (eg. subsequence, gapped, or

mismatch string kernels):

" as the dimensionality grows exponentially |!|k, direct computation of the inner products is prohibitive

!Eg. music with 1024-alphaber ~109-10

15 features (k=3,5)

# Need efficient algorithms for the inner product (kernel)

How do we compute string kernel inner products more efficiently for large-|!| inputs and inexact matching with relaxed (eg. with many mismatches)

constraints?

Our approach! Sufficient statistics – based method for efficient RKHS inner product evaluation for string kernels

" generalizes computation and improves time-efficiency of many existing string kernel functions! Previous work: suffix trees (Vishwanathan '02), (sparse) dynamic programming (Rousu '05),

mismatch trie (Leslie '04)! This work: new algorithmic approach (sufficient statistics & sorting), improves complexity bounds

over prior works, orders of magnitude speed-up

! Open questions addressed in this work:" dependency on |"| for mismatch model" unified computational framework for string kernels with inexact matching" efficient computation for large-|"| inputs

Main ContributionsNew efficient algorithms based on sufficient statistics for string kernel computations

Theoretical results! New algorithmic approach for string kernel computation with inexact matching! Improve complexity bounds over existing algorithms

Empirical results! Orders of magnitude speed-up, large datasets (millions of examples, semi-supervised)! State-of-the-art results for protein classification

Scalable Algorithms for String Kernels with Inexact MatchingPavel Kuksa, Pai-Hsi Huang, Vladimir Pavlovic

Department of Computer Science, Rutgers University, NJ

Algorithm OverviewI. Mismatch kernel function:

Rewriting (1) we get

This gives a quadratic algorithm (Hamming distance-based ), does not depend on |!#$%&'()*)+(,

"#$%&'%()%*'++',-

-%./)+)01$234$.%$1%+

2.#+/01$2%3+#+13+1/34

We use sorting-based counting to compute in linear time cumulative statistics Ci first,

Then derive Mi from C

i to evaluate the inner product (1)

i.e. we obtain a linear time algorithm independent of |!#

10,000x faster for large (|∑|=1000) alphabets

Relative running time vs. alphabet size

5/6+%)0 789 :%&+ ;<=)*100 75 >106 104

Speed-up [x]:

K $X ,Y %&'()'K c $( |X %c $( |Y % $1%

c $( |X %&'*)X I m $* ,(%

I m $( ,*%&+1, d $( ,*%,m0, otherwise

d $( ,*%-Hamming distance between ( and *

K $X ,Y %&'*)X'.)Y '()'k I m $( ,*% I m $( ,.%#neighbors / shared by a and bconstant for given d(a,b)!

4'.)+'%0).)5)26%('+'/+1)$%748"9:;Original (5,1) Relaxed (7,3)


<=5+1>/5#33%?,)+'1$%@)5(%?,'(1/+1)$A%BMethod / Accuracy Original (5,1) Relaxed (7,3)Supervised 46.78 52.87Semi-supervised 70.87 77.51


Music genre classificationaccuracy (multi-class), %

K $X ,Y %&'*)X '()Y /$( ,*% $2%

K $X ,Y %&'i&0

i&2m Mi/i $3%

K $X ,Y %&'i&0

i&mCi

K $X ,Y %&'i&0

i&mCi'

K $X ,Y %&M0

Results

AAAA ACAC AGAG ...... CGCG ...... GAGA ...... TTTT00

11

22 >2(‘ACGAC’)

k=2m=0

1 2 3 ... |∑|k

FeaturesFeatures (k-mers)

Cou

nts

index

index

features

features

ordered

unordered

AA AC AG ... TT 1 2 3 ... |∑|k

k-mer feature space:

!"-!"#$%&'$! "#% "!)$ "#)% " '#(')$&(#*+,' &

' $()0-.' .)*% 1 ' $+(")*% " +(")&'&&/01 $(& %$(-&%

ReferencesVishwanathan & Smola, Fast kernels for string and tree matching, NIPS 2002Leslie & Kuang, Fast string kernels using inexact matching, JMLR 2004Rousu & Shawe-Taylor, Efficient computation of gapped substring kernels on large alphabets, JMLR 2005Lodhi et al. Text classification using string kernels, JMLR 2002

C',$'5%.#+,1D%/).?=+#+1)$%%73;!"#$% &'()' &*

+,-./$0 1'2 34*

5,-./$067#/!$8#"+9 &'(' ':;<*

String kernels! Kernel functions defined on feature vectors computed from strings! Features:

" substrings (contiguous), " subsequences (non-contiguous)

AACACTCTG

Feature space (AAA-TTT)AACTG GCAAC

GCACAAAAC

AAA AACACTCAACTGGCA

010101

011010

" Iteratively remove i positions, sort, and count number of matches for (k-i)-mers

" Reduces computation to a series of spectrum kernel computations

Ci&Mi0' j&0

j&i-1$k- ji- j %M j

NIPS 2008,December 8-11

EEF%G31$2%3=@@1/1'$+%3+#+13+1/3%+)%/).?=+'%3+,1$2%H',$'5%@=$/+1)$3

?@$A'%*+/<B$C%/0%(=$2D%=()%E$3FF34G

H@$I)(J*?/J$C%/0%(=$2D%=()%E$3FFK4G$

$

*@$L?''%J$C%/0%(=$2D%=()%E$3FFK4G$

J@$A'?+)?($C%/0%(=$2M<C=?E$3FFN4

%@$8%)1OH6/O66J$C%/0%(=$2I%=+60E$3FFP4G$$$$=%B)Q=<'%/R)=%J$0%)1OH6/QH?=%J

$$$$$$$$$$M282S4E82T44UM2SVETV4

X1: GGAAX2: TTTGAAX3: GGAAT

GGAGAATTTTTGTGAGAAGGAGAAAAT

112222333

AATGAAGAAGAAGGAGGATTATTGTTT

312313222

Kernel computation with counting sort:

Kernel updates: M2WEW4$U$M2WEW4$X$**:

Algorithm. (Mismatch-SS) Mismatch kernel algorithm based on Sufficient StatisticsInput: strings X, Y, parameters k,m,1. Compute min(2m,k) cumulative matching statistics, C

i, using counting sort

2. Compute exact matching statistics, Mi

3. Evaluate kernel using :

Output: Mismatch kernel value K(X,Y|k,m)

/!

" $# $% !&$' %&'!&"&!($#$ $ ' %

)!/!

)!&*!-' +&"!-% $'- +!- + %) + $ !&"&' '' $&!($&!($#$ $ ' % $ '-%%

)'&,-' +&"'-%) +

Features: Seq. Index:

Counting sortSeq. index: 1 2 3Counts: 1 1 1Seq. index: 1 3Counts: 1 1

Seq. index: 2 Counts: 1


...

" Jointly sort the observed k-mers in N(X) and N(Y)" Apply desired kernel evaluation method (eg. spectrum, mismatch, gapped, etc)

" $# $% %&'!&$-%&-#&' '' $ -.-%%

*!

Input sequences

Kernel updates (per substring):K(J,J) = K(J,J) + ccT

c - feature counts, J - sequence index

substring kernel

k=3,m=0

Counting sort:O(kn) evaluation for

exact spectrum

How does this extend to inexact matching (m>0)?

Mismatch Kernel

builds a profile [10] PX and uses a similar |Σ|k-dimensionalrepresentation, derived from PX . Constructing the profilefor each sequence may not be practical in some applicationdomains, since the size of the profile is dependent on thesize of the alphabet set. While for bio-sequences |Σ| = 4or 20, for music or text classification |Σ| can potentially bevery large, on the order of tens of thousands of symbols. Inthis case, a very simple semi-supervised learning method,the sequence neighborhood kernel, can be employed [11] asan alternative to lone k-mers with many mismatches.

Some of the most efficient available trie-based algo-rithms [3], [5] for mismatch kernels have a strong de-pendency on the size of alphabet set and the number ofallowed mismatches, both of which need to be restrictedin practice to control the complexity of the algorithm.Under the trie-based framework, the list of k-mers extractedfrom given strings is traversed in a depth-first search withbranches corresponding to all possible σ ∈ Σ. Each leafnode at depth k corresponds to a particular k-mer feature(either exact or inexact instance of the observed exact stringfeatures) and contains a list of matching features fromeach string. The kernel matrix is updated at leaf nodeswith corresponding counts. The complexity of the trie-basedalgorithm for mismatch kernel computation for two stringsX and Y is O(km+1|Σ|m(|X| + |Y |)) [3]. The algorithmcomplexity depends on the size of Σ since during a trietraversal, possible substitutions are drawn from Σ explicitly;consequently, to control the complexity of the algorithm weneed to restrict the number of allowed mismatches (m), aswell as the alphabet size (|Σ|).

We note that most of existing k-mer string kernels (e.g.,mismatch/spectrum kernels, gapped and wildcard kernels)essentially use symbolic Hamming-distance based match-ing, which may not necessarily reflect underlying simi-larity/dissimilarity between sequence fragments (k-mers).For a large class of k-mer string kernels, which includemismatch/spectrum, gapped, wildcard kernels, the matchingfunction I(α,β) is independent of the actual k-mers beingmatched and depends only on the Hamming distance [12].As a result, related k-mers may not be matched becausetheir symbolic dissimilarity exceeds the maximum allowedHamming distance. This presents a limitation as in manycases similarity relationships are not entirely based onsymbolic similarity, e.g., as in matching word n-grams oramino-acid sequences, where, for instance, words may besemantically related or amino-acids could share structuralor physical properties not reflected on a symbolic level. ForHamming-distance based matching, recent work in [12] haveintroduced linear time algorithms with alphabet-independentcomplexity applicable to computation of a large class ofexisting string kernels. However, above approaches do notreadily extend to the case of a general (non-Hamming)similarity metrics (e.g., BLOSUM-based scoring functionsin biological sequence analysis, or measures of semantic

relatedness between words, etc.) without introducing a highcomputational cost (e.g., requiring quadratic or exponen-tial running time). In this work, we aim to extend theworks presented in [12], [7] to the case of general (non-Hamming) similarity metrics and introduce efficient linear-time generalized string kernel algorithms. We also showempirically that using these generalized similarity kernelsprovides effective improvements in practice for a number ofchallenging classification problems.

III. DISTANCE-PRESERVING STRING KERNELS

A. Spectrum and Mismatch Kernels Definition

Given a sequence X with symbols from alphabet Σ thespectrum-k kernel [8] and the mismatch(k,m) kernel [3]induce the following |Σ|k-dimensional representation for thesequence:

Φk,m(γ|X) =

��

α∈X

Im(α, γ)

�

γ∈Σk

, (1)

where Im(α, γ) = 1 if α ∈ Nk,m(γ), and Nk,m(γ) is themutational neighborhood, the set of all k-mers that differfrom γ by at most m mismatches. Note that, by definition,for spectrum kernels, m = 0. Effectively, these are the bag-of-substrings representations, with either exact (spectrum)or approximate/smoothed (mismatch) counts of substringspresent in the sequence X .

The spectrum/mismatch kernel is then defined as

K(X, Y |k, m) =�

γ∈Σk

Φk,m(γ|X)Φk,m(γ|Y ). (2)

=�

α∈X

�

β∈Y

�

γ∈Σk

Im(α, γ)Im(β, γ).(3)

One interpretation of this kernel is that of cumulative pair-wise comparison of all substrings α and β contained in se-quences X and Y , respectively. In the case of mismatch ker-nels the level of similarity of each pair of substrings (α,β) isbased on the number of identical substrings their mutationalneighborhoods give rise to,

�γ∈Σk Im(α, γ)Im(β, γ). For

the spectrum kernel, this similarity is simply the exactmatching of α and β.

One can generalize this and allow an arbitrary metricsimilarity function S to replace the Hamming similarity�

γ∈Σk Im(α, γ)Im(β, γ)→ S(α,β):

K(X,Y |k,S) =�

α∈X

�

β∈Y

S(α,β). (4)

Such similarity function can go significantly beyond therelatively simple substring mismatch (substitution) modelsmentioned above. However, a drawback of this generalrepresentation over the simple mismatch model that employsthe Hamming similarity is, of course, that the complexityof comparing two sequences in general becomes quadratic

Observation I: number of substrings within distance m from both a and b is independent of a and b

Im(α, γ) =

�1, d(α, γ) ≤ m0, otherwise

(1)

1

Main issue: denser substring spectrum, many more features due to approximate counts (m>0)

introduces for every a mismatch neighborhood N(a,m) of size

O(km|∑|m)

intersection size for twomismatch neighborhoods

Sufficient Statistics for String Kernels

Algorithm 1. (Hamming-Mismatch) Mismatch algorithm based on Hamming distance

Input: strings X, Y, |X| = nx, |Y | = ny , parameters k, m, lookup table I for intersection sizes

Evaluate kernel using Equation 5:

K(X, Y |k, m) =�nx−k+1

ix=1

�ny−k+1iy=1 I(d(xix:ix+k−1, yiy:iy+k−1)|k, m)

where I(d) is the intersection size for distance d

Output: Mismatch kernel value K(X, Y |k, m)

The overall complexity of the algorithm is O(knxny) since the Hamming distances between all

k-mer pairs observed in X and Y need to be known. In the following section, we discuss how to

efficiently compute the size of the intersection.

3.3 Intersection Size: Closed Form Solution

The number of neighboring k-mers shared by two observed k-mers a and b can be directly com-

puted, in a closed-form, from the Hamming distance d(a, b) for fixed k and m, requiring no explicit

traversal of the k-mer space as in the case of trie-based computations. We first consider the case

a = b (i.e. d(a, b) = 0). The intersection size corresponds to the size of the (k, m)-mismatch

neighborhood, i.e. I(a, b) = |Nk,m| =�m

i=0

�ki

�(|Σ| − 1)i

. For higher values of Hamming dis-

tance d, the key observation is that for fixed Σ, k, and m, given any distance d(a, b) = d, I(a, b) is

also a constant, regardless of the mismatch positions. As a result, intersection values can always be

pre-computed once, stored and looked up when necessary. To illustrate this, we show two examples

for m = 1, 2:

I(a, b)(m = 1) =

|Nk,m|, d(a, b) = 0|Σ|, d(a, b) = 1

2, d(a, b) = 2

I(a, b)(m = 2) =

|Nk,m|, d(a, b) = 01 + k(|Σ|− 1) + (k − 1)(|Σ|− 1)2, d(a, b) = 11 + 2(k − 1)(|Σ|− 1) + (|Σ|− 1)2, d(a, b) = 2

6(|Σ|− 1), d(a, b) = 3�42

�, d(a, b) = 4

In general, the intersection size can be found in a weighted form�

i wi(|Σ| − 1)iand can be pre-

computed in constant time.

4 Mismatch Algorithm based on Sufficient Statistics

In this section, we further develop ideas from the previous section and present an improved mismatch

algorithm that does not require pairwise comparison of the k-mers between two strings and dependes

linearly on sequence length. The crucial observation is that in Equation 5, I(a, b) is non-zero

only when d(a, b) ≤ 2m. As a result, the kernel computed in Equation 5 is incremented only by

min(2m, k) + 1 distinct values, corresponding to min(2m, k) + 1 possible intersection sizes. We

then can re-write the equation in the following form:

K(X, Y |m, k) =nx−k+1�

ix=1

ny−k+1�

iy=1

I(xix:ix+k−1, yiy:iy+k−1) =min(2m,k)�

i=0

MiIi, (6)

where Ii is the size of the intersection of k-mer mutational neighborhood for Hamming distance

i, and Mi, the number of observed k-mer pairs in X and Y having Hamming distance i. The

problem of computing the kernel has been further reduced to a single summation. We have shown

in Section 3.3 that given any i, we can compute Ii in advance. The crucial task now becomes

computing the sufficient statistics Mi efficiently. In the following, we will show how to compute the

mismatch statistics {Mi} in O(ck,m(nx + ny)) time, where ck,m is a constant that does not depend

on the alphabet size. We formulate the task of inferring matching statistics {Mi} as the following

auxiliary counting problem:

Mismatch Statistic Counting: Given a set of n k-mers from two strings X and Y ,

for each Hamming distance i = 0, 1, ...,min(2m, k), output the number of k-mer

pairs (a, b), a ∈ X, b ∈ Y with d(a, b) = i.

Matching (sufficient) statistics: Mi = number of pairs of substrings (a in X, b in Y) at Hamming distance d(a,b)=i

Observation II: kernel is incremented only by min(2m,k)+1 distinct values corresponding to Hamming distances 0-2m

How to compute matching statistics Mi efficiently?

• Problem: direct computation of Mi is still quadratic!

Computing matching statistics• Instead of Mi first compute approximate (with over-

counting) number of pairs Ci at distance at most i

• Algorithm: iteratively remove i positions, sort and compute spectrum kernel for (k-i)-length substrings

• Inexact matching statistics Ci:








15 features (k=3,5)



constraints?














"#$%&'%()%*'++',-

-%./)+)01$234$.%$1%+

2.#+/01$2%3+#+13+1/34







5/6+%)0 789 :%&+ ;<=)*100 75 >106 104

Speed-up [x]:

K $X ,Y %&'()'K c $( |X %c $( |Y % $1%

c $( |X %&'*)X I m $* ,(%

I m $( ,*%&+1, d $( ,*%,m0, otherwise








K $X ,Y %&'*)X '()Y /$( ,*% $2%

K $X ,Y %&'i&0

i&2m Mi/i $3%

K $X ,Y %&'i&0

i&mCi

K $X ,Y %&'i&0

i&mCi'

K $X ,Y %&M0

Results


11

22 >2(‘ACGAC’)

k=2m=0

1 2 3 ... |∑|k


Cou

nts

index

index

features

features

ordered

unordered

AA AC AG ... TT 1 2 3 ... |∑|k


!"-!"#$%&'$! "#% "!)$ "#)% " '#(')$&(#*+,' &

' $()0-.' .)*% 1 ' $+(")*% " +(")&'&&/01 $(& %$(-&%


C',$'5%.#+,1D%/).?=+#+1)$%%73;!"#$% &'()' &*

+,-./$0 1'2 34*

5,-./$067#/!$8#"+9 &'(' ':;<*



AACACTCTG


GCACAAAAC

AAA AACACTCAACTGGCA

010101

011010



Ci&Mi0' j&0

j&i-1$k- ji- j %M j


EEF%G31$2%3=@@1/1'$+%3+#+13+1/3%+)%/).?=+'%3+,1$2%H',$'5%@=$/+1)$3

?@$A'%*+/<B$C%/0%(=$2D%=()%E$3FF34G

H@$I)(J*?/J$C%/0%(=$2D%=()%E$3FFK4G$

$

*@$L?''%J$C%/0%(=$2D%=()%E$3FFK4G$

J@$A'?+)?($C%/0%(=$2M<C=?E$3FFN4


$$$$$$$$$$M282S4E82T44UM2SVETV4



112222333


312313222








/!

" $# $% !&$' %&'!&"&!($#$ $ ' %

)!/!

)!&*!-' +&"!-% $'- +!- + %) + $ !&"&' '' $&!($&!($#$ $ ' % $ '-%%

)'&,-' +&"'-%) +





...


" $# $% %&'!&$-%&-#&' '' $ -.-%%

*!

In this problem it is not necessary to know the distance between each pair of k-mers; one only

needs to know the number of pairs (Mi) at each distance i. We show next that the above problem

of computing matching statistics can be solved in linear time (in the number n of k-mers) using

multiple rounds of counting sort as a sub-algorithm.

We first consider the problem of computing number of k-mers at distance 0, i.e. the number of

exact matches. In this case, we can apply counting sort to order all k-mers lexicographically and

find the number of exact matches by scanning the sorted list. The counting then requires linear

O(kn) time. Efficient direct computation of Mi for any i > 0 is difficult (requires quadratic time);

we take another approach and first compute inexact cumulative mismatch statistics, Ci = Mi +�i−1j=0

�k−ji−j

�Mj , that overcount the number of k-mer pairs at a given distance i, as follows. Consider

two k-mers a and b. Pick i positions and remove from the k-mers the symbols at the corresponding

positions to obtain (k−i)-mers a� and b�. The key observation is that d(a�, b�) = 0 ⇒ d(a, b) ≤ i.

As a result, given n k-mers, we can compute the cumulative mismatch statistics Ci in linear time

using�k

i

�rounds of counting sort on (k − i)-mers. The exact mismatch statistics Mi can then be

obtained from Ci by subtracting the exact counts to compensate for overcounting as follows:

Mi = Ci −i−1�

j=0

�k − j

i− j

�Mj , i = 0, . . . ,min(min(2m, k), k − 1) (7)

The last mismatch statistic Mk can be computed by subtracting the preceding statistics M0, ...Mk−1

from the total number of possible matches:

Mk = T −k−1�

j=0

Mj , where T = (nx − k + 1)(ny − k + 1). (8)

Our algorithm for mismatch kernel computations based on sufficient statistics is summarized in

Algorithm 2. The overall complexity of the algorithm is O(nck,m) with the constant ck,m =�min(2m,k)l=0

�kl

�(k − l), independent of the size of the alphabet set, and

�kl

�is the number of rounds

of counting sort for evaluating the cumulative mismatch statistics Cl.

Algorithm 2. (Mismatch-SS) Mismatch kernel algorithm based on Sufficient Statistics

Input: strings X, Y, |X| = nx, |Y | = ny , parameters k, m, pre-computed intersection values I1. Compute min(2m, k) cumulative matching statistics, Ci, using counting sort

2. Compute exact matching statistics, Mi, using Equation 7

3. Evaluate kernel using Equation 6: K(X, Y |m, k) =�min(2m,k)

i=0 MiIi

Output: Mismatch kernel value K(X, Y |k, m)

5 Extensions

Our algorithmic approach can also be applied to a variety of existing string kernels, leading to very

efficient and simple algorithms that could benefit many applications.

Spectrum Kernels. The spectrum kernel [9] in our notation is the first sufficient statistic M0, i.e.

K(X, Y |k) = M0, which can be computed in k rounds of counting sort (i.e. in O(kn) time).

Gapped Kernels. The gapped kernels [6] measure similarity between strings X and Y based on the

co-occurrence of gapped instances g, |g| = k + m > k of k-long substrings:

K(X, Y |k, g) =�

γ∈Σk

� �

g∈X,|g|=k+m

I(γ, g)��

g∈Y,|g|=k+m

I(γ, g)�, (9)

where I(γ, g) = 1 when γ is a subsequence of g. Similar to the algorithmic approach for extracting

cumulative mismatch statistics in Algorithm-2, to compute the gapped(g,k) kernel, we perform a

single round of counting sort over k-mers contained in the g-mers. This gives a very simple and

efficient O(�gk

�kn) time algorithm for gapped kernel computations.

Wildcard kernels. The wildcard(k,m) kernel [6] in our notation is the sum of the cumulative

statistics K(X, Y |k, m) =�m

i=0 Ci, i.e. can be computed in�m

i=0

�ki

�rounds of counting sort,

giving a simple and efficient O(�m

i=0

�ki

�(k − i)n) algorithm.

II: Sufficient Statistic (SS) algorithm








15 features (k=3,5)



constraints?














"#$%&'%()%*'++',-

-%./)+)01$234$.%$1%+

2.#+/01$2%3+#+13+1/34







5/6+%)0 789 :%&+ ;<=)*100 75 >106 104

Speed-up [x]:

K $X ,Y %&'()'K c $( |X %c $( |Y % $1%

c $( |X %&'*)X I m $* ,(%

I m $( ,*%&+1, d $( ,*%,m0, otherwise








K $X ,Y %&'*)X '()Y /$( ,*% $2%

K $X ,Y %&'i&0

i&2m Mi/i $3%

K $X ,Y %&'i&0

i&mCi

K $X ,Y %&'i&0

i&mCi'

K $X ,Y %&M0

Results


11

22 >2(‘ACGAC’)

k=2m=0

1 2 3 ... |∑|k


Cou

nts

index

index

features

features

ordered

unordered

AA AC AG ... TT 1 2 3 ... |∑|k


!"-!"#$%&'$! "#% "!)$ "#)% " '#(')$&(#*+,' &

' $()0-.' .)*% 1 ' $+(")*% " +(")&'&&/01 $(& %$(-&%


C',$'5%.#+,1D%/).?=+#+1)$%%73;!"#$% &'()' &*

+,-./$0 1'2 34*

5,-./$067#/!$8#"+9 &'(' ':;<*



AACACTCTG


GCACAAAAC

AAA AACACTCAACTGGCA

010101

011010



Ci&Mi0' j&0

j&i-1$k- ji- j %M j


EEF%G31$2%3=@@1/1'$+%3+#+13+1/3%+)%/).?=+'%3+,1$2%H',$'5%@=$/+1)$3

?@$A'%*+/<B$C%/0%(=$2D%=()%E$3FF34G

H@$I)(J*?/J$C%/0%(=$2D%=()%E$3FFK4G$

$

*@$L?''%J$C%/0%(=$2D%=()%E$3FFK4G$

J@$A'?+)?($C%/0%(=$2M<C=?E$3FFN4


$$$$$$$$$$M282S4E82T44UM2SVETV4



112222333


312313222








/!

" $# $% !&$' %&'!&"&!($#$ $ ' %

)!/!

)!&*!-' +&"!-% $'- +!- + %) + $ !&"&' '' $&!($&!($#$ $ ' % $ '-%%

)'&,-' +&"'-%) +





...


" $# $% %&'!&$-%&-#&' '' $ -.-%%

*!








15 features (k=3,5)



constraints?














"#$%&'%()%*'++',-

-%./)+)01$234$.%$1%+

2.#+/01$2%3+#+13+1/34







5/6+%)0 789 :%&+ ;<=)*100 75 >106 104

Speed-up [x]:

K $X ,Y %&'()'K c $( |X %c $( |Y % $1%

c $( |X %&'*)X I m $* ,(%

I m $( ,*%&+1, d $( ,*%,m0, otherwise








K $X ,Y %&'*)X '()Y /$( ,*% $2%

K $X ,Y %&'i&0

i&2m Mi/i $3%

K $X ,Y %&'i&0

i&mCi

K $X ,Y %&'i&0

i&mCi'

K $X ,Y %&M0

Results


11

22 >2(‘ACGAC’)

k=2m=0

1 2 3 ... |∑|k


Cou

nts

index

index

features

features

ordered

unordered

AA AC AG ... TT 1 2 3 ... |∑|k


!"-!"#$%&'$! "#% "!)$ "#)% " '#(')$&(#*+,' &

' $()0-.' .)*% 1 ' $+(")*% " +(")&'&&/01 $(& %$(-&%


C',$'5%.#+,1D%/).?=+#+1)$%%73;!"#$% &'()' &*

+,-./$0 1'2 34*

5,-./$067#/!$8#"+9 &'(' ':;<*



AACACTCTG


GCACAAAAC

AAA AACACTCAACTGGCA

010101

011010



Ci&Mi0' j&0

j&i-1$k- ji- j %M j


EEF%G31$2%3=@@1/1'$+%3+#+13+1/3%+)%/).?=+'%3+,1$2%H',$'5%@=$/+1)$3

?@$A'%*+/<B$C%/0%(=$2D%=()%E$3FF34G

H@$I)(J*?/J$C%/0%(=$2D%=()%E$3FFK4G$

$

*@$L?''%J$C%/0%(=$2D%=()%E$3FFK4G$

J@$A'?+)?($C%/0%(=$2M<C=?E$3FFN4


$$$$$$$$$$M282S4E82T44UM2SVETV4



112222333


312313222








/!

" $# $% !&$' %&'!&"&!($#$ $ ' %

)!/!

)!&*!-' +&"!-% $'- +!- + %) + $ !&"&' '' $&!($&!($#$ $ ' % $ '-%%

)'&,-' +&"'-%) +





...


" $# $% %&'!&$-%&-#&' '' $ -.-%%

*!

original complexity(trie-based) Sufficient-statistic based

• Spectrum kernels (Leslie, 2002)

K(X,Y) = M0

• Gapped kernels (Leslie, 2004; Rousu, 2005)

• Spatial kernels (Kuksa, 2008 & 2010)

• In all cases reduce computation to multiple rounds of exact spectrum kernel computation (counting sort)

Using Sufficient Statistics to compute String Kernel Functions








15 features (k=3,5)



constraints?














"#$%&'%()%*'++',-

-%./)+)01$234$.%$1%+

2.#+/01$2%3+#+13+1/34







5/6+%)0 789 :%&+ ;<=)*100 75 >106 104

Speed-up [x]:

K $X ,Y %&'()'K c $( |X %c $( |Y % $1%

c $( |X %&'*)X I m $* ,(%

I m $( ,*%&+1, d $( ,*%,m0, otherwise








K $X ,Y %&'*)X '()Y /$( ,*% $2%

K $X ,Y %&'i&0

i&2m Mi/i $3%

K $X ,Y %&'i&0

i&mCi

K $X ,Y %&'i&0

i&mCi'

K $X ,Y %&M0

Results


11

22 >2(‘ACGAC’)

k=2m=0

1 2 3 ... |∑|k


Cou

nts

index

index

features

features

ordered

unordered

AA AC AG ... TT 1 2 3 ... |∑|k


!"-!"#$%&'$! "#% "!)$ "#)% " '#(')$&(#*+,' &

' $()0-.' .)*% 1 ' $+(")*% " +(")&'&&/01 $(& %$(-&%


C',$'5%.#+,1D%/).?=+#+1)$%%73;!"#$% &'()' &*

+,-./$0 1'2 34*

5,-./$067#/!$8#"+9 &'(' ':;<*



AACACTCTG


GCACAAAAC

AAA AACACTCAACTGGCA

010101

011010



Ci&Mi0' j&0

j&i-1$k- ji- j %M j


EEF%G31$2%3=@@1/1'$+%3+#+13+1/3%+)%/).?=+'%3+,1$2%H',$'5%@=$/+1)$3

?@$A'%*+/<B$C%/0%(=$2D%=()%E$3FF34G

H@$I)(J*?/J$C%/0%(=$2D%=()%E$3FFK4G$

$

*@$L?''%J$C%/0%(=$2D%=()%E$3FFK4G$

J@$A'?+)?($C%/0%(=$2M<C=?E$3FFN4


$$$$$$$$$$M282S4E82T44UM2SVETV4



112222333


312313222








/!

" $# $% !&$' %&'!&"&!($#$ $ ' %

)!/!

)!&*!-' +&"!-% $'- +!- + %) + $ !&"&' '' $&!($&!($#$ $ ' % $ '-%%

)'&,-' +&"'-%) +





...


" $# $% %&'!&$-%&-#&' '' $ -.-%%

*!








15 features (k=3,5)



constraints?














"#$%&'%()%*'++',-

-%./)+)01$234$.%$1%+

2.#+/01$2%3+#+13+1/34







5/6+%)0 789 :%&+ ;<=)*100 75 >106 104

Speed-up [x]:

K $X ,Y %&'()'K c $( |X %c $( |Y % $1%

c $( |X %&'*)X I m $* ,(%

I m $( ,*%&+1, d $( ,*%,m0, otherwise








K $X ,Y %&'*)X '()Y /$( ,*% $2%

K $X ,Y %&'i&0

i&2m Mi/i $3%

K $X ,Y %&'i&0

i&mCi

K $X ,Y %&'i&0

i&mCi'

K $X ,Y %&M0

Results


11

22 >2(‘ACGAC’)

k=2m=0

1 2 3 ... |∑|k


Cou

nts

index

index

features

features

ordered

unordered

AA AC AG ... TT 1 2 3 ... |∑|k


!"-!"#$%&'$! "#% "!)$ "#)% " '#(')$&(#*+,' &

' $()0-.' .)*% 1 ' $+(")*% " +(")&'&&/01 $(& %$(-&%


C',$'5%.#+,1D%/).?=+#+1)$%%73;!"#$% &'()' &*

+,-./$0 1'2 34*

5,-./$067#/!$8#"+9 &'(' ':;<*



AACACTCTG


GCACAAAAC

AAA AACACTCAACTGGCA

010101

011010



Ci&Mi0' j&0

j&i-1$k- ji- j %M j


EEF%G31$2%3=@@1/1'$+%3+#+13+1/3%+)%/).?=+'%3+,1$2%H',$'5%@=$/+1)$3

?@$A'%*+/<B$C%/0%(=$2D%=()%E$3FF34G

H@$I)(J*?/J$C%/0%(=$2D%=()%E$3FFK4G$

$

*@$L?''%J$C%/0%(=$2D%=()%E$3FFK4G$

J@$A'?+)?($C%/0%(=$2M<C=?E$3FFN4


$$$$$$$$$$M282S4E82T44UM2SVETV4



112222333


312313222








/!

" $# $% !&$' %&'!&"&!($#$ $ ' %

)!/!

)!&*!-' +&"!-% $'- +!- + %) + $ !&"&' '' $&!($&!($#$ $ ' % $ '-%%

)'&,-' +&"'-%) +





...


" $# $% %&'!&$-%&-#&' '' $ -.-%%

*!

Evaluation

Running time (I)

104 speed-up for large (|∑|=1000) alphabets

Figure 1: Relative running time vs. alphabet size

Scalable Algorithms for String KernelsScalable Algorithms for String Kernelswith Inexact Matchingwith Inexact Matching

Pavel P. Kuksa, Pai-Hsi Huang, Vladimir Pavlovic

Theoretical resultsNew algorithmic approach for string kernels with inexact matching, including mismatch, gapped, and other count-based kernelsImproves complexity bounds over existing algorithmsExtends to large-scale semi-supervised learning (eg. neighborhood kernel approach)

Empirical resultsSeveral orders of magnitude speed-up Explore more complex kernels yielding improved predictions

Points of Interest! Removing dependency on the size of the alphabet under the mismatch measure O(km|∑|mn) " O(ck,mn)! Generalization of algorithms for kernels on strings, improving time-efficiency! Enhancing performance with new algorithms

! computational biology! recognizing protein fold from sequence! detecting remote evolutionary similarity

! general sequence data! music genre recognition

! State-of-the-art results for protein classification

Problem: what are the best algorithms for strings under inexact matching?Spotlight ID: ???

Speed-upProtein DNA Text Music100 75 >106 104

Table 2: Remote homology detectionOriginal Relaxed

Supervised 41.92 52.00Semi-supervised 81.05 86.32http://seqam.rutgers.edu/projects/new-inexact/

Table 1: Protein fold prediction (multi-class)Original Relaxed


features index

features index

ordered

unordered

similarity matrix

Sequence data


Table 3: Music genreclassification (multi-class)

#Reduce error by 15% compared to the best available method#Reduce error by 30% compared to the state-of-the-art profile kernel

Protein 100

Music ~104

Text ~106

Speed-up

Settings: seq. length = 10K, k=5, m=1

Running time (2)

100 200 300 400 500 600 700 800 900 1000100

101

102

103

104

105

alphabet size

rela

tive

runn

ing

time,

Ttri

e/Tss

Figure 1: Relative running time Ttrie/Tss

(in logarithmic scale) of the mismatch-trie andmismatch-ss as a function of the alphabet size(mismatch(5,1) kernel, n = 105)

Table 1: Running time (in seconds) for kernel compu-tation between two strings on real data

longprotein protein dna text music

n 36672 116 570 242 6892|Σ| 20 20 4 29224 1024(5,1)-trie 1.6268 0.0212 0.0260 20398 526.8(5,1)-ss 0.1987 0.0052 0.0054 0.0178 0.0331time ratio 8 4 5 106 16,000(5,2)-trie 31.5519 0.2918 0.4800 - -(5,2)-ss 0.2957 0.0067 0.0064 0.0649 0.0941time ratio 100 44 75 - -

the semi-supervised setting for neighborhood mismatch kernels; for example, computing a smallerneighborhood mismatch(5,2) kernel matrix for the labeled sequences only (2862-by-2862 matrix)using the Swiss-Prot unlabeled dataset takes 1, 480 seconds with our algorithm, whereas performingthe same task with the trie-based algorithm takes about 5 days.

6.2 Empirical performance analysis

In this section we show predictive performance results for several sequence analysis tasks using ournew algorithms. We consider the tasks of the multi-class music genre classification [16], with resultsin Table 2, and the protein remote homology (superfamily) prediction [9, 2, 18] in Table 3. We alsoinclude preliminary results for multi-class fold prediction [14, 15] in Table 4.

On the music classification task, we observe significant improvements in accuracy for larger numberof mismatches. The obtained error rate (35.6%) on this dataset compares well with the state-of-the-art results based on the same signal representation in [16]. The remote protein homology detection,as evident from Table 3, clearly benefits from larger number of allowed mismatches because theremotely related proteins are likely to be separated by multiple mutations or insertions/deletions.For example, we observe improvement in the average ROC-50 score from 41.92 to 52.00 under afully-supervised setting, and similar significant improvements in the semi-supervised settings. Inparticular, the result on the Swiss-Prot dataset for the (7, 3)-mismatch kernel is very promising andcompares well with the best results of the state-of-the-art, but computationally more demanding,profile kernels [2]. The neighborhood kernels proposed by Weston et al. have already shown verypromising results in [7], though slightly worse than the profile kernel. However, using our newalgorithm that significantly improves the speed of the neighborhood kernels, we show that withlarger number of allowed mismatches the neighborhood can perform even better than the state-of-the-art profile kernel: the (7,3)-mismatch neighborhood achieves the average ROC-50 score of86.32, compared to 84.00 of the profile kernel on the Swiss-Prot dataset. This is an important resultthat addresses a main drawback of the neighborhood kernels, the running time [7, 2].

Table 2: Classification per-formance on music genreclassification (multi-class)Method ErrorMismatch (5,1) 42.6±6.34Mismatch (5,2) 35.6±4.99

Table 3: Classification performance on protein remote homologypredictiondataset mismatch (5,1) mismatch (5,2) mismatch (7,3)

ROC ROC50 ROC ROC50 ROC ROC50SCOP (supervised) 87.75 41.92 90.67 49.09 91.31 52.00SCOP (unlabeled) 90.93 67.20 91.42 69.35 92.27 73.29SCOP (PDB) 97.06 80.39 97.24 81.35 97.93 84.56SCOP (Swiss-Prot) 96.73 81.05 97.05 82.25 97.78 86.32

For multi-class protein fold recognition (Table 4), we similarly observe improvements in perfor-mance for larger numbers of allowed mismatches. The balanced error of 25% for the (7,3)-mismatchneighborhood kernel using Swiss-Prot compares well with the best error rate of 26.5% for the state-

Kernel computation K(X,Y), real dataRunning time, (s)

Large-scale protein sequence fold recognition

• Unlabeled data: >1M protein sequences; Labeled data: ~4K sequences

• 27 fold classes (very low sequence similarity)

• 20% reduction in error with relaxed matching (larger m); 40-50% reduction with large unlabeled corpus

of-the-art profile kernel with adaptive codes in [15] that used a much larger non-redundant (NR)dataset. Using NR, the balanced error further reduces to 22.5% for the (7,3)-mismatch.

Table 4: Classification performance on fold prediction (multi-class)

Method Error Top 5Error

BalancedError

Top 5BalancedError

Recall Top 5Recall Precision Top 5

Precision F1 Top5F1

Mismatch (5, 1) 51.17 22.72 53.22 28.86 46.78 71.14 90.52 95.25 61.68 81.45Mismatch (5, 2) 42.30 19.32 44.89 22.66 55.11 77.34 67.36 84.77 60.62 80.89Mismatch (5, 2)† 27.42 14.36 24.98 13.36 75.02 86.64 79.01 91.02 76.96 88.78Mismatch (7, 3) 43.60 19.06 47.13 22.76 52.87 77.24 84.65 91.95 65.09 83.96Mismatch (7, 3)† 26.11 12.53 25.01 12.57 74.99 87.43 85.00 92.78 79.68 90.02Mismatch (7, 3)‡ 23.76 11.75 22.49 12.14 77.59 87.86 84.90 91.99 81.04 89.88† used the Swiss-Prot sequence database; ‡ used NR (non-redundant) database

7 Conclusions

We presented new algorithms for inexact matching of the discrete-valued string representations thatreduce computational complexity of current algorithms, demonstrate state-of-the-art performanceand significantly improved running times. This improvement makes the string kernels with approxi-mate but looser matching a viable alternative for practical tasks of sequence analysis. Our algorithmswork with large databases in supervised and semi-supervised settings and scale well in the alphabetsize and the number of allowed mismatches. As a consequence, the proposed algorithms can bereadily applied to other challenging problems in sequence analysis and mining.

References[1] Jianlin Cheng and Pierre Baldi. A machine learning information retrieval approach to protein fold recog-

nition. Bioinformatics, 22(12):1456–1463, June 2006.[2] Rui Kuang, Eugene Ie, Ke Wang, Kai Wang, Mahira Siddiqi, Yoav Freund, and Christina S. Leslie.

Profile-based string kernels for remote homology detection and motif extraction. In CSB, pages 152–160, 2004.

[3] Christina S. Leslie, Eleazar Eskin, Jason Weston, and William Stafford Noble. Mismatch string kernelsfor SVM protein classification. In NIPS, pages 1417–1424, 2002.

[4] Soren Sonnenburg, Gunnar Ratsch, and Bernhard Scholkopf. Large scale genomic sequence SVM clas-sifiers. In ICML ’05, pages 848–855, New York, NY, USA, 2005.

[5] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge UniversityPress, New York, NY, USA, 2004.

[6] Christina Leslie and Rui Kuang. Fast string kernels using inexact matching for protein sequences. J.Mach. Learn. Res., 5:1435–1455, 2004.

[7] Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff, and William Stafford Noble.Semi-supervised protein classification using cluster kernels. Bioinformatics, 21(15):3241–3247, 2005.

[8] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998.[9] Christina S. Leslie, Eleazar Eskin, and William Stafford Noble. The spectrum kernel: A string kernel for

SVM protein classification. In Pacific Symposium on Biocomputing, pages 566–575, 2002.[10] S. V. N. Vishwanathan and Alex Smola. Fast kernels for string and tree matching. Advances in Neural

Information Processing Systems, 15, 2002.[11] M. Gribskov, A.D. McLachlan, and D. Eisenberg. Profile analysis: detection of distantly related proteins.

Proceedings of the National Academy of Sciences, 84:4355–4358, 1987.[12] Juho Rousu and John Shawe-Taylor. Efficient computation of gapped substring kernels on large alphabets.

J. Mach. Learn. Res., 6:1323–1344, 2005.[13] Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Fast protein homology and fold detection with

sparse spatial sample kernels. In ICPR 2008, 2008.[14] Chris H.Q. Ding and Inna Dubchak. Multi-class protein fold recognition using support vector machines

and neural networks. Bioinformatics, 17(4):349–358, 2001.[15] Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, and Christina Leslie. Multi-class protein

classification using adaptive codes. J. Mach. Learn. Res., 8:1557–1581, 2007.[16] Tao Li, Mitsunori Ogihara, and Qi Li. A comparative study on content-based music genre classification.

In SIGIR ’03, pages 282–289, New York, NY, USA, 2003. ACM.[17] Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. On the role of local matching for efficient semi-

supervised protein sequence classification. In BIBM, 2008.[18] Tommi Jaakkola, Mark Diekhans, and David Haussler. A discriminative framework for detecting remote

protein homologies. In Journal of Computational Biology, volume 7, pages 95–114, 2000.

Algorithms for string kernels with inexact matching: Our contributions

• New family of algorithms based on sufficient statistics and counting sort for string kernels

• unified approach

• improves time complexity bounds over existing algorithms

• reduces kernel computation to exact spectrum kernel computation

• enhances performance with relaxed matching

Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic, NIPS 2008“Scalable Algorithms for String Kernels with Inexact Matching”

Pattern extraction:Motif discovery

Motif Discovery

GAACTCATGGTG

AAAAGCACGGTC

TCAAAGCAAGGC

CCTAATCAGGGC

AAGTATGGACTC

ACTAAGCAGGGT

TCTCACGGCCCA

CCTCGTGGTGGG

TACCGTATGGTT

ACCACTCGTCGA

GAACTCATGGTG

AAAAGCACGGTC

TCAAAGCAAGGC

CCTAATCAGGGC

AAGTATGGACTC

ACTAAGCAGGGT

TCTCACGGCCCA

CCTCGTGGTGGG

TACCGTATGGTT

ACCACTCGTCGA

Sequence database

motifinstanceslength k=8

A[AC]TCATGGMotif:

consensussequence

• Motifs: sequence patterns of biological significance- E.g.: TF binding sites in DNA, structural motifs in proteins

Motiffinder

Motif finding algorithms• State-of-the-art algorithms (e.g., MITRA,

RISOTTO, etc) depend on the (k,m) mutational neighborhood O(km |∑|m)

• Can use similar approach and remove alphabet dependency (Kuksa & Pavlovic, BMC Bioinformatics, 2010)

depth k

|∑|k leaves (all possible k-length motif sequences)

A

C GMotif search tree:

Algorithm Overview– 4 –

! " #

!"#$%&'(%&) *(+',-.(&,"'%+"/('(%0&121331)1

45%,6&,"'%+"/('

Figure 2: Feasible instance selection.

At first, it seems that in order to obtain a set of k-mers R at distance of at most 2m from

every string one needs to compute all O(n2N2) pairwise distances between input k-mers and

select k-mers at distance of at most 2m — which would require quadratic running time. We

will show next how to obtain a reduced set of k-mers in time linear in the input length n.

The reduced set R of potential motif instances, i.e. k-mers at distance of at most 2m

from each of the input strings, can be obtained using our linear time selection algorithm

(Algorithm 1). To select the valid k-mers (i.e. a set of potential motif instances), we use

multiple rounds of count sort by removing iteratively 2m out of k positions and sorting

the resulting sets of (k − 2m)-mers. A k-mer is deemed a potential motif instance if it

matched at least one k-mer from each of the other strings in at least one of the sorting

rounds. The purpose of sorting is to group the same k-mers together; note that (k − 2m)-

mers corresponding to k-mers at distance of at most 2m will match exactly. Using a simple

linear scan over the sorted list of all input k-mers, we can find the set of potential motif

instances and construct R. This algorithm is outlined below (Algorithm 1).

Algorithm 1 Selection algorithm

Input: set S of k-mers with associated sequence index L, distance parameter d

Output: set R of k-mers at distance d from each input string

1. Pick d positions and remove from the k-mers symbols at the corresponding positions

to obtain a set of (k − d)-mers.

2. Use counting sort to order (lexicographically) the resulting set of (k − d)-mers.

3. Scan the sorted list to create the list of all sequences in which k-mers appear using

sequence index L.

4. Output the k-mers that appear in every input string.

As we will see in the experiments (Section 4), the selection using Algorithm 1 significantly

reduces the number of k-mer instances considered by the motif algorithm and improves its

search efficiency. The number of selected k-mers, i.e. the size of R, is small, especially

Feasible instances: at Hamming distance at most 2mACTCATGG

AAGCACGG

AAGCAAGG

AATCAGGG

CCTCGTGG

ACTCATGG

AAGCACGG

AAGCAAGG

AATCAGGG

CCTCGTGG

k-mers (k-d)-mers

mat

ch

count sort:(TGT, TGC)intersection:TGx, Txx, xGx, TxC, xGCx - wildcardk=3, m=2

Selection Efficiency

! !

!"" #""" $""" %""" &""""

#"""

$"""

%"""

$'!$'!&(#&(#

#"'##"'##)$*#)$*

$&""$&""

+,-./0/.12.1304566275+,-./0/.12.130456627589157-0:627;-.,1<89157-0:627;-.,1<

8=,-./0>613-?0@A##B0=CDE0F=.4=C-;?640=A%<

GH5.;C>046I761;60>613-?

9157-0:627;-.,10/C;-,:

!"#$%&'$()%*+,+-$('.%)'/%0(','1(-",%*+23+.-+#4"5+,%637#"8%9,":(;(/%4"5,'5(-

<+="/$;+.$%')%>';=3$+/%*-(+.-+8%?3$1+/#%@.(5+/#($A

4/'B,+;4/'B,+;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%&'$()%!(.:(.1

=,-./871@1,J1<

=,-./0.14-C1;648.16DC;-0;,5.64B071@1,J1<

46I761;64

,K6:0C>5?CL6-0!08MNOB05:,-6.1<

C78%;8%!"#$%&'()*+#,-./01#2/(34-'

D.=3$P046-0,/04-:.1340*B0Q*Q0A0NB0;,14-:7;-620/:,=0C10C>5?CL6-0!

56)26)!"#$%"70,/0=,-./4047;?0-?C-06C;?0=,-./0C0!0&[email protected]=.1.=7=0RC==.1302.4-C1;60,/0C-0=,4-0!0/:,=04-:.1340.10*

#" #& $" &""

$"""

)"""

!"""

'"""

#*#* %&%& #'"#'"

!*'!!*'!

+,-./0/.12.1304566275+,-./0/.12.13045662758S711.130-.=6<8S711.130-.=6<

8=,-./0>613-?0@A##B0=CDE0F=.4=C-;?640=A%<

O>5?CL6-04.T6

U5662750/C;-,:

#" #& $" &" #"""

&"

#""

#&"

$""

$&"

#%#% $"$"&"&"

##$##$

$"&$"&

+,-./0/.12.1304566275+,-./0/.12.13045662758S711.130-.=6<8S711.130-.=6<

8=,-./0>613-?0@A##B0=CDE0F=.4=C-;?640=A%<

O>5?CL6-04.T6

U5662750/C;-,:

&'$()%!(.:(.1%EF=+/(;+.$#&'$()%!(.:(.1%EF=+/(;+.$#

?+)+/+.-+#

V#W0NE0X,Y.;B0ZE0X,Y.;B[E0\:6HB0]E0+66@B0ME0R6;@6:=C1E0^4.130_65.-,=64`0-,0=,26>03616-.;02.K6:4.-HE0N9aU0$""[email protected]:E0\.12.130;,=5,4.-60:637>C-,:H05C--6:140.10MNO046I761;64E0[.,.1/,:=C-.;4B0$""$V%W0bEaE0c.13B0+E9E0X,:2C1B0SE+E0dC:5B0UES7446>E0O0?.6:C:;?.;C>0[CH64.C10+C:@,K.C10=,26>0/,:0=,-./4E0N9aU0$""%

O>3,:.-?= G.=60;,=5>6D.-H U5C;6

UabeebS08UC3,-B0#**'<

+9GSO08a6KT16:B0$""$< f81N@<

]bNU^U08bKC14B0$""%< f81N@<

Z,-.1308]?.1B0$""&< f81Z8=B@<<

S9UfGGf08a.4C1-.B0$""!<

a+U08MCK.>CB0$""(<

f81N$9C;87G< f81N$gJ<

f8@1N9C;87G<

f8@1N9C;87G<

f81N9C;87G<

f81N$9C;87G< f81N$<

f81$N9C;87G< f81$N<

! ""#$ #$%%$"

"

"$% # #& #$$& "$" #& #"#

%&'()*+!,-!./00*+&1!12/+*3!)*04**&!04,!1')10+5&61!/!/&3!)7

%%%%&"(.%"==/'"-H+#%$'%C78;8&"(.%"==/'"-H+#%$'%C78;8!!8IGJ;'$()%)(.:(.18IGJ;'$()%)(.:(.100826-6:=.1.4-.;B0;,=5>6-6<

E.3;+/"$('.K%%#E0617=6:C-60C>>0=,-./40J.-?.100000002.4-C1;60"0/,:06C;?04-:.130#00$E0,7-57-0.1-6:46;-.,10,/0C>>00000047;?046-40

L/++#%M%L/(+#K%G:660J.-?0>6CK64g5C-?40/,:0C>>05,44.L>60=,-./40$00#E0.-6:C-60,K6:0C>>05C-?40/:,=0:,,-0-,0>6CK6400$E0,7-57-0>6CK640$04E-E0$0,;;7:40.10000006K6:H04-:.13

EF=,(-($%;"==(.1%CH"#H(.1GK%%RVCW0A0F0K,-640/,:0$08.E6E017=L6:0,/04-:.1340J.-?0$<00#E046-0RV$W0A0"0/,:0C>>0$00$E0.-6:C-60,K6:06K6:H0.157-04-:.130C120752C-60RVhW00%E0,7-57-0C0./0RV$W0A0NB0.E6E0,;;7:40.106K6:H04-:.13

&"(.%:/"NB"-7P02,01,-026C>0J.-?0>C:36iQ!8#&'()%#"*$+,+-"(./%$&'"#$0)$'1$#2-"#%./',34"5$($'5"/'"%6$"

73(678$%"#&9$-"$+,+"7#":!89"

[9[+0$""*jC4?.13-,10ME]EB0N,K0#i)B0$""*

D;=/'5(.1%*=++:%')%>';;'.%&'$()%!(.:+/#D;=/'5(.1%*=++:%')%>';;'.%&'$()%!(.:+/#

8,05-!-5&35&6!9,(.:*;50<!5(.+,=*1!0,

& "" $$( #'('()()! "$ #"## ()(*'(

*+,+-$('.%O,1'/($H;#%)'/%&'$()%!(.:(.1*+,+-$('.%O,1'/($H;#%)'/%&'$()%!(.:(.1

O,1'/($H;P%&'$()%-".:(:"$+%#+,+-$('.%C)+"#(B,+%(.#$".-+%#+$%+F$/"-$('.GD.=3$K% .157-046-0UAk8$B%<l0,/0@i=6:40J.-?0C44,;.C-62046I761;60.126D0%m02.4-C1;605C:C=6-6:02082A$=<

#E0fL-C.10C046-0,/08@i2<i=6:40LH0&'()$'&*(%+046>6;-.130205,4.-.,140C120:6=,K.130;,::645,12.1304H=L,>40000$E0U,:-0-?60:647>-.13046-0,/08@i2<i=6:4074.130;,71-.1304,:-%E0U;C10-?604,:-620>.4-0-,0;:6C-60>.4-0,/0C>>046I761;640.10J?.;?0@i=6:40C556C:)E0f7-57-0-?60@i=6:40-?C-0C556C:0.106K6:H046I761;6

Q3$=3$K%:627;62046-0SAk8LB><l0,/0@i=6:40C-02.4-C1;60,/0C-0=,4-020/:,=06C;?0.157-04-:.13

;

C ;

aC--6:10-:66

;C ;;

EEE

N6.3?L,:046-

/,:04-:.130#

N6.3?L,:046-0/,:04-:.130$

N6.3?L,:46-0/,:04-:.130%

+,-./046-0+

N,-6026561261;H0,1C>5?CL6-04.T6n

>'.-,3#('.#%".:%!3$3/+%N'/7>'.-,3#('.#%".:%!3$3/+%N'/7#E0O55>.640C40C0#,(("-.,0=6;?C1.4=0-,0C0KC:.6-H0,/00617=6:C-.,1iLC462086DC;-<00=,-./0/.126:4

$E0],=L.1640J.-?05:,LCL.>.4-.;0=,-./0/.126:40C400=,-./0/$0"&"$'(-#(%(/'1)-86E3EB0.157-0466240/,:0+b+bB0b+<

%E0U.31./.;C1->H0456627540.105:C;-.;60=,-./0/.12.13+0;C1046C:;?0/,:0>,136:B0>6440;,146:K620=,-./4

0000000000000-?C105:6K.,74>H0LH0617=6:C-.K60=6-?,24

U.31./.;C1-0456627540.105:C;-.;6n

!$,"!# !$ "" %##-$(.!/& % '()( *+,)-*). /0 1/.'0 *).

+9GSO0.=5:,K62 Z,-.130.=5:,K62

%%%%%%%%%%E.3;+/"$('.JB"#+:%;'$()%)(.:+/#

9E

9E 99E

?+:3-+:%5#%D.=3$?+:3-+:%5#%D.=3$%%%%%%%%%%%%%%%%%%%%%%%%#+$#+$

&'$() &DL?O *J&DL?O 9'$(.1 *J9'$(.1

8*B$< )E(% RPST ("E& RPSURPSU

8##B%< %"( VPU ') VPUVPU

8#%B)< *&$) UPS *'E( WPXWPX

8#&B&< ∞ VYR ##* VTPWVTPW

8#(B!< ∞ SWSZ #&$E* SXPTSXPT

%%%%O,1'/($H;%-';="/(#'.%'.%:())(-3,$%C78;GJ;'$()%:(#-'5+/A%%%%%%%%-"#+#%%C=/'$+(.%#+23+.-+#G

<"$"#+$ &'$() *+,+-$('. I'%#+,+-$('.

e.5,;C>.1408eCJ:61;6B#**%< 8#!B!< YXURU o

p.1;0=6-C>>,565-.2C4608pC4>CK4@HB0$""!<

8##B&< TTWR ##"!&"

U756:46;,12C:HU-:7;-7:608UUU<0=,-./408d.4-6:B0$""!<

8$"B)< WWPS &!#E&

D.=3$%*([+

?+:3-+:%(.=3$%?C)+"#(B,+%#+$%#([+G

##'"" SY

#*'"" TW

%*'"" TU

&*'"" SY

**'"" SR

*+",-./%01 > & "" $$(#'( #

4/'$+(.%#+23+.-+#4/'$+(.%#+23+.-+#

9157-046-0,/:6>C-62046I761;64

\.1C>0=,-./0=,26>0b17=6:C-.K60C>3,:.-?=408+9GSOB0Z,-.13B06-;<0a:,LCL.>.4-.;0C>3,:.-?=408b+B+b+bBq.LL4B6-;<

a:,-6.1B0MNO46I761;64

\6C4.L>6914-C1;6046-46>6;-.,1

U-C:-.130=,-./0=,26>4

+,-./0S6/.16=61-

)/2 \6C4.L>60914-C1;6046-BQSQrrQUQ

* ? &

9157-046-0U

+,-./0.14-C1;64

Selection Efficiency

! !

!"" #""" $""" %""" &""""

#"""

$"""

%"""

$'!$'!&(#&(#

#"'##"'##)$*#)$*

$&""$&""

+,-./0/.12.1304566275+,-./0/.12.130456627589157-0:627;-.,1<89157-0:627;-.,1<

8=,-./0>613-?0@A##B0=CDE0F=.4=C-;?640=A%<

GH5.;C>046I761;60>613-?

9157-0:627;-.,10/C;-,:

!"#$%&'$()%*+,+-$('.%)'/%0(','1(-",%*+23+.-+#4"5+,%637#"8%9,":(;(/%4"5,'5(-

<+="/$;+.$%')%>';=3$+/%*-(+.-+8%?3$1+/#%@.(5+/#($A

4/'B,+;4/'B,+;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%&'$()%!(.:(.1

=,-./871@1,J1<

=,-./0.14-C1;648.16DC;-0;,5.64B071@1,J1<

46I761;64

,K6:0C>5?CL6-0!08MNOB05:,-6.1<

C78%;8%!"#$%&'()*+#,-./01#2/(34-'

D.=3$P046-0,/04-:.1340*B0Q*Q0A0NB0;,14-:7;-620/:,=0C10C>5?CL6-0!

56)26)!"#$%"70,/0=,-./4047;?0-?C-06C;?0=,-./0C0!0&[email protected]=.1.=7=0RC==.1302.4-C1;60,/0C-0=,4-0!0/:,=04-:.1340.10*

#" #& $" &""

$"""

)"""

!"""

'"""

#*#* %&%& #'"#'"

!*'!!*'!

+,-./0/.12.1304566275+,-./0/.12.13045662758S711.130-.=6<8S711.130-.=6<

8=,-./0>613-?0@A##B0=CDE0F=.4=C-;?640=A%<

O>5?CL6-04.T6

U5662750/C;-,:

#" #& $" &" #"""

&"

#""

#&"

$""

$&"

#%#% $"$"&"&"

##$##$

$"&$"&

+,-./0/.12.1304566275+,-./0/.12.13045662758S711.130-.=6<8S711.130-.=6<

8=,-./0>613-?0@A##B0=CDE0F=.4=C-;?640=A%<

O>5?CL6-04.T6

U5662750/C;-,:

&'$()%!(.:(.1%EF=+/(;+.$#&'$()%!(.:(.1%EF=+/(;+.$#

?+)+/+.-+#

V#W0NE0X,Y.;B0ZE0X,Y.;B[E0\:6HB0]E0+66@B0ME0R6;@6:=C1E0^4.130_65.-,=64`0-,0=,26>03616-.;02.K6:4.-HE0N9aU0$""[email protected]:E0\.12.130;,=5,4.-60:637>C-,:H05C--6:140.10MNO046I761;64E0[.,.1/,:=C-.;4B0$""$V%W0bEaE0c.13B0+E9E0X,:2C1B0SE+E0dC:5B0UES7446>E0O0?.6:C:;?.;C>0[CH64.C10+C:@,K.C10=,26>0/,:0=,-./4E0N9aU0$""%

O>3,:.-?= G.=60;,=5>6D.-H U5C;6

UabeebS08UC3,-B0#**'<

+9GSO08a6KT16:B0$""$< f81N@<

]bNU^U08bKC14B0$""%< f81N@<

Z,-.1308]?.1B0$""&< f81Z8=B@<<

S9UfGGf08a.4C1-.B0$""!<

a+U08MCK.>CB0$""(<

f81N$9C;87G< f81N$gJ<

f8@1N9C;87G<

f8@1N9C;87G<

f81N9C;87G<

f81N$9C;87G< f81N$<

f81$N9C;87G< f81$N<

! ""#$ #$%%$"

"

"$% # #& #$$& "$" #& #"#

%&'()*+!,-!./00*+&1!12/+*3!)*04**&!04,!1')10+5&61!/!/&3!)7

%%%%&"(.%"==/'"-H+#%$'%C78;8&"(.%"==/'"-H+#%$'%C78;8!!8IGJ;'$()%)(.:(.18IGJ;'$()%)(.:(.100826-6:=.1.4-.;B0;,=5>6-6<

E.3;+/"$('.K%%#E0617=6:C-60C>>0=,-./40J.-?.100000002.4-C1;60"0/,:06C;?04-:.130#00$E0,7-57-0.1-6:46;-.,10,/0C>>00000047;?046-40

L/++#%M%L/(+#K%G:660J.-?0>6CK64g5C-?40/,:0C>>05,44.L>60=,-./40$00#E0.-6:C-60,K6:0C>>05C-?40/:,=0:,,-0-,0>6CK6400$E0,7-57-0>6CK640$04E-E0$0,;;7:40.10000006K6:H04-:.13

EF=,(-($%;"==(.1%CH"#H(.1GK%%RVCW0A0F0K,-640/,:0$08.E6E017=L6:0,/04-:.1340J.-?0$<00#E046-0RV$W0A0"0/,:0C>>0$00$E0.-6:C-60,K6:06K6:H0.157-04-:.130C120752C-60RVhW00%E0,7-57-0C0./0RV$W0A0NB0.E6E0,;;7:40.106K6:H04-:.13

&"(.%:/"NB"-7P02,01,-026C>0J.-?0>C:36iQ!8#&'()%#"*$+,+-"(./%$&'"#$0)$'1$#2-"#%./',34"5$($'5"/'"%6$"

73(678$%"#&9$-"$+,+"7#":!89"

[9[+0$""*jC4?.13-,10ME]EB0N,K0#i)B0$""*

D;=/'5(.1%*=++:%')%>';;'.%&'$()%!(.:+/#D;=/'5(.1%*=++:%')%>';;'.%&'$()%!(.:+/#

8,05-!-5&35&6!9,(.:*;50<!5(.+,=*1!0,

& "" $$( #'('()()! "$ #"## ()(*'(

*+,+-$('.%O,1'/($H;#%)'/%&'$()%!(.:(.1*+,+-$('.%O,1'/($H;#%)'/%&'$()%!(.:(.1

O,1'/($H;P%&'$()%-".:(:"$+%#+,+-$('.%C)+"#(B,+%(.#$".-+%#+$%+F$/"-$('.GD.=3$K% .157-046-0UAk8$B%<l0,/0@i=6:40J.-?0C44,;.C-62046I761;60.126D0%m02.4-C1;605C:C=6-6:02082A$=<

#E0fL-C.10C046-0,/08@i2<i=6:40LH0&'()$'&*(%+046>6;-.130205,4.-.,140C120:6=,K.130;,::645,12.1304H=L,>40000$E0U,:-0-?60:647>-.13046-0,/08@i2<i=6:4074.130;,71-.1304,:-%E0U;C10-?604,:-620>.4-0-,0;:6C-60>.4-0,/0C>>046I761;640.10J?.;?0@i=6:40C556C:)E0f7-57-0-?60@i=6:40-?C-0C556C:0.106K6:H046I761;6

Q3$=3$K%:627;62046-0SAk8LB><l0,/0@i=6:40C-02.4-C1;60,/0C-0=,4-020/:,=06C;?0.157-04-:.13

;

C ;

aC--6:10-:66

;C ;;

EEE

N6.3?L,:046-

/,:04-:.130#

N6.3?L,:046-0/,:04-:.130$

N6.3?L,:46-0/,:04-:.130%

+,-./046-0+

N,-6026561261;H0,1C>5?CL6-04.T6n

>'.-,3#('.#%".:%!3$3/+%N'/7>'.-,3#('.#%".:%!3$3/+%N'/7#E0O55>.640C40C0#,(("-.,0=6;?C1.4=0-,0C0KC:.6-H0,/00617=6:C-.,1iLC462086DC;-<00=,-./0/.126:4

$E0],=L.1640J.-?05:,LCL.>.4-.;0=,-./0/.126:40C400=,-./0/$0"&"$'(-#(%(/'1)-86E3EB0.157-0466240/,:0+b+bB0b+<

%E0U.31./.;C1->H0456627540.105:C;-.;60=,-./0/.12.13+0;C1046C:;?0/,:0>,136:B0>6440;,146:K620=,-./4

0000000000000-?C105:6K.,74>H0LH0617=6:C-.K60=6-?,24

U.31./.;C1-0456627540.105:C;-.;6n

!$,"!# !$ "" %##-$(.!/& % '()( *+,)-*). /0 1/.'0 *).

+9GSO0.=5:,K62 Z,-.130.=5:,K62

%%%%%%%%%%E.3;+/"$('.JB"#+:%;'$()%)(.:+/#

9E

9E 99E

?+:3-+:%5#%D.=3$?+:3-+:%5#%D.=3$%%%%%%%%%%%%%%%%%%%%%%%%#+$#+$

&'$() &DL?O *J&DL?O 9'$(.1 *J9'$(.1

8*B$< )E(% RPST ("E& RPSURPSU

8##B%< %"( VPU ') VPUVPU

8#%B)< *&$) UPS *'E( WPXWPX

8#&B&< ∞ VYR ##* VTPWVTPW

8#(B!< ∞ SWSZ #&$E* SXPTSXPT

%%%%O,1'/($H;%-';="/(#'.%'.%:())(-3,$%C78;GJ;'$()%:(#-'5+/A%%%%%%%%-"#+#%%C=/'$+(.%#+23+.-+#G

<"$"#+$ &'$() *+,+-$('. I'%#+,+-$('.

e.5,;C>.1408eCJ:61;6B#**%< 8#!B!< YXURU o

p.1;0=6-C>>,565-.2C4608pC4>CK4@HB0$""!<

8##B&< TTWR ##"!&"

U756:46;,12C:HU-:7;-7:608UUU<0=,-./408d.4-6:B0$""!<

8$"B)< WWPS &!#E&

D.=3$%*([+

?+:3-+:%(.=3$%?C)+"#(B,+%#+$%#([+G

##'"" SY

#*'"" TW

%*'"" TU

&*'"" SY

**'"" SR

*+",-./%01 > & "" $$(#'( #

4/'$+(.%#+23+.-+#4/'$+(.%#+23+.-+#

9157-046-0,/:6>C-62046I761;64

\.1C>0=,-./0=,26>0b17=6:C-.K60C>3,:.-?=408+9GSOB0Z,-.13B06-;<0a:,LCL.>.4-.;0C>3,:.-?=408b+B+b+bBq.LL4B6-;<

a:,-6.1B0MNO46I761;64

\6C4.L>6914-C1;6046-46>6;-.,1

U-C:-.130=,-./0=,26>4

+,-./0S6/.16=61-

)/2 \6C4.L>60914-C1;6046-BQSQrrQUQ

* ? &

9157-046-0U

+,-./0.14-C1;64

Time efficiency• Finding k-length motifs planted in iid sequences with

up to m substitutionsTable 2: Running time comparison on the challenging instances of the planted motif problem (DNA, |Σ| = 4,N = 20 sequences of length n = 600). Problem instances are denoted by (k, m, |Σ|), where k is the length of themotif implanted with m mismatches.

Motif problem instances (k, m, |Σ|)Algorithm (9,2,4) (11,3,4) (13,4,4) (15,5,4) (17,6,4) (19,7,4)

Stemming 0.95 8.8 31 187 1462 8397MITRA [5] 0.89 17.9 203 1835 4012 n/aPMSPrune [10] 0.99 10.4 103 858 7743 81010RISOTTO [6] 1.64 24.6 291 2974 29792 n/a

Table 3: Running time, in seconds, on large-|Σ| inputs. (k, m) instances denote implanted motifs of length kwith up to m substitutions.

|Σ| (9,2) (11,3) (13,4) (15,5)MITRA Stemming MITRA Stemming MITRA Stemming MITRA Stemming

20 8.39 0.637 1032.17 1.07 28905 5.247 n/a 12.3150 89.82 0.633 12295.73 0.963 685015 2.244 n/a 11.92100 265.94 0.645 n/a 0.967 > 1 month 2.227 n/a 11.86

(a) CPR binding sites

(b) Lipocalin motifs

Figure 3: (a) Recognition of CRP binding sites (k =18, m = 7, |Σ| = 4). (b) Lipocalin motifs (k = 15, m =7, |Σ| = 20).

superfamily we find long motifs of length 20 (using m = 4mismatches) corresponding to the secondary structure unitsstrand 1 - loop - strand 2 (VIPPISCPENE[KR]GPFPKNLV) andstrand 3 - loop - strand 4 (YSITGQGAD[KNQT]PPVGVFII) (3DSSS motif [23]). Our algorithm finds 36 potential motifinstances (out of 330 samples) after the selection (step 1)and takes about 47 seconds (compared to about 600 sec-onds using the trie traversal). In Immunoglobin superfam-ily (C1 set domains), we find a sequence motif of length 19SSVTLGCLVKGYFPEPVTV which corresponds to strand 2-loop-strand 3 secondary structure units (2E SSS motif).

5. CONCLUSIONS

We presented a new deterministic and exhaustive algo-rithm for finding motifs, the common patterns in sequences.Our algorithm reduces computational complexity of the cur-rent motif finding algorithms and demonstrate strong run-ning time improvements over existing exact algorithms, es-pecially in large-alphabet sequences (e.g., proteins), as weshowed on several motif discovery problems in both DNAand protein sequences. The proposed algorithms could beapplied to other cases and challenging problems in sequenceanalysis and mining.

6. ACKNOWLEDGMENT

This research was partially supported by DIMACS, Cen-ter for Discrete Mathematics and Theoretical Computer Sci-ence, Rutgers University.

7. REFERENCES

[1] Eric P. Xing, Michael I. Jordan, Richard M. Karp, andStuart Russell. A hierarchical Bayesian Markovianmodel for motifs in biopolymer sequences. In In Proc.of Advances in Neural Information ProcessingSystems, pages 200–3. MIT Press, 2003.

[2] Pavel A. Pevzner and Sing-Hoi Sze. Combinatorialapproaches to finding subtle signals in dna sequences.In Proceedings of the Eighth International Conferenceon Intelligent Systems for Molecular Biology, pages269–278. AAAI Press, 2000.

[3] Jean-Marc Fellous, Paul H. E. Tiesinga, Peter J.Thomas, and Terrence J. Sejnowski. Discovering Spike

Table 2: Running time comparison on the challenging instances of the planted motif problem (DNA, |Σ| = 4,N = 20 sequences of length n = 600). Problem instances are denoted by (k, m, |Σ|), where k is the length of themotif implanted with m mismatches.

Motif problem instances (k, m, |Σ|)Algorithm (9,2,4) (11,3,4) (13,4,4) (15,5,4) (17,6,4) (19,7,4)

Stemming 0.95 8.8 31 187 1462 8397MITRA [5] 0.89 17.9 203 1835 4012 n/aPMSPrune [10] 0.99 10.4 103 858 7743 81010RISOTTO [6] 1.64 24.6 291 2974 29792 n/a

Table 3: Running time, in seconds, on large-|Σ| inputs. (k, m) instances denote implanted motifs of length kwith up to m substitutions.

|Σ| (9,2) (11,3) (13,4) (15,5)MITRA Stemming MITRA Stemming MITRA Stemming MITRA Stemming

20 8.39 0.637 1032.17 1.07 28905 5.247 n/a 12.3150 89.82 0.633 12295.73 0.963 685015 2.244 n/a 11.92100 265.94 0.645 n/a 0.967 > 1 month 2.227 n/a 11.86

(a) CPR binding sites

(b) Lipocalin motifs

Figure 3: (a) Recognition of CRP binding sites (k =18, m = 7, |Σ| = 4). (b) Lipocalin motifs (k = 15, m =7, |Σ| = 20).

superfamily we find long motifs of length 20 (using m = 4mismatches) corresponding to the secondary structure unitsstrand 1 - loop - strand 2 (VIPPISCPENE[KR]GPFPKNLV) andstrand 3 - loop - strand 4 (YSITGQGAD[KNQT]PPVGVFII) (3DSSS motif [23]). Our algorithm finds 36 potential motifinstances (out of 330 samples) after the selection (step 1)and takes about 47 seconds (compared to about 600 sec-onds using the trie traversal). In Immunoglobin superfam-ily (C1 set domains), we find a sequence motif of length 19SSVTLGCLVKGYFPEPVTV which corresponds to strand 2-loop-strand 3 secondary structure units (2E SSS motif).

5. CONCLUSIONS

We presented a new deterministic and exhaustive algo-rithm for finding motifs, the common patterns in sequences.Our algorithm reduces computational complexity of the cur-rent motif finding algorithms and demonstrate strong run-ning time improvements over existing exact algorithms, es-pecially in large-alphabet sequences (e.g., proteins), as weshowed on several motif discovery problems in both DNAand protein sequences. The proposed algorithms could beapplied to other cases and challenging problems in sequenceanalysis and mining.

6. ACKNOWLEDGMENT

This research was partially supported by DIMACS, Cen-ter for Discrete Mathematics and Theoretical Computer Sci-ence, Rutgers University.

7. REFERENCES

[1] Eric P. Xing, Michael I. Jordan, Richard M. Karp, andStuart Russell. A hierarchical Bayesian Markovianmodel for motifs in biopolymer sequences. In In Proc.of Advances in Neural Information ProcessingSystems, pages 200–3. MIT Press, 2003.

[2] Pavel A. Pevzner and Sing-Hoi Sze. Combinatorialapproaches to finding subtle signals in dna sequences.In Proceedings of the Eighth International Conferenceon Intelligent Systems for Molecular Biology, pages269–278. AAAI Press, 2000.

[3] Jean-Marc Fellous, Paul H. E. Tiesinga, Peter J.Thomas, and Terrence J. Sejnowski. Discovering Spike

DNA sequences (|∑|=4, k=9-19, m=2-7), runtime [sec]

Large-alphabet setting (|∑|=20-100), runtime [sec]

• DNA: 3-10x speedup

• Proteins: 10-104 speedups for larger k, m

Motif Discovery: Our Contributions• Algorithm to improve search efficiency for

• large-alphabet• large length inputs• compared to prior art does not depend on alphabet

• Significant speed-ups in practice

• Can search for longer, less conserved motifs

• Outlook:• Applies as a speed-up mechanism to a variety of exact motif

finders (R could be used as an input instead of S)• Combines with probabilistic motif finders as a candidate

selector (e.g., MEME, EM)

Pavel Kuksa, Vladimir Pavlovic. Efficient motif finding algorithms for large-alphabet inputs, BMC Bioinformatics, 2010

Summary: Key Contributions• New framework (efficient algorithms and methods) for

sequence comparison, classification, pattern extraction• Spatial representations and kernels• Unified computational framework for string kernels with

inexact matching• Efficient (alphabet-independent) algorithms for motif

discovery• Benefit I (complexity): fast, alphabet-free sequence

matching (sufficient statistics, spatial representations)• Benefit II (accuracy): enhances performance on many

practical tasks (bio-informatics, text, music)• Extends to richer semi-supervised settings

• neighborhood kernels• abstraction-based kernels

Future Work• non-Hamming distances

• semi-supervised string kernels

• graph, image kernels

• information extraction, e.g., identifying relationships, extracting interesting/relevant sentences from documents

• sequence labeling

Thank you!

Journal Publications• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Efficient use of unlabeled data for protein sequence classification: a comparative study. BMC Bioinformatics, 10(Suppl 4):S2, 2009. Impact factor: 3.78• Pavel Kuksa and Vladimir Pavlovic. Efficient alignment-free DNA barcode analytics. BMC Bioinformatics, 10(Suppl 14):S9, 2009. Impact factor: 3.78• Pavel Kuksa and Vladimir Pavlovic. Efficient motif finding algorithms for large-alphabet inputs. BMC Bioinformatics, 11(Suppl 8):S1, 2010.Conference papers• Pavel P. Kuksa and Vladimir Pavlovic. Efficient Motif Finding Algorithms for Large-Alphabet Inputs. In BIOKDD, 2010. Acceptance rate: 7/29 regular (24%)• Pavel Kuksa and Vladimir Pavlovic. Fast motif selection for biological sequences. In IEEE International Conference on Bioinformatics and Biomedicine BIBMʼ09, 2009. Acceptance rate: (44+37)/233 (35%)• Pavel P. Kuksa and Vladimir Pavlovic. Spatial Representation for Efficient Sequence Classification. In ICPR, 2010. Acceptance rate: 385/2140 oral (18%)• Pavel Kuksa and Vladimir Pavlovic. Efficient Alignment-free Barcode Ana- lytics. In Third International Barcode of Life Conference, 2009.• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Scalable Algorithms for String Kernels with Inexact Matching. In NIPS, 2008. Spotlight Presen- tation. Acceptance rate: 123/1022 (12%)

Conference papers• Pavel P. Kuksa, Yanjun Qi, Bing Bai, Ronan Collobert, Jason Weston, Vladimir Pavlovic, and Xia Ning. Semi-Supervised Abstraction-Augmented String Kernel for Multi-Level Bio-Relation Extraction. In ECML, 2010. Ac- ceptance rate: 106/658 (16%)• Yanjun Qi, Ronan Collobert, Pavel Kuksa, Koray Kavukcuoglu, and Jason Weston. Combining labeled and unlabeled data with word-class distribution learning. In Proceeding of the 18th ACM Conference on Information and Knowledge Management CIKM 2009, pp. 17371740, 2009. Acceptance rate: (123+171)/847 (20% short paper)• Yanjun Qi, Pavel P. Kuksa, Ronan Collobert, Kunihiko Sadamasa, Koray Kavukcuoglu, and Jason Weston. Semi-Supervised Sequence Labeling with Self-Learned Features. In Proc. International Conference on Data Mining (ICDMʼ09), IEEE, 2009. Acceptance rate: 8.9% regular (70/786)• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. On the role of local matching for efficient semi-supervised protein sequence classification. In BIBM, 2008. Acceptance rate: 38/156 (24%)• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. A fast, semi-supervised learning method for protein sequence classification. In 8th International Workshop on Data Mining in Bioinformatics (BIOKDD 2008), pp. 2937, 2008. Acceptance rate: 8/25 (32%)• Pavel Kuksa and Yanjun Qi. Semi-Supervised Bio-Named Entity Recog- nition with Word-Codebook Learning. In SDM, 2010. Acceptance rate: 82/351 (23%)

Conference papersPavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Fast and Accurate Multi-class Protein Fold Recognition with Spatial Sample Kernels. In Com- putational Systems Bioinformatics: Proceedings of the CSB2008 Conference, pp. 133143, 2008. Acceptance rate: 30/135 (22%)• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Fast Protein Homol- ogy and Fold Detection with Sparse Spatial Sample Kernels. In 19th Inter- national Conference on Pattern Recognition ICPR 2008, 2008. Acceptance rate: 18% (oral). Best paper nominee.• Pavel Kuksa and Vladimir Pavlovic. Fast Barcode-Based Species Identifica- tion Using String Kernels. In Second International Barcode of Life Confer- ence, 2007. Acceptance rate: 30% (oral)• Pavel Kuksa and Vladimir Pavlovic. Fast Kernel Methods for SVM Sequence Classifiers. In WABI, pp. 228239, 2007. Acceptance rate: 37/131 (28%)

Workshop papers• Pavel Kuksa and Vladimir Pavlovic. Efficient Sequence Classification with Spatial Representations. In Snowbird Learning Workshop, April 2010. Oral presentation.• Jason Weston, Ronan Collobert, Frederic Ratle, Hossein Mobahi, Pavel Kuksa, and Koray Kavukcuoglu. Deep Learning via Semi-Supervised Em- bedding. In ICML 2009 Workshop on Learning Feature Hierarchies, 2009.• Pavel Kuksa and Vladimir Pavlovic. Efficient Discovery of Common Patterns in Sequences. Snowbird Learning Workshop, Clearwater, Florida, April 13- 16 2009, 2009.• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. High Performance Sequence Classification with Novel Spatial Sample Embedding. 3rd Annual Machine Learning Symposium, NY, Oct 10, 2008, 2008.• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Spatially-constrained sample kernel for sequence classification. Snowbird Learning Workshop, Utah, April 1-4, 2008, 2008.• Pavel Kuksa and Vladimir Pavlovic. Kernel methods for DNA barcoding. Snowbird Learning Workshop, San Juan, Puerto Rico, March 2007, 2007.

References• Bio-Informatics

• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Efficient use of unlabeled data for protein sequence classification: a comparative study. BMC Bioinformatics, 10(Suppl 4):S2, 2009

• Pavel Kuksa and Vladimir Pavlovic. Efficient alignment-free DNA barcode analytics. BMC Bioinformatics, 10(Suppl 14):S9, 2009

• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Fast and Accurate Multi-class Protein Fold Recognition with Spatial Sample Kernels. In Computational Systems Bioinformatics: Proceedings of the CSB2008 Conference, pp. 133–143, 2008

• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Fast Protein Homology and Fold Detection with Sparse Spatial Sample Kernels. In 19th International Conference on Pattern Recognition ICPR 2008, 2008.

• String kernels

• Pavel Kuksa, Pai-Hsi Huang, and Vladimir Pavlovic. Scalable Algorithms for String Kernels with Inexact Matching. In NIPS, 2008. Spotlight Presentation

• Pavel Kuksa and Vladimir Pavlovic. Fast Kernel Methods for SVM Sequence Classifiers. In WABI, pp. 228–239, 2007.

• Pavel P. Kuksa and Vladimir Pavlovic. Spatial Representation for Efficient Sequence Classification. In ICPR, 2010.

• NLP

• “Semi-Supervised Abstraction-Augmented String Kernel for Multi-Level Bio-Relation Extraction.”, Pavel Kuksa, Yanjun Qi, Bing Bai, Ronan Collobert, Jason Weston, ECML, 2010

• Pavel Kuksa and Yanjun Qi. Semi-Supervised Bio-Named Entity Recognition with Word-Codebook Learning. In SDM, 2010.

• Yanjun Qi, Pavel P. Kuksa, Ronan Collobert, Kunihiko Sadamasa, Koray Kavukcuoglu, and Jason Weston. Semi-Supervised Sequence Labeling with Self-Learned Features. In Proc. International Conference on Data Mining (ICDM'09), IEEE, 2009.

http://research.google.com/pubs/author126.html

http://research.google.com/pubs/author126.html

Documents

pkuksa.orgpkuksa.org/~pkuksa/publications/talks/pavel_predefense_2010.pdf · String Kernel Concept GGAATTTGAGCAGGACTAATTGGAACTTCTTTAAGATTACTTATTCGAACTGAATTAGGAACCCCAGGATCTTTAATTGGAGATGATCAAATTTATAATACAAT