Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Evaluation of Techniques for Classifying Biological

Sequences

Authors: Mukund Deshpande and George Karypis

Speaker: Sarah Chan

CSIS DB Seminar

May 31, 2002

Presentation Outline

Introduction Traditional Approaches (kNN, Markov Models) t

o Sequence Classification Feature Based Sequence Classification Experimental Evaluation Conclusions

Introduction

The amount of biological sequences available in public databases is increasing exponentially• GenBank: 16 billion DNA base-pairs• PIR: over 230,000 protein sequences

Strong sequence similarity often translates to functional and structural relations

Classification algorithms applied on sequence data can be used to gain valuable insights on functions and relations of sequences• E.g. to assign a protein sequence to a protein family

Introduction

K-nearest neighbor, Markov models and Hidden Markov models have been extensively used• They have considered the sequential constraints

present in datasets

Motivation: Few attempts to use traditional machine learning classification algorithms such as decision trees and support vector machines• They were thought of not being able to model

sequential nature of datasets

Focus of This Paper

To evaluate some widely used sequence classification algorithms• K-nearest neighbor• Markov models

To develop a framework to model sequences such that traditional machine learning algorithms can be easily applied• Represent each sequence as a vector in a derived fe

ature space, and then use SVMs to build a sequence classifier

Problem Definition- Sequence Classification

A sequence Sr = {x1, x2, x3, .. xl} is an ordered list of symbols

The alphabet for symbols: known in advance and of fixed size N

Each sequence Sr has a class label Cr

Assumption: Two class labels only (C+, C-)

Goal: To correctly assign a class label to a test sequence

Approach 1:K Nearest Neighbor (KNN) Classifiers

To classify a test sequence Sr

• Locate K training sequences being most similar to Sr

• Assign to Sr the class label which occurs the most in those K sequences

Key task: to compute similarity between two sequences


Alignment score as similarity function• Compute an optimal alignment between two

sequences (by dynamic programming, hence computationally expensive), and then

• Score this alignment: the score is a function of the no. of matched and unmatched symbols in the alignment


Two variations• Global alignment score

Align sequences across their entire length Can capture position specific patterns Need to be normalized due to varying sequence lengths

• Local alignment score Only portions of two sequences are aligned Can capture small substrings of symbols which are present in the

two sequences but not necessarily at the same position

Approach 2.1:Simple Markov Chain Classifiers

To build a simple Markov chain based classification model• Partition training sequences according to class labels• Build a simple Markov chain (M) for each smaller dat

aset

To classify a test sequence Sr

• Compute the likelihood of Sr being generated by each Markov chain M, i.e. P(Sr | M)

• Assigns to Sr the class label associated with the Markov chain that gives the highest likelihood


Log-likelihood ratio (for two class problems):

If L(Sr) 0, then Cr C+ else Cr C-

Markov principle (for 1st order Markov chain):

each symbol in a sequence depends only on its preceding symbol, so


Transition probability xi-1, xi = P(xi | xi -1)

Each symbol is associated with a state A Transition Probability Matrix (TPM) is built for

each class


Example


Higher (kth) order Markov chain• Transition probability for a symbol xl is computed by l

ooking at its k preceding symbols• No. of states = Nk, each associated with a sequence

of k symbols• Size of TPM = Nk+1 (Nk rows x N columns)• Pros: Better classification accuracy since they captur

e longer ordering constraints• Cons: No. of states grow exponentially with the order

many infrequent states poor probability estimates

Approach 2.2:Interpolated Markov Models (IMM)

Build a series of Markov chains starting from the 0th order up to the kth order

Transition probability for a symbol

P(xi|xi-1, xi-2, .., x1, IMMk) = sum of weighted transition probabilities of the different order chains from 0th order up to kth order• Weights: Often based on distribution of different state

s in various order Markov models• The right method appears to be dataset dependent

Approach 2.3:Selective Markov Models (SMM)

Build various order Markov chains Prune non-discriminatory states from higher ord

er chains (will explain how) Conditional probability P(xi|xi-1, xi-2, .., x1, SMMk)

is the probability corresponding to highest order chain among remaining states

Approach 2.3:Selective Markov Models (SMM)

Key task: to decide which states are non-discriminatory

Simplest way: use a frequency threshold and prune all states which occur less than it

Method used in experiment:• Specify frequency threshold as a parameter • A state-transition pair is kept only if it occurs times

more frequently than its expected frequency, when uniform distribution is assumed

Approach 3:Feature Based Sequence Classification

Sequences are modeled into a form that can be used by traditional machine learning algorithms

Extraction of features that take sequential nature of sequences into account

Motivated by Markov models, support vector machines (SVMs) are used


SVM• A relatively new learning algorithm by Vapnik (1995)• Objective: Given a training set in a vector space, find

the best hyperplane (with max. margin) that separates two classes

• Approach: Formulate a constrained optimization problem, then solve it using constrained quadratic programming (QP)

• Well-suited for high dimensional data• Require lots of memory and CPU time


SVM – Maximum margin

(a) A separating hyperplane with a small margin.(b) A separating hyperplane with a larger margin.A better generalization is expected from (b).


SVM – Feature space mapping

Mapping data into a higher dimensional feature space (by using kernel functions) where they are linearly separable.


Vector space view

(simple 1st order Markov chain)

is equivalent to L(Sy) = ut w

• u and w are of length N2, each dimension corresponds to a unique pair of symbols Element in u: frequency of a particular sequence Element in w: log-ratio of conditional probabilities for + and – clas

ses)


Vector space view - Example

(simple 1st order Markov chain)


Vector space view• All variants of Markov chains described previously c

an be transformed in a similar manner Dimensionality of new space:

For higher order Markov chains: Nk+1

For IMM: N + N2 + .. + Nk+1

For SMM: no. of non-pruned states

• Each sequence is viewed as a frequency vector• Allows the use of any traditional classifier that opera

tes on objects represented in multi-dimensional vectors

Experimental Evaluation

5 different datasets, each with 2-3 classes

Table 1

Experimental Evaluation

Methodology• Performance of algorithms was measured using

classification accuracy• Ten-way cross validation was used• Experiments were restricted to two class problems

KNN Classifiers

“Cosine”• Sequence: Frequency vector of different symbols in it• Similarity `/. sequences: cosine of the two vectors• Does not take sequential constraints into account

Table 2

KNN Classifiers

1. ‘Global’ outperforms the other two for all K

2. For PS-HT and PS-TS, performance of ‘Cosine’ is comparable to that of ‘Global’ as limited sequential info. can be exploited

Table 2

KNN Classifiers

3. ‘Local’ performs very poorly esp. on protein seq. Not good to base classification only on a single

substring

4. Accuracy decreases when K increases

Table 2

Simple Markov Chains vs.Their Feature Spaces

1. Accuracy improves with order of each model• Only exceptions: For PS-*, accuracy peaks at 2nd/1st

order, as sequences are very short higher order models & their features spaces contain very few examples for calculating transition probabilities

Table 3

Simple Markov Chains vs.Their Feature Spaces

2. SVM achieves higher accuracies than simple Markov chains (often 5-10% improvement)

Table 3

IMM vs. Their Feature Spaces

1. SVM achieves higher accuracies than IMM for most datasets• Exceptions: For P-*, higher order IMM models do

considerably better (no explanation provided)

Table 4

2. Simple Markov chain based classifiers usually outperform IMM• Only exceptions: PS-*, since sequences are

comparatively short greater benefit in using different order Markov states

IMM vs. Their Feature Spaces

Table 4

IMM Based Classifiers vs.Simple Markov Chain Based Classifiers

Table 4IMM Based

Part of Table 3Simple Markov Chain Based

SMM vs. Their Feature Spaces

Table 5a

: parameter (for frequency threshold) used in pruning states of different order Markov chains

Table 5b

Table 5c

1. SVM usually achieves higher accuracies than SMM

2. For many problems SMM achieves higher accuracy when increases, but the gains are rather small• Maybe because pruning strategy is too simple

SMM vs. Their Feature Spaces

Conclusions

1. SVM classifier used on the feature spaces of different Markov chains (and its variants) achieves substantially better accuracies than the corresponding Markov chain classifier. The linear classification models learnt by SVM is b

etter than those learnt by Markov chain based approaches

Conclusions

2. Proper feature selection can improve accuracy, but increase in amount of info. available does not necessarily guarantee so.• (Except PS-*) The max. accuracy attained by SVM o

n IMM’s feature spaces is always lower than that attained by it on feature spaces of simple Markov chains.

• Even with simple frequency based feature selection, as done in SMM, overall accuracy is higher.

3.

• KNN by computing global alignments can take advantage of the relative positions of symbols in aligned sequences

• Simple experiment: SVM incorporated with info. about position of symbols was able to achieve an accuracy > 97%.

Position specific info. can be useful for building effective classifiers for biological sequences.

Conclusions

Dataset Highest accuracy

Scheme which achieves

the highest accuracy

S-EI 0.9390 KNN (K=5, with global sequence alignment)

P-MS 0.9719 KNN (K=1, with global sequence alignment)

References

Mukund Deshpande and George Karypis. Evaluation of Techniques for Classifying Biological Sequences. In proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery (PAKDD), 2002.

Ming-Husan Yang. Presentation entitled “Gentle Guide to Support Vector Machines”.

Alexanda Johannes Smola. Presentation entitled “Support Vector Learning: Concepts and Algorithms”.

Documents

Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,