41
Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George K arypis Speaker: Sarah Chan CSIS DB Seminar May 31, 2002

Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Embed Size (px)

Citation preview

Page 1: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Evaluation of Techniques for Classifying Biological

Sequences

Authors: Mukund Deshpande and George Karypis

Speaker: Sarah Chan

CSIS DB Seminar

May 31, 2002

Page 2: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Presentation Outline

Introduction Traditional Approaches (kNN, Markov Models) t

o Sequence Classification Feature Based Sequence Classification Experimental Evaluation Conclusions

Page 3: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Introduction

The amount of biological sequences available in public databases is increasing exponentially• GenBank: 16 billion DNA base-pairs• PIR: over 230,000 protein sequences

Strong sequence similarity often translates to functional and structural relations

Classification algorithms applied on sequence data can be used to gain valuable insights on functions and relations of sequences• E.g. to assign a protein sequence to a protein family

Page 4: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Introduction

K-nearest neighbor, Markov models and Hidden Markov models have been extensively used• They have considered the sequential constraints

present in datasets

Motivation: Few attempts to use traditional machine learning classification algorithms such as decision trees and support vector machines• They were thought of not being able to model

sequential nature of datasets

Page 5: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Focus of This Paper

To evaluate some widely used sequence classification algorithms• K-nearest neighbor• Markov models

To develop a framework to model sequences such that traditional machine learning algorithms can be easily applied• Represent each sequence as a vector in a derived fe

ature space, and then use SVMs to build a sequence classifier

Page 6: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Problem Definition- Sequence Classification

A sequence Sr = {x1, x2, x3, .. xl} is an ordered list of symbols

The alphabet for symbols: known in advance and of fixed size N

Each sequence Sr has a class label Cr

Assumption: Two class labels only (C+, C-)

Goal: To correctly assign a class label to a test sequence

Page 7: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 1:K Nearest Neighbor (KNN) Classifiers

To classify a test sequence Sr

• Locate K training sequences being most similar to Sr

• Assign to Sr the class label which occurs the most in those K sequences

Key task: to compute similarity between two sequences

Page 8: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 1:K Nearest Neighbor (KNN) Classifiers

Alignment score as similarity function• Compute an optimal alignment between two

sequences (by dynamic programming, hence computationally expensive), and then

• Score this alignment: the score is a function of the no. of matched and unmatched symbols in the alignment

Page 9: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 1:K Nearest Neighbor (KNN) Classifiers

Two variations• Global alignment score

Align sequences across their entire length Can capture position specific patterns Need to be normalized due to varying sequence lengths

• Local alignment score Only portions of two sequences are aligned Can capture small substrings of symbols which are present in the

two sequences but not necessarily at the same position

Page 10: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 2.1:Simple Markov Chain Classifiers

To build a simple Markov chain based classification model• Partition training sequences according to class labels• Build a simple Markov chain (M) for each smaller dat

aset

To classify a test sequence Sr

• Compute the likelihood of Sr being generated by each Markov chain M, i.e. P(Sr | M)

• Assigns to Sr the class label associated with the Markov chain that gives the highest likelihood

Page 11: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 2.1:Simple Markov Chain Classifiers

Log-likelihood ratio (for two class problems):

If L(Sr) 0, then Cr C+ else Cr C-

Markov principle (for 1st order Markov chain):

each symbol in a sequence depends only on its preceding symbol, so

Page 12: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 2.1:Simple Markov Chain Classifiers

Transition probability xi-1, xi = P(xi | xi -1)

Each symbol is associated with a state A Transition Probability Matrix (TPM) is built for

each class

Page 13: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 2.1:Simple Markov Chain Classifiers

Example

Page 14: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 2.1:Simple Markov Chain Classifiers

Higher (kth) order Markov chain• Transition probability for a symbol xl is computed by l

ooking at its k preceding symbols• No. of states = Nk, each associated with a sequence

of k symbols• Size of TPM = Nk+1 (Nk rows x N columns)• Pros: Better classification accuracy since they captur

e longer ordering constraints• Cons: No. of states grow exponentially with the order

many infrequent states poor probability estimates

Page 15: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 2.2:Interpolated Markov Models (IMM)

Build a series of Markov chains starting from the 0th order up to the kth order

Transition probability for a symbol

P(xi|xi-1, xi-2, .., x1, IMMk) = sum of weighted transition probabilities of the different order chains from 0th order up to kth order• Weights: Often based on distribution of different state

s in various order Markov models• The right method appears to be dataset dependent

Page 16: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 2.3:Selective Markov Models (SMM)

Build various order Markov chains Prune non-discriminatory states from higher ord

er chains (will explain how) Conditional probability P(xi|xi-1, xi-2, .., x1, SMMk)

is the probability corresponding to highest order chain among remaining states

Page 17: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 2.3:Selective Markov Models (SMM)

Key task: to decide which states are non-discriminatory

Simplest way: use a frequency threshold and prune all states which occur less than it

Method used in experiment:• Specify frequency threshold as a parameter • A state-transition pair is kept only if it occurs times

more frequently than its expected frequency, when uniform distribution is assumed

Page 18: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 3:Feature Based Sequence Classification

Sequences are modeled into a form that can be used by traditional machine learning algorithms

Extraction of features that take sequential nature of sequences into account

Motivated by Markov models, support vector machines (SVMs) are used

Page 19: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 3:Feature Based Sequence Classification

SVM• A relatively new learning algorithm by Vapnik (1995)• Objective: Given a training set in a vector space, find

the best hyperplane (with max. margin) that separates two classes

• Approach: Formulate a constrained optimization problem, then solve it using constrained quadratic programming (QP)

• Well-suited for high dimensional data• Require lots of memory and CPU time

Page 20: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 3:Feature Based Sequence Classification

SVM – Maximum margin

(a) A separating hyperplane with a small margin.(b) A separating hyperplane with a larger margin.A better generalization is expected from (b).

Page 21: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 3:Feature Based Sequence Classification

SVM – Feature space mapping

Mapping data into a higher dimensional feature space (by using kernel functions) where they are linearly separable.

Page 22: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 3:Feature Based Sequence Classification

Vector space view

(simple 1st order Markov chain)

is equivalent to L(Sy) = ut w

• u and w are of length N2, each dimension corresponds to a unique pair of symbols Element in u: frequency of a particular sequence Element in w: log-ratio of conditional probabilities for + and – clas

ses)

Page 23: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 3:Feature Based Sequence Classification

Vector space view - Example

(simple 1st order Markov chain)

Page 24: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Approach 3:Feature Based Sequence Classification

Vector space view• All variants of Markov chains described previously c

an be transformed in a similar manner Dimensionality of new space:

For higher order Markov chains: Nk+1

For IMM: N + N2 + .. + Nk+1

For SMM: no. of non-pruned states

• Each sequence is viewed as a frequency vector• Allows the use of any traditional classifier that opera

tes on objects represented in multi-dimensional vectors

Page 25: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Experimental Evaluation

5 different datasets, each with 2-3 classes

Table 1

Page 26: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Experimental Evaluation

Methodology• Performance of algorithms was measured using

classification accuracy• Ten-way cross validation was used• Experiments were restricted to two class problems

Page 27: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

KNN Classifiers

“Cosine”• Sequence: Frequency vector of different symbols in it• Similarity `/. sequences: cosine of the two vectors• Does not take sequential constraints into account

Table 2

Page 28: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

KNN Classifiers

1. ‘Global’ outperforms the other two for all K

2. For PS-HT and PS-TS, performance of ‘Cosine’ is comparable to that of ‘Global’ as limited sequential info. can be exploited

Table 2

Page 29: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

KNN Classifiers

3. ‘Local’ performs very poorly esp. on protein seq. Not good to base classification only on a single

substring

4. Accuracy decreases when K increases

Table 2

Page 30: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Simple Markov Chains vs.Their Feature Spaces

1. Accuracy improves with order of each model• Only exceptions: For PS-*, accuracy peaks at 2nd/1st

order, as sequences are very short higher order models & their features spaces contain very few examples for calculating transition probabilities

Table 3

Page 31: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Simple Markov Chains vs.Their Feature Spaces

2. SVM achieves higher accuracies than simple Markov chains (often 5-10% improvement)

Table 3

Page 32: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

IMM vs. Their Feature Spaces

1. SVM achieves higher accuracies than IMM for most datasets• Exceptions: For P-*, higher order IMM models do

considerably better (no explanation provided)

Table 4

Page 33: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

2. Simple Markov chain based classifiers usually outperform IMM• Only exceptions: PS-*, since sequences are

comparatively short greater benefit in using different order Markov states

IMM vs. Their Feature Spaces

Table 4

Page 34: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

IMM Based Classifiers vs.Simple Markov Chain Based Classifiers

Table 4IMM Based

Part of Table 3Simple Markov Chain Based

Page 35: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

SMM vs. Their Feature Spaces

Table 5a

: parameter (for frequency threshold) used in pruning states of different order Markov chains

Page 36: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Table 5b

Table 5c

Page 37: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

1. SVM usually achieves higher accuracies than SMM

2. For many problems SMM achieves higher accuracy when increases, but the gains are rather small• Maybe because pruning strategy is too simple

SMM vs. Their Feature Spaces

Page 38: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Conclusions

1. SVM classifier used on the feature spaces of different Markov chains (and its variants) achieves substantially better accuracies than the corresponding Markov chain classifier. The linear classification models learnt by SVM is b

etter than those learnt by Markov chain based approaches

Page 39: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

Conclusions

2. Proper feature selection can improve accuracy, but increase in amount of info. available does not necessarily guarantee so.• (Except PS-*) The max. accuracy attained by SVM o

n IMM’s feature spaces is always lower than that attained by it on feature spaces of simple Markov chains.

• Even with simple frequency based feature selection, as done in SMM, overall accuracy is higher.

Page 40: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

3.

• KNN by computing global alignments can take advantage of the relative positions of symbols in aligned sequences

• Simple experiment: SVM incorporated with info. about position of symbols was able to achieve an accuracy > 97%.

Position specific info. can be useful for building effective classifiers for biological sequences.

Conclusions  

 

Dataset Highest accuracy

Scheme which achieves

the highest accuracy

S-EI 0.9390 KNN (K=5, with global sequence alignment)

P-MS 0.9719 KNN (K=1, with global sequence alignment)

Page 41: Evaluation of Techniques for Classifying Biological Sequences Authors: Mukund Deshpande and George Karypis Speaker: Sarah Chan CSIS DB Seminar May 31,

References  

Mukund Deshpande and George Karypis. Evaluation of Techniques for Classifying Biological Sequences. In proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery (PAKDD), 2002.

Ming-Husan Yang. Presentation entitled “Gentle Guide to Support Vector Machines”.

Alexanda Johannes Smola. Presentation entitled “Support Vector Learning: Concepts and Algorithms”.