Upload
iris-parker
View
213
Download
0
Embed Size (px)
Citation preview
Evaluation of Techniques for Classifying Biological
Sequences
Authors: Mukund Deshpande and George Karypis
Speaker: Sarah Chan
CSIS DB Seminar
May 31, 2002
Presentation Outline
Introduction Traditional Approaches (kNN, Markov Models) t
o Sequence Classification Feature Based Sequence Classification Experimental Evaluation Conclusions
Introduction
The amount of biological sequences available in public databases is increasing exponentially• GenBank: 16 billion DNA base-pairs• PIR: over 230,000 protein sequences
Strong sequence similarity often translates to functional and structural relations
Classification algorithms applied on sequence data can be used to gain valuable insights on functions and relations of sequences• E.g. to assign a protein sequence to a protein family
Introduction
K-nearest neighbor, Markov models and Hidden Markov models have been extensively used• They have considered the sequential constraints
present in datasets
Motivation: Few attempts to use traditional machine learning classification algorithms such as decision trees and support vector machines• They were thought of not being able to model
sequential nature of datasets
Focus of This Paper
To evaluate some widely used sequence classification algorithms• K-nearest neighbor• Markov models
To develop a framework to model sequences such that traditional machine learning algorithms can be easily applied• Represent each sequence as a vector in a derived fe
ature space, and then use SVMs to build a sequence classifier
Problem Definition- Sequence Classification
A sequence Sr = {x1, x2, x3, .. xl} is an ordered list of symbols
The alphabet for symbols: known in advance and of fixed size N
Each sequence Sr has a class label Cr
Assumption: Two class labels only (C+, C-)
Goal: To correctly assign a class label to a test sequence
Approach 1:K Nearest Neighbor (KNN) Classifiers
To classify a test sequence Sr
• Locate K training sequences being most similar to Sr
• Assign to Sr the class label which occurs the most in those K sequences
Key task: to compute similarity between two sequences
Approach 1:K Nearest Neighbor (KNN) Classifiers
Alignment score as similarity function• Compute an optimal alignment between two
sequences (by dynamic programming, hence computationally expensive), and then
• Score this alignment: the score is a function of the no. of matched and unmatched symbols in the alignment
Approach 1:K Nearest Neighbor (KNN) Classifiers
Two variations• Global alignment score
Align sequences across their entire length Can capture position specific patterns Need to be normalized due to varying sequence lengths
• Local alignment score Only portions of two sequences are aligned Can capture small substrings of symbols which are present in the
two sequences but not necessarily at the same position
Approach 2.1:Simple Markov Chain Classifiers
To build a simple Markov chain based classification model• Partition training sequences according to class labels• Build a simple Markov chain (M) for each smaller dat
aset
To classify a test sequence Sr
• Compute the likelihood of Sr being generated by each Markov chain M, i.e. P(Sr | M)
• Assigns to Sr the class label associated with the Markov chain that gives the highest likelihood
Approach 2.1:Simple Markov Chain Classifiers
Log-likelihood ratio (for two class problems):
If L(Sr) 0, then Cr C+ else Cr C-
Markov principle (for 1st order Markov chain):
each symbol in a sequence depends only on its preceding symbol, so
Approach 2.1:Simple Markov Chain Classifiers
Transition probability xi-1, xi = P(xi | xi -1)
Each symbol is associated with a state A Transition Probability Matrix (TPM) is built for
each class
Approach 2.1:Simple Markov Chain Classifiers
Example
Approach 2.1:Simple Markov Chain Classifiers
Higher (kth) order Markov chain• Transition probability for a symbol xl is computed by l
ooking at its k preceding symbols• No. of states = Nk, each associated with a sequence
of k symbols• Size of TPM = Nk+1 (Nk rows x N columns)• Pros: Better classification accuracy since they captur
e longer ordering constraints• Cons: No. of states grow exponentially with the order
many infrequent states poor probability estimates
Approach 2.2:Interpolated Markov Models (IMM)
Build a series of Markov chains starting from the 0th order up to the kth order
Transition probability for a symbol
P(xi|xi-1, xi-2, .., x1, IMMk) = sum of weighted transition probabilities of the different order chains from 0th order up to kth order• Weights: Often based on distribution of different state
s in various order Markov models• The right method appears to be dataset dependent
Approach 2.3:Selective Markov Models (SMM)
Build various order Markov chains Prune non-discriminatory states from higher ord
er chains (will explain how) Conditional probability P(xi|xi-1, xi-2, .., x1, SMMk)
is the probability corresponding to highest order chain among remaining states
Approach 2.3:Selective Markov Models (SMM)
Key task: to decide which states are non-discriminatory
Simplest way: use a frequency threshold and prune all states which occur less than it
Method used in experiment:• Specify frequency threshold as a parameter • A state-transition pair is kept only if it occurs times
more frequently than its expected frequency, when uniform distribution is assumed
Approach 3:Feature Based Sequence Classification
Sequences are modeled into a form that can be used by traditional machine learning algorithms
Extraction of features that take sequential nature of sequences into account
Motivated by Markov models, support vector machines (SVMs) are used
Approach 3:Feature Based Sequence Classification
SVM• A relatively new learning algorithm by Vapnik (1995)• Objective: Given a training set in a vector space, find
the best hyperplane (with max. margin) that separates two classes
• Approach: Formulate a constrained optimization problem, then solve it using constrained quadratic programming (QP)
• Well-suited for high dimensional data• Require lots of memory and CPU time
Approach 3:Feature Based Sequence Classification
SVM – Maximum margin
(a) A separating hyperplane with a small margin.(b) A separating hyperplane with a larger margin.A better generalization is expected from (b).
Approach 3:Feature Based Sequence Classification
SVM – Feature space mapping
Mapping data into a higher dimensional feature space (by using kernel functions) where they are linearly separable.
Approach 3:Feature Based Sequence Classification
Vector space view
(simple 1st order Markov chain)
is equivalent to L(Sy) = ut w
• u and w are of length N2, each dimension corresponds to a unique pair of symbols Element in u: frequency of a particular sequence Element in w: log-ratio of conditional probabilities for + and – clas
ses)
Approach 3:Feature Based Sequence Classification
Vector space view - Example
(simple 1st order Markov chain)
Approach 3:Feature Based Sequence Classification
Vector space view• All variants of Markov chains described previously c
an be transformed in a similar manner Dimensionality of new space:
For higher order Markov chains: Nk+1
For IMM: N + N2 + .. + Nk+1
For SMM: no. of non-pruned states
• Each sequence is viewed as a frequency vector• Allows the use of any traditional classifier that opera
tes on objects represented in multi-dimensional vectors
Experimental Evaluation
5 different datasets, each with 2-3 classes
Table 1
Experimental Evaluation
Methodology• Performance of algorithms was measured using
classification accuracy• Ten-way cross validation was used• Experiments were restricted to two class problems
KNN Classifiers
“Cosine”• Sequence: Frequency vector of different symbols in it• Similarity `/. sequences: cosine of the two vectors• Does not take sequential constraints into account
Table 2
KNN Classifiers
1. ‘Global’ outperforms the other two for all K
2. For PS-HT and PS-TS, performance of ‘Cosine’ is comparable to that of ‘Global’ as limited sequential info. can be exploited
Table 2
KNN Classifiers
3. ‘Local’ performs very poorly esp. on protein seq. Not good to base classification only on a single
substring
4. Accuracy decreases when K increases
Table 2
Simple Markov Chains vs.Their Feature Spaces
1. Accuracy improves with order of each model• Only exceptions: For PS-*, accuracy peaks at 2nd/1st
order, as sequences are very short higher order models & their features spaces contain very few examples for calculating transition probabilities
Table 3
Simple Markov Chains vs.Their Feature Spaces
2. SVM achieves higher accuracies than simple Markov chains (often 5-10% improvement)
Table 3
IMM vs. Their Feature Spaces
1. SVM achieves higher accuracies than IMM for most datasets• Exceptions: For P-*, higher order IMM models do
considerably better (no explanation provided)
Table 4
2. Simple Markov chain based classifiers usually outperform IMM• Only exceptions: PS-*, since sequences are
comparatively short greater benefit in using different order Markov states
IMM vs. Their Feature Spaces
Table 4
IMM Based Classifiers vs.Simple Markov Chain Based Classifiers
Table 4IMM Based
Part of Table 3Simple Markov Chain Based
SMM vs. Their Feature Spaces
Table 5a
: parameter (for frequency threshold) used in pruning states of different order Markov chains
Table 5b
Table 5c
1. SVM usually achieves higher accuracies than SMM
2. For many problems SMM achieves higher accuracy when increases, but the gains are rather small• Maybe because pruning strategy is too simple
SMM vs. Their Feature Spaces
Conclusions
1. SVM classifier used on the feature spaces of different Markov chains (and its variants) achieves substantially better accuracies than the corresponding Markov chain classifier. The linear classification models learnt by SVM is b
etter than those learnt by Markov chain based approaches
Conclusions
2. Proper feature selection can improve accuracy, but increase in amount of info. available does not necessarily guarantee so.• (Except PS-*) The max. accuracy attained by SVM o
n IMM’s feature spaces is always lower than that attained by it on feature spaces of simple Markov chains.
• Even with simple frequency based feature selection, as done in SMM, overall accuracy is higher.
3.
• KNN by computing global alignments can take advantage of the relative positions of symbols in aligned sequences
• Simple experiment: SVM incorporated with info. about position of symbols was able to achieve an accuracy > 97%.
Position specific info. can be useful for building effective classifiers for biological sequences.
Conclusions
Dataset Highest accuracy
Scheme which achieves
the highest accuracy
S-EI 0.9390 KNN (K=5, with global sequence alignment)
P-MS 0.9719 KNN (K=1, with global sequence alignment)
References
Mukund Deshpande and George Karypis. Evaluation of Techniques for Classifying Biological Sequences. In proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery (PAKDD), 2002.
Ming-Husan Yang. Presentation entitled “Gentle Guide to Support Vector Machines”.
Alexanda Johannes Smola. Presentation entitled “Support Vector Learning: Concepts and Algorithms”.