Upload
ross-ramsey
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
Application of latent semantic analysis to protein remote homology detection
Wu Dongyin4/13/2015
ABSTRACT
LSA
Related Work on Remote Homology Detection
LSA-based SVM and Data set
Result and Discussion
CONCLUSION
Motivation Remote homology detection: A central problem in computational biology, the classification of proteins into functi
onal and structural classes given their amino acid sequences
Results
Discriminative method such as SVM is one of the most effective methods
Explicit feature are usually large and noise data may be introduced, and it leads to peaking phenomenon
Introduce LSA, which is an effecient feature extraction technique in NLP LSA model significantly improves the performance of remote homology detection in comparison with basic formal
isms, and its peformance is comparable with complex kernel methods such as SVM-LA and better than other sequence-based methods
ABSTRACT
Related Work of Remote Homology Detection
pairwise sequence
comparison algorithm
rotein families and discriminative classifiers
generative models
dynamic programming
algorithm:
BLAST,
FASTA,
PSI-BLAST, etc
HMM, etc SVM, SVM-fisher, SVM-k-spectrum, mismatch-SVM, SVM-pairwise, SVM-I-sites,
SVM-LA, SVM-SW, etc
structure is more conserved than sequence -- detecting very subtle sequence similarities, or remote homology is important
Most methods can detect homology with a high level of similarity, while remote homology is often difficult to be separated from pairs of proteins that share similarities owing to chance -- 'twilight zone'
The success of a SVM classification method depends on the choice of the feature set to describe each protein. Most of these research efforts focus on finding useful representations of protein sequence data for SVM training by using either explicit feature vector representations or kernel functions.
LSA
Latent semantic analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning by statistical computations applied to a large corpus of text.
LSA analysis the relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text.
LSA
c1 c2 c3 c4 c5 m1 m2 m3 m4
human 1 1
interface 1 1
computer 1 1
user 1 1 1
system 1 1 2
response 1 1
time 1 1
EPS 1 1
survey 1 1
tree 1 1 1
graph 1 1 1
minor 1 1
bag-of-words model
N documentM words in total Word-Document Matrix
( M × N )
( This repretation does not recognize synonymous or related words and the dimensions are too large)
TUSVW
U(M×K) S(K×K) VT(K×N)
LSA
))Min(( NM,R
c1 c2 c3 c4 c5 m1
m2
m3
m4
human 1 1
interface 1 1
computer 1 1
user 1 1 1
system 1 1 2
response 1 1
time 1 1
EPS 1 1
survey 1 1
tree 1 1 1
graph 1 1 1
minor 1 1
+sequences of proteins
documents
LSA
For a new document (sequence) which is not in the training set, it is required to add the unseen do
cument (sequence) to the original training set and the LSA model be computed. The new vector t can be approximated as
t = dU
LSA
where d is the raw vector of the new document, which is similar to the columns of the matrix W
LSA-based SVM and Data set
Structral Classification of Proteins (SCOP) 1.53 sequences from ASTRAL database
54 families 4352 distinct sequences remote homology is simulated by holding out all m
embers of a target 1.53 family from a given superfamily.
3 basic building block of proteins N-gram
N = 3, 20^3, 8000 words Patterns
alphabet ∑U{‘.’}, where ∑ is the set of the 20 amino acids and {‘.’} can be any of the amino acids. X2 selection, 8000 patterns. Motifs
denotes the limited, highly conserved regions of proteins. 3231 motifs.
Result and Discussion
Two methods are used to evaluate the experimental results: the receiver operating characteristic (ROC) scores. the median rate of false positives (M-RFP) scores. The fraction of negative test sequences
that score as high or better than the median score of the positive sequences.
Result and Discussion
Result and Discussion
When the families are in the left-upper area, it means that the method labeled by y-axis outperforms the method labeled by x-axis on this family.
Result and Discussion
fold1
superfamily1.1
family1.1.1
fold2
family1.1.2 family1.1.3
superfamily2.1
family1.2.1 family1.2.2
superfamily1.2
family2.1.1
positivetrain
20
positivetest13
1. Family level
negative train & negative test3033 & 1137
Result and Discussion
fold1
superfamily1.1
family1.1.1
fold2
family1.1.2 family1.1.3
superfamily2.1
family1.2.1 family1.2.2
superfamily1.2
family2.1.1
positivetrain
88
positivetest33
2. Superfamily level
negative train & negative test3033 & 1137
Result and Discussion
fold1
superfamily1.1
family1.1.1
fold2
family1.1.2 family1.1.3
superfamily2.1
family1.2.1 family1.2.2
superfamily1.2
family2.1.1
positive
train61
3. Fold level
positivetest33
negative train & negative test3033 & 1137
Result and Discussion
Result and Discussion
LSA better than SVM-pairwise and SVM-LAworse than methods without LSA and PSI-BLAST
vectorization step optimization step
SVM-pairwise O(n2l2) O(n3)
SVM-LA O(n2l2) O(n2p)
SVM-Ngram O(nml) O(n2m)
SVM-Pattern O(nml) O(n2m)
SVM-Motif O(nml) O(n2m)
SVM-Ngram-LSA O(nmt) O(n2R)
SVM-Pattern-LSA O(nmt) O(n2R)
SVM-Motif-LSA O(nmt) O(n2R)
computational efficiency
n: the number of training examplesl: the length of the longest training sequencem: the total number of wordst: min (m,n)p: the length of the latent semantic representation vector p = n, in SVM-pairwise p = m , in the method with LSA p = R, in the LSA method
CONCLUSION
In this paper, the LSA model from natural language processing is successfully used in protein remote homology detection and improved performances have been acquired in comparison with the basic formalisms.
Each document is represented as a linear combination of hidden abstract concepts, which arise automatically from the SVD mechanism.
LSA defines a transformation between high-dimensional discrete entities (the vocabulary) and a low-dimensional continuous vector space S, the R-dimensional space spanned by the Us, leading to noise removal and efficient representation of the protein sequence.
As a result, the LSA model achieves better performance than the methods without LSA.