View
213
Download
1
Embed Size (px)
Citation preview
Efficient Nearest-Neighbor Search in Large Sets of Protein
Conformations
Fabian Schwarzer
Itay Lotan
Motivation
• SRS1. Sample conformations
2. Create edges between “neighboring” conformations
• Ab-initio structure prediction1. Generate a large decoy set
2. Cluster based on similarity
When the number of conformations is large, finding neighboring (similar) conformations is costly
Similarity Measures
• Given the backbone Cα atom positions of two conformations – how similar are they?– Hard to define when comparing two different
proteins– Straightforward when comparing two
conformations of the same protein.
Similarity Measures
• We are interested in comparing conformations of the same protein.
• Hence - trivial correspondence between the two point sets.
• The two most common measures are:– cRMS deviation– dRMS deviation
cRMS
n
i
Qi
PiT Tcc
nQPcRMS
1
2
2
1min),(
T is the rigid body transform that optimally aligns P and Q
• cRMS is a metric, but the space is not Euclidean
• There is a closed form solution for T
• Complexity is linear in the number of points (plus a 4x4 eigenvectors computation)
dRMS
• A metric over a Euclidean space.
• Complexity is quadratic in the number of points (size of protein)
n
i
i
j
Qij
Pij dd
nnQPdRMS
2 1
2
)1(
2),(
2jiij ccd D is the internal distances matrix:
k Nearest Neighbors
• Find the k nearest neighbors of every conformation in the set
• Currently the fastest algorithm in practice for high dimensionality is brute force:For each conformation q in set
Compute distance to all other conformations
Find the k nearest conformations
• Complexity is O(n2 log k)
• The literature has a number of efficient nearest neighbor algorithms:– kd-trees is the most prevalent
• We cannot use these algorithms:– Require a Euclidean space – cRMS– Not efficient with high dimensionality - dRMS
k Nearest Neighbors
We reduce the dimensionality of dRMS to make kd-trees applicable.
Uniform Simplification
• Cut sequence into m equal subsequences
• Average the coordinates of the Cα atoms in each subsequence
• Use averaged coordinates ai when computing cRMS and dRMS
a0
a1
am
a6a5a4
a3
a2
Uniform Simplification - Results
• There is a high correlation between the full and the averaged representation when using cRMS and dRMS:– Proteins with 60 – 75AA: r > 0.95 for m > 12 – Protein with 374 AA: r > 0.95 for m > 16
Even with m = 12, the dimensionality of the internal distances matrix used by dRMS is too high (66) for a kd-tree to be used. Further reduction is needed.
Proteins
1HTB (374)
4PTI (58)
1R69 (63)
1CTF (68)
Further Reduction using SVD
• We Apply SVD to the reduced distance matrices (stacked as vectors)
• We project the reduced matrices onto the important singular vectors to further reduce the size.
Further Reduction – Results.
• Averaging before creating internal distances vector makes SVD feasible
• For proteins with 60-75 AA, dRMS using only 20 parameters was highly correlated (r > 0.90) with dRMS using full representation.
• 20 Dimensions is not too much for kd-trees.
Finding k Nearest Neighbors
• We tested the actual ability of the reduced representation to find NNs
• 80 of the 100 true NNs (using dRMS) where found using the reduced rep. of decoy sets
• Results are better (90) when the data set contains uniformly sampled conformations
• The maximal relative error was 10% - 20% (0.5Å – 1.5Å)
• The average relative error was < 5%
Using kd-trees
• We used the ANN implementation (UMD kd-tree software).
• The data set contained 100,000 conformations. • We want to find 100 NN for each conformation.
Full rep., cRMS (brute force) : ~52h
Ave. rep., cRMS (brute force) : ~35h
Full rep., dRMS (brute force) : ~84h
Ave. rep., dRMS (brute force) : ~4.8h
SVD red. rep., dRMS (brute force) : 41min
SVD red. rep., dRMS (brute force) : 19min
Why Does Averaging Work?
• The mean distance of the i’th point from the origin is O(N0.5) and its stdev is also O(N0.5).
• There is very high corr. between dRMS using the full distances vector and using only distances between “highly” separated points
• The amount of distortion added by averaging has a mean of 0 and stdev of O(n0.5)
Conjecture:
The important differences between two conformations are found in the distances between “highly” separated points. These distances are large and therefore only distorted a little by averaging