Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan

Efficient Nearest-Neighbor Search in Large Sets of Protein

Conformations

Fabian Schwarzer

Itay Lotan

Motivation

• SRS1. Sample conformations

2. Create edges between “neighboring” conformations

• Ab-initio structure prediction1. Generate a large decoy set

2. Cluster based on similarity

When the number of conformations is large, finding neighboring (similar) conformations is costly

Similarity Measures

• Given the backbone Cα atom positions of two conformations – how similar are they?– Hard to define when comparing two different

proteins– Straightforward when comparing two

conformations of the same protein.

Similarity Measures

• We are interested in comparing conformations of the same protein.

• Hence - trivial correspondence between the two point sets.

• The two most common measures are:– cRMS deviation– dRMS deviation

cRMS

n

i

Qi

PiT Tcc

nQPcRMS

1

2

2

1min),(

T is the rigid body transform that optimally aligns P and Q

• cRMS is a metric, but the space is not Euclidean

• There is a closed form solution for T

• Complexity is linear in the number of points (plus a 4x4 eigenvectors computation)

dRMS

• A metric over a Euclidean space.

• Complexity is quadratic in the number of points (size of protein)

n

i

i

j

Qij

Pij dd

nnQPdRMS

2 1

2

)1(

2),(

2jiij ccd D is the internal distances matrix:

k Nearest Neighbors

• Find the k nearest neighbors of every conformation in the set

• Currently the fastest algorithm in practice for high dimensionality is brute force:For each conformation q in set

Compute distance to all other conformations

Find the k nearest conformations

• Complexity is O(n2 log k)

• The literature has a number of efficient nearest neighbor algorithms:– kd-trees is the most prevalent

• We cannot use these algorithms:– Require a Euclidean space – cRMS– Not efficient with high dimensionality - dRMS

k Nearest Neighbors

We reduce the dimensionality of dRMS to make kd-trees applicable.

Uniform Simplification

• Cut sequence into m equal subsequences

• Average the coordinates of the Cα atoms in each subsequence

• Use averaged coordinates ai when computing cRMS and dRMS

a0

a1

am

a6a5a4

a3

a2

Uniform Simplification - Results

• There is a high correlation between the full and the averaged representation when using cRMS and dRMS:– Proteins with 60 – 75AA: r > 0.95 for m > 12 – Protein with 374 AA: r > 0.95 for m > 16

Even with m = 12, the dimensionality of the internal distances matrix used by dRMS is too high (66) for a kd-tree to be used. Further reduction is needed.

Proteins

1HTB (374)

4PTI (58)

1R69 (63)

1CTF (68)

Further Reduction using SVD

• We Apply SVD to the reduced distance matrices (stacked as vectors)

• We project the reduced matrices onto the important singular vectors to further reduce the size.

Further Reduction – Results.

• Averaging before creating internal distances vector makes SVD feasible

• For proteins with 60-75 AA, dRMS using only 20 parameters was highly correlated (r > 0.90) with dRMS using full representation.

• 20 Dimensions is not too much for kd-trees.

Finding k Nearest Neighbors

• We tested the actual ability of the reduced representation to find NNs

• 80 of the 100 true NNs (using dRMS) where found using the reduced rep. of decoy sets

• Results are better (90) when the data set contains uniformly sampled conformations

• The maximal relative error was 10% - 20% (0.5Å – 1.5Å)

• The average relative error was < 5%

Using kd-trees

• We used the ANN implementation (UMD kd-tree software).

• The data set contained 100,000 conformations. • We want to find 100 NN for each conformation.

Full rep., cRMS (brute force) : ~52h

Ave. rep., cRMS (brute force) : ~35h

Full rep., dRMS (brute force) : ~84h

Ave. rep., dRMS (brute force) : ~4.8h

SVD red. rep., dRMS (brute force) : 41min

SVD red. rep., dRMS (brute force) : 19min

Why Does Averaging Work?

• The mean distance of the i’th point from the origin is O(N0.5) and its stdev is also O(N0.5).

• There is very high corr. between dRMS using the full distances vector and using only distances between “highly” separated points

• The amount of distortion added by averaging has a mean of 0 and stdev of O(n0.5)

Conjecture:

The important differences between two conformations are found in the distances between “highly” separated points. These distances are large and therefore only distorted a little by averaging

Documents

Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan