Res2Vec:Aminoacidvectorembeddingsfrom 3d-proteinstructurecs229.stanford.edu/proj2018/poster/154.pdf · 2019. 1. 6. · Res2Vec:Aminoacidvectorembeddingsfrom 3d-proteinstructure ScottA.Longwell1([email protected]

Res2Vec: Amino acid vector embeddings from3d-protein structure

Scott A. Longwell1 ([email protected]),Tyler C. Shimko2 ([email protected])

1Department of Bioengineering, Stanford University, Stanford, CA, 94305; 2Department of Genetics, Stanford University, Stanford, CA, 94305

Abstract

Data

Features

Models

Results

Discussion

Conclusions/Future Directions

References

The twenty naturally-occurring amino acid residues are the building blocks of allproteins. These residues confer distinct physicochemical properties to proteins basedon their chemical composition, size, charge, and other properties. Here, inspired byrecent work in the field of natural language processing [1], we propose a vectorembedding model for amino acids based on their neighbors in 3-dimensional spacewithin folded, active protein structures. We then test the utility of such a model byusing the embedded vectors to predict the effect of mutation on protein activity.

The data for this project were acquired from the RCSBPDB [2] and split in training, validation, and test datasetsusing the methodology described by ProteinNet(https://github.com/aqlaboratory/proteinnet). Thenumber of example proteins by sequencesimilarity/dataset is shown in the table to the right. Notethat each protein is composed of hundreds of aminoacids.

To provide a uniform coordinate basis for each focus residue, we centered thecoordinate system at the alpha carbon of the residue of interest and transformed thecoordinates such as to adhere to the coordinate system outlined below. This changeof basis allowed us to implicitly correct for rotational invariance.

Many different model architectures were trained, but we ultimately decided on anarchitecture with 20 hidden units and a fully-connected coordinate combination layer.This model was trained on the 30% sequence similarity dataset and ultimatelyachieved a training set average loss of 2.12, a validation set average loss of 2.14, anda validation set accuracy of 8.7%. On the test set, the model had an accuracy of 8.9%.

We assessed the performance of the embeddings by training an SVM to predict theeffect of mutation on the T4 lysozyme protein [3]. We compared the performance ofour embeddings to one-hot encoding vectors and BLOSUM empirical substitutionmatrices [4].

[1] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representationsin Vector Space,” arXiv.org, vol. cs.CL. 16-Jan-2013.

[2] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov,and P. E. Bourne, “The Protein Data Bank,” Nucl. Acids Res., vol. 28, no. 1, pp. 235–242, Jan.2000.

[3] W. Torng and R. B. Altman, “3D deep convolutional neural networks for amino acidenvironment similarity analysis,” BMC bioinformatics, vol. 18, no. 1, p. 302, Dec. 2017.

[4] S. Henikoff and J. G. Henikoff, “Amino acid substitution matrices from protein blocks,” PNAS,vol. 89, no. 22, pp. 10915–10919, Nov. 1992.

Figure 1: The transformedcoordinate system shownwith respect to the targetamino acid. The red arrowindicates the positivedirection of the x-axis and thegreen arrow denotes thepositive direction of the y-axis, both in the plane of thepage. The blue circleindicates an arrow in thepositive direction along the z-axis perpendicular to theplane of the page.

Figure 2: The transformed coordinate systemshown in place at a specific focus residue. The x, y,and z axes are shown in red, green, and blue,respectively, and a line is drawn between the alphacarbon of the focus residue and the alpha carbonof each of the 10 closest residues. Original PDBcoordinate system shown in upper left.

Figure 6: The first 2 principal components ofthe embedding vectors for each residue areshown (44.5% and 9.4% V.E., respectively).Residues are identified by single letter codeand colored by predominant chemicalproperty.

Figure 5: The confusion matrix ofthe training (top) and held-out testset (bottom).

Figure 4: Trainloss (left), val loss(middle), and valaccuracy (right)traces are shownas measured onceper epoch.

Figure 3: A diagram of the modelarchitecture. The following vector isused to weight the 10 context residuesto generate "x":ReLU(linear(ReLU(linear(coords))))

O

N

The vector embeddings we created capture a sizable portion of the variation in aminoacid properties. We applied these embeddings to predict the (categorical) effect ofmutations on the T4 lysozyme protein and our embeddings matched performance ofthe current standard in the field, BLOSUM empirical substitution matrices, without thesame need for computationally expensive sequence alignment. Given further time toimprove model performance we would implement the following changes:

To improve training:- Incorporate information about angles of context amino acid side chains- Use distance to any atom in side chain as the context distance metricTo evaluate utility:- Train a regression model for mutation effect prediction in addition to the SVM

described in the Discussion section above

V_dot

Figure 7: SVM model accuracy for T4 lysozyme mutation effect prediction is shown.Four-fold cross validation was employed and mean accuracy is shown. The formulaefor the V_dot and V_freq scores (adapted from [3]) are shown at right.

V_freq

Documents

Res2Vec:Aminoacidvectorembeddingsfrom 3d-proteinstructurecs229.stanford.edu/proj2018/poster/154.pdf · 2019. 1. 6. · Res2Vec:Aminoacidvectorembeddingsfrom 3d-proteinstructure ScottA.Longwell1([email protected]