13
Application of generative autoencoder in de novo molecular design Thomas Blaschke 1,2,, Marcus Olivecrona 1 , Ola Engkvist 1 , J¨ urgen Bajorath 2 , and Hongming Chen 1,1 Hit Discovery, Discovery Sciences, Innovative Medicines and Early Development Biotech Unit, AstraZeneca R&D Gothenburg, 431 83 M¨ olndal, Sweden 2 Bonn Aachen International Center for Information Technology BIT, University of Bonn, Dahlmannstrasse 2, 53113 Bonn, Germany [email protected], [email protected] November 20, 2017 Abstract A major challenge in computational chemistry is the generation of novel molecular structures with desirable pharmacological and physiochem- ical properties. In this work, we investigate the potential use of autoencoder, a deep learn- ing methodology, for de novo molecular design. Various generative autoencoders were used to map molecule structures into a continuous la- tent space and vice versa and their performance as structure generator was assessed. Our re- sults show that the latent space preserves chem- ical similarity principle and thus can be used for the generation of analogue structures. Fur- thermore, the latent space created by autoen- coders were searched systematically to generate novel compounds with predicted activity against dopamine receptor type 2 and compounds simi- lar to known active compounds not included in the trainings set were identified. 1 Introduction Over the last decade, deep learning (DL) tech- nology has been successfully applied in various areas of artificial intelligence research. DL has evolved from artificial neural networks and has often shown superior performance compared to other machine learning algorithms in areas such as image or voice recognition and natural language processing. Recently, DL has been successfully applied to different research areas in drug discovery. One notable application has been the use of a fully connected deep neural network (DNN) to build quantitative structure- activity relationship (QSAR) models that have outperformed some commonly used machine learning algorithms. [1] Another interesting ap- plication of DL is training various types of neu- ral networks to build generative models for gen- erating novel structures. Segler et al. [2] and Yuan et al. [3] applied a recurrent neural net- work (RNN) to a large number of chemical struc- tures represented as SMILES strings to obtain generative models which learn the probability distribution of characters in a SMILES string. The resulting models were capable of generat- ing new strings which correspond to chemically meaningful SMILES. Jaques et al. [4] combined RNN with a reinforcement learning method, deep Q-learning, to train models that can gener- ate molecular structures with desirable property values for cLogP and quantitative estimate of drug-likeness (QED). [5] Olivecrona et al. [6] proposed a policy based reinforcement learning approach to tune the pretrained RNNs for gen- erating molecules with user defined properties. The method has been successfully applied on in- verse QSAR problems, i.e. generating new struc- tures under the constraint of a QSAR model. Such generative DL approach allows addressing the inverse-QSAR problem more directly. The predominant inverse-QSAR approach consists of three steps. [7, 8] First, a forward QSAR model is built to fit biological activity with chemical descriptors. Second, given the forward QSAR function an inverse-mapping function is defined to map the activity to chemical descriptor val- ues, so that a set of molecular descriptor values resulting high activity can be obtained. Third, 1 arXiv:1711.07839v1 [cs.LG] 21 Nov 2017

arXiv:1711.07839v1 [cs.LG] 21 Nov 2017

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Application of generative autoencoder in de novo molecular

design

Thomas Blaschke1,2,†, Marcus Olivecrona1, Ola Engkvist1, Jurgen Bajorath2, andHongming Chen1,†

1Hit Discovery, Discovery Sciences, Innovative Medicines and Early Development Biotech Unit,AstraZeneca R&D Gothenburg, 431 83 Molndal, Sweden

2Bonn Aachen International Center for Information Technology BIT, University of Bonn,Dahlmannstrasse 2, 53113 Bonn, Germany

[email protected], [email protected]

November 20, 2017

Abstract

A major challenge in computational chemistryis the generation of novel molecular structureswith desirable pharmacological and physiochem-ical properties. In this work, we investigatethe potential use of autoencoder, a deep learn-ing methodology, for de novo molecular design.Various generative autoencoders were used tomap molecule structures into a continuous la-tent space and vice versa and their performanceas structure generator was assessed. Our re-sults show that the latent space preserves chem-ical similarity principle and thus can be usedfor the generation of analogue structures. Fur-thermore, the latent space created by autoen-coders were searched systematically to generatenovel compounds with predicted activity againstdopamine receptor type 2 and compounds simi-lar to known active compounds not included inthe trainings set were identified.

1 Introduction

Over the last decade, deep learning (DL) tech-nology has been successfully applied in variousareas of artificial intelligence research. DL hasevolved from artificial neural networks and hasoften shown superior performance comparedto other machine learning algorithms in areassuch as image or voice recognition and naturallanguage processing. Recently, DL has beensuccessfully applied to different research areasin drug discovery. One notable application hasbeen the use of a fully connected deep neural

network (DNN) to build quantitative structure-activity relationship (QSAR) models that haveoutperformed some commonly used machinelearning algorithms. [1] Another interesting ap-plication of DL is training various types of neu-ral networks to build generative models for gen-erating novel structures. Segler et al. [2] andYuan et al. [3] applied a recurrent neural net-work (RNN) to a large number of chemical struc-tures represented as SMILES strings to obtaingenerative models which learn the probabilitydistribution of characters in a SMILES string.The resulting models were capable of generat-ing new strings which correspond to chemicallymeaningful SMILES. Jaques et al. [4] combinedRNN with a reinforcement learning method,deep Q-learning, to train models that can gener-ate molecular structures with desirable propertyvalues for cLogP and quantitative estimate ofdrug-likeness (QED). [5] Olivecrona et al. [6]proposed a policy based reinforcement learningapproach to tune the pretrained RNNs for gen-erating molecules with user defined properties.The method has been successfully applied on in-verse QSAR problems, i.e. generating new struc-tures under the constraint of a QSAR model.Such generative DL approach allows addressingthe inverse-QSAR problem more directly. Thepredominant inverse-QSAR approach consists ofthree steps. [7, 8] First, a forward QSAR modelis built to fit biological activity with chemicaldescriptors. Second, given the forward QSARfunction an inverse-mapping function is definedto map the activity to chemical descriptor val-ues, so that a set of molecular descriptor valuesresulting high activity can be obtained. Third,

1

arX

iv:1

711.

0783

9v1

[cs

.LG

] 2

1 N

ov 2

017

NH2

OH

OHNH2

OH

OH

Input

Molecule

EncoderNeural Network

Latent SpaceContinuous numerical

representation

Reconstructed

Molecule

DecoderNeural Network

Figure 1: An autoencoder is a coordinated pair of NNs. The encoder converts a high- dimensionalinput, e.g. a molecule, into a continuous numerical representation with fixed dimensionality. The decoderreconstructs the input from the numerical representation.

the obtained molecular descriptor values arethen translated into new compound structures.How to define an explicit inverse-mapping func-tion transforming descriptor value to chemicalstructure is a major challenge for such inverse-QSAR methods. Miyao et al. applied Gaussianmixture models to address inverse-QSAR prob-lem, but their method can only be applied tolinear regression or multiple linear regressionQSAR models for the inverse-mapping. [9] Thiswill largely limit the usage of the method, sincemost popular machine learning models are non-linear. [10] To address this issue, Miyao et al.recently applied differential evolution to obtainoptimized molecular descriptors for non-linearQSAR models, [11] though, the identificationof novel compounds using virtual screening re-mained challenging. In contrast, the generativeDL approach allows to directly generate desir-able (determined by the forward QSAR models)molecules, without the need of using an explicitinverse-mapping function.

Autoencoder (AE) is a type of NN for un-supervised learning. It first encodes an inputvariable into latent variables and then decodesthe latent variables to reproduce the input in-formation. Gomez-Bombarelli et al. proposesa novel method [12] using variational autoen-coder (VAE) to generate chemical structures.After training the VAE on a large number ofcompound structures, the resulting latent spacebecomes a generative model. Sampling on thelatent vectors results in chemical structures. Bynavigating in the latent space one could specifi-cally search for latent points with desired chem-ical properties. However, a targeted search wasdifficult in the case study of designing organiclight-emitting diodes. [12] Based upon Gomez-Bombarellis work, Kadurin et al. [13] used VAEas a descriptor generator and coupled it witha generative adversarial network (GAN) [14], aspecial NN architecture, to identify new struc-tures that were proposed to have desired activi-ties. In our current study, we have constructed

various types of AE models and compared theirperformance in structure generation. Further-more, Bayesian optimization was used to searchfor new compounds in the latent space guidedby user defined target functions. This strategywas used successfully to generate new structuresthat were predicted to be active by a QSARmodel.

2 Methods

2.1 Neural networks

In this study we combined different NN archi-tectures to generate molecular structures. Thissection provides some background informationfor all relevant architectures used in this work.

2.1.1 Autoencoder

Autoencoder (AE) is a NN architecture for unsu-pervised feature extraction. A basic AE consistsof an encoder, a decoder and a distance func-tion (Figure 1). The encoder is a NN that mapshigh-dimensional input data to a lower dimen-sional representation (latent space), whereas thedecoder is a NN that reconstructs the originalinput given the lower dimensional representa-tion. A distance function quantifies the infor-mation loss derived from the deviation betweenthe original input and the reconstructed output.The goal of the training is to minimize the in-formation loss of the reconstruction. Becausetarget labels for the reconstruction are gener-ated from the input data, the AE is regardedas self-supervised.

The input mapping to and from a lower di-mensional space introduces an information bot-tleneck so that the AE is forced to extract infor-mative features from the high dimensional input.The most common dimensionality reductiontechnique introducing an information bottleneckis principal component analysis (PCA). [15] Infact, the basic AE version is sometimes referred

2

NH2

OH

OH

Input

Molecule

EncoderNeural Network

Latent SpaceContinuous numerical

representation

Reconstructed

Molecule

DecoderNeural Network

ReparametrizationSample point

𝝈

𝑞𝜑(𝒛|𝑿) 𝑝𝜃(𝑿|𝒛)𝒛𝝁

NH2

OH

OH

Figure 2: Encoding and decoding of a molecule using a variational autoencoder. The encoder converts amolecule structure into a Gaussian distribution deterministically. Given the generated mean and variance,a new point is sampled and fed into the decoder. The decoder then generates a new molecule from thesampled point.

as non-linear PCA. [16] The dimensionality re-duction performance and the usefulness of theextracted features depend on the input dataand the actual architecture of the encoder anddecoder NN. It has been shown that recurrentand convolutional NNs can successfully generatetext sequences and molecular structures. [12,17]

2.1.2 Recurrent Neural Networks

Recurrent NNs (RNNs) are popular architec-tures with high potential for natural languageprocessing. [18] The idea behind RNNs is toapply the same function for each element of se-quential data, with the output being dependedon the result of the previous step. RNN compu-tations can be rationalized using the concept ofa cell. For any given step t, the cell t is a resultof the previous cell t1 and the current input x.The content of cell t will determine both theoutput of the current step and influence the nextcell state. This enables the network to memorizepast events and model data according to pre-vious inputs. This concept implicitly assumesthat the most recent events are more importantthan early events since recent events influencethe content of the cell the most. However, thismight not be an appropriate assumption for alldata sequences, therefore, Hochreiter et al. [19]introduced the Long-Short-Term Memory cell.Through a more controlled flow of information,this cell type can decide which previous infor-mation to retain and which to discard. TheGated Recurrent Unit (GRU) is a simplifiedimplementation of the Long-Short-Term Mem-ory architecture and achieves much of the sameeffect at a reduced computational cost. [20]

2.1.3 Convolutional Neural Networks

Convolutional NNs (CNNs) are common neu-ral NNs for pattern recognition in images orfeature extraction from text. [21] A CNN con-

sists of an input and output layer as well asmultiple hidden layers. The hidden layers areconvolutional, pooling or fully connected. De-tails of different layers in CNN can be foundin Simard et al. [22] The key feature of a CNNis introducing of convolution layers. In eachlayer, the same convolution operation is appliedon each input data which is then replaced by alinear combination of its neighbours. The pa-rameters of the linear combination are referredas filter or kernel, whereas the number of con-sidered neighbours is referred as filter size orkernel size. The output of applying one filteronto the input information is called a featuremap. Applying non-linearity such as sigmoid orscaled exponential linear units (SELU) [23] ona feature map allows to model nonlinear data.Furthermore, applying another CNN on top ofa feature map will allow one to model featuresof spatially separated inputs. The pooling layeris used to reduce the size of feature maps. Afterpassing through multiple convolution and pool-ing layers, the feature maps are concatenatedinto fully connected layers where every neuronin neighbouring layers are all connected to givefinal output value.

2.2 Implementation details

2.2.1 Variational autoencoder

The basic AE described in section 2.1.1 mapsa molecule X into a continuous space z andthe decoder reconstructs the molecule from itscontinuous representation. However, using thisbasic definition, the model is not enforced tolearn a generalized numeric representation of themolecules. Due to the large amount of param-eters of the NNs and the comparatively smallamount of training data, the AE will likely learnsome explicit mapping of the training set andthus the decoder will not be able to decode arbi-trary points in the continuous space. To avoid

3

Last decoder layer

Second last decoder

layer

Expected output

Probability output

Cellt-1 Cellt Cellt+1

Cellt-1 Cellt Cellt+1

P(t-1)

n c1

P(t) P(t+1)

Training Mode

Last decoder layer

Second last decoder

layer

Sampled output

Probability output

Cellt-1 Cellt Cellt+1

Cellt-1 Cellt Cellt+1

P(t-1)

n c1

P(t+1)P(t)

Generation Mode

Figure 3: Sequence generation using teachers forcing. The last decoder trained with teachers forcingreceives two inputs: the output of the previous layer and a character from the previous time step. In thetraining mode, the previous character is equal to the corresponding character from the input sequence,regardless of the probability output. During the generation mode the decoder samples at each time stepa new character based on the output probability and uses this as input for the next time step.

learning an explicit mapping, we restrict themodel to learn a latent variable from its inputdata. The variational autoencoder (VAE) [24]provides a formulation in which the continuousrepresentation z is interpreted as a latent vari-able in a probabilistic generative model. Letp(z) be the prior distribution imposed on thecontinuous representation, pθ(X|z) be a prob-abilistic decoding distribution and qφ(z|X) bea probabilistic encoding distribution (shown inFigure 2).

The parameters of pθ(X|z) and qφ(z|X) canbe inferred during the training of the VAEvia backpropagation. During training, the re-construction error of the decoder is reducedby maximizing the log-likelihood pθ(X|z). Si-multaneously, the encoder is regularized to ap-proximate the latent variable distribution p(z)by minimizing the Kullback-Leibler divergenceDKL(qφ(z|X)||p(z)). If the prior p(z) has to fol-low a multivariate Gaussian distribution withzero mean and unit variance, the loss functioncan be formulized as:

L = −DKL(qφ(z|X)||N(0, I))+E[log pθ(X|z)]

Our VAE is implemented using the PyTorchpackage [25] and follows Gmez-Bombarelli archi-tecture closely. [12] Like Gomez-Bombarelli et al.we used three CNN layer followed by two fullyconnected neural layers as an encoder. Insteadof using rectified linear units [24] SELU [23]was used as non-linear activation function al-lowing a faster convergence of the model. Theoutput of the last encoder layer is interpreted

as mean and variance of a Gaussian distribu-tion. A random point is sampled using theseparameters and used as input for the decoder.The decoder employs a fully connected neurallayer followed by three layers of RNNs, builtusing GRU cells. The last GRU layer definesthe probability distribution over all possible out-puts at each position in the output sequence.As shown in Figure 3, in the training mode,the input of the last layer is a concatenationof the output of the previous layer and tokenfrom the target SMILES (in training set), whilein the generation mode, the input of the lastlayer is a concatenation of the output of theprevious layer and the sampled token from theprevious GRU cell (Figure 3). This methodis called teachers forcing [26] and is known toincrease the performance on long text genera-tion tasks. [27] For comparison, we also traineda VAE model without teachers forcing, wherethe input of last GRU layer was not influencedby any previous token but only depended onthe output of the previous GRU layer in bothtraining and generation mode.

2.2.2 Adversarial autoencoder

The VAE assumes a simplistic prior, i.e. aGaussian distribution, on the latent represen-tation to keep the Kullback-Leibler divergencemathematically computable. To evaluate alter-native priors we implemented a modified VAEarchitecture known as adversarial autoencoder(AAE). [28] The major difference between theAAE and VAE is that an additional discrim-

4

NH2

OH

OHDirect output

No reparametrization

Input

Molecule

EncoderNeural Network

Latent SpaceContinuous numerical

representation

Reconstructed

Molecule

DecoderNeural Network

DiscriminatorNeural Network

Target DistributionSampled points

Class Probability

𝑞𝜑(𝒛|𝑿) 𝑝𝜃(𝑿|𝒛)𝒛

𝑝(𝒛)

𝒛′ 𝒛

NH2

OH

OH

Figure 4: Learning process of an adversarial autoencoder. The encoder converts a molecule directlyinto a numerical representation. During training the output is not only fed into the decoder but also intoa discriminator. The discriminator is trained to distinguish between the output of the encoder and arandomly sampled point from a prior distribution. The encoder is trained to fool the discriminator bymimicking the target prior distribution.

inator NN is added into the architecture toforce the output of encoder qφ(z|X) to follow aspecific target distribution, while at the sametime the reconstruction error of decoder is min-imized. Figure 4 illustrates the workflow of theAAE model. To regularize the encoder poste-rior qφ(z|X) to match a specific prior p(z), adiscriminator D was introduced. It consists ofthree fully connected NN layers. The first layerapplies only an affine matrix transformation,whereas the second layer additionally appliesSELU as non-linear activation. The third layercomposes an affine matrix transformation fol-lowed by a sigmoid function. The discriminatorD receives the output z of the encoder and ran-domly sampled points z′ from the target prioras input. The whole training process was donein three sequential steps:

1. The encoder and decoder were trained si-multaneously to minimize the reconstruc-tion loss of the decoder:

Lpθ = E[log pθ(X|z)]

2. The discriminator NN D was then trainedto correctly distinguish the true input sig-nals z′ generated from target distributionfrom the false signals z generated by theencoder by minimizing the loss function:

LD = −(log(D(z′)) + log(1−D(z)))

3. In the end, the encoder NN was trainedto fool the discriminator by minimizinganother loss function:

Lqφ = −log(D(z))

The three steps were iteratively run for eachbatch until the reconstruction loss function Lpθwas converged. For all AAE models, teachersforcing scheme was always used for the decoderNN.

2.2.3 Bayesian optimization ofmolecules

To search for molecules with desired propertiesin latent space Bayesian optimization (BO) wasimplemented using the GPyOpt package. [29]Here the score estimating the probability of be-ing active against a specific target was used asthe objective function and it was maximizedduring the BO. Gaussian process (GP) mod-els [30] were trained with 100 starting pointsto predict the score of each molecule from thelatent space. A new point was then selected bysequentially maximizing the expected improve-ment acquisition [31] based on the GP model.The new data point was transformed into acorresponding SMILES strings [32] using theabove mentioned decoder network and scoredaccordingly. The new latent point was addedto the GP model as an additional point withassociated score. The process was repeated 500iterations to search for optimal solutions.

2.2.4 Tokenizing SMILES and generat-ing new SMILES

A SMILES string represents a molecule as asequence of characters corresponding to atomsas well as special characters denoting openingand closure of rings and branches. The SMILESstring is, in most cases, tokenized based on

5

single characters, except for atom types whichcomprises two characters such as ”Cl” and ”Br”,where they are considered as one token. Thismethod of tokenization resulted in 35 tokenspresent in the training data. The SMILES werecanonicalized using RDKit [33] and encoded upto a maximum length of 120 token. ShorterSMILES were padded with spaces at the endto the same length. The sequence of tokenis converted into a one-hot representation andused as input for the AE. Figure 5 illustratesthe tokenization and the one-hot encoding.

BrCc1c[nH]nc1

C c 1 c [ H ] n c

Br

c

n

1

[

]

n

C

1 0 0 0 0 0 0 0 0 00

0 1 0 0 0 0 0 0 0 00

0 0 0 0 0 0 1 0 0 00

0 0 1 0 1 0 0 0 0 10

0 0 0 0 0 0 0 0 1 01

0 0 0 1 0 0 0 0 0 00

0 0 0 0 0 1 0 0 0 00

0 0 0 0 0 0 0 1 0 00

0

0

0

0

0

1

0

0

Graph

SMILES

One-hot

encoding

Br

H

1

N

NH

Br

Figure 5: Different representations of 4-(bromomethyl)-1H-pyrazole. Exemplary genera-tion of the one-hot representation derived from theSMILES. For simplicity only a reduced vocabularyis shown here, while in practice a larger vocabularythat covers all tokens present in the training datais used.

Once an AE is trained, the decoder is usedto generate new SMILES from arbitrary pointsof the latent space. Because the output of lastlayer of the decoder is a probability distributionover all possible tokens, the output token ateach step was sampled 500 times. Thereby, weobtained 500 sequences containing 120 tokensfor each latent point. The sequence of tokenswas then transformed into a SMILES string andits validity was checked using RDKit. The mostfrequently sampled valid SMILES was assignedas the final output to the corresponding latentpoint.

2.2.5 Training of AE models

Various AE models were trained on structurestaken from ChEMBL version 22. [34] TheSMILES were canonicalized using RDKit andthe stereochemistry information was removed

for simplicity. We omitted all structures withless than 10 heavy atoms and filtered out struc-tures that had more than 120 tokens (see Sec-tion 2.2.4). Additionally, all compounds re-ported to be active against the dopamine type 2receptor (DRD2) were removed from the set.The final set contained approx. 1.3 millionunique compounds, from which we use used1.2 million compounds as training set and theremaining 162422 compounds as a validationset. All AE models were trained to map to a56-dimensional latent space. Mini-batch size of500, a learning rate of 3.1 ∗ 10−4 and stochasticgradient optimization method ADAM [35] wereused to train all models until convergence.

2.2.6 DRD2 activity model

A crucial objective for de novo molecule designis to generate molecules having a high proba-bility of being active against a given biologicaltarget. In current study, DRD2 was chosen asthe target, and the same data set and activitymodel generated in our previous study[6] wereused here. The data set was extracted fromExCAPE-DB [36] and contained 7218 actives(pIC50 > 5) and 343204 inactives (pIC50 < 5).A support vector machine (SVM) classificationmodel with Gaussian kernel was built using Sci-Kit Learn [37] on the DRD2 training set usingthe extended connectivity fingerprint with adiameter of 6 (ECFP6). [38]

3 Results and Discussion

In this work, we tried to address three ques-tions: First, if compounds can be mapped intoa continuous latent space and subsequently re-constructed by autoencoder NN? Second, if thelatent space preserves chemical similarity princi-ples? Third, if the latent space encoding chemi-cal structures can be used to search and opti-mize structures with respect to some complexproperties such as predicted biological activity?In current study we trained and compared fourdifferent AE types: A variational autoencoderwhich does not use teachers forcing (namedas NoTeacher VAE), a variational autoencoderwhich utilize teachers forcing (named as TeacherVAE) and two adversarial autoencoder wherethe encoder was trained to follow either a Gaus-sian or a Uniform distribution (named GaussAAE and Uniform AAE).

6

Table 1: Reconstruction accuracy for the different AE models.

Model Average characterreconstruction %in training set(training mode)

Average characterreconstruction %in validation set(generation mode)

Valid SMILES %in validation set(generation mode)

NoTeacher VAE 96.8 96.3 19.3Teacher VAE 97.4 86.2 77.6Gauss AAE 98.2 89.0 77.4Uniform AAE 98.9 88.5 78.3

3.1 Structure generation

The encoder and decoder of the AE model weretrained simultaneously on the training set byminimizing the character reconstruction error ofthe input SMILES sequence. Once the modelswere trained, the validation set was first mappedinto the latent space via the encoder NN andthe structures were reconstructed through thedecoder (i.e. generation mode). To evaluatethe performance of the autoencoder, the re-construction accuracy (percentage of position-to-position correct character) and percentageof valid SMILES string (according to RDKitSMILES definition) of whole validation set wereexamined. The results are shown in Table 1.All methods yield good performance on charac-ter reconstruction of the SMILES sequence inthe training mode and at least 95% of all char-acters are correct in a reconstructed SMILESstring, whereas all teachers forcing methods dis-play a minor improved accuracy of about 1-2%compared to NoTeacher VAE. As to the gener-ation mode, all teachers forcing based modelsshow decreased accuracy. This is not surpris-ing since the information of a wrongly sampledcharacter can propagate through the remain-ing sequence generation and influences the nextsteps. Both AAE methods achieved higher ac-curacy than the Teacher VAE, indicating thatthe decoder for these method depends more onthe information from the latent space than onthe previously generated character.

Interestingly teacher forcing based modelsdemonstrate much higher percentage of validSMILES when compared to NoTeacher VAEmodel, although the NoTeacher VAE methodhas higher character reconstruction accuracy. Itmeans the reconstruction errors in NoTeacherVAE model are more likely to result in invalidSMILES. Browsing these invalid SMILES re-veals that the NoTeacher VAE model is of-ten not able to generate matching pairs of thebranching characters ”(” and ”)” or ring sym-bols like ”1” as exemplified in Table 2. Losingthese ring forming and branching informationmakes it impossible for the NoTeacher VAEmodel to adapt its remaining sequence genera-tion to remedy the errors.

The teacher forcing containing AEs producea significantly higher fraction of valid SMILES.They finish more often with a syntactically cor-rect SMILES, as the information about thepreviously generated character influences thenext generation steps. This allows the modelto learn the syntax of the SMILES and thushave better chance to generate correct sequence.Their lower reconstruction accuracy in genera-tion mode, is due to the fact that the generatormust continue to generate a sequence under thecondition of a previously incorrectly sampledcharacter and thus the error propagates throughthe remaining sequence generation process andcreate additional errors. Both adversarial mod-els show a slightly higher character reconstruc-tion of the validation set than the Teacher VAE.

Table 2: Exemplary sequence reconstruction.

Generated Sequence ValidTarget sequence Cc1ccc2c(c1)sc1c(=O)[nH]c3ccc(C(=O)NCCCN(C)C)cc3c12

NoTeacher VAE Cc1ccc2cnc1)sc1c(=O)[nH]c3ccc(C(=O)NCCCN(C)C)c33c12 NoTeacher VAE Cc1ccc2c(c1)sc1c(=O)[nH]c3ccc(C(=O)NCCN(C)C)cc3c12 YesGauss AAE Cc1ccc2c(c1)sc1c(=O)[nH]c3ccc(C(=O)NCCCN(C)C)cc3c12 YesUniform AAE Cc1ccc2c(c1)sc1c(=O)[nH]c3ccc(C(=O)NCCCN(C)C)cc3c12 Yes

7

3.2 Does the latent space pre-serve the similarity principle?

To use generative AE in a de novo moleculedesign task, the latent space must preserve thesimilarity principle, i.e. the more similar struc-tures are, the closer they must be positionedin latent space. When the similarity principleis preserved in latent space, searching for newstructures based on a query structure becomesfeasible.

A known drug, Celecoxib, was chosen as anexample for validating the similarity principlein latent space. It was first mapped into latentspaces generated by various AE models andthen chemical structures were sampled from thelatent vectors. Figure 6 shows the structuresgenerated from the latent vector of Celecoxibfor different AE models. AE generated manyanalogues to Celecoxib that often only differedby a single atom. To investigate if the similarityprinciple is also applicable over a larger area,we gradually increased the distance between therandom latent vectors and Celecoxib from 0 to 8with a step size of 0.1 steps. At each distance binwe sampled 10 random points and generated 500sequences from each points, resulting in 5000reconstruction attempts at each distance bin.The median ECFP6 Tanimoto similarity of allvalid structures and Celecoxib was calculated.

The results are shown in Figure 7a anddemonstrate that compounds with decreasingsimilarity to Celecoxib generally have larger dis-tances to Celecoxib in all four different latentspaces. This clearly indicates that the simi-larity principle is reserved in the surroundingarea of Celecoxib making it possible to carry

FF

F

NN

S

NH2

OO

F

F

F

NN

S

NH2

OO

Cl

FF

F

N

S

NH2

OO

N

F

F

F

NN

S

NH2

OO

F

F

F

NN

S

NH2

OO

N

FF

F

NN

S

NH2

OO

F

F

F

N

S

NH2

OO

FF

F

N

S

NH2

OO

Br

F

F

F

NN

S

NH2

OO

FF

F

NN

S

NH2

OO

FF

F

NN

O

N+

O-

O

FF

F

NN

S

NH2

OO

F

F

F

NN

S

NH2

OO

FF

F

NN

S

NH2

OO

Cl

O

FF

F

NN

S OO

NH2

S

FF

F

NN

S

NH2

OO

F

F

F

NN

S

NH2

OO

SNH2

OO

FF

F

NN

FF

F

NN

S

N+

O-

O

F

F

F

N

N

S

S

O

ONH2

S OO

NH2

FF

F

NN

Cl

Uniform AAE

Gauss AAE

Teacher VAE

NoTeacher VAE

Celecoxib

Figure 6: Sampled structures at the latent vec-tor corresponding to Celecoxib. The structuresare sorted by the relative generation frequencies indescending order from left to right.

out similarity searches in the latent space. Fig-ure 7b shows the relationship between the pro-portion of valid SMILES and the distance toCelecoxib in latent spaces. Uniform AAE hashigher fraction of valid SMILES compared toother models. NoTeacher VAE has a very lowfraction of valid SMILES at distance larger than2. This may explain its much steeper similaritycurve in Figure 7a compared to other models.Celecoxib and many of its close analogues arefound in the ChEMBL training set used to de-rive the AE models. An interesting questionis whether we can maintain the smooth latentspace around Celecoxib when these structuresare excluded from the training set? To investi-

0 1 2 3 4 5 6 7 8Euclidean distance within latent space

0.0

0.2

0.4

0.6

0.8

1.0

Frac

tion

of v

alid

SM

ILES

NoTeacher VAETeacher VAEGauss AAEUniform AAE

(a)

0 1 2 3 4 5 6 7 8Euclidean distance within latent space

0.0

0.2

0.4

0.6

0.8

1.0

Tani

mot

o sim

ilarit

y

NoTeacher VAETeacher VAEGauss AAEUniform AAE

(b)

Figure 7: (a) Chemical similarity (Tanimoto, ECFP6) of generated structures to Celecoxib in relationto the distance in the latent space. (b) Fraction of valid SMILES generated during the reconstruction.

8

0 1 2 3 4 5 6 7 8Euclidean distance within latent space

0.0

0.2

0.4

0.6

0.8

1.0Fr

actio

n of

val

id S

MIL

ESNoTeacher VAETeacher VAEGauss AAEUniform AAE

(a)

0 1 2 3 4 5 6 7 8Euclidean distance within latent space

0.0

0.2

0.4

0.6

0.8

1.0

Tani

mot

o sim

ilarit

y

NoTeacher VAETeacher VAEGauss AAEUniform AAE

(b)

Figure 8: Results without Celecoxib in trainings set. (a) Chemical similarity (Tanimoto, ECFP6) ofgenerated structures to Celecoxib in relation to the distance in the latent space. (b) Fraction of validSMILES generated during the reconstruction.

gate this question, all analogues with a featureclass fingerprint of diameter 4 (FCFP4) [38]Tanimoto similarity to Celecoxib larger than 0.5(1788 molecules in total) are removed from thetraining set and new models are trained. Thesame computational process is repeated and theresults are shown in Figure 8. Again Celecoxiband close analogues can be successfully recon-structed for all models at very close distance(less than 1) to Celecoxib in the latent space. InNoTeacher VAE and the Teacher VAE models,when the distance to Celecoxib is larger than4, the fraction of valid SMILES is significantlylower compared to the Gauss and Uniform AAEmodels. Especially for the Uniform AAE model(shown in Figure 8b) at a Euclidean distance of4, more than 30% of the generated structuresare valid SMILES. At distance of 6, the percent-age of valid SMILES for the Teacher VAE isreduced to 5% while the Uniform AAE recon-structs 20% of valid SMILES. This highlightsthat the Uniform AAE generates the smoothestlatent chemical space representation. The highfraction of valid SMILES is useful for searchingstructures in latent space since non-decodablepoints in latent space make the optimizationvery difficult.

3.3 Target-activity guided struc-ture generation

Given the high fraction of valid SMILES andsmooth latent space for the Uniform AAE evenin the case where Celecoxib is excluded fromthe training set, an interesting question is if

we can search for novel compounds that arepredicted to be active against a specific biologi-cal target. This task can be understood as aninverse-QSAR problem where one attempts toidentify new compound predicted to be activeby a QSAR model. We choose the DRD2 asthe target using the same SVM model as above.The model is based on the ECFP6 fingerprintand the output of the SVM classifier is the prob-ability of activity. The score for the Bayesianoptimization is defined as follows:

S(z) =

average Pactive for all active com-pounds

average Pactive for all inactives ifthere are no active compounds

Here we classify each compound with Pactive >0.5 as active. During the optimization the modeltends to generate structures with large macro-cycles with ring sizes larger than eight atoms.This is consistent with the findings reported byGomez-Bombarelli et al. [12] Such large macro-cycles generally have low synthetic feasibility.Thus, we include an explicit filter to remove allgenerated structures with ring size larger thaneight.

The BO method is used as the search engineto identify solutions with maximal scores. TheBO search begins at random starting points andis repeated 80 times to collect multiple latentpoints with Pactive larger than 0.5. Similarresults are obtained in each run. The resultof a representative DRD2 search is shown inFigure 9. The BO algorithm constantly findsstructures with high Pactive values. Low score

9

0 100 200 300 400 500 600Iteration step

0.0

0.2

0.4

0.6

0.8

1.0Sc

ore

(Pac

tive)

Figure 9: Searching for DRD2 active compoundsusing the Uniform AAE. The first 100 iterationsare randomly sampled points while the next 500iterations are determined by Bayesian optimization.

values correspond to points at which no validSMILES are generated or only structures withlow activity scores.

In total, 369591 compounds are sampled fromBO solutions with an average Pactive larger than0.95 and 11.5% of them have an ECFP6 Tani-moto similarity to the nearest validated activeof larger than 0.35. Figure 10 shows some ex-emplary compounds with a high Pactive valuecompared to nearest neighbours of actives. Thegenerated compounds are predicted to be highlyactive (Pactive > 0.99) and share mostly thesame chemical scaffold with the validated ac-tives. However, the Uniform AAE model doesnot fully reproduce known actives. The re-lationship between the probability of findingactive compounds (Pactive > 0.5) and scorevalues at the latent point is shown in Figure 11.At latent points with higher score, there is a

Random 0.5-0.6 0.6-0.7 0.7-0.8 0.8-0.9 0.9-1.0Score (Pactive)

0.000

0.025

0.050

0.075

0.100

0.125

0.150

0.175

0.200

Frac

tion

of g

ener

ated

act

ives

Figure 11: The relationship between the fractionof generated active compounds at specific latentpoints and the BO score. The fraction of gener-ated actives is the number of actives divided byall 500 reconstruction attempts. The set Randomcorresponds to the randomly selected latent points.

higher probability of finding compounds pre-dicted to be active by the SVM model. Forexample, sampling at random points only yieldsa probability of 0.1% to find active compoundswhile sampling at points with a score between0.9 and 1.0 has a 19% probability. This in-dicates that the BO algorithm can efficientlysearch through latent space generated by theUniform AAE model to identify novel activecompounds guided by a QSAR model.

4 Conclusion

In our current study, we introduce generative ad-versarial autoencoder NN and applied them toinverse QSAR to generate novel chemical struc-tures. To our knowledge, adversarial autoen-coder has neither been applied to structure gen-

OH

N

NN N OHN

N

N

NH

NHN

NN

OH

NH N

N OH

N

NHValidated active

Generated structure

Pactive 0.84

Pactive 1.0

Similarity 0.40

Pactive 1.0

Pactive 1.0

Similarity 0.60

Pactive 1.0

Pactive 0.99

Similarity 0.53

Pactive 1.0

Pactive 0.99

Similarity 0.50

Figure 10: Generated structures from BO compared to the nearest neighbour from the set of validatedactives. The validated actives were not present in the training set of the autoencoder. The Tanimotosimilarity is calculated using the ECFP6 fingerprint.

10

eration nor inverse QSAR. Unlike other inverseQSAR methods that rely on back-mapping ofdescriptors to chemical structures, our methodutilizes the latent space based generative modelto construct novel structures under the guidanceof a QSAR model. Four different AE architec-tures are explored for structure generation. Theresults indicate that sequence generation per-formance for novel compounds relies not onlyon the decoder architecture but also on thedistribution of latent vectors of the encoder.The fraction of valid structures is significantlyimproved for architectures using teachers forc-ing. The autoencoder model (Uniform AAE)imposing a uniform distribution onto the la-tent vectors yields the largest fraction of validstructures. AE generated latent space preservesthe similarity principle locally in latent spaceby investigating the similarity to a query struc-ture such as Celecoxib. Furthermore, BO isapplied to search for structures predicted to beactive against DRD2 by a QSAR model. Ourresults shows that novel structures predicted tobe active are identified by the BO search andthis indicates that AE is a useful approach fortackling inverse QSAR problems.

Acknowledgements

The authors thank Thierry Kogej, ErnstAhlberg-Helgee and Christian Tyrchan for use-ful discussions and general technical assistance.

Competing interests

The authors declare that they have no compet-ing interests.

Funding

Thomas Blaschke has received funding from theEuropean Unions Horizon 2020 research and in-novation program under the Marie Sklodowska-Curie grant agreement No 676434, ”Big Data inChemistry” (”BIGCHEM”, http://bigchem.

eu). Marcus Olivecrona, Hongming Chen, andOla Engkvist are employed by AstraZeneca.Jurgen Bajorath is employed by University ofBonn. The article reflects only the authorsview and neither the European Commission northe Research Executive Agency (REA) are re-sponsible for any use that may be made of theinformation it contains.

References

[1] Ma, J., Sheridan, R. P., Liaw, A., Dahl,G. E. & Svetnik, V. Deep Neural Nets as aMethod for Quantitative StructureActivityRelationships. Journal of Chemical Infor-mation and Modeling 55, 263–274 (2015).

[2] Segler, M. H. S., Kogej, T., Tyr-chan, C. & Waller, M. P. Generat-ing Focussed Molecule Libraries for DrugDiscovery with Recurrent Neural Net-works. arXiv:1701.01329 [physics, stat](2017). URL http://arxiv.org/abs/

1701.01329. ArXiv: 1701.01329.

[3] Yuan, W. et al. Chemical Space Mimicryfor Drug Discovery. Journal of Chemi-cal Information and Modeling 57, 875–882(2017).

[4] Jaques, N. et al. Sequence Tutor: Conser-vative Fine-Tuning of Sequence GenerationModels with KL-control. arXiv:1611.02796[cs] (2016). URL http://arxiv.org/abs/

1611.02796. ArXiv: 1611.02796.

[5] Bickerton, G. R., Paolini, G. V., Besnard,J., Muresan, S. & Hopkins, A. L. Quanti-fying the chemical beauty of drugs. NatureChemistry 4, 90–98 (2012).

[6] Olivecrona, M., Blaschke, T., Engkvist,O. & Chen, H. Molecular de-novo designthrough deep reinforcement learning. Jour-nal of Cheminformatics 9, 48 (2017).

[7] Churchwell, C. J. et al. The signa-ture molecular descriptor: 3. Inverse-quantitative structureactivity relationshipof ICAM-1 inhibitory peptides. Journalof Molecular Graphics and Modelling 22,263–273 (2004).

[8] Wong, W. W. & Burkowski, F. J. A con-structive approach for discovering new drugleads: Using a kernel methodology for theinverse-QSAR problem. Journal of Chem-informatics 1, 4 (2009).

[9] Miyao, T., Kaneko, H. & Funatsu, K. In-verse QSPR/QSAR Analysis for ChemicalStructure Generation (from y to x). Jour-nal of Chemical Information and Modeling56, 286–299 (2016).

[10] Douali, L., Villemin, D. & Cherqaoui,D. Neural Networks: Accurate Nonlinear

11

QSAR Model for HEPT Derivatives. Jour-nal of Chemical Information and ComputerSciences 43, 1200–1207 (2003).

[11] Miyao, T., Funatsu, K. & Bajorath, J. Ex-ploring differential evolution for inverseQSAR analysis. F1000Research 6, 1285(2017).

[12] Gmez-Bombarelli, R. et al. Automaticchemical design using a data-drivencontinuous representation of molecules.arXiv:1610.02415 [physics] (2016). URLhttp://arxiv.org/abs/1610.02415.ArXiv: 1610.02415.

[13] Kadurin, A., Nikolenko, S., Khrabrov, K.,Aliper, A. & Zhavoronkov, A. druGAN:An Advanced Generative Adversarial Au-toencoder Model for de Novo Generationof New Molecules with Desired MolecularProperties in Silico. Molecular Pharmaceu-tics (2017). URL http://dx.doi.org/10.

1021/acs.molpharmaceut.7b00346.

[14] Goodfellow, I. J. et al. Generative Ad-versarial Networks. arXiv:1406.2661 [cs,stat] (2014). URL http://arxiv.org/

abs/1406.2661. ArXiv: 1406.2661.

[15] Abdi, H. & Williams, L. J. Principal com-ponent analysis. Wiley InterdisciplinaryReviews: Computational Statistics 2, 433–459 (2010).

[16] Scholz, M., Fraunholz, M. & Selbig, J.Nonlinear Principal Component Analysis:Neural Network Models and Applications.SpringerLink 44–67 (2008).

[17] Duvenaud, D. et al. Convolutional Net-works on Graphs for Learning Molecu-lar Fingerprints. arXiv:1509.09292 [cs,stat] (2015). URL http://arxiv.org/

abs/1509.09292. ArXiv: 1509.09292.

[18] Goodfellow, I., Bengio, Y. & Courville, A.Deep Learning (MIT Press, 2016). URLhttp://www.deeplearningbook.org.

[19] Hochreiter, S. & Schmidhuber, J. LongShort-Term Memory. Neural Comput. 9,1735–1780 (1997).

[20] Chung, J., Gulcehre, C., Cho, K. & Ben-gio, Y. Empirical Evaluation of GatedRecurrent Neural Networks on SequenceModeling. arXiv:1412.3555 [cs] (2014).URL http://arxiv.org/abs/1412.3555.ArXiv: 1412.3555.

[21] Lei, T., Barzilay, R. & Jaakkola,T. Molding CNNs for text: non-linear, non-consecutive convolutions.arXiv:1508.04112 [cs] (2015). URLhttp://arxiv.org/abs/1508.04112.ArXiv: 1508.04112.

[22] Simard, P. Y., Steinkraus, D. & Platt,J. C. Best Practices for Convolu-tional Neural Networks Applied to Vi-sual Document Analysis. In Proceed-ings of the Seventh International Confer-ence on Document Analysis and Recog-nition - Volume 2, ICDAR ’03, 958–(IEEE Computer Society, Washington, DC,USA, 2003). URL http://dl.acm.org/

citation.cfm?id=938980.939477.

[23] Klambauer, G., Unterthiner, T., Mayr,A. & Hochreiter, S. Self-NormalizingNeural Networks. arXiv:1706.02515 [cs,stat] (2017). URL http://arxiv.org/

abs/1706.02515. ArXiv: 1706.02515.

[24] Kingma, D. P. & Welling, M.Auto-Encoding Variational Bayes.arXiv:1312.6114 [cs, stat] (2013). URLhttp://arxiv.org/abs/1312.6114.ArXiv: 1312.6114.

[25] PyTorch (2017). URL http://pytorch.

org/.

[26] Williams, R. J. & Zipser, D. A LearningAlgorithm for Continually Running FullyRecurrent Neural Networks. Neural Com-putation 1, 270–280 (1989).

[27] Jaeger, H. A tutorial on training recur-rent neural networks, covering BPPT,RTRL, EKF and the ”echo state net-work” approach. Tech. Rep., GermanNational Research Center for InformationTechnology, GMD Report 159 (2002).URL https://www.pdx.edu/sites/

www.pdx.edu.sysc/files/Jaeger_

TrainingRNNsTutorial.2005.pdf.

[28] Makhzani, A., Shlens, J., Jaitly, N.,Goodfellow, I. & Frey, B. Adversar-ial Autoencoders. arXiv:1511.05644 [cs](2015). URL http://arxiv.org/abs/

1511.05644. ArXiv: 1511.05644.

[29] authors, T. G. GPyOpt: A Bayesian Op-timization framework in Python (2016).URL http://github.com/SheffieldML/

GPyOpt.

12

[30] Rasmussen, C. E. & Williams, C. K. I.Gaussian Processes for Machine Learn-ing (Adaptive Computation and MachineLearning) (The MIT Press, 2005).

[31] Jones, D. R., Schonlau, M. & Welch, W. J.Efficient Global Optimization of ExpensiveBlack-Box Functions. Journal of GlobalOptimization 13, 455–492 (1998).

[32] Daylight Theory: SMILES (2017). URLhttp://www.daylight.com/dayhtml/

doc/theory/theory.smiles.html.

[33] RDKit (2017). URL http://www.rdkit.

org/.

[34] Gaulton, A. et al. ChEMBL: a large-scalebioactivity database for drug discovery.Nucleic Acids Research 40, D1100–D1107(2012).

[35] Kingma, D. P. & Ba, J. Adam:A Method for Stochastic Optimization.arXiv:1412.6980 [cs] (2014). URL http:

//arxiv.org/abs/1412.6980. ArXiv:1412.6980.

[36] Sun, J. et al. ExCAPE-DB: an integratedlarge scale dataset facilitating Big Dataanalysis in chemogenomics. Journal ofCheminformatics 9, 17 (2017).

[37] scikit-learn: machine learning in Pythonscikit-learn 0.18.2 documentation(2017). URL http://scikit-learn.

org/stable/index.html.

[38] Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. Journal ofChemical Information and Modeling 50,742–754 (2010).

13