44
Conquering Chemical Space Optimization of Docking Libraries through Interconnected Molecular Features Leonard Sparring Degree project in bioinformatics, 2020 Examensarbete i bioinformatik 45 hp till masterexamen, 2020 Biology Education Centre and Department of Cell and Molecular Biology, Uppsala University Supervisors: Jens Carlsson and Andreas Luttens

Leonard Sparring - diva-portal.org

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Leonard Sparring - diva-portal.org

Conquering Chemical Space

Optimization of Docking Libraries through Interconnected

Molecular Features

Leonard Sparring

Degree project in bioinformatics, 2020Examensarbete i bioinformatik 45 hp till masterexamen, 2020Biology Education Centre and Department of Cell and Molecular Biology, Uppsala UniversitySupervisors: Jens Carlsson and Andreas Luttens

Page 2: Leonard Sparring - diva-portal.org

AbstractThe development of new pharmaceuticals is a long and ardous process that typically requires more than 10 years fromtarget identification to approved drug. This process often relies on high throughput screening of molecular libraries.However, this is a costly and time-intensive approach and the selection of molecules to screen is not obvious, especiallyin relation to the size of chemical space, which has been estimated to consist of 1060 compounds. To accelerate thisexploration, molecules can be obtained from virtual chemical libraries and tested in-silico using molecular docking.Still, such methods are incapable of handling the increasingly colossal virtual libraries, currently reaching into thebillions. As the libraries continue to expand, a pre-selection of compounds will be necessitated to allow accuratedocking-predictions.

This project aims to investigate whether the search for ligands in vast molecular libraries can be made more efficientwith the aid of classifiers extended with the conformal prediction framework. This is also explored in conjunction witha fragment based approach, where information from smaller molecules are used to predict larger, lead-like molecules.

The methods in this project are retrospectively tested with two clinically relevant G protein-coupled receptor targets,A2A and D2. Both of these targets are involved in devastating disease, including Parkinson’s disease and cancer.

The framework developed in this project has the capacity to reduce a chemical library of > 170 million tenfold, whileretaining the 80 % of molecules scoring among the top 1 % of the entire library. Furthermore, it is also capable offinding known ligands. This will allow for reduction of ultra-large chemical libraries to manageable sizes, and willallow increased sampling of selected molecules. Moreover, the framework can be used as a modular extension on top ofalmost any classifier. The fragment-based approaches that were tested in this project performed unreliably and will beexplored further.

Page 3: Leonard Sparring - diva-portal.org
Page 4: Leonard Sparring - diva-portal.org

Conquering Chemical Space

Popular Science Summary

Leonard Sparring

Läkemedelsutveckling är en lång och dyr process. Det typiska första steget är identifikationen av en receptor som ärinvolverad i ett visst sjukdomsförlopp. Därefter krävs det vanligtvis mer än 10 år innan ett läkemedel mot receptornkan utvecklas och godkännas. Den tidiga delen i detta förlopp sker ofta med hjälp av automatiserad screening avmolekylbibliotek. Här försöker man finna en molekyl som binder till receptorn – en ligand – som sen kan optimerastill en mer läkemedelslik förening. Detta är emellertid ett kostsamt och tidskrävande tillvägagångssätt och valet avmolekyler som ska undersökas är inte uppenbart, särskilt i förhållande till storleken på den kemiska rymden, som haruppskattats bestå av 1060 föreningar. För att påskynda denna undersökning kan molekyler testas med datorsimuleringarmed såkallad molekylär dockning. Denna metod utspelar sig genom att molekyler passas mot 3D-modeller av en receptor.Om molekylen och receptorn passar väl ihop, så får molekylen en bra poäng. Molekylerna till dessa simuleringarerhålls från virtuella kemiska bibliotek som innehåller kommersiellt tillgängliga och farmakologiskt relevanta molekyler.Dessa molekyler har inte nödvändigtvis syntetiserats förut, utan de genereras genom kombinationen av molekylärabyggklossar som byggs ihop med ett antal tillåtna kemiska reaktioner.

Dessa bibliotek växer och har nu nått flera miljarder molekyler, storlekar som till och med molekylär dockning äroförmögen att hantera. Nu, och särskilt när biblioteken fortsätter att expandera kommer ett förval av föreningar attbehöva göras för att reducera tiden som används till att undersöka irrelevanta molekyler och för att möjliggöra merexakt poänggivning vid molekylär dockning.

Detta arbete syftar till att undersöka om sökandet efter ligander i stora molekylbibliotek kan effektiviseras genomklassificerare som utvidgas med ramverket conformal prediction. Detta ramverk tillåter en kontroll av hur mångaväl-dockande molekyler som behålls i det reducerade biblioteket. I projektet undersöks också ett fragmentbaserattillvägagångssätt, där information från mindre molekyler används för att förutsäga större, mer läkemedelslika molekyler.Metoderna i detta projekt testas retrospektivt med två kliniskt relevanta receptorer, adenosinreceptorn A2A och dopam-inreceptorn D2. Båda dessa receptorer är intressanta mål för förödande sjukdomar, till exempel Parkinsons sjukdom ochcancer.

Metoden som utvecklats i detta projekt har kapacitet att minska ett kemiskt bibliotek på > 170 miljoner molekylertiofaldigt, samtidigt som 80 % av molekylerna som räknas bland den bästa percentilen av molekylerna – med avseendepå dockningspoäng – av hela biblioteket behålls. Dessutom kan metoden korrekt förutsäga kända ligander till bådade testade receptorerna. Detta möjliggör en minskning av ultra-stora kemiska bibliotek till hanterbara storlekar ochkommer att tillåta ökad sampling av de utvalda molekylerna. Därtill kan metoden användas som ett modulärt tilläggovanpå närtill vilken klassificerare som helst. Fragmentbaserade tillvägagångssätten som testades i detta projekt varinstabila för de två testade receptorerna, och andra metoder bör undersökas ytterligare.

Degree project in bioinformatics, 2020

Examensarbete i bioinformatik 45 hp till masterexamen, 2020

Deparment of Cell and Molecular Biology, ICM

Supervisors: Jens Carlsson & Andreas Luttens

Page 5: Leonard Sparring - diva-portal.org
Page 6: Leonard Sparring - diva-portal.org

Contents

1 Introduction 81.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.1.1 High Throughput Screening & Commercial Chemical Libraries . . . . . . . . . . . . . . . . 81.1.2 Virtual Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.1.3 Machine Learning in Cheminformatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.1.4 Conformal Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.1.5 Adenosine A2A Receptor & Dopamine D2 Receptor . . . . . . . . . . . . . . . . . . . . . . 15

1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Methods 162.1 AMCP Architecture & Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Descriptive Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3 Retrospective Prediction of Known Ligands & Docked Compounds . . . . . . . . . . . . . . . . . . 192.4 Lead-Likes Predicted from Fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.5 Adaptive Learning & Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.6 Heavy Atom Count Dependence of Required Training Set . . . . . . . . . . . . . . . . . . . . . . . . 22

3 Results 233.1 Retrospective Prediction of Known Ligands & Docked Compounds . . . . . . . . . . . . . . . . . . 233.2 Lead-Likes Predicted from Fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Adaptive Learning & Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4 Heavy Atom Count Dependence of Required Training Set . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Discussion 284.1 Retrospective Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2 Choice of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Heavy Atom Count Dependence of Required Training Size . . . . . . . . . . . . . . . . . . . . . . . 294.4 Fragment-Based Conformal Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.5 Prospective Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.6 AMCP Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.7 Conclusions & Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Acknowledgements 33

References 34

A Appendix: Test Set Distributions 36

B Appendix: Adaptive Predictions 39

C Appendix: Predictions by Choice of Descriptive Features 42

Page 7: Leonard Sparring - diva-portal.org

Abbreviations

A2AR . . . . . . . . . . . . Adenosine A2A Receptor

AMCP . . . . . . . . . . . . Aggregated Mondrian Conformal Prediction

AUC . . . . . . . . . . . . . . Area Under the Curve

CDDD . . . . . . . . . . . . Continuous and Data-Driven Descriptors

cLogP . . . . . . . . . . . . Calculated Logarithm of Partition coefficient

D2R . . . . . . . . . . . . . . Dopamine D2 Receptor

DUD-E . . . . . . . . . . . Directory of Useful Decoys, Enhanced

ECFP4 . . . . . . . . . . . Extended connectivity Fingerprint, up to four bonds

FWOK . . . . . . . . . . . Fragment Wait-OK

GPCR . . . . . . . . . . . . G Protein-Coupled Receptor

HTS . . . . . . . . . . . . . . High-Throughput Screening

IID . . . . . . . . . . . . . . . Independently Identically Distributed

InChi . . . . . . . . . . . . . International Chemical Identifier

LSD . . . . . . . . . . . . . . Large Scale Docking screen

LLWOK . . . . . . . . . . Lead-Like Wait-OK

NMR . . . . . . . . . . . . . Nuclear Magnetic Resonance

ROC . . . . . . . . . . . . . Receiver Operating Characteristic

SMILES . . . . . . . . . . Simplified Molecular Input Line Entry Specification

Page 8: Leonard Sparring - diva-portal.org
Page 9: Leonard Sparring - diva-portal.org

1 Introduction

1.1 Background

1.1.1 High Throughput Screening & Commercial Chemical Libraries

Modern drug-discovery typically starts with the identification and validation of a biological target that is involved in aparticular disease. This initiates the search for active molecules that are capable of binding to the target to modulate itsactivity. When information about the target is sparse, high throughput screening (HTS) is a common approach in thisendeavour. This is to screen a large molecular library against a given target with a certain assay. The goal is to efficientlyfind compounds that elicit a response in the assay, so-called hits. These are subsequently refined into analogous leadcompounds and if showing promise in other assays, optimized to more selective drug-like molecules for pre-clinical andclinical testing. This optimization often results in an increase in lipophilicity and molecular weight of the hit. Typically,libraries of 100,000 to 1,000,000 compounds are screened in HTS campaigns (Brown & Boström 2018) with hit-ratesranging from 0.01 % to 0.14 % (Zhu et al. 2013). A major challenge in HTS is the design of the molecular library.Especially so when considering the size of the chemical space, which has been estimated to consist of 1060 molecules(Kirkpatrick & Ellis 2004) (see figure 1). The capacity of HTS also appears miniscule in relation to the number ofcommercially available compounds. These include molecules that have not necessarily been synthesized before butare synthetically tractable. They are catalogued in virtual chemical libraries that attempt to capture all compoundsthat are available on demand. These libraries are generally generated with a set of building blocks, chemical reactionsand limitations on the end-products (Walters 2019). The generated molecules are often confined to Lipinski’s rule offive (Lipinski et al. 2001). That is, their molecular weight is restricted to < 500 Da, their calculated octanol-watercoefficient (cLogP) to < 5 and their number of hydrogen-bond acceptors and donors to < 5. The libraries have growndramatically over recent years and are not static in size. One example is the ZINC (Sterling & Irwin 2015) database,which aims to catalogue all biologically relevant molecules for sale in 2D and 3D representations. This has grown from∼120 million compounds in 2015 to more than 900 million compounds in 2020. Likewise, the pharmaceutical vendorEnamine, provides four billion molecules in their searchable REAL space (Hoffmann & Gastreich 2019). These virtualchemical libraries and spaces represent a huge opportunity and challenge in the field of drug-discovery.

ScreenedSpace

CommercialSpace

ChemicalSpace

Figure 1: Chemical space. Chemical space is enormous and the amount of experimentally screened compoundsrepresent only a fraction of commercially available compounds.

One approach of adapting to these sizes is fragment-based drug-discovery (see figure 2). Initially, a small library withcompounds of ∼150 Da are screened. The number of non-hydrogen atoms, also referred to as heavy atom count (HAC),of these molecules is typically≤ 20. Thus, combinatorial expansion of chemical libraries will not affect these moleculesto the same extent as it has and will continue to do for the larger lead-like and drug-like compounds. Furthermore, aftera fragment’s binding mode has been determined with X-ray crystallography or nuclear magnetic resonance (NMR), itconstitutes a good starting point for chemical modification. They can be used as as scaffolds to rationally combine withother chemical groups to promote interaction with other parts of the binding pocket. (Murray & Rees 2009)

8

Page 10: Leonard Sparring - diva-portal.org

O OO

A B

Figure 2: Fragment-based drug discovery. HTS (A) typically aims to find large ligands that interact with most partsof the binding pocket by testing millions of compounds. In fragment-based drug discovery (B), few fragments arescreened to find low affinity ligands that allow more thorough optimization.

Hann et al. (2001) illuminated the strength of fragment-based drug-discovery with a simple model of molecularrecognition. They showed that as the complexity of molecules increases, the probability of the molecules makingunique complementary interactions with a binding pocked decreases. The inverse response to complexity was shownfor measuring that interaction in assays. As the complexity of molecules increases, they are allowed to make moreinteractions with the binding pocket, resulting in increased affinity. Thus, the probability of measuring binding ishigher for more complex molecules that do have complementary interactions with the binding pocket. These twoprobabilities are independent and the authors multiplied them to define the probability of a useful event (see equation 1).Maximization of this probability produces an optimal complexity for molecules to be screened (see figure 3).

p(useful event) = p(measuring binding of ligand) · p(molecule matches pocket in a unique way)

Equation 1: Probability of a useful event.

It should be noted that complexity is an abstract concept that encompasses many molecular features that are arguablydependent on more factors than size, but HAC and molecular weight can serve as proxies. The authors’ model wasrelated to experimental methods where compounds of 100-250 Da were tested, suggesting this range for experimentalfragment-based screenings. This study illustrates that a higher fraction of fragments will be ligands to a receptor, butthat they will be less potent than ligands from lead-like space and thus less likely to elicit a response in a given assay.

9

Page 11: Leonard Sparring - diva-portal.org

Ligand complexity

Pro

bab

ility Probability of measuring

binding

Probability of bindingin a unique way

Probability of usefulevent

Figure 3: The successs landscape. Schematic adaptation molecular complexity (Hann et al. 2001).

1.1.2 Virtual Screening

In the age of high performance computing clusters, virtual screening has become an important toolbox for accessinglarge virtual libraries. This group of techniques utilize information about already established ligands (ligand-based) orthe target’s structure (structure-based) and guides compound selection for experimental testing accordingly. Ligand-based methods are useful when there already are many compounds known to interact with a given target, but thereis limited information about the target itself. The premise is that molecules that are similar to known binders arelikely to be binders as well. However, ligand-based methods are generally unable to detect compounds that wouldinteract with the target in a previously unknown manner (Geppert et al. 2010). In contrast, structure-based methodsare not as dependent on previously known ligands. These methods are used when the structure of the target has beenexperimentally solved, either through X-ray crystallography, NMR or cryogenic electron microscopy. A protein target’sstructure can also be accurately modeled if there is already a solved homologous protein with high amino acid sequencesimilarity. With a 3D model of the target, its interaction with unseen molecules can be simulated by fitting the molecules’3D representations to the binding pocket and by estimating binding affinities with scoring functions. Physics-basedscoring functions attempt to approximate affinities, where a more negative term indicates higher affinity (see equation2). In this way, structure-based methods are capable of finding novel scaffolds. Due to the nearly exponential increasein experimentally solved protein structures occurring in the 1990’s to 2000’s (Berman 2012), these methods have gainedimportance in recent years.

Edocking = Eelectrostatics +EvanderWaals +Edesolvation

Equation 2: Scoring function. Physics based scoring function for the molecular docking software DOCK3.7.

Molecular docking is a structure-based technique that can emulate HTS in-silico. The aim is to filter through a large setof compounds in the computer and to rank them on their docking scores. Only a few compounds among the top-rankingmolecules are then selected for experimental testing. Using molecular docking, researchers are still able to access alarge proportion of virtual chemical libraries. The success of docking relies on the accuracy of the structural model, theaccuracy of the 3D structures of the molecules to be docked and the scoring function. Their concordance to experimentalresults is generally estimated by how well they can replicate experimentally solved poses and by how well their scoringfunction enriches active molecules over highly similar inactive molecules, so-called decoys (Mysinger et al. 2012).Such decoys are often obtained from the directory of useful decoys (DUD-E) database. The molecular docking softwareDOCK3.7 (Coleman et al. 2013) is especially geared towards screening large chemical libraries. Implemented inFortran in the 1970s, DOCK3.7 is one of the fastest docking softwares (Fan et al. 2019). It achieves this by assuming arigid protein structure, bounded sampling coordinates and by performing many calculations prior to the docking screen.

10

Page 12: Leonard Sparring - diva-portal.org

For instance, the 3D coordinates of a molecule’s structural conformations are generated beforehand. They can also beobtained from the ZINC database. The protein environment is also represented as a grid with precalculated potentials.

Lyu et al. (2019) investigated the ZINC library for lead-like (molecular weight between 250-350 Da) moleculesavailable on-demand by two large scale docking screens (LSD). Docking was performed with 99 million moleculesagainst the G protein-coupled receptor (GPCR) AmpC β -lactamase. After filtering for compounds that were similar toknown inhibitors and clustering for similarity, 51 diverse molecules ranking in the top ∼1 % on docking scores wereselected for experimental testing. Out of the 44 successfully synthesized compounds, 5 showed to inhibit β -lactamase,a hit-rate of 11 %. Additionally, after probing the analogues, the most potent non-covalent inhibitor to β -lactamasewas identified. For dopamine receptor D4, 138 million molecules were docked. After similar filtering for establishedligands and selection of top ranking diverse compounds, 589 molecules were selected for experimental testing. 81 newchemotypes were discovered with 31 of them showing sub-micromolar activity.

In another LSD, Stein et al. (2020) docked more than 150 million molecules against the sleep-wake cycle regulatingGPCR MT1. The orthosteric binding site is highly similar to MT2 and few selective ligands are known. The topscoring 300,000 molecules were clustered for similarity and compounds with similar chemotypes to known ligandswere removed. After manual selection from the top 10,000 clusters, 38 molecules were experimentally tested. Of these,15 had activity in MT1 or MT2, an hit rate of 39 %. Two selective MT1 inverse agonists were also found. In this study,it was also shown that docking cluster representatives rather than the complete library produced worse docking scores.

In the largest molecular docking screen to date, Gorgulla et al. (2020) docked 330 million molecules from ZINC andmore than one billion compounds from Enamine REAL space against KEAP1’s NRF2 interaction interface. Threemillion high ranking molecules were chosen from a first-stage rigid-target docking screen and rescored in a second-stageside-chain-flexible docking screen. 492 compounds from the top 0.03 % of the second stage docking and 98 from thetop 0.0001 % of the first stage docking were chosen for experimental validation. 69 of the selected compounds wereshown to bind KEAP1 with a fluorescence based method, hit rate of 12 %. In this study, the authors also showed that,keeping the number of selected compounds constant, the true positive rate (TPR) scales with the size of the dockedlibrary. This is not obvious, as larger libraries generally entails that actives are outnumbered with decoys. This furtherillustrates the power of docking.

However, for computational feasibility of LSDs, several sacrifices have to be made. They are computationallydemanding, requiring approximately 40,000 CPU-hours to dock 300 million molecules of the current ZINC lead-likelibrary, according to in-house experience. Additionally, the molecules are sampled to a limited extent. This will be aneven more accentuated problem as libraries continue to increase in size. To overcome this challenge, fragment-basedapproaches have also been tried in-silico. For example, fragments from ZINC can be docked and substructure searchescan be performed in the lead-like domain.

1.1.3 Machine Learning in Cheminformatics

Apart from virtual screening methods, other computational approaches have also been employed in the search for novelreceptor ligands and in cheminformatics in general. Machine learning has successfully been used for predictions of drugmetabolism, drug-drug interactions and carcinogenesis. The idea of supervised machine learning is that the relationshipbetween a set of simple molecular descriptors and a numerical value – for regression – or label – for classification – canbe learned by a model. The model’s parameters can then be transferred to molecules with other value-combinations ofthe descriptors, after which their state can be predicted. The recent advent of deep-learning has also made an impactin cheminformatics. Broadly, these models generate hidden descriptive features about an object that will maximizethe model’s capacity to make accurate predictions about future objects. This is done by leveraging a voting system by’neurons’ in layers. (Lo et al. 2018)

Class-imbalance is an important consideration in classification applied to cheminformatics as the goal is often to predictminority-classes from large libraries. For example, finding a selective inhibitor is much like finding a needle in ahaystack. This is highly pertinent to virtual screening, as the growth of molecular libraries will bring more activemolecules, but will also entail an exacerbation of the imbalance between active and non-active compounds.

A common evaluation criteria for classifiers is their accuracy, their fraction of correct predictions. While this may beinformative for balanced datasets, it is not useful when the data is highly imbalanced. To illustrate this, imagine aclassifier that is presented with the task of discerning positive samples from negative samples. A poor classifier could

11

Page 13: Leonard Sparring - diva-portal.org

merely predict all samples as negative, regardless of their properties. Such a classifier actually has an accuracy of99 % if the number of negative samples outnumber positive samples 99 to 1, despite it being wrong in 100 % of itspredictions on positive samples. For these scenarios, precision and recall are valuable class-wise metrics obtained fromthe confusion matrix (see figure 4). In this example, precision captures the proportion of positives in the set of samplesthat are predicted as positive. Recall, also known as the true positive rate (TPR), captures the proportion of all positivesamples that are predicted as positive. The maximization of both of these metrics is often desired but they are generallyinversely correlated. When the threshold for classifying a sample as positive is strict, a classifier’s precision will befavoured, while when the threshold is more lenient, recall is favoured. Which one of the two is more important dependson question asked. Both of them are captured in the F1-score which is their harmonic mean. This mean’s strength overits common arithmetic counterpart is that it is more sensitive to extremes. If either the recall or the precision is low, theF1-score will be low as well.

Lastly, the false positive rate (FPR) captures how many of the negative samples are predicted as positives. The TPRand FPR are plotted against each other in the receiver operating characteristic (ROC) space to evaluate a classifier’sperformance. This plot attempts to capture how well a classifier enriches positives over negatives as the threshold forclassifying a sample as positive is made increasingly lenient. The area under this curve (AUC) is often used as a singlemeasure of model performance, but it should be interpreted cautiously when there is high class imbalance.

Recall = TPR =TP

TP+FN

Precision =TP

TP+FP

F1-score = 2 · Precision ·RecallPrecision+Recall

FPR =FP

FP+TN

Equation 3: Common machine learning metrics based on the confusion matrix.

Pre

dic

ted

Actual

Posi

tive

Neg

ati

ve

Positive Negative

TP

TN

FP

FN

TPR

FPR

No skill

Good performance

Badperformance

Figure 4: Confusion matrix and ROC plot. The confusion matrix captures the amount of true negatives, true positives,false negatives and false positives after classification. It is the basis for most metrics of supervised classification. TheROC plot is used to assess enrichment of true positives over false positives.

12

Page 14: Leonard Sparring - diva-portal.org

1.1.4 Conformal Prediction

Conformal prediction (Vovk et al. 2005) is a framework that can be applied to almost any classifier or regressor. It isdesigned to output predictions with guaranteed error rates on a per sample basis. That is, if the accepted error rate isε , the conformal predictor should give at most ε errors. It achieves this by predicting sets of labels. For the case ofbinary classification, where samples are either in labeled as ‘1’ or ‘0’, the set-predictions are {1}, {0}, {1,0} and { }. Ifthe sample’s actual label is included in the set, the prediction is considered valid. The error rate is guaranteed in theway that if the actual sample’s label is ‘1’, the predicted set will contain ‘1’ with probability 1− ε . A useful metric ofconformal predictions is the efficiency. This is the number of single label predictions. There is an inherent reciprocalrelationship between the expected error rate ε and the efficiency. At low levels of ε , the framework will assign bothlabels to the majority of the samples, while at levels of high ε , it will assign none of the labels to the majority of thesamples (see figure 5).

Classifier +

Conformal predictor

ε

Pro

port

ions

{0} & {1} sets

{0,1} set { } set

p('1') = 0.52 > εp('0') = 0.23 < ε }{0,1}, valid

p('1') = 0.05 < εp('0') = 0.13 < ε }{ }, error

p('1') = 0.09 < εp('0') = 0.79 < ε }{0}, error

p('1') = 0.65 > εp('0') = 0.13 < ε }{1}, valid

ε=0.2

1

1

0

0

error r

ate

Figure 5: Conformal prediction. For binary classification, if samples are assigned their actual label, or more labels, theprediction is valid. Samples that are not assigned any labels (predicted to the { } set) are always erroneous. As thevalue of ε increases from 0 to 1, the proportion of the {0,1} set decreases in favor for single label predictions (higherefficiency). After a certain point, no labels will be assigned to the samples, instead predicting all of them in the { } set.

An important aspect of conformal prediction is the distribution of training sets and the set that is predicted. If both setsare obtained through the same generative process, the sets are said to be independently identically distributed (IID).Conformal prediction operates under the exchangeability assumption. This is a slightly weaker assumption than IID,that does not require independence. The typical difference is seen when drawing random samples. If this is performedwith replacement, so that the already drawn samples can be drawn again, this will generate IID datasets. Conversely,sampling without replacement will generate exchangeable sequences. Furthermore, similar distributions for the trainingand test sets is also important for the underlying classifier’s predictive performance.

The first described variation of the framework is the on-line, transductive, approach (Shafer & Vovk 2008). This relieson updating the model after each sample has been predicted, thus requiring retraining of the classifier for every newprediction. To achieve computational efficiency, the inductive, ‘batch setting’ approach was proposed. This method willbe outlined for the case of binary classification below.

Firstly, the training set is split into a proper training set and a calibration set. The proper training set is used to fitan underlying classifier. The calibration set is used to generate a list of nonconformity scores. To do this, conformalprediction relies on the ability of the classifier to provide the value α , a measure of nonconformity, that aims to capturehow dissimilar a given sample is to those that have already been seen. The typical function to generate α is 1 subtractedwith the classifiers probability for the given label (see equation 4). However, the conformal predictor’s guarantees of theerror rate are independent of this function and although the predictions would be uninformative, even a random numbergenerator can be used. The nonconformity function is applied to each sample of the calibration set and sorted in a list. Atthe stage of prediction, each label yn is examined for each sample. The sample and label combination’s nonconformity

13

Page 15: Leonard Sparring - diva-portal.org

score is calculated and the rank of this nonconformity score among the calibration set sample’s nonconformity scoresis reported as the conformal prediction p-value. If the p-value is more than ε , the sample conforms to the label andthe label is added to the sample’s prediction set. If the p-value is less than or equal ε , the label is rejected from theprediction set.

When observing predictions on a sample-per-sample basis, they are predicted to the class that has the highest p-value. Inaccordance with conformal terminology, this p-value is the prediction’s credibility. The conformal prediction confidenceis 1 subtracted by the second to highest p-value. Another useful metric is the quality of information, defined by ShenShyang Ho and Harry Wechsler (2008) as the largest minus the second largest p-values.

A caveat of the framework is that the conformal prediction error rate is not guaranteed for each class individually,which especially for imbalanced datasets can cause unreliable predictions. The mondrian conformal predictor providesa simple solution to ensure class-wise validity. The nonconformity function is unmodified but the calibration set’snonconformity scores are separated by class and the p-value for each label is deduced from the nonconformity score’srank in each list separately (see equation 5).

αi, j := f (zi) = 1−Ph (y j |xi)

Equation 4: Typical nonconformity function for the calibration set and for samples to be predicted. For each sample ziwith the label y j and features xi, the nonconformity score α is 1 subtracted by the p-value predicted by model h. Alllabel’s nonconformity scores are calculated for the samples to be predicted. For samples of the calibration set, only thenonconformity score corresponding to the true label is calculated.

pi,y j =|{αi, j : αi, j ≥ α j,n j+1}|

n j +1

Equation 5: The mondrian conformal prediction p-value. For each sample, the p-value of each label y j is calculated bydividing the total number of nonconformity scores in the current label’s calibration set n j that are less than or equal tothe current sample’s and label’s nonconformity score, by the number of samples in the label’s calibration set plus one.

Furthermore, a problem with the inductive approach is that the calibration set is not utilized for training of the classifier.Depending on the size of the training set, this may reduce predictive performance of the classifier. To counter this issue,several conformal predictors can be aggregated (Carlsson et al. 2014). Also known as cross-conformal predictors,aggregated conformal predictors can be implemented in different ways. Commonly, the training set is split into a propertraining set and a calibration set for each classifier to be trained. Between the training of each classifier, the training setis randomly shuffled. At the subsequent stage of prediction. Each sample is predicted by each classifier and the medianof the models’ conformal prediction p-values is selected. This increases the likelihood that each sample of the trainingset will influence the predictions. Aggregated conformal predictors achieve higher efficiency but the error rate is notautomatically guaranteed. The error rates are instead conditional on the stability of the underlying classifier. If eachaggregated model produce similar p-values, the error rate approximates ε . Otherwise, error rates are generally less thanε for ε ≤ 0.4 (Linusson et al. 2017).

Conformal prediction has previously been used in cheminformatics for toxicological modeling and molecular docking.Svensson et al. (2017) used iterative docking and conformal prediction to improve HTS efficiency. Firstly, docking wasperformed with the DUD-E datasets against their respective targets. The top 1 % scoring compounds were chosen tobe screened experimentally. The confirmed actives and non-actives were labeled and used for training of a conformalpredictor after which the remaining samples in the dataset were predicted. Conformal prediction achieved an averagehit rate of 7.6 %, compared to a hit rate of 5.2 % achieved through docking alone.

Ahmed et al. (2018) leveraged the conformal prediction framework in an iterative manner to predict molecular dockingscores in order to limit the total number of molecules to dock. Firstly, molecular signatures – features that capturemolecules’ atomistic environments – were generated for a molecular library of ∼2 million compounds. An initialtraining set was randomly extracted and docked against four receptors. High scoring compounds were assigned thelabel ’1’ and low scoring compounds the label ’0’, using a docking score histogram approach. Compounds between the’high scoring’ and the ’low scoring’ bins were not labeled and hence not used for training. The labeled molecules were

14

Page 16: Leonard Sparring - diva-portal.org

subsequently used to train a conformal predictor with an underlying support vector machine for classification. Themodel was used to predict the rest of the library after which several iterative prediction-docking steps were performed.Firstly, if compounds were predicted to the {0} set, they were disregarded for subsequent docking steps. A smaller setof compounds were then randomly selected from the rest of the library to expand the training set. After this, the modelwas re-trained and the rest of the library predicted again. This was repeated until predictions reached a tolerable level ofefficiency. With this method, the authors managed to reduce the number of docked molecules by 62.61 %. This wasperformed while maintaining an accuracy of the best scoring 30 binders of 94 % on average.

1.1.5 Adenosine A2A Receptor & Dopamine D2 Receptor

G protein-coupled receptors (GPCRs) are a diverse group of membrane-receptors that respond to a vast variety of stimuli,including ions, odors, hormones and even light. Their activation triggers intracellular signaling cascades that mediatephysiological responses throughout the entire mammalian body. GPCRs also represent the most pharmaceuticallyinvolved group of receptors, with > 30 % of all small-molecule drugs having a GPCR as their primary target. Due totheir association with the plasma membrane, early structure determination of GPCRs proved difficult. Nevertheless,breakthroughs in structural biology has produced a plethora of high resolution structures that are ideal for structure-baseddrug discovery. (Hauser et al. 2017)

In this project, molecular docking scores to two G protein-coupled receptor (GPCR) targets are investigated, adenosineA2A receptor (A2AR) and dopamine D2 receptor (D2R) (see binding modes in figure 6).

A2AR has physiological roles in the regulation of myocardial oxygen demand, immune cell suppression and control ofglutamatergic and dopaminergic release in the brain. Non-selective A2AR antagonists, most notably aminophylline,were developed in the 1990s and are used for the treatment of bradyasystolic arrest (Burton et al. 1997). Recent researchfocuses on discovery of selective antagonists for the treatment of neurodegenerative disease (Franco & Navarro 2018)and cancer immunotherapy (Leone et al. 2015). A2AR is one of the best structurally characterized protein targets.Ligand binding is generally coordinated with hydrophobic interactions and selectivity is mainly mediated by polarinteraction with asparagine 253 (pdb accession 4EIY). The distinguishing features for agonists are hydrogen bondsmade deep in the binding pocket, in adenosine’s case, through its ribose moiety. (Carpenter & Lebon 2017)

D2R have predominant roles in neurophysiology and neurobiological disorders. It is highly expressed in the striatumwhere it regulates locomotion (Mishra et al. 2018). It is also involved in cognitive flexibility, learning and memory(Cameron et al. 2018). Many first-generation antipsychotic drugs, such as haloperidol and chlorpromazine, target D2Ras non-selective antagonists while selective agonists, such as bromocriptine, are used for the treatment of Parkinson’sdisease. Additionally, A2AR and D2R have been shown to interact in-vivo by dimerization, causing a reduction inD2R activity (Kamiya et al. 2003). The key interaction for ligand binding to D2R is the formation of a salt-bridgebetween aspartic acid 114 (pdb accession 6CM4) and a protonatable amine of the ligands. Similarly to A2AR, severalhydrophobic interactions contribute to binding for D2R ligands as well, but the pocket is more constrained (Wang et al.2018).

15

Page 17: Leonard Sparring - diva-portal.org

Asn253

Asp114

A BRBP

Figure 6: Binding modes for the antagonist ZMA (pdb accession: 4EIY) to A2AR (A) and the antagonist risperidone(pdb accession: 6CM4) to D2R (B). The ribose binding pocket (RBP) for A2AR and both targets’ key interactingresidues are marked.

1.2 Purpose

Finding novel selective ligands to A2AR and D2R can help alleviate suffering for patients of tragic disease, suchas neurodegenerative disorders and cancer. The expansion of molecular libraries provide an enormous resource inthis endeavour. However, the current methods of exploring such libraries requires spending massive amounts ofcomputational resources. This necessitates a method that is less restricted by size expansion.

To solve this problem, this project aims to investigate the fusion of molecular docking and machine learning modelswithin the conformal prediction framework. The goal is to design a combined machine learning-docking method thathas the capacity to find ligands in chemical datasets of billions. To accomplish this, the final framework must reducethe time spent docking irrelevant compounds while retaining the strictly-docking approach’s sensitivity in findingligands. Furthermore, this project investigates a fragment-based approach, something that has not yet been tried withinthe conformal prediction framework. If successful, the framework designed in this project may expedite the earlycomputational drug-discovery process.

Aim 1: Reduce large virtual libraries to manageable sizes for molecular docking, with a controlled loss of foundligands.

Aim 2: Exploit knowledge acquired from fragments to find ligands among lead-like molecules.

2 Methods

2.1 AMCP Architecture & Classification

To allow for parallel building and prediction, to control memory usage, and for multi-class predictions, a new installmentof an aggregated mondrian conformal prediction (AMCP) framework was implemented in python as a collaborative

16

Page 18: Leonard Sparring - diva-portal.org

effort. The implementation is briefly described here.

The main AMCP script has two modes, build and predict (see figure 7). Both modes expect an input file with whitespace-delimited values where each line represents a sample to be used for building or prediction. The first field is the IDof the sample, the second is the label of the sample and subsequent fields are descriptive features. For the build mode,the entire input file is read into memory. For each model built, the samples are shuffled and split into a proper trainingset and a calibration set. A classifier is fit to the proper training set and the calibration set is used to generate conformityscores, rather than nonconformity scores. That is, Ph is used directly, instead of 1−Ph in equation 4. Models are builtsequentially and saved in a compressed format together with the calibration set’s conformity scores in a model directory.

Prediction is performed by loading all models and calibration set’s conformity scores into memory and instantiating anarray of fixed size. Samples are loaded into the array and prediction is performed in batches. All models are used toobtain a conformity score for each sample. The conformity score is compared to each model’s respective calibrationset’s conformity score. The conformal prediction p-values are then obtained by equation 5 after which their median iswritten to the output file.

Parallel building and prediction is performed with job arrays. Models can be built separately with the same input fileand afterwards aggregated to the same model directory. Parallel prediction can be performed by splitting the files,predicting and concatenating. For all tests in this project, the random forest classifier implemented in python scikit-learnis used. It allows further parallelization for each job across all designated cores.

Predict with each model and

get conformity scores

Sample Class Feature1 Feature2 ···Z03145 1 0.8944 0.3261 ···Z03341 0 0.5378 0.3754 ···Z00941 0 0.7894 0.1610 ··· ··· ··· ··· ··· ···

Random split

Proper training

set

Calibration

set

Fit classifierPredict calibration set

and store conformity scores

[0.03, 0.11, 0.23, 0.31, 0.49, 0.63, 0.75, 0.81]

[0.17, 0.32, 0.44, 0.53, 0.65, 0.74, 0.82, 0.91]

α class 0

α class 1

20X

[0.03, 0.11, 0.23, 0.31, 0.49, 0.63, 0.75, 0.81]

[0.17, 0.32, 0.44, 0.53, 0.65, 0.74, 0.82, 0.91]

α class 0

α class 1

Sample Feature1 Feature2 ···Z02954 0.9158 0.1236 ···Z07714 0.2961 0.1154 ···Z10491 0.4731 0.1106 ··· ··· ··· ··· ···

Building Predicting

Z02954: {α0 = 0.02, α1 = 0.61}

Z07714: {α0 = 0.76, α1 = 0.15}

Z10491: {α0 = 0.65, α1 = 0.83}

Z02954: {p0=0.11, p1=0.44}: {1}

Z07714: {p0=0.88, p1=0.11}: {0}

Z10491: {p0=0.66, p1=0.88}: {0,1}

ε=0.2

P-values from conformity scores' positions

in calibration set lists

Figure 7: Building and predicting with the AMCP framework.

The random forest classifier Ho (1995, Breiman 2001) was chosen due to its capacity for handling imblanaced data-sets.To understand this classifier, its underlying units, decision trees, must first be explained. A decision tree is trained byconstructing a binary tree graph, where each internal node represents a true-or-false question about the training samples’properties. For each node, the feature and value that best splits the classes are used. Nodes are added recursivelytop-down until only one class is represented in the node. Thus, every leaf-node will associate to one of the classes. When

17

Page 19: Leonard Sparring - diva-portal.org

predicting new samples, they are guided through the tree by their features. The leaf that it is predicted in determines itspredicted class (see figure 8).

MW > 423 Da?

cLogP < 4.8?Hydrogen bond donors > 2?Hydrogen bond donors > 2?

# rings > 2? cLogP > 3.9? Active Non-active

Active Non-active Active Non-active

yes

yes yes

yes

yes

no

no

no

no

Figure 8: A simple decision tree. This is the minimal unit of a random forest classifier.

While decision trees are highly interpretable, they are generally inaccurate for predictions of unseen samples. To counterthis weakness, random forests are constructed with stochastic components. These classifiers rely on building manydecision trees from the original training set by sampling with replacement, so called bootstrapping. The classifier alsoselects a random subset of features for building the trees. All of the trees are subsequently used to vote on classes, wherethe majority-vote decides the final classification. For each sample, class-wise probabilities are also assigned. These arecalculated as the division of votes for each class by the number of decision trees used. These are the probabilities thatare used as conformity scores for the conformal predictor.

2.2 Descriptive Features

Molecules can be represented virtually in several ways. The simplified molecular input line entry specification (SMILES)format represents molecules as interpretable strings. Similarly, the international chemical identifier (InChi) representmolecules in string format, but in a unique manner. These are the predominant 2D representations found in virtuallibraries. From these, more descriptive features can be generated.

In this work, three sets of descriptive features are used. Firstly, a set of 97 physico-chemical properties. These include,for example molecular weight, the number of atoms and the cLogP of the molecule. Secondly, extended connectivityfingerprints with diameter 4 (ECFP4s) are used (see figure 9). These represent the presence of particular substructuresand are algorithmically generated through the morgan algorithm (Morgan 1965). Firstly, each non-hydrogen atom islabeled with an integer that represents its element, number of heavy neighbours, number of bonded hydrogens, charge,isotope and whether it is within a ring. These values are hashed into a single 32-bit integer. Next, the atomic identifiersare iteratively updated by collecting all of the neighbouring atoms’ bond orders and identifiers into a single array. Thebond orders are 1 for single, 2 for double, 3 for triple and 4 for aromatic bonds. The identifiers are the neighbouringatoms’ previously assigned integer values. Before the next iteration, this array is again hashed to a 32-bit integer, whichis used to update the identifier of the atom. After a fixed amount of iterations, any duplicate identifiers are removedand the integer represented as a bit-vector. In this work, 1024 bits were used as binary features. The diameter of thefingerprint is twice the amount of iterations that have been performed (see figure 9).

18

Page 20: Leonard Sparring - diva-portal.org

NH

{...010001000100010...}

Figure 9: ECFP4s. Each bit represents a substructure of the molecule up to the diameter 4.

Thirdly, continuous and data-driven molecular descriptors (CDDD) (Winter et al. 2019) are used. These features aregenerated with a pre-trained autoencoder. The autoencoder leverages deep learning to maximize its performance intranslating SMILES to InChis (see figure 10). The idea is that this requires the model to learn the properties of themolecule that best describes it. This translation process uses an encoder, forcing the SMILES string to be represented asa 512 length vector of floating point numbers, and a decoder, translating the vector to InChis. Here, the encoder of theauthors’ pre-trained model is used to generate descriptors (see figure 10).

SMILES InChi

CDDDs

Figure 10: Autoencoder. Latent features generated from an autoencoder.

2.3 Retrospective Prediction of Known Ligands & Docked Compounds

Prior to this work, DOCK3.7 was used to perform LSDs against A2AR and D2R. The docking grids were optimizedand validated with satisfactory enrichment of actives over decoys. 3D models of molecules with wait-ok status – that is,commercially available – were obtained from ZINC. Molecules with molecular weights < 250 Da were consideredfragments (FWOK) and those between 250 Da and 350 Da were considered lead-like (LLWOK). Both sets were dockedwith DOCK3.7, using the same settings, and all docking scores were saved. To validate the predictive power of theAMCP framework, the top scoring compounds to A2AR and D2R were predicted retrospectively.

The training set was chosen to a random subset of the combined ZINC FWOK and LLWOK sets, at 18 to 26 heavyatoms. Compounds with docking scores at the top 1 % of the combined FWOK-LLWOK sets were labeled with ’1’ andthe rest were labeled with ’0’. The docking score cutoff was -33.45 for A2AR and -51.53 for D2R. The underlyingclassifier was random forest with 100 trees. 20 aggregated models were used with a training size of 3,000,000 molecules.

Validation was also performed by prediction of known ligands to the two targets. Ligand SMILES were obtained fromChEMBL (ID: CHEMBL251 for A2AR and CHEMBL217 for D2R) with the requirement that the Ki values were

19

Page 21: Leonard Sparring - diva-portal.org

below 10 µM. To ensure that none of the predicted ligands were in the FWOK and LLWOK sets, the compounds werecompared after neutralization, removal of salts and stereoisomers based on their SMILES. The overlap of smiles to thetraining set was removed from the ligand set.

For evaluation of retrospective predictions, The proportion of ’actives’ predicted in the {1} set was determined atε = 0.2. The ROC AUC was calculated both for the known binders (considering all compounds of the FWOK-LLWOKlibrary as false positives) and the top 1 % scoring compounds (considering all compounds not in the top 1 % as falsepositives). This was done using three schemes based on ranking on the conformal prediction confidence, 1− p(0);credibility, p(1) and the quality of information, p(1)− p(0), respectively.

2.4 Lead-Likes Predicted from Fragments

The aim of this method was to assess whether models trained on fragments could accurately predict lead-like compounds.Approximately 20 % of the ZINC FWOK (MW less than 250 Da) library was chosen as the training set for buildingAMCP models, training size of 2,929,118. A ∼10 % subset of the LLWOK set was used as test set, 30 million samples.As controls, AMCP models were also built with a subset of the LLWOK compounds of the same and used for predictionof the same LLWOK test set. For the same reason, the rest of the FWOK set was also predicted with the FWOK-builtAMCP models. For each AMCP model built, molecules with docking scores at the top ∼1 % (a docking score cutoff of-31) of fragments were labeled ’1’ and the rest of the compounds were labeled ’0’.

The same labeling strategy was applied to the test sets. Molecules with docking scores below -31 were labeled ’1’ andthe rest of the molecules were labeled ’0’. Due to the large file-size of the LLWOK library, the CDDD training andtest sets were selected from after quasi-shuffling. That is, lines, representing samples, from the file were extracted inblocks after which the blocks were shuffled individually and then concatenated back together. This does not ensureexchangeability between training and test sets. For ECFP4 and physico-chemical features, the same training set and testset were used. These were obtained through random sampling from the entire LLWOK library. For each model built, 20aggregated models were used, each with 100 trees for the random forest classifier.

2.5 Adaptive Learning & Prediction

Based on the previous method’s results (see 3.2), the conformal prediction error rate was considered controlled whenpredicting compounds approximately three to four heavy atoms larger than the training set when ECFP4s were used asfeatures. Hence, several different approaches of adaptive learning and prediction were attempted with the same generalidea. This was to use a large proportion of the fragments as a training set to predict slightly larger compounds. Thesepredictions would guide which of these larger compounds to ’dock’, which would then constitute the training set forpredictions of molecules with an even larger size (see figure 11).

20

Page 22: Leonard Sparring - diva-portal.org

NH N

H

Build Predict

Build

Predict

Build

Predict

HN

HAC

Figure 11: Adaptive learning. Guiding idea for adaptive learning and prediction.

The docked and scored ZINC FWOK and LLWOK compounds were neutralized, stripped of salts and subsequentlydistributed in different bins of 14-17, 18-21, 22-24 and 25-26 heavy atoms. A random subset was extracted from eachbin to be used for using a non-adaptive controls scheme (see description below). The remaining compounds wereused as test sets. If a compound of the test set had a docking score among the top 1 % of the entire combined FWOKand LLWOK libraries, the compound was considered ’active’. Otherwise, it was considered ’non-active’. For A2AR,this docking score cutoff was -33.45 and for D2R, it was -51.53. For all methods other than the multi-class methods –subsequent description explicitly – training set-labeling was performed by labeling ’actives’ as ’1’, and ’non-actives’ as’0’. Note that the top 1 % scoring compounds are not uniformly distributed across the HAC bins. This means that thatwhen ’actives’ were labeled as ’1’, each step of prediction did have a slightly different ratio of ’actives’ to ’non-actives’for the training sets. The rationale for using this strategy, as opposed to choosing the top 1 % scoring compounds foreach HAC bin individually, was that it would allow more accurate comparisons between the different approaches. Inprospective use cases, labeling would rather be based on the initial training set. Each time that a prediction set wasextracted, it was done so with a significance of ε = 0.2.

The first approach was the random-subset method, which was used as a non-adaptive negative control for the subsequentmethods. Training was performed with the aforementioned extracted sets. For each HAC bin, 2,000,000 samples wereused for training. The models were used to predict the test sets with the same HAC bin as was used for training. Thisensured exchangeable distributions. Ten aggregated models were trained for each AMCP model.

Secondly, the {1} set based method represented the most naïve conformal prediction method. The 14-17 AMCP modelswere used to predict the 18-21 HAC bin and the compounds predicted in the {1} set were extracted. For prospectiveuse, these compounds are the ones that would be indicated for docking, but in this retrospective scenario, they werealready scored. A 2 million random subset of these ’docked’ molecules were then used for training new AMCP models.These models were used for prediction of the 22-24 HAC bin. As previously, compounds predicted in the ’{1} set werechosen to be ’docked’ and a 2,000,000 random subset of these were used to build new AMCP models for prediction ofthe 25-26 HAC bin. Lastly, the 25-26 HAC compounds predicted in the {1} set were selected to be ’docked’. For eachtraining step, ten aggregated models were used.

The third approach was the confidence-sort method, which was designed to ignore indications of the framework for howmany molecules to dock, and instead select a fixed number of compounds that the model is most confident in. The14-17 HAC bin models were used to predict the 18-21 HAC bin. After this, the predicted samples were sorted accordingto their confidence (1− p(0)) and the top 2,000,000 compounds were selected to be ’docked’. The ’docked’ compoundswere subsequently used for training new AMCP models, for prediction of the 22-24 HAC bin. As in the previous step,the 2,000,000 compounds that the were were most confidently predicted as ’actives’ were ’docked’, forming the training

21

Page 23: Leonard Sparring - diva-portal.org

set for new AMCP models for the last prediction step. Lastly, the 25-26 HAC bin was predicted and the 2,000,000compounds that the framework was most confident in were ’docked’. For each training step, twenty aggregated modelswere used.

Two multi-class approaches with differing labeling strategies were also attempted. In both cases, compounds withdocking scores among the bottom 10 % of the combined FWOK and LLWOK sets were labeled with ’0’. For A2AR,this docking score cutoff was -4.82 and for D2R, it was -17.90. For the first multi-class approach, compounds withdocking scores among the top 1 % were labeled as ’2’ (docking score cutoff at -33.45 for A2AR, and -51.53 for D2R)and the middle scoring compounds were labeled as ’1’. For the second multi-class approach, compounds scoring amongin the top 10 % of the entire FWOK and LLWOK libraries were labeled as ’2’ (docking score cutoff at -27.35 for A2ARand -39.75 for D2R) and the middle 80 % scoring compounds were labeled as ’1’. The same labeling strategy with thesame docking score cutoffs was performed for each HAC bin step.

The practical method was the same for the two multi-class approaches. Firstly, AMCP models were built on 2,000,000molecules from the 14-17 HAC bin and used to predict the 18-21 HAC bin. 2,000,000 compounds from the 18-21HAC bin were then randomly selected from samples in all prediction sets with the exception of the {0} set. That is, theselected compounds included the samples that were predicted in the {1} set and the {2} set, but also in the { } set, the{1,2} set, the {0,1} set, the {0,2} set and the {0,1,2} set. The rational of choosing these compounds was that they wouldnot be as biased to ’actives’ and therefor present a more accurate representation of the ’active’ to ’non-active’ ratio fortraining of the next model. By not training on compounds predicted in the {0} set, the sampling of non-interestingcompounds was still reduced to a limited extent.

After the AMCP models had been trained, they were used in two ways. Firstly, the same HAC bin was predicted again.The {2} set was then extracted for ’docking’. The compounds predicted in the {2} set are the compounds that themodel is most confident in that they will score well. Secondly, these models were used to predict the 22-24 HAC bin.As before, new AMCP models were then built on compounds not predicted in the {0} set. These models were usedto predict the 22-24 HAC bin again, and the compounds predicted in the {2} set were extracted for ’docking’. Thesame idea was then repeated in the last step for the 25-26 HAC bin. While these multi-class approaches may achievemore exchangeable distributions between training and test sets, they do require sampling many compounds that are notexpected to score well.

For comparison of these methods, recall and precision of top 1 % scoring compounds was assessed in the extracted {1}sets for the first three methods: the random-subset method, the {1} set-based method and the confidence-sort method.For the multi-class approaches, recall and precision of top 1 % scoring compounds were assessed in the extracted {2}sets.

2.6 Heavy Atom Count Dependence of Required Training Set

The purpose of these tests were to assess if the required training size for predictive performance increases with the HACof compounds. Similarly to section 2.5 the compounds of the FWOK and LLWOK libraries were binned into bins ofHAC. AMCP models were trained at 12 different size levels for each HAC bin. These were 1,000, 2,000, 4,000, 8,000,1,6000, 32,000, 64,000, 128,000, 160,000, 320,000, 1,000,000 and 2,000,000 samples. Both the training data andseparate test sets were predicted with these models. Three replicates were used for each training size. After prediction,recall was assessed at ε = 0.2.

22

Page 24: Leonard Sparring - diva-portal.org

3 Results

3.1 Retrospective Prediction of Known Ligands & Docked Compounds

The application of the AMCP method for retrospective prediction of A2AR LSD scores allowed for finding 86.8 % oftop 1 % scoring compounds while only evaluating 9.62 % of the entire 18-26 HAC test-set. Of the compounds predictedin the {1} set, 9.58 % were part of the top 1 % scoring compounds for A2AR. Similar predictive power was observedfor D2R. Retrospective prediction of the D2R test set allowed for finding 84.4 % of top 1 % scoring compounds whileonly evaluating 9.7 % of the entire 18-26 HAC test-set. Of the compounds predicted in the {1} set, 8.45 % were top 1% scoring compounds for D2R. The error-rates for both the active and non-active class were lower than the chosensignificance of ε = 0.2 for both A2AR and D2R (see table 1).

Table 1: Retrospective prediction of top 1 % scoring compounds in FWOK-LLWOK LSDs with ε = 0.2.

A2AR D2R

Test set size 179,805,087 219,039,320Number of top 1 % scoring compounds 1,908,170 2,132,578{1} set size 17,293,923 21,296,055{1} set proportion 9.62 % 9.72 %Top 1 % predicted in {1} set 1,657,007 1,799,949Recall in {1} 86.8 % 84.4 %Efficiency 98.0 % 99.4 %Precision in {1} 9.58 % 8.45 %Error-rate class ’1’ 13.1 % 14.6 %Error-rate class ’0’ 10.8% 9.07 %ROC AUC, p(1) 95.7 % 94.7 %ROC AUC, 1− p(0) 95.7 % 94.7 %ROC AUC, p(1)− p(0) 95.7 % 94.7 %

Calibration plots and predictive efficiencies are seen for A2AR in figure 12. This shows that the error rates were ≤ ε asthe ε increased linearly from 0 to 1 in 50 steps. The behaviour of the error rate differed for the classes. Predictionsof the majority class (non-actives) were consistently over-conservative, and especially so at higher values of ε . Theminority class (actives) followed the expected error rate more closely but had over conservative error rates for lowervalues of ε . The proportion of single label predictions were maximized at ε = 0.18 for 99.97 %. At ε = 0.2, where the{1} set was extracted, it was at 97.99 %.

23

Page 25: Leonard Sparring - diva-portal.org

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.0

0.2

0.4

0.6

0.8

1.0

Err

orra

te

A

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.0

0.2

0.4

0.6

0.8

1.0

Effi

cien

cy

B

Error rate ’0’ Error rate ’1’

Figure 12: Class wise error rates (A) and efficiency (B) for retrospective top 1 % predictions of A2AR. For (A), thedotted line indicates the expected error rates. For (B), the dotted line shows the ε that was used for extracting the {1}set.

Calibration plots for retrospective predictions of D2R showed a highly similar pattern as A2AR (see figure 12). Errorrates were ≤ ε as the ε increased linearly from 0 to 1 in 50 steps. Predictions of the majority class (non-actives) wereoverconservative and the minority class (actives) were mostly only overconservative at low values of ε .

The scores of the non-active compounds predicted in the {1} set were also investigated. It was seen that thesecompounds, although not reaching the docking score cutoff-value, were generally among the better scoring (morenegative) compounds of the test set. This was true for both A2AR and D2R (see figure 13).

−40 −20 0 20

Docking score

0.00

0.02

0.04

0.06

0.08

Den

sity

A

−60 −40 −20

Docking score

0.00

0.02

0.04

0.06

Den

sity

B

Test set {1} set

Figure 13: Docking score distribution for the {1} set compared to the test set for A2AR (A) and D2R (B) at ε = 0.2.The vertical lines indicate the thresholds for the top 1 % scoring compounds. For A2AR, the right side of the plot is cut,due to a long tail.

Retrospective prediction of known ligands to A2AR showed 100 % efficiency. Each of the ligands were given a singlelabel and none of them were predicted as both ({0,1}) or null ({ }). The recall in the {1} set also reflected the expectederror rate of 80 % at ε = 0.2. Retrospective prediction of known ligands to D2R showed an efficiency of 99.0 % and anerror rate that was marginally higher than the expected, at 23.3 % (see table 2).

24

Page 26: Leonard Sparring - diva-portal.org

Table 2: Retrospective prediction of known ligands for A2AR and D2R with ε = 0.2.

A2AR D2R

Number of ligands 8,683 9,477Ligands predicted in {1} set 7,314 7,167Recall in {1} set 84.2 % 75.6 %Error rate 15.8 % 23.3 %Efficiency 100 % 99.0 %ROC AUC, p(1) 93.4 % 90.2 %ROC AUC, 1− p(0) 92.8 % 91.1%ROC AUC, p(1)− p(0) 92.8 % 90.2 %

Lastly, the AMCP models’ abilities of enriching ’actives’ in the predicted sets were also assessed. This was done afterranking the samples on the p-values. The AMCP model enriched top 1 % scoring compounds over non-top 1 % scoringcompounds in the 18-26 HAC test set when ranking predictions on credibility (p(1)), confidence (1− p(0) and qualityof information (p(1)− p(0)). For these predictions, there was no discernible difference in performance for the rankingschemes in regards to the AUC metric.

For enrichment of known ligands, detection of any compound of the combined FWOK-LLWOK set was considered afalse positive, including their top 1 % scoring compounds. The ranking schemes did perform in a different manner forthe two targets. For A2AR, credibility showed best enrichment. For D2R, confidence showed best enrichment. In figure15, ROC plots of known binders and top 1 % scoring compounds can be seen.

0.0 0.2 0.4 0.6 0.8 1.0

FPR

0.00

0.25

0.50

0.75

1.00

TP

R

A

0.0 0.2 0.4 0.6 0.8 1.0

FPR

0.00

0.25

0.50

0.75

1.00

TP

R

B

Top 1 % compounds Known ligands

Figure 14: ROC curves for A2AR (A) and D2R (B) based on credibility (p(1)).

3.2 Lead-Likes Predicted from Fragments

After the AMCP framework had been validated for retrospective predictions, it was tested whether models trainedin the fragment domain could be used to predict molecules of the lead-like domain. Considering the quasi-shuffleapproach that was used to extract CDDD-training and test sets from the LLWOK library, precision and recall values forCDDD-trained models should be compared with the other feature’s models cautiously. With this in mind, it is apparentthat models built on each set of features perform well when predicting in the same domain as they were trained in. Forbuilding and prediction in fragment space, the error rates are consistently < ε . The models are also efficient, with eachmodel labeling more than 85 % of the samples with a single label. For models built with CDDDs, the error rate exceedsε , likely a result of the aforementioned quasi-shuffling.

When predicting lead-like molecules with models built on fragments, it is evident that the physico-chemical and CDDDmodels successfully label almost all ’actives’ correctly, with a recall of 99.9 % and 98.8 % respectively. However, theyalso predict a large proportion of ’non-actives’ in the {1} set, as shown by their low precision 3.27 % and 4.98 %. In

25

Page 27: Leonard Sparring - diva-portal.org

contrast, the ECFP4 model does have the worst recall, of 70.9 %, when predicting lead-likes from fragments. Thisis the case when comparing it to the other features, but also when comparing it to ECFP4-predictions performed inexchangeable domains. Nonetheless, the drop is not substantial, and the precision, 15.8 %, is in the same range as whenECFP4s are used for building and predicting in the same domain (17.0 % and 20.5 %) (see table 3).

Table 3: F1-scores for predictions made with models built with physico-chemical properties, CDDDs or ECFP4s asfeatures. From the top row, fragments were predicted with fragment-trained models, lead-like molecules were predictedwith lead-like-trained models and lead-likes were predicted with fragment-trained models. The {1} set was extractedwith the expected error rate of ε = 0.2.

Physico-chemical CDDD ECFP4

F1-score Fragm→ Fragm 25.4 % 23.1 % 28.4 %F1-score Lead→ Lead 29.2 % 29.4 % 32.9 %F1-score Fragm→ Lead 6.23 % 9.48 % 25.8 %

In figure 15, it is shown that as the number of heavy atoms grow, the fragment-based ECFP4 model respond by placingmore samples in the null set ({ }). To the contrary, fragment-based physico-chemical and CDDD models respondto increased HAC by labeling ’non-actives’ as ’actives’. Overall, this indicates that the fragment-based conformalprediction transferred to the lead-like domain best when using ECFP4s as training features.

17 18 19 20 21 22 23 24 250

5

#m

olec

ule

sx

106

A

17 18 19 20 21 22 23 24 250

5 B

17 18 19 20 21 22 23 24 250

5C

17 18 19 20 21 22 23 24 25

HAC

0

5

#m

olec

ule

sx

106

D

17 18 19 20 21 22 23 24 25

HAC

0

5 E

17 18 19 20 21 22 23 24 25

HAC

0

5F

True {0}False {0}

False {1}True {1}

{0,1}{ }

Figure 15: Comparison of predictions performed in lead-like space for A2AR with different features and trainingsets. LLWOK predicted from LLWOK with physico-chemical descriptors (A), LLWOK predicted from LLWOKwith ECFP4s (B), LLWOK predicted from LLWOK with CDDDs (C), LLWOK predicted from FWOK with physico-chemical descriptors (D), LLWOK predicted from FWOK with ECFP4s (E), LLWOK predicted from FWOK withECFP4s (F).

3.3 Adaptive Learning & Prediction

The random subset method achieved near 100 % efficiencies for each HAC bin, both for A2AR and D2R (see figure S4(B) and figure S5 (B)). The recall of ’actives’ was consistently above 80 %, which is expected at ε = 0.2 with highefficiencies and when training and test sets are exchangeable. This constitutes the control for the subsequent methods(see table 4 for overview and S3 and S4 for a full comparison of the methods for A2AR and D2R respectively).

26

Page 28: Leonard Sparring - diva-portal.org

The {1} set-based method had a drop in recall and an increase in precision for each step of HAC bin for both targets.This was likely a result of progressively biasing the training set towards ’actives’ for each step of HAC bin producingincreasingly smaller prediction sets.

To the contrary, the confidence-sort method showed an increase in recall for each step for both A2AR and D2R. Theprecision was inconsistent between the targets. For A2AR, the precision increased when predicting from models builton compounds of the 18-21 HAC bin in the 22-24 HAC bin, but decreased when predicting the 25-26 HAC bin withmodels of the 22-24 HAC bin. For D2R, the precision consistently increased. This discrepancy between targets may bedue to there being a higher percentage of top 1 % compounds in higher ranges of HAC for A2AR, while D2R ’actives’are more evenly distributed for HAC.

As seen on the calibration plots (figures S5 and S4 (E) and (F)), the initial prediction based on a random subset from the14-17 HAC bin produced similar error rates as when predicting from a random subset. The distribution of the trainingset was distorted when top ranking compounds were used for prediction.

The first multi-class approach was aimed at maintaining a high degree of exchangeability between training sets andtest sets. The introduction of a second class produced lower efficiencies, especiallly for predictions of the 22-24 HACand the 25-26 HAC bins, for both targets (see figures S5 and S4 (G) and (H)). Furthermore, the error rates marginallyexceeded the expected at ε = 0.2. The amount of actives found per HAC bin step was poorly controlled, with recallsincreasing from 43.3 % to 79.4 % for A2AR and from 58.0 % to 85.7 % for D2R. The increase in recall for each stepwas a symptom of increased {2} set size for each step, as seen by decreased precisions for A2AR. A notable drop inprecision was seen for 25-26 HAC bin precision.

The second multi-class approach was investigated in order to mitigate the lower efficiencies of top 1 % molecules. Bymaking the threshold of docking score for labeling a molecule as ’active’ more lenient, the top 1 % would be furtherfrom the middle-class. Indeed, more of the compounds with docking scores in the top 1 % were found, as indicated byrecall values above 90 % in all three steps for A2AR, and above 83.9 % in all steps for D2R. However, this caused theselection of a higher proportion of ’non-active’ compounds into the {2} set, leading to poor precision of 5.85 % forA2AR and 6.31 % for D2R.

Table 4: Comparison for different approaches of adaptive prediction for A2AR.

Rand-subset {1} set Conf-sort Multi-class 1 Multi-class 2

A2AR Total recall 85.0 % 64.5 % 60.0 % 48.6 % 93.3 %A2AR Total precision 9.51 % 12.3 % 17.3 % 20.9 % 5.85 %A2AR Total F1-score 17.1 % 20.7 % 26.5 % 29.3 % 11.0 %D2R Total recall 85.1 % 80.3 % 41.5 % 60.03 % 87.4 %D2R Total precision 7.88 % 6.32 % 14.5 % 14.23 % 6.31 %D2R Total F1-score 14.42 % 11.7 % 21.5 % 23.0 % 11.8 %

Overall, it can be declared that the random-subset method had the most consistent behaviour. The first multi-classmethod did achieve the highest F1-scores for both targets. However, this method performed differently for A2AR andD2R. In prospective predictions, its behaviour cannot be guaranteed.

3.4 Heavy Atom Count Dependence of Required Training Set

For both A2AR and D2R, a plateau of diminishing returns for recall was seen after a training size of approximately320,000 compounds. This was true for both A2AR and D2R (see figure 16). For A2AR, the plateau was reached earlierfor the larger compounds, consisting of 22-24 and 25-26 heavy atoms. For D2R, the opposite pattern was observed.The plateau was reached earlier for compounds of lower HAC, coinciding with higher recall values overall.

27

Page 29: Leonard Sparring - diva-portal.org

0 500000 1000000 1500000 2000000

Training size

0.0

0.2

0.4

0.6

0.8

Rec

all

A

0 500000 1000000 1500000 2000000

Training size

0.2

0.4

0.6

0.8

Rec

all

B

14-17 HAC 18-21 HAC 22-24 HAC 25-26 HAC

Figure 16: Recall for top 1 % scoring compounds in the {1} set for A2AR (A) and D2R (B). Calculated after extractingthe {1} set at ε = 0.2.

The pattern differed for the precision metric’s dependence of HAC for A2AR and D2R. Firstly, there was an initialplateau seen at a training size of 320,000 samples for both targets. For D2R, this plateau was more consistent for thelevels of HAC bins than for A2AR. For A2AR, the precision continued to increase for the heavier bins, albeit moreslowly, until the maximum training size of 2,000,000 samples (see figure 17). Differently from D2R, the proportion thatwas labeled as active for A2AR increased in concordance with an increase of HAC (see appendix A). The heavier, morebalanced, bins allowed for greater values of precision.

0 500000 1000000 1500000 2000000

Training size

0.05

0.10

0.15

Pre

cisi

on

A

0 500000 1000000 1500000 2000000

Training size

0.02

0.04

0.06

0.08

0.10

Pre

cisi

on

B

14-17 HAC 18-21 HAC 22-24 HAC 25-26 HAC

Figure 17: Precision for top 1 % scoring compounds in the {1} set for A2AR (A) and D2R (B). Calculated afterextracting the {1} set at ε = 0.2.

To conclude, both targets reached an initial plateau for recall and precision at similar training sizes. However, theresponse to increased training sizes were different for the individual HAC bins.

4 Discussion

4.1 Retrospective Predictions

Retrospective prediction of the FWOK and LLWOK 18-26 HAC test sets allowed for finding more than 80 % of top 1 %scoring molecules while reducing the docking size tenfold for both A2AR and D2R. For these predictions, completely

28

Page 30: Leonard Sparring - diva-portal.org

exchangeable distributions were used and the error rates were ≤ ε for both A2AR and D2R which is expected with theconformal prediction framework. This conveys the capability of the AMCP framework for subsetting large chemicallibraries with controlled error rates. Furthermore, retrospective prediction of known A2AR and D2R ligands showedalmost perfect efficiency at ε = 0.2. The error rate for A2AR was controlled under 20 %, while D2R had a marginallyhigher than expected error rate. This reveals that the AMCP framework capability of predicting docking scores can alsobe translated into predictions of biologically relevant compounds for two targets with clearly different binding pockets.

For D2R’s error rate, it is important to stress conformal prediction’s assumption of exchangeability for test and trainingsets. The known ligands are not sampled from the same underlying distribution as the samples used for training. Thisweakens conformal guarantees. Another factor is that these ligands were not docked. Despite being ligands, it ispossible that these would not be in the top 1 % on docking scores. This highlights another important premise formachine learning models to output reliable predictions. The settings used for the docking software have to be accurateand well validated. For use of AMCP prediction of ultra-large libraries, the docking software has to be able to producehit-rates that improve with the growth of the library. Otherwise, ligands will be labeled as both actives and non-activesin the training set, resulting in worse precision in the extracted sets.

The ROC plots show a general enrichment of compounds scoring in the top 1 % over other molecules in the LLWOK18-26 HAC test set. This was true when ranking on credibility (p(1)), confidence (1− p(0)), or quality of information(p(1)− p(0)). Similarly, known ligands were enriched for each way of ranking on p-values. For A2AR, the bestmethod was to rank on credibility, which intuitively describes how similar the compound is to ’actives’ in the trainingset. For D2R, the confidence in the prediction enriched for actives better. In essence, this describes how dissimilar thecompound is to ’non-actives’ in the training set.

4.2 Choice of Features

The transferability of models built in fragment space to lead-like space had most reliable error rates when using ECFP4sas predictive features. As ECFP4s represent substructures, they do not contain any single feature that is directlycorrelated to size. To the contrary, physico-chemical features use molecular weight and HAC as explicit features.Principal component analysis of the features from the pretrained CDDD model also showed that the first principalcomponent correlates with the molecules’ molecular weights (Winter et al. 2019). It is likely that models built withphysico-chemical properties and CDDDs designated high importance to size for their separation of ’actives’ and’non-actives’. This might have resulted in lead-likes, that per our definition are larger, were labeled as ’active’ simplydue to their larger size, disregarding other features. This is partly a reflection of reality, where larger fragment-ligandsdo have a higher chance to make potent interactions with a target than smaller fragment-ligands do. It is also a reflectionof the DOCK3.7 docking software, which in it current state has a bias for larger molecules, whether there are specificinteractions or not. The fragment models built with ECFP4s predicted more molecules in the null set as the HACincreased. This is likely a result of the lead-likes simply having more substructures, making them dissimilar both to’actives’ and ’non-actives’. This is in line with the intuition behind conformal prediction, where samples predicted inthe { } set are too dissimilar to both classes of the training samples for the model to assign labels.

4.3 Heavy Atom Count Dependence of Required Training Size

While the error rate is controlled for all training sizes if training and test sets are exchangeable, they are not alwaysefficient. For low training sizes, conformal prediction will control the error rate by assigning samples both labels,predicting them in the {0,1} set, which is always valid. As training sizes increase, the conformal predictor becomesmore efficient, such that the added experience allows the predictor to produce more single label predictions. Recall isthe standard machine learning metric that is most closely related to conformal predictions error rate. Here, recall ismeasured on the {1} set predictions. With efficient models, the recall value approximates the error rate.

For both A2AR and D2R, recall plateaued at approximately 80 % after training was performed with approximately320,000 samples, consistent with the expected error rate of 80 % at ε = 0.2. There was a discrepancy for precision,where D2R mostly showed a plateau after 320,000 samples, while A2AR did have an increase in precision for larger

29

Page 31: Leonard Sparring - diva-portal.org

training sizes when building and predicting in the 22-24 HAC bin and the 25-26 HAC bin (see figure 16 and 17). Thediscrepancy should be considered in context of the of ’active’ compounds distributions (see appendix A). This showsa difference in class imbalance for the two targets. For both training and testing, molecules were defined as ’active’if they were part of the top 1 % of the entire combined FWOK and LLWOK libraries. For A2AR, the proportion of’actives’ increased for each step of HAC, while for D2R, the proportion of ’actives’ was more uniformly distributed forHAC. The distribution of ’actives’ can be related to the binding sites of the targets. The D2R pocket is more constrainedand the major driver of ligand affinity is the salt-bridge formed with aspartic acid 114. For A2AR, the binding pocket isnot as constrained, and while polar interactions to asparagine 253 is a major contributor to affinity, larger moleculeswill have more opportunities to make more interactions with the pocket.

Although it was not the aim, the tests on the required training size indicate a response of the framework to class balance.Namely, that larger training sizes may allow higher values of precision if the ’active’ and ’non-active’ classes are morebalanced. This claim must be validated for other targets and ranges of imbalance, to see whether this behaviour isgeneralizable.

4.4 Fragment-Based Conformal Prediction

The idea of training in fragment space for prediction in lead-like space is appealing. Since virtual fragment space issmaller and does not grow at the same rate as lead-like and drug-like space, fragments can allow a more completeenumeration of chemical groups while representing a small proportion of the entire commercially available space.Another factor is that their small size allows for sampling of one chemical group without the interference of another.

Additionally, as fragments have fewer possible conformations, they can be docked more quickly. Even so, there area number of issues that complicate this matter. Firstly, docking scores are generally worse for fragments than forlead-likes. If the threshold for the ’active’ class is set to the top 1 % of the fragments, more than 1 % of lead-likeshave ’active’ docking scores. Secondly, features that do represent size in one way or another do not allow completeexchangeability between training and test sets and then, error rates cannot be guaranteed under conformal prediction.The contribution of size-correlated properties to docking scores, compared to chemotype-specific interactions is alsonot consistent between different protein targets.

Surely, small lead-like molecules do behave more similarly to fragments than large lead-like molecules do, which iswhy the adaptive methods were investigated. ECFP4s were chosen, as their error rates were less responsive to theincrease in HAC. At each HAC bin step, compounds that are enriched in ’actives’ were chosen as the training set for thenext HAC bin. However, this introduced an additional bias, as the proportion of ’actives’ was now further from theproportion of ’actives’ in the predicted set. Although the highest F1-score was achieved with an adaptive method, itrequired the sampling of many middle-scoring compounds for building additional training sets, and despite this, theerror rates were not controlled at each step. It was also not consistent between the two targets. While there likely is away to optimize these methods for satisfactory precision and recall to A2AR and D2R retrospectively, it is unlikely thatit will generalize to other targets. For example, how the error rate changes with respect to HAC and how biased theextracted sets are to the training sets may not behave in a consistent manner.

In the end, both the fragment-based and the adaptive methods that were attempted in this work produced unreliableerror rates and there are no guarantees that they will behave similarly as libraries continue to grow in size. It will beespecially difficult to make a generalizable adaptive method. Despite this, there are other approaches that may be worthto explore further for machine-learning models built in fragment space.

The reader is asked to recall the probability of a useful event (see equation 1) that was described in relation to molecularcomplexity by Hann et al. (2001). This probability is dependent on a molecule making favorable interactions with atarget, and the probability of it having adequate affinity for detection in an assay. Ranking molecules on docking scoresis to rank them on an estimate of affinity if they would bind. Due to their higher complexity, lead-like compounds havea lower probability of having complementary interactions to a given target. It follows that the top 1 % scoring lead-likemolecules will have a lower proportion of real binders than the top 1 % of the fragments will. This conveys that top 1 %score might be the wrong way to classify compounds for extraction for docking screens.

Instead of classifying compounds on their affinities, it might prove relevant to approximate their usefulness instead.

30

Page 32: Leonard Sparring - diva-portal.org

This might facilitate better predictive transfer between the fragment and lead-like domains. One possible procedure isto normalize molecules’ docking scores by their complexities. Practically, complexity can be estimated by their HAC.Before labeling the fragments as ’useful’ or ’non-useful’, their docking scores’ response to increase in HAC can bedetermined through regression analysis after which the scores of the molecules can be normalized by the relationship.This would punish the scores of more complex molecules, and pardon them for less complex ones.

4.5 Prospective Use

Over the course of this project, the method that performed most reliably was to use a training set that was exchangeablewith the test set. This approach consistently produced error rates ≤ ε for the two targets, A2AR and D2R. This iscurrently the proposed method for prospective use of the framework in predicting ultra-large libraries. The idea is toextract a random subset of around 2,000,000 molecules from the library for docking. The top 1 % scoring compoundscan be labeled as ’1’ and the rest as ’0’. This labeled set can then serve as a training set for prediction of the rest of thelibrary. If the efficiency – the fraction of single label predictions – is satisfactory, the compounds predicted in the {1}set can be docked with a reliable estimate of missed top 1 % scoring compounds. The ε can also be scaled to controlthe amount of molecules to dock. If the user accepts a higher error rate, fewer molecules will be predicted in the {1} set.Using this approach, the framework has the capacity to reduce large chemical libraries to a manageable sizes. In thisway, it will allow for sampling more conformations of the selected compounds. It may prove especially useful for suchlabs that do not have the capacity to perform LSDs but still need to investigate a large proportion of chemical space.Before prospective use is attempted, the framework must be fine-tuned for the number of aggregated models, choice ofdescriptive features and for choice of classifier. For a large library, it is especially important that the features are chosencarefully, as they then only have to be generated once. They can subsequently be used for prediction of any dockabletarget.

There are some caveats of this approach to a fragment based approach that should be elaborated on for prediction ofexpanding molecular libraries. Particularly, it is important to stress an effect of extracting set predictions for imbalanceddatasets. Assuming error rates are exactly equal to ε and that predictions are 100 % efficient, the classifier will indeedfind 80 % of the top scoring compounds when extracting the {1} set. Likewise, at least 80 % of ’non-active’ moleculeswill be predicted in the {0} set. It follows that up to 20 % of ’non-actives’ may be predicted in the {1} set. With largeclass imbalances, the number of falsely labeled ’non-actives’ normally exceed the number of ’actives’ in the {1} set,in extreme cases causing large {1} sets with small precision. As molecular libraries continue to grow, the top 1 %scoring compounds will consist of more molecules. The proportion of biologically active molecules in these top 1 %will also be diluted. At a certain point, it may be tempting to instead look at the top 0.1 % scoring compounds. Asbefore, efficient models should predict 80 % of the top 0.1 % molecules in the {1} set. However, changing the thresholdwill increase the class imbalance. Up to 20 % of the non-active molecules will still be predicted in the {1} set. Thus,the size of the {1} set may not change to a high degree. The underlying classifier may also not handle such large classimbalances particularly well.

An important consideration is also in how the method compares to other approaches of navigating through chemicalspace. It might be tempting to simply use the underlying classifier to eliminate the computational overhead of theconformal prediction framework. This is a reasonable choice if there is a fixed amount of molecules that can be docked.Ranking on the nonconformity scores of the classifier, should perform similarly well as ranking on the p-values of theAMCP with large training sizes. The signifying strength of conformal prediction is of course that the user suppliesthe accepted error-rate, ε , and the conformal predictor indicates the size to dock. Lastly, it would be intriguing tocompare the AMCP framework to substructure searches performed after docking fragments. Especially in the searchfor predicting novel chemotypes. It is possible that the AMCP method is less capable of finding novel scaffolds, asmolecules with the representative substructes may be too rare to be sampled for the training set.

As has been done previously with conformal prediction for docking score prediction, the AMCP framework can also beextended with iterative approaches that are based on extracting an initial random subset from the prediction domain toguide subsequent training. As shown by Ahmed et al. (2018), the efficiency metric is particularly useful for controllingthe training size. After initial training, more samples can be collected and used for training, until the predictionshave reached a certain threshold of – or converged on – efficiency. Iterative building might also be useful when usingclassifiers that are more sensitive to severe class imbalances. In these circumstances, oversampling of minority-class

31

Page 33: Leonard Sparring - diva-portal.org

is a common method of boosting the predictive power of the classifier. For each iteration, more of the minority-classsamples can be implemented in the training set.

4.6 AMCP Framework

The AMCP codebase was developed with modularity in mind and can easily be extended with other classifiers. Aftervalidation, it is intended to be distributed as open source. The framework is agnostic to what it is predicting and buildingon, as long as the descriptors can be represented as vectors with continuous or integer values. Thus, it can easily beapplied in other fields.

A facet of the framework in this work that is important to elaborate on is the overconservative error rates (see figure 12,13, 23 and 24). This was not considered a problem because the error rate of the minority-class was controlled aroundε . As discussed by Johansson et al. (2015), overconservative error rates are also seen in non-aggregated inductiveconformal prediction, mostly for mondrian models. The pattern emerges when nonconformity scores in the calibrationset are few. Predicted samples with nonconformity scores that are between two nonconformity scores are then given thep-value that corresponds to the more nonconforming instance. This ensures that the error rate is controlled. In this work,the causative factor was not few samples in the calibration set, but rather a low resolution, where many of the calibrationset’s conformity scores were exactly the same. Practically, the calibration set’s conformity score is a sorted list. The lowresolution meant that it contained values in blocks where each value in a block is exactly the same. For new samples,the first conformity score of a given block was used to assign the predicted sample’s p-value. With a better resolution,new instances might have been assigned p-values that would correspond to conformity scores in the middle of the block.

This behaviour mostly obscured the otherwise common sigmoidal shape that occurs when predicting with aggregatedmodels. As explained by Linusson et al. (2017), error rates have a tendency to give less than the expected error-rate forε ≤ 0.4 and greater than the expected error-rate for ε > 0.4. This pattern was seen for the minority class with smallertraining and test sets.

Although overconservative conformal predictors do not have a controlled lower bound on the error rate, the error ratesdo have an upper bound. Pragmatically, this can be useful, as over-conservative predictors can produce better recall thanwhat is expected. It may also reduce the number of false positives in the {1} set. The downside is that if it is completelyuncontrolled for the class of interest, it can indicate very large sizes for the set to be extracted.

4.7 Conclusions & Future Directions

In this project, a new implementation of an aggregated, mondrian, conformal prediction (AMCP) framework has beendeveloped. With the random forest classifier, the framework is capable of extracting subsets of molecules from chemicallibraries of tenfold while retaining the 80 % of molecules scoring among the top 1 % of the entire library. This hasbeen shown for two GPCR targets with distinct binding pockets. Furthermore, retrospective prediction showed a clearenrichment of known ligands, with controlled error in the extracted set, indicating that AMCP models trained ondocking scores can predict biology. This will allow for reduction of ultra-large chemical libraries to manageable sizes,and will allow increased sampling of selected molecules. The framework can be used as a modular extension on top ofalmost any classifier and will be optimized for increased predictive power before being used prospectively.

The fragment-based and the adaptive approaches that were attempted in this project performed in an unreliable mannerand will not be useful for prospective predictions. Nevertheless, due to the many advantages of fragment-basedprediction, more ideas will be investigated. Other ways of classifying the compounds of the training set, and techniquesfor transfer learning will be explored.

32

Page 34: Leonard Sparring - diva-portal.org

5 Acknowledgements

I wish to express my sincere appreciation to my supervisor, Jens Carlsson, for his continued faith in me throughoutthese months. In spite of sometimes wavering progress, he continued to trust in me to make the right choices in myexplorations of this field, and lead me in the correct direction when I was lost. I especially appreciate his pragmaticattitude which provided a stable base when I overcomplicated matters.

I also want to show my utmost gratitude to my closest supervisor, Andreas Luttens, for his tremendous support andguidance in this project, of which the realization would have been a complete impossibility without. I ascribe anenormous part of what has been achieved in this work to him.

Furthermore, I am highly indebted to my external advisor, Ulf Norinder, for constantly bringing novel ideas and insights.Your ambition constatntly inspired me and your recommendations have been a great resource for times when progresshas been slow. Thank you especially for your comments and support during my frantic days of writing.

I also wish to show my gratitude to my subject reader Ola Spjuth for your early directions and for providing foundationalresearch that guided my investigations in this field.

Furthermore, I am highly thankful for being introduced to the current and previous wonderful group members of theJens Carlsson group. This team of talented individuals has continually motivated me to perform at my best and providedme considerably useful suggestions when my own ideas have dwindled.

Lastly, I wish to thank my family and friends, for their enormous support and understanding during a demanding periodof my life.

33

Page 35: Leonard Sparring - diva-portal.org

References

Ahmed L, Georgiev V, Capuccini M, Toor S, Schaal W, Laure E, Spjuth O. 2018. Efficient iterative virtual screeningwith Apache Spark and conformal prediction. Journal of Cheminformatics, doi 10.1186/s13321-018-0265-z.

Berman HM. 2012. Creating a community resource for protein science. Protein Science 21: 1587–1596.

Breiman L. 2001. Random forests. Machine Learning 45: 5–32.

Brown DG, Boström J. 2018. Where Do Recent Small Molecule Clinical Development Candidates Come From? Journalof Medicinal Chemistry 61: 9442–9468.

Burton JH, Mass M, Menegazzi JJ, Yealy DM. 1997. Aminophylline as an adjunct to standard advanced cardiac lifesupport in prolonged cardiac arrest. Annals of Emergency Medicine 30: 154–158.

Cameron IG, Wallace DL, Al-Zughoul A, Kayser AS, D’Esposito M. 2018. Effects of tolcapone and bromocriptine oncognitive stability and flexibility. Psychopharmacology 235: 1295–1305.

Carlsson L, Eklund M, Norinder U. 2014. Aggregated conformal prediction. IFIP Advances in Information andCommunication Technology 437: 231–240.

Carpenter B, Lebon G. 2017. Human adenosine A2A receptor: Molecular mechanism of ligand binding and activation.Frontiers in Pharmacology, doi 10.3389/fphar.2017.00898.

Coleman RG, Carchia M, Sterling T, Irwin JJ, Shoichet BK. 2013. Ligand Pose and Orientational Sampling in MolecularDocking. PLoS ONE 8: e75992.

Fan J, Fu A, Zhang L. 2019. Progress in molecular docking. Quantitative Biology 7: 83–89.

Franco R, Navarro G. 2018. Adenosine A2A receptor antagonists in neurodegenerative diseases: Huge potential andhuge challenges. Frontiers in Psychiatry 9: 68.

Geppert H, Vogt M, Bajorath J. 2010. Current trends in ligand-based virtual screening: molecular representations, datamining methods, new application areas, and performance evaluation. Journal of Chemical Information and Modeling50: 205–216.

Gorgulla C, Boeszoermenyi A, Wang Z-F, Fischer PD, Coote PW, Padmanabha Das KM, Malets YS, Radchenko DS,Moroz YS, Scott DA, Fackeldey K, Hoffmann M, Iavniuk I, Wagner G, Arthanari H. 2020. An open-source drugdiscovery platform enables ultra-large virtual screens. Nature 580: 663–668.

Hann MM, Leach AR, Harper G. 2001. Molecular Complexity and Its Impact on the Probability of Finding Leads forDrug Discovery. Journal of Chemical Information and Computer Sciences 41: 856–864.

Hauser AS, Attwood MM, Rask-Andersen M, Schiöth HB, Gloriam DE. 2017. Trends in GPCR drug discovery: Newagents, targets and indications. Nature Reviews Drug Discovery 16: 829–842.

Ho SS, Wechsler H. 2008. Query by Transduction. IEEE Transactions on Pattern Analysis and Machine Intelligence 30:1557–1571.

Ho TK. 1995. Random decision forests. In: Proceedings of the international conference on document analysis andrecognition, icdar, pp. 278–282. IEEE Computer Society,

Hoffmann T, Gastreich M. 2019. The next level in chemical space navigation: going far beyond enumerable compoundlibraries. Drug Discovery Today 24: 1148–1156.

Johansson U, Ahlberg E, Boström H, Carlsson L, Linusson H, Sönströd C. 2015. Handling Small Calibration Sets inMondrian Inductive Conformal Regressors. In: Statistical learning and data sciences, pp. 272–280.

Kamiya T, Saitoh O, Yoshioka K, Nakata H. 2003. Oligomerization of adenosine A2A and dopamine D2 receptors inliving cells. Biochemical and Biophysical Research Communications 306: 544–549.

Kirkpatrick P, Ellis C. 2004. Chemical space. Nature 432: 823.

34

Page 36: Leonard Sparring - diva-portal.org

Leone RD, Lo YC, Powell JD. 2015. A2aR antagonists: Next generation checkpoint blockade for cancer immunotherapy.Computational and Structural Biotechnology Journal 13: 265–272.

Linusson H, Norinder U, Boström H, Johansson U, Gammerman A, Vovk V, Luo Z, Papadopoulos H. 2017. On theCalibration of Aggregated Conformal Predictors. Proceedings of Machine Learning Research 60: 1–20.

Lipinski CA, Lombardo F, Dominy BW, Feeney PJ. 2001. Experimental and computational approaches to estimatesolubility and permeability in drug discovery and development settings. Advanced Drug Delivery Reviews 64: 4–17.

Lo Y-C, Rensi SE, Torng W, Altman RB. 2018. Machine learning in chemoinformatics and drug discovery. DrugDiscovery Today 23: 1538–1546.

Lyu J, Wang S, Balius TE, Singh I, Levit A, Moroz YS, O’Meara MJ, Che T, Algaa E, Tolmachova K, TolmachevAA, Shoichet BK, Roth BL, Irwin JJ. 2019. Ultra-large library docking for discovering new chemotypes. Nature 566:224–229.

Mishra A, Singh S, Shukla S. 2018. Physiological and Functional Basis of Dopamine Receptors and TheirRole in Neurogenesis: Possible Implication for Parkinson’s disease. Journal of Experimental Neuroscience, doi10.1177/1179069518779829.

Morgan HL. 1965. The Generation of a Unique Machine Description for Chemical Structures—A Technique Developedat Chemical Abstracts Service. Journal of Chemical Documentation 5: 107–113.

Murray CW, Rees DC. 2009. The rise of fragment-based drug discovery. Nature Chemistry 1: 187–192.

Mysinger MM, Carchia M, Irwin JJ, Shoichet BK. 2012. Directory of Useful Decoys, Enhanced (DUD-E): BetterLigands and Decoys for Better Benchmarking. Journal of Medicinal Chemistry 55: 6582–6594.

Shafer G, Vovk V. 2008. A Tutorial on Conformal Prediction. Journal of Machine Learning Research 9: 371–421.

Stein RM, Kang HJ, McCorvy JD, Glatfelter GC, Jones AJ, Che T, Slocum S, Huang XP, Savych O, Moroz YS, StauchB, Johansson LC, Cherezov V, Kenakin T, Irwin JJ, Shoichet BK, Roth BL, Dubocovich ML. 2020. Virtual discoveryof melatonin receptor ligands to modulate circadian rhythms. Nature 579: 609–614.

Sterling T, Irwin JJ. 2015. ZINC 15 – Ligand Discovery for Everyone. Journal of Chemical Information and Modeling55: 2324–2337.

Svensson F, Norinder U, Bender A. 2017. Improving Screening Efficiency through Iterative Screening Using Dockingand Conformal Prediction. Journal of Chemical Information and Modeling 57: 439–444.

Vovk V, Gammerman A, Shafer G. 2005. Conformal prediction. In: Algorithmic learning in a random world, pp. 17–51.Springer-Verlag, New York.

Walters WP. 2019. Virtual Chemical Libraries. Journal of Medicinal Chemistry 62: 1116–1124.

Wang S, Che T, Levit A, Shoichet BK, Wacker D, Roth BL. 2018. Structure of the D2 dopamine receptor bound to theatypical antipsychotic drug risperidone. Nature 555: 269–273.

Winter R, Montanari F, Noé F, Clevert DA. 2019. Learning continuous and data-driven molecular descriptors bytranslating equivalent chemical representations. Chemical Science 10: 1692–1701.

Zhu T, Cao S, Su PC, Patel R, Shah D, Chokshi HB, Szukala R, Johnson ME, Hevener KE. 2013. Hit identification andoptimization in virtual screening: Practical recommendations based on a critical literature analysis. 56: 6560–6572.

35

Page 37: Leonard Sparring - diva-portal.org

A Appendix: Test Set Distributions

Table S1: HAC distribution for A2AR test sets for adaptive predictions and HAC dependence on required training sizes.

HAC compounds actives active proportion (%)

18 10757871 44281 0.41219 15360333 82485 0.53720 20563620 141621 0.68921 24562426 207079 0.84322 25330845 276763 1.0923 28986219 326137 1.1324 28905034 401857 1.3925 21088811 300053 1.4226 706457 7889 1.12

Table S2: HAC distribution for D2R test sets for adaptive predictions and HAC dependence on required training sizes.

HAC compounds actives active proportion (%)

18 12138037 148540 1.2219 18000448 189194 1.0520 24602124 257839 1.0521 29973905 303930 1.0122 31856211 300169 0.94223 36692899 321805 0.87724 35732137 330134 0.92425 25035679 230541 0.92126 871194 11048 1.27

36

Page 38: Leonard Sparring - diva-portal.org

Figure S1: HAC distribution for A2AR test sets.

37

Page 39: Leonard Sparring - diva-portal.org

Figure S2: HAC distribution for D2R test sets.

38

Page 40: Leonard Sparring - diva-portal.org

B Appendix: Adaptive Predictions

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.0

0.2

0.4

0.6

0.8

1.0

Err

orra

te

A

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.0

0.2

0.4

0.6

0.8

1.0

Effi

cien

cy

B

Error rate ’0’ Error rate ’1’

Figure S3: Class wise error rates (A) and efficiency (B) for retrospective top 1 % predictions of D2R. For (A), the dottedline indicates the expected error rates. For (B), the dotted line shows the ε that was used for extracting the {1} set.

39

Page 41: Leonard Sparring - diva-portal.org

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Err

orra

te

A

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Effi

cien

cy

B

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Err

orra

te

C

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Effi

cien

cy

D

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Err

orra

te

E

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Effi

cien

cy

F

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Err

orra

te

G

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Effi

cien

cy

H

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Err

orra

te

I

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Effi

cien

cy

J

18-21 heavy atoms 22-24 heavy atoms 25-26 heavy atoms

Figure S4: Calibration plots and efficiencies for adaptive predictions of A2AR. Random subset (A) and (B), {1} setbased (C) and (D), confidence-sort based (E) and (F), multi-class 1 (G) and (H), multi-class 2 (I) and (J).

40

Page 42: Leonard Sparring - diva-portal.org

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Err

orra

te

A

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Effi

cien

cy

B

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Err

orra

te

C

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Effi

cien

cy

D

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Err

orra

te

E

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Effi

cien

cy

F

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Err

orra

te

G

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Effi

cien

cy

H

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Err

orra

te

I

0.0 0.2 0.4 0.6 0.8 1.0

Significance

0.00

0.25

0.50

0.75

1.00

Effi

cien

cy

J

18-21 heavy atoms 22-24 heavy atoms 25-26 heavy atoms

Figure S5: Calibration plots for adaptive predictions of D2R. Random subset (A) and (B), {1} set based (C) and (D),confidence-sort based (E) and (F), multi-class 1 (G) and (H), multi-class 2 (I) and (J).

41

Page 43: Leonard Sparring - diva-portal.org

Rand-subset {1} set Conf-sort Multi-class 1 Multi-class 2

Selected Compounds 16,627,130 9,733,996 6,000,000 4,312,725 29,660,826Recall in 18-21 HAC bin 88.7 % 72.2 % 47.8 % 43.3 % 94.2 %Recall in 22-24 HAC bin 83.9 % 64.8 % 52.9 % 39.5 % 92.4 %Recall in 25-26 HAC bin 83.4 % 54.2 % 74.3 % 79.4 % 94.5 %Precision in 18-21 HAC bin 5.76 % 5.66 % 11.4 % 24.6 % 4.43 %Precision in 22-24 HAC bin 11.2 % 20.7 % 26.6 % 18.8 % 6.04 %Precision in 25-26 HAC bin 17.4 % 40.3 % 14.1 % 22.0 % 8.56 %Total recall 85.0 % 64.5 % 60.0 % 48.6 % 93.3 %Total precision 9.51 % 12.3 % 17.3 % 20.9 % 5.85 %Total F1-score 17.1 % 20.7 % 26.5 % 29.3 % 11.0 %

Table S3: Comparisons of prediction metrics for different approaches of adaptive prediction for A2AR.

Rand-subset {1} set Conf-sort Multi-class 1 Multi-class 2

Selected Compounds 22,622,543 26,620,732 6,000,000 8,832,026 29,002,452Recall in 18-21 HAC bin 86.7 % 91.7 % 42.3 % 58.0 % 91.0 %Recall in 22-24 HAC bin 83.8 % 73.4 % 33.9 % 55.4 % 83.3 %Recall in 25-26 HAC bin 84.4 % 65.2 % 68.4 % 85.7 % 89.9 %Precision in 18-21 HAC bin 8.04 % 4.69 % 19.0 % 21.1 % 6.35 %Precision in 22-24 HAC bin 7.61 % 8.65 % 16.1 % 16.8 % 6.28 %Precision in 25-26 HAC bin 8.40 % 16.3 % 8.3 % 6.45 % 6.23 %Total recall 85.1 % 80.3 % 41.5 % 60.03 % 87.4 %Total precision 7.88 % 6.32 % 14.5 % 14.23 % 6.31 %Total F1-score 14.42 % 11.7 % 21.5 % 23.0 % 11.8 %

Table S4: Comparisons of prediction metrics for different approaches of adaptive prediction for D2R.

C Appendix: Predictions by Choice of Descriptive Features

Fragm→ Fragm Lead→ Lead Fragm→ Lead

Recall in {1} set (%) 83.7 82.4 99.9Precision in {1} (%) 14.9 17.8 3.27F1-score (%) 25.4 29.2 6.33Efficiency (%) 88.5 94.1 92.3Error rate (%) 16.5 17.6 96.5

Table S5: Comparisons of models built on physico-chemical properties when predicting fragments with fragment-trainedmodels, lead-likes with lead-like-trained models and when predicting lead-likes with fragment-based models. The {1}set was extracted with ε = 0.2.

42

Page 44: Leonard Sparring - diva-portal.org

Fragm→ Fragm Lead→ Lead Fragm→ Lead

Recall in {1} set (%) 83.9 82.8 98.8Precision in {1} (%) 13.4 17.9 4.98F1-score (%) 23.1 29.4 9.48Efficiency (%) 91.5 89.9 84.0Error rate (%) 14.3 24.8 87.2

Table S6: Comparisons of models built on CDDD features when predicting fragments with fragment-trained models,lead-likes with lead-like-trained models and when predicting lead-likes with fragment-based models. The {1} set wasextracted with ε = 0.2.

Fragm→Fragm Lead→Lead Fragm→Lead

Recall in {1} set (%) 84.3 82.9 70.9Precision in {1} (%) 17.0 20.5 15.8F1-score (%) 28.4 32.9 25.8Efficiency (%) 89.4 94.3 63.4Error rate (%) 15.0 15.6 48.0

Table S7: Comparisons of models built on ECFP4 features when predicting fragments with fragment-trained models,lead-likes with lead-like-trained models and when predicting lead-likes with fragment-based models. The {1} set wasextracted with ε = 0.2.

43