12
Mol Divers (2014) 18:829–840 DOI 10.1007/s11030-014-9545-3 FULL-LENGTH PAPER LBVS: an online platform for ligand-based virtual screening using publicly accessible databases Minghao Zheng · Zhihong Liu · Xin Yan · Qianzhi Ding · Qiong Gu · Jun Xu Received: 24 April 2014 / Accepted: 12 August 2014 / Published online: 3 September 2014 © Springer International Publishing Switzerland 2014 Abstract Abundant data on compound bioactivity and publicly accessible chemical databases increase opportuni- ties for ligand-based drug discovery. In order to make full use of the data, an online platform for ligand-based virtual screening (LBVS) using publicly accessible databases has been developed. LBVS adopts Bayesian learning approach to create virtual screening models because of its noise tol- erance, speed, and efficiency in extracting knowledge from data. LBVS currently includes data derived from Bind- ingDB and ChEMBL. Three validation approaches have been employed to evaluate the virtual screening models created from LBVS. The tenfold cross validation results of twenty different LBVS models demonstrate that LBVS achieves an average AUC value of 0.86. Our internal and external testing results indicate that LBVS is predictive for lead identifica- tions. LBVS can be publicly accessed at http://rcdd.sysu.edu. cn/lbvs. Keywords Ligand-based virtual screening · Virtual screening · Public database · Bayesian learning Minghao Zheng and Zhihong Liu have Equally contribution to this work. M. Zheng · Z. Liu · X. Yan · Q. Ding · Q. Gu (B ) · J. Xu (B ) Research Center for Drug Discovery, School of Pharmaceutical Sciences, Sun Yat-sen University, 132 East Circle at University City, Guangzhou 510006, China e-mail: [email protected] J. Xu e-mail: [email protected] Introduction Virtual high-throughput screening (vHTS) is a common process in drug discovery [13]. Numerous successful vir- tual screening campaigns have been reported, and con- firmed with bioassays [3]. Various computational methods, including ligand- and structure-based strategies, were uti- lized to identify leads for a specific target. Usually, ligand- based approaches, such as QSAR, machine learning, are used for lead optimization. Recently, however, ligand-based approaches (such as naïve Bayesian classifiers, support vec- tor machines, decision trees, random forests, clustering meth- ods, and neural networks) are increasingly employed for drug lead identification [47]. Notably, naïve Bayesian learning is widely used in chemoinformatics, since the Laplacian-corrected modified naïve Bayes approach was first applied to kinase inhibitor classification in 2004 [8, 9]. The widely used Bayesian learn- ing yields respectable biological target predictions [10, 11], virtual screenings [8, 1216], and ADMET properties estima- tions [17, 18]. Bayesian learning is widely employed in drug discovery primarily because it has the following advantages. First, Bayesian learning shows the tolerance of noise. Errors in the bioactivity databases [19], the imbalance between active and inactive compounds and the arbitrary definition of bioactivity for activity and inactivity make HTS data noisy and sparse. The characteristic of Bayesian learning enables it to handle HTS data analyses [9, 15, 20]. Second, Bayesian learning is free of over fitting. Bayesian models are not con- structed via a fitting approach, which frees them from the “curse of dimensionality” when using numerous descriptors [8]. Third, Bayesian learning is rapid and efficient in process- ing large amounts of data [8]. Abundant data on compound bioactivity and publicly accessible chemical databases have the characteristics of big 123

LBVS: an online platform for ligand-based virtual screening using publicly accessible databases

  • Upload
    jun

  • View
    218

  • Download
    2

Embed Size (px)

Citation preview

Mol Divers (2014) 18:829–840DOI 10.1007/s11030-014-9545-3

FULL-LENGTH PAPER

LBVS: an online platform for ligand-based virtual screeningusing publicly accessible databases

Minghao Zheng · Zhihong Liu · Xin Yan ·Qianzhi Ding · Qiong Gu · Jun Xu

Received: 24 April 2014 / Accepted: 12 August 2014 / Published online: 3 September 2014© Springer International Publishing Switzerland 2014

Abstract Abundant data on compound bioactivity andpublicly accessible chemical databases increase opportuni-ties for ligand-based drug discovery. In order to make fulluse of the data, an online platform for ligand-based virtualscreening (LBVS) using publicly accessible databases hasbeen developed. LBVS adopts Bayesian learning approachto create virtual screening models because of its noise tol-erance, speed, and efficiency in extracting knowledge fromdata. LBVS currently includes data derived from Bind-ingDB and ChEMBL. Three validation approaches have beenemployed to evaluate the virtual screening models createdfrom LBVS. The tenfold cross validation results of twentydifferent LBVS models demonstrate that LBVS achieves anaverage AUC value of 0.86. Our internal and external testingresults indicate that LBVS is predictive for lead identifica-tions. LBVS can be publicly accessed at http://rcdd.sysu.edu.cn/lbvs.

Keywords Ligand-based virtual screening ·Virtual screening · Public database · Bayesian learning

Minghao Zheng and Zhihong Liu have Equally contribution to thiswork.

M. Zheng · Z. Liu · X. Yan · Q. Ding · Q. Gu (B) · J. Xu (B)Research Center for Drug Discovery, School of PharmaceuticalSciences, Sun Yat-sen University, 132 East Circle at University City,Guangzhou 510006, Chinae-mail: [email protected]

J. Xue-mail: [email protected]

Introduction

Virtual high-throughput screening (vHTS) is a commonprocess in drug discovery [1–3]. Numerous successful vir-tual screening campaigns have been reported, and con-firmed with bioassays [3]. Various computational methods,including ligand- and structure-based strategies, were uti-lized to identify leads for a specific target. Usually, ligand-based approaches, such as QSAR, machine learning, areused for lead optimization. Recently, however, ligand-basedapproaches (such as naïve Bayesian classifiers, support vec-tor machines, decision trees, random forests, clustering meth-ods, and neural networks) are increasingly employed for druglead identification [4–7].

Notably, naïve Bayesian learning is widely used inchemoinformatics, since the Laplacian-corrected modifiednaïve Bayes approach was first applied to kinase inhibitorclassification in 2004 [8,9]. The widely used Bayesian learn-ing yields respectable biological target predictions [10,11],virtual screenings [8,12–16], and ADMET properties estima-tions [17,18]. Bayesian learning is widely employed in drugdiscovery primarily because it has the following advantages.First, Bayesian learning shows the tolerance of noise. Errorsin the bioactivity databases [19], the imbalance betweenactive and inactive compounds and the arbitrary definition ofbioactivity for activity and inactivity make HTS data noisyand sparse. The characteristic of Bayesian learning enablesit to handle HTS data analyses [9,15,20]. Second, Bayesianlearning is free of over fitting. Bayesian models are not con-structed via a fitting approach, which frees them from the“curse of dimensionality” when using numerous descriptors[8]. Third, Bayesian learning is rapid and efficient in process-ing large amounts of data [8].

Abundant data on compound bioactivity and publiclyaccessible chemical databases have the characteristics of big

123

830 Mol Divers (2014) 18:829–840

Table 1 Some popular bioactivity database in drug discovery

Name Compounds Targets Activities Website

BindingDB 427,325 6,589 1,009,290 http://www.bindingdb.org

ChEMBL 1,324,941 9,356 12,077,491 https://www.ebi.ac.uk/chembldb

PubChem 48,781,965 No data 739,658a http://pubchem.ncbi.nlm.nih.gov

Drugbank 7,677 4,092 14,785 http://beta.drugbank.ca

TTD 20,667 2,360 No data http://bidd.nus.edu.sg/group/ttd

The statistic data were from the website of databases by December of 2013a The number denoted here is the number of assays

data, which have drawn industrial, academic, and govern-mental attention [21–23] in recent years. The major chal-lenge is to extract knowledge from the data, and solve drugdiscovery problems. A significant amount of bioactivity dataat the molecular, cellular, and animal levels has been gener-ated due to the advent of high-throughput screening. ZINC,a free database of commercially available compounds forvirtual screening, contains over 35 million purchasable com-pounds [24]. ChEMBL is a publicly accessible database con-taining over 12 million compounds with biological activities[25]. Table 1 summarizes some well-known databases widelyaccessed for drug discovery. There is an increasing need fora platform for ligand-based virtual screening (LBVS) usingthese publicly accessible databases. The platform should takeover the routine tasks, such as importing data from differentsources, cleaning data, and creating predictive model [26].

The flowchart of LBVS is depicted in Fig. 1. Currently,LBVS collects BindingDB and ChEMBL databases for vir-tual screening modeling, employs Bayesian learning to buildvirtual screening models, which are validated by a numberof model validation methods. Atom center fragments (ACFs)

BindingDBChEMBL……

Targetrelateddata

Target-basedData selection

Virtualscreeningmodels

BayesianLearning

Validatedvirtualscreeningmodel

Validating models

Virtuallibrary

Virtualscreening

Virtualhits

Hit selection criteria

Privilegedfragments

Fig. 1 Schematic flow diagram of the LBVS platform. User can selecta target through web browser and the target-related ligands bioactivitydata can be retrieved from the public database BindingDB or ChEMBLand divided into active and inactive sets. Subsequently, naïve Bayesianmodel was constructed online and validated by various metrics. With thevalidated model, a virtual library (a commercial available or in-houselibrary) can be screened to generate virtual hits

are used as structural features for the Bayesian modeling. TheBayesian models are validated with a number of validationmethods, such as tenfold cross-validations.

Materials and methods

Web implementation

The web platform was constructed from three components:the interface, the application server, and the database. Theweb interface was implemented using HTML, CSS andJavaScript, ChemDoodle, and D3.js for visualization. Theapplication server was implemented in the Go language(a new programming language that was initially devel-oped at Google). Bioactivity data from the public data-bases BindingDB (2013/6/13) and ChEMBLdb (release 17)were imported into the MongoDB and MySQL databases.Data for the ligand structure, target name, source organ-ism, and ligand–target affinity parameters (IC50, Ki, andKd) were extracted to create the active and inactive sub-sets. The nanomolar affinity values were used as the stan-dard unit. The user can select the target of interest in bind-ingDB or ChEMBL or alternatively upload a private datasetof bioactivity data in the SDF file format. Three thresh-olds for activity (100, 1,000, and 10,000 nM) were chosenbased on user demand. The common 5- and 10-fold crossvalidation methods can be used, and the atom center frag-ment layer can be set from 1 to 4. The optional test set canbe uploaded for online modeling. The active and inactivesets were generated based on the selected target and thresh-old. The naïve Bayesian classifier and atom center frag-ment generation methods were implemented in-house, anda binary executable program was called using the Go lan-guage. The model validation results, including ROC curves,privileged structures and virtual screening hits, are avail-able for download. Snapshots of the major pages are shownin Fig. 2.

Feature calculations

The structural features for Bayesian learning are the ACFs[27,28], which are computed with an in-house program. Vari-ous ACFs can be generated based on the ACF-layer (the num-

123

Mol Divers (2014) 18:829–840 831

Fig. 2 Snapshots of model building, virtual screening, and results pages

ber of bonds from an edge atom to the central atom). ACF-nwas derived from a heavy atom and n-bonds away neighboratoms. The generation of ACF is illustrated in Fig. 3. Detailedinformation on ACFs can be found in the references [27,28].

Naïve Bayesian classifier

The naïve Bayesian classifier is a method based on the Bayestheorem for conditional probability.

Fig. 3 An example of ACFs generation

123

832 Mol Divers (2014) 18:829–840

Fig. 4 Diversity analysisthrough Scaffold-basedClassification Approach (SCA)displayed online using D3.js.Each compound was mapped asdot. “1” denoted as activecompounds (blue) and “0”denoted as inactive compounds(orange) as defined by user’sthreshold. (Color figure online)

P (A|B) = P (B|A) P (A)

P (B)(1)

Applying this theorem using two classifications in the con-text of activity prediction, the atom center fragments derivedfrom the molecular structure, denoted as x1, x2, . . . , xn ,and yield the active probability, p, and inactive probability,q:

p = p (active|x1, x2, . . . , xn)

= p (x1, x2, . . . , xn|active) p(active)

p(x1, x2, . . . , xn)(2)

q = p (inactive|x1, x2, . . . , xn)

= p (x1, x2, . . . , xn|inactive) p(inactive)

p(x1, x2, . . . , xn)(3)

Because p + q = 1, we can conclude the following:

logp

1 − p= log

∏n

i=1

p(xi |active)

p(xi |inacive)+ log

p(active)

p(inactive),

(4)

where

p (xi |active) = N (xi |active) + 0.5

N (active) + 1.0(5)

p (xi |inactive) = N (xi |inactive) + 0.5

N (inactive) + 1.0(6)

p (active) = N (active)

N (active) + N (inactive)(7)

p (inactive) = N (inactive)

N (active) + N (inactive)(8)

N (active) is the total number of active compounds andN (inactive) is the total number of inactive compounds.N (xi |active) is the number of compounds containing atomcenter fragment xi in the active set. N (xi |inactive) is thenumber of compounds containing xi in the inactive set.Laplace corrections were used to overcome the zero countsissues in Eqs. (5) and (6). The naïve Bayesian classifier isdetailed in references [8,9,12].

Model performance metrics

The classification model performance depends upon the con-fusion matrix, which contained the number of true posi-tives (TP), true negatives (TN), false positives (FP), andfalse negatives (FN). Several widely used metrics derivedfrom the confusion matrix were used to evaluate the model,including accuracy (Eq. 9), precision (Eq. 10), sensitivity (orrecall, Eq. 11), and specificity (Eq. 12), Matthews correla-tion coefficient (MCC, Eq. 13) and received operating char-acteristic (ROC) curve, specifically the area under the ROCcurve. These metrics were all calculated using the LBVSplatform.

Accuracy = TP + TN

TP + FN + TN + FP(9)

Precision = TP

TP + FP(10)

123

Mol Divers (2014) 18:829–840 833

Fig. 5 ROC curves for selected targets retrieved from BindingDB. The tenfold cross validation method was employed with the level 2 ACFs asthe features for Bayesian learning

Sensitivity (Recall) = TP

TP + FN(11)

Specificity = TN

TN + FP(12)

MCC = TP × TN − FP × FN√(TP + FN) (TP + FP)(TN + FN)(TN + FP)

(13)

Results and discussions

Chemical space exploration

The chemical space of the training set directly influencedthe model performance and virtual screening application. Ascaffold-based classification approach [29] was used to ana-lyze the chemical space of the training set, including both

123

834 Mol Divers (2014) 18:829–840

Fig. 6 Enrichment curves for selected targets retrieved from BindingDB. The ten-fold cross validation method was employed with the level 2ACFs as the features for Bayesian learning

the active and inactive sets. A scaffold ID was assigned toeach compound, and the CID value was associated with thecompound’s complexity. The cyclicity measures how cycli-cal a compound is. The JavaScript library D3.js was usedfor online display. Aldose reductase in the BindingDB wasused as an example, and the diversity of inhibitors and non-inhibitors is displayed in Fig. 4.

Model validation

A large number of target-oriented models were developedto validate our naïve Bayesian virtual screening platform. Atotal of 60 models were constructed with 20 targets in Bind-ingDB with three thresholds. Level 2 ACFs were generatedas descriptors, and the tenfold cross validation method was

123

Mol Divers (2014) 18:829–840 835

Table 2 Sensitivity, specificity, and AUC values of the models with different thresholds

Target ID Target name Sensitivity Specificity AUC

100 nM 1,000 nM 10,000 nM 100 nM 1,000 nM 10,000 nM 100 nM 1,000 nM 10,000 nM

1 5-Lipoxygenase 0.90 0.86 0.85 0.68 0.65 0.68 0.87 0.84 0.84

2 Amine oxidase 0.72 0.70 0.76 0.69 0.66 0.63 0.79 0.75 0.77

3 Butyrylcholinesterase 0.92 0.87 0.87 0.75 0.79 0.79 0.91 0.90 0.88

4 CDK2 0.85 0.84 0.83 0.75 0.75 0.77 0.89 0.88 0.89

5 Cholinesterase 0.85 0.84 0.90 0.68 0.65 0.59 0.85 0.85 0.86

6 Cyclooxygenase-2 0.79 0.84 0.85 0.71 0.62 0.60 0.84 0.82 0.80

7 Delta Opioid Receptor 0.73 0.90 0.91 0.87 0.73 0.55 0.91 0.89 0.79

8 Dihydrofolate reductase 0.88 0.83 0.88 0.64 0.66 0.57 0.86 0.85 0.86

9 Dipeptidyl peptidase IV 0.87 0.93 0.92 0.60 0.64 0.74 0.83 0.88 0.91

10 Dopamine Receptor D2 0.86 0.87 0.90 0.70 0.64 0.72 0.86 0.86 0.90

11 Dopamine transporter 0.87 0.90 0.87 0.68 0.71 0.78 0.86 0.87 0.89

12 EGF-R Tyrosine Kinase 0.96 0.91 0.88 0.66 0.68 0.65 0.91 0.89 0.87

13 Hepsin 0.89 0.89 0.88 0.57 0.60 0.57 0.82 0.82 0.82

14 Histone deacetylase 1 0.83 0.82 0.87 0.68 0.69 0.56 0.84 0.84 0.82

15 HIV-1 Reverse Transcriptase 0.90 0.83 0.79 0.66 0.70 0.69 0.86 0.84 0.81

16 Phospholipase A2 0.84 0.91 0.85 0.72 0.62 0.54 0.87 0.87 0.80

17 Protease 0.95 0.95 0.85 0.76 0.75 0.68 0.93 0.94 0.87

18 Protein Kinase C 0.88 0.92 0.91 0.60 0.61 0.68 0.83 0.85 0.87

19 Serotonin transporter 0.82 0.81 0.88 0.75 0.70 0.66 0.87 0.84 0.86

20 Src 0.88 0.90 0.86 0.70 0.66 0.66 0.87 0.88 0.87

used to construct the models. The model performance wasevaluated by the ROC curves (Fig. 5), enrichment curves(Fig. 6), and the model performance criteria, such asarea under ROC curve (AUC), sensitivity, and specificity(Table 2). All of the AUC values for the 20 targets weregreater than 0.8, with the exceptions of amine oxidase (0.77).The averages of sensitivity and specificity for the 60 modelsare 0.86 and 0.68. The average specificity of these modelsis lower than the average sensitivity. The default thresholdof Bayesian score for distinguishing actives from inactivesis 0.5. A user can balance the sensitivity and specificity byadjusting the Bayesian score threshold. The higher thresholdwill result in better specificity. As shown in Fig. 6, enrich-ment curves reflect the LBVS’ hit enrichment capacity. Thecapacity of the hit enrichment changes with different thresh-old (100 > 1,000 > 10,000 nM). The internal validation resultsindicated that the models constructed with the naïve Bayesianclassifiers using ACFs features performed well. Furthermore,the small differences for each target among the three mod-els based on the different thresholds suggest the thresholdvalue had little influence on the model performance, whichshows naïve Bayesian learning’s remarkable tolerance fornoise [15].

Natural products bioactivity prediction

Natural products are essential sources of lead drug discovery[30,31]. With the development of extraction/isolation tech-nologies, a large number of natural products have been iden-tified, and the rapid bioactivity identification or target fishingof natural products is significant for accelerating drug discov-ery. A series of natural products from Morus alba L. [32],Crotalaria ferruginea [33], Toddalia asiatica [34], and gor-gonian Echinogorgia pseudossapo [35] were recently iden-tified by our group as showing inhibitory activity againstphosphodiesterase-4D (PDE4D), which is the drug target forasthma and chronic obstructive pulmonary disease. To testthe LBVS performance, a PDE4D-directed model was con-structed online using bioactivity information from ChEMBLwith default parameters and then used to predict those nat-ural products. An activity threshold of 10µM was used todetermine actives in training sets. With a Bayesian score of0.5, LBVS predicted 22 PDE4D natural product inhibitors,in which fourteen active compounds were correctly pre-dicted, there were two false negatives and five false positives.Detailed information on the natural products and their predic-tion results are shown in Fig. 7. This PDE4D inhibitory activ-

123

836 Mol Divers (2014) 18:829–840

O

O

O

O

H

H

H

1

Source:Morus albaL.IC50_nM: 2900

Score: 0.60

O

O

O

O

O

O

O

O

O

O

O

O

O

O

H

H

H

H

H

H

2

Source:Toddalia asiaticaIC50_nM: 16650

Score: 0.81

O

O

O

O

O

O

O

O

O

O

H

H

3

Source:Toddalia asiaticaIC50_nM: 7810

Score: 0.33

N

O

O

O

O

O

O

4

Source:Toddalia asiaticaIC50_nM: 140

Score: 0.52

O

O

O

O

5

Source:Toddalia asiaticaIC50_nM: 1870

Score: 0.99

O

O

O

O

6

Source:Toddalia asiaticaIC50_nM: 2200

Score: 1.00

O OO

O

7

Source:Toddalia asiaticaIC50_nM: 960

Score: 0.97

O OO

O

8

Source:Toddalia asiaticaIC50_nM: 1530

Score: 0.99

O OO

OH

9

Source:Toddalia asiaticaIC50_nM: 2380

Score: 0.95

O

O

O

O

10

Source:Toddalia asiaticaIC50_nM: 9980

Score: 0.86

O

O

O

O

11

Source:Toddalia asiaticaIC50_nM: 11490

Score: 0.71

O

O

OO

12

Source:Toddalia asiaticaIC50_nM: 6820

Score: 0.99

O

O

OO

13

Source:Toddalia asiaticaIC50_nM: 10220

Score: 0.98

O

O

OOOO

O

14

Source:Toddalia asiaticaIC50_nM: 5510

Score: 1.00

N

N

O

N

N

O

15

Source:Echinogorgia pseudossaIC50_nM: 8500

Score: 0.02

N

NN

ON

O

O

H

16

Source:Echinogorgia pseudossaIC50_nM: 20300

Score: 0.02

O O

O

OO

O

H

H

17

Source:Crotalaria ferrugineaIC50_nM: 4590

Score: 1.00

O O

O

OO

H

H

18

Source:Crotalaria ferrugineaIC50_nM: 2570

Score: 1.00

O

O

O

O

O

H

H

H

19

Source:Crotalaria ferrugineaIC50_nM: 8940

Score: 1.00

O

O

O

O

O

H

H

H

20

Source:Crotalaria ferrugineaIC50_nM: 15000

Score: 1.00

O

O

O

O

O

O

H

H

H

H

21

Source:Crotalaria ferrugineaIC50_nM: 8740

Score: 1.00

O

O

O

O

O

H

H

H

22

Source:Crotalaria ferrugineaIC50_nM: 15000

Score: 1.00

Fig. 7 Prediction of natural products inhibitory activity for PDE4D

123

Mol Divers (2014) 18:829–840 837

Table 3 Reported virtual screening hits validation results

No. Target name VS methods Hits Correctlypredicted

Hit rate (%) Reference

1 Kappa-opioid receptor Docking 4 3 75 [36]

2 Mycobacterium tuberculosisInhA

Pharmacophore, docking 3 2 67 [37]

3 Aldose reductase Docking 26 16 62 [38]

4 Sumoylation activating enzyme1

Docking, similarity 5 5 100 [39]

5 5-HT1A Pharmacophore 10 8 80 [40]

6 TTK Scaffold similarity 8 5 63 [41]

7 5-HT1A QSAR model 9 6 67 [42]

8 Liver X receptor Pharmacophore, shapesimilarity

3 2 67 [43]

9 Hsp90 Docking 4 2 50 [44]

10 5-HT transporter Similarity, pharmacophore,docking

46 30 65 [45]

11 Hepatitis C virus NS5Bpolymerase

Pharmacophore, 3D-QSAR,docking

2 2 100 [46]

12 PDE4D Pharmacophore, docking,molecular dynamics

15 13 87 [47]

ity prediction case study demonstrates the potential power ofour LBVS platform.

Reported virtual screening hits validation

The mission of a QSAR models is to predict the activity ofunknown compounds based on known actives and inactives.To further validate the hits enrichment power of our LBVSplatform in virtual screening, more restricted and larger-scalevalidation were performed using recently reported successfulvirtual screening cases. The target-directed virtual screeningprotocols reported in the Journal of Chemical Informationand Modeling over the last year were collected to evaluatethe hits enrichment power. Notably, non-drug targets anddrug targets with too small of a sample size in bindingDBand ChEMBL were excluded. Both structure-based virtualscreening methods such as docking and molecular dynamics,and LBVS methods such as pharmacophore, similarity, andQSAR modeling approaches were widely applied and suc-cessfully identified the hits. We collected 12 virtual screen-ing campaigns with 11 targets, and the hits for each targetwere extracted for the online LBVS predictions. The virtualscreening information is listed in Table 3. The true hit ratesranged from 50 to 100 % with an average value of 73 %,which suggested the strength of LBVS for virtual screening.The virtual screening validation set only contains the activecompounds. However, it is common for virtual screening thatonly active compounds were reported. The reported virtualhits were validated to demonstrate that the model can enrich

and predict bioactivities for unknown compounds (latest pub-lished active compounds not included in the training set),which is important for identifying new scaffolds for specifictarget. The user is advised to select the top score compoundsin the virtual screening rather than using the score thresholdof 0.5.

Privileged fragments for a target

It is worth to note that LBVS can output privileged frag-ments after the validated Bayesian model is generated for adrug target. These privileged fragments are associated withthe concept of privileged structures in medicinal chemistry[48–50]. A privileged fragment is an ACF with high bioac-tivity contribution value (BCV), which is calculated usingBCV = p (xi |active) /p (xi |inactive). The privileged frag-ments for aldose reductase inhibitors are listed in Fig. 8.These fragments can be used to guide lead optimization, orcompose new scaffolds for lead identification.

Conclusions

LBVS, an online platform for LBVS using publicly accessi-ble databases has been reported in this paper. This platformcan be a solution to manage and mine the increasingly avail-able data of compounds and their bioactivities for virtualscreening purposes. Bayesian learning is the main modelingtool to build target-oriented models for virtual screenings due

123

838 Mol Divers (2014) 18:829–840

Fig. 8 Top-20 privileged fragments for aldose reductase inhibitors

to its capacity in dealing with big data. Since LBVS can out-put privileged fragments for drug targets, it can be used forlead identification and lead optimization.

Acknowledgments This work was supported by a grant from theNational High Technology Research and Development Program ofChina (863 Program) (No. 2012AA020307), the Guangdong Recruit-ment Program of Creative Research Groups (No. 2009010058), theNational Natural Science Foundation of China (No. 81173470), andthe Special Funding Program for the National Supercomputer Centerin Guangzhou (2012Y2-00048/2013Y2-00045, 201200000037).

References

1. Kar S, Roy K (2013) How far can virtual screening take us indrug discovery? Expert Opin Drug Dis 8:245–261. doi:10.1517/17460441.2013.761204

2. Lavecchia A, Di Giovanni C (2013) Virtual screening strategies indrug discovery: a critical review. Curr Med Chem 20:2839–2860.doi:10.2174/09298673113209990001

3. Zhu T, Cao S, Su PC, Patel R, Shah D, Chokshi HB, Szukala R,Johnson ME, Hevener KE (2013) Hit identification and optimiza-tion in virtual screening: practical recommendations based on a crit-ical literature analysis. J Med Chem 56:6560–6572. doi:10.1021/jm301916b

4. Varnek A, Baskin I (2012) Machine learning methods for prop-erty prediction in chemoinformatics: quo vadis? J Chem Inf Model52:1413–1437. doi:10.1021/ci200409x

5. Fukunishi Y (2009) Structure-based drug screening and ligand-based drug screening with machine learning. Comb Chem High TScr 12:397–408. doi:10.2174/138620709788167890

6. Plewczynski D, Spieser SAH, Koch U (2006) Assessing differentclassification methods for virtual screening. J Chem Inf Model46:1098–1106. doi:10.1021/ci050519k

7. Chen B, Harrison RF, Papadatos G, Willett P, Wood DJ, LewellXQ, Greenidge P, Stiefl N (2007) Evaluation of machine-learningmethods for ligand-based virtual screening. J Comput Aided MolDes 21:53–62. doi:10.1007/s10822-006-9096-5

123

Mol Divers (2014) 18:829–840 839

8. Xia X, Maliski EG, Gallant P, Rogers D (2004) Classification ofkinase inhibitors using a Bayesian model. J Med Chem 47:4463–4470. doi:10.1021/jm0303195

9. Mussa HY, Mitchell JB, Glen RC (2013) Full ”Laplacianised” pos-terior naive Bayesian algorithm. J Cheminform 5:37. doi:10.1186/1758-2946-5-37

10. Nidhi Glick M, Davies JW, Jenkins JL (2006) Prediction of bio-logical targets for compounds using multiple-category Bayesianmodels trained on chemogenomics databases. J Chem Inf Model46:1124–1133. doi:10.1021/ci060003g

11. Nigsch F, Bender A, Jenkins JL, Mitchell JB (2008) Ligand-targetprediction using Winnow and naive Bayesian algorithms and theimplications of overall performance statistics. J Chem Inf Model48:2313–2325. doi:10.1021/ci800079x

12. Sun HM (2006) An accurate and interpretable Bayesian classi-fication model for prediction of hERG liability. ChemMedChem1:315–322. doi:10.1002/cmdc.200500047

13. Prathipati P, Ma NL, Keller TH (2008) Global Bayesian modelsfor the prioritization of antitubercular agents. J Chem Inf Model48:2362–2370. doi:10.1021/ci800143n

14. Broccatelli F (2012) QSAR models for P-glycoprotein transportbased on a highly consistent data set. J Chem Inf Model 52:2462–2470. doi:10.1021/ci3002809

15. Klon AE (2009) Bayesian modeling in virtual high throughputscreening. Comb Chem High Throughput Screen 12:469–483.doi:10.2174/138620709788489046

16. Fang JS, Yang RY, Gao L, Zhou D, Yang SQ, Liu AL, Du GH (2013)Predictions of BuChE inhibitors using support vector machine andNaive Bayesian classification techniques in drug discovery. J ChemInf Model 53:3009–3020. doi:10.1021/ci400331p

17. Martins IF, Teixeira AL, Pinheiro L, Falcao AO (2012) A Bayesianapproach to in silico blood-brain barrier penetration modeling. JChem Inf Model 52:1686–1697. doi:10.1021/ci300124c

18. Xu C, Cheng F, Chen L, Du Z, Li W, Liu G, Lee PW, Tang Y (2012)In silico prediction of chemical Ames mutagenicity. J Chem InfModel 52:2840–2847. doi:10.1021/ci300400a

19. Tiikkainen P, Bellis L, Light Y, Franke L (2013) Estimating errorrates in bioactivity databases. J Chem Inf Model 53:2499–2505.doi:10.1021/ci400099q

20. Glick M, Jenkins JL, Nettles JH, Hitchings H, Davies JW (2006)Enrichment of high-throughput screening data with increasing lev-els of noise using support vector machines, recursive partition-ing, and Laplacian-modified naive Bayesian classifiers. J Chem InfModel 46:193–200. doi:10.1021/ci050374h

21. Marx V (2013) The big challenges of big data. Nature 498:255–260. doi:10.1038/498255a

22. Lynch C (2008) Big data: how do your data grow? Nature 455:28–29. doi:10.1038/455028a

23. Chute CG, Ullman-Cullere M, Wood GM, Lin SM, He M, PathakJ (2013) Some experiences and opportunities for big data in trans-lational research. Genet Med 15:802–809. doi:10.1038/gim.2013.121

24. Irwin JJ, Sterling T, Mysinger MM, Bolstad ES, Coleman RG(2012) ZINC: a free tool to discover chemistry for biology. J ChemInf Model 52:1757–1768. doi:10.1021/ci3001277

25. Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A,Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, OveringtonJP (2012) ChEMBL: a large-scale bioactivity database for drugdiscovery. Nucleic Acids Res 40:D1100–D1107. doi:10.1093/nar/gkr777

26. Patiny L, Borel A (2013) ChemCalc: a building block for tomor-row’s chemical infrastructure. J Chem Inf Model 53:1223–1228.doi:10.1021/ci300563h

27. Kuhne R, Ebert RU, Schuurmann G (2009) Chemical domain ofQSAR models from atom-centered fragments. J Chem Inf Model49:2660–2669. doi:10.1021/ci900313u

28. Yan X, Gu Q, Lu F, Li J, Xu J (2012) GSA: a GPU-acceleratedstructure similarity algorithm and its application in progres-sive virtual screening. Mol Divers 16:759–769. doi:10.1007/s11030-012-9403-0

29. Xu J (2002) A new approach to finding natural chemical structureclasses. J Med Chem 45:5311–5320. doi:10.1021/jm010520k

30. Newman DJ, Cragg GM (2012) Natural products as sources of newdrugs over the 30 years from 1981 to 2010. J Nat Prod 75:311–335.doi:10.1021/np200906s

31. Cragg GM, Newman DJ (2013) Natural products: a continuingsource of novel drug leads. Bba-Gen Subj 1830:3670–3695. doi:10.1016/j.bbagen.2013.02.008

32. Chen SK, Zhao P, Shao YX, Li Z, Zhang CX, Liu PQ, He XX,Luo HB, Hu XP (2012) Moracin M from Morus alba L. is anatural phosphodiesterase-4 inhibitor. Bioorganic Med Chem Lett22:3261–3264. doi:10.1016/j.bmcl.2012.03.026

33. Liu YN, Huang YY, Bao JM, Cai YH, Guo YQ, Liu SN, LuoHB, Yin S (2014) Natural phosphodiesterase-4 (PDE4) inhibitorsfrom Crotalaria ferruginea. Fitoterapia 94:177–182. doi:10.1016/j.fitote.2014.02.010

34. Lin TT, Huang YY, Tang GH, Cheng ZB, Liu X, Luo HB,Yin S (2014) Prenylated coumarins: natural phosphodiesterase-4inhibitors from toddalia asiatica. J Nat Prod 77:955–962. doi:10.1021/np401040d

35. Sun ZH, Cai YH, Fan CQ, Tang GH, Luo HB, Yin S (2014) Sixnew tetraprenylated alkaloids from the South China Sea gorgonianEchinogorgia pseudossapo. Mar Drugs 12:672–681. doi:10.3390/md12020672

36. Negri A, Rives ML, Caspers MJ, Prisinzano TE, Javitch JA, FilizolaM (2013) Discovery of a novel selective kappa-opioid receptoragonist using crystal structure-based virtual screening. J Chem InfModel 53:521–526. doi:10.1021/ci400019t

37. Pauli I, dos Santos RN, Rostirolla DC, Martinelli LK, DucatiRG, Timmers LF, Basso LA, Santos DS, Guido RV, Andricop-ulo AD, Norberto de Souza O (2013) Discovery of new inhibitorsof Mycobacterium tuberculosis InhA enzyme using virtual screen-ing and a 3D-pharmacophore-based approach. J Chem Inf Model53:2390–2401. doi:10.1021/ci400202t

38. Wang L, Gu Q, Zheng X, Ye J, Liu Z, Li J, Hu X, Hagler A,Xu J (2013) Discovery of new selective human aldose reduc-tase inhibitors through virtual screening multiple binding pocketconformations. J Chem Inf Model 53:2409–2422. doi:10.1021/ci400322j

39. Kumar A, Ito A, Hirohama M, Yoshida M, Zhang KY (2013) Identi-fication of sumoylation activating enzyme 1 inhibitors by structure-based virtual screening. J Chem Inf Model 53:809–820. doi:10.1021/ci300618e

40. Xu L, Zhou S, Yu K, Gao B, Jiang H, Zhen X, Fu W (2013)Molecular modeling of the 3D structure of 5-HT(1A)R: discoveryof novel 5-HT(1A)R agonists via dynamic pharmacophore-basedvirtual screening. J Chem Inf Model 53:3202–3211. doi:10.1021/ci400481p

41. Langdon SR, Westwood IM, van Montfort RL, Brown N, Blagg J(2013) Scaffold-focused virtual screening: prospective applicationto the discovery of TTK inhibitors. J Chem Inf Model 53:1100–1112. doi:10.1021/ci400100c

42. Luo M, Wang XS, Roth BL, Golbraikh A, Tropsha A (2014) Appli-cation of quantitative structure-activity relationship models of 5-HT1A receptor binding to virtual screening identifies novel andpotent 5-HT1A ligands. J Chem Inf Model 54:634–647. doi:10.1021/ci400460q

43. Temml V, Voss CV, Dirsch VM, Schuster D (2014) Discovery ofnew liver X receptor agonists by pharmacophore modeling andshape-based virtual screening. J Chem Inf Model 54:367–371.doi:10.1021/ci400682b

123

840 Mol Divers (2014) 18:829–840

44. Caroli A, Ballante F, Wickersham RB 3rd, Corelli F, RagnoR (2014) Hsp90 inhibitors, part 2: combining ligand-based andstructure-based approaches for virtual screening application. JChem Inf Model 54:970–977. doi:10.1021/ci400760a

45. Gabrielsen M, Kurczab R, Siwek A, Wolak M, Ravna AW, Kris-tiansen K, Kufareva I, Abagyan R, Nowak G, Chilmonczyk Z, SylteI, Bojarski AJ (2014) Identification of novel serotonin transportercompounds by virtual screening. J Chem Inf Model 54:933–943.doi:10.1021/ci400742s

46. Therese PJ, Manvar D, Kondepudi S, Battu MB, Sriram D, Basu A,Yogeeswari P, Kaushik-Basu N (2014) Multiple e-pharmacophoremodeling, 3D-QSAR, and high-throughput virtual screening ofhepatitis C virus NS5B polymerase inhibitors. J Chem Inf Model54:539–552. doi:10.1021/ci400644r

47. Li Z, Cai YH, Cheng YK, Lu X, Shao YX, Li X, Liu M, Liu P, LuoHB (2013) Identification of novel phosphodiesterase-4D inhibitorsprescreened by molecular dynamics-augmented modeling and val-idated by bioassay. J Chem Inf Model 53:972–981. doi:10.1021/ci400063s

48. Klekota J, Roth FP (2008) Chemical substructures that enrichfor biological activity. Bioinformatics 24:2518–2525. doi:10.1093/bioinformatics/btn479

49. Costantino L, Barlocco D (2006) Privileged structures as leadsin medicinal chemistry. Curr Med Chem 13:65–85. doi:10.2174/092986706775197999

50. Welsch ME, Snyder SA, Stockwell BR (2010) Privileged scaf-folds for library design and drug discovery. Curr Opin Chem Biol14:347–361. doi:10.1016/j.cbpa.2010.02.018

123