Improved Predictions in Structure-Based Drug Design Using CART and Bayesian Models

Donovan N. Chin & R. Aldrin Denny

Traditional Drug Discovery (insert graph)

In Silico Prediction of ADME (insert graph)◦ Potency

◦ Absorption

◦ Lead

◦ Drug

◦ Toxicity

◦ Excretion

◦ Metabolism

◦ distribution

Target IVY(Brute force virtual screening of very large compound libraries) Lead Discovery IVY(Utilize predictive models from Biogen data for more efficient virtual screening) Lead Optimization candidate

(insert graph)◦ Potency

◦ Lead

◦ Drug

◦ Toxicity

◦ Excretion

◦ Metabolism

◦ Distribution

◦ absorption

Goal: Identify crystallographic binding mode, Rank order ligands wrt binding with protein

(insert graph)

Receptor Docking

Ligand Shape

Generate plausible trial binding modes using docking function then Re-rank modes with scoring function

(insert graph)

341 Active

47 Non-Active

(insert graph)

After filtering by Pharmacophore Feature

(insert graph)

(insert functions for)◦ F_Score*

◦ D_Score

◦ G_Score

◦ PMF_Score

◦ Chem_Score

◦ ICM_Score*

Cell Adhesion Assay (50% Serum)◦ (insert graph)

Biochemical Adhesion Assay◦ (insert graph)

Scoring Functions Are Poor More Often Than Not

Receptor Site View Library Design FlexXScore Consensus Score>=3 e.g. Contact Map, CLogP MW, HBOND Rotatable bondsConsensus=5? if yes, substructure exists?if yes, Pharmacophore<4.2Å? if yes, Publish Hit Report

(insert graph)

Goal: Predict hit/miss class based on presence of features (fingerprints)

Method◦ Given a set of N samples◦ Given that some subset A of them are good („active‟) Then we estimate for a new compound: P(good)~ A/N

◦ Given a set of binary features F For a given feature F:

It appears in N samples

It appears in A good samples

Can we estimate: P(good l F)~A/N (Problem: Error gets worse as Nsmall)

◦ P‟(good l F)= (A+P(good)k)/(n+k) P‟(good l F)p(good)as N0 P‟(good l F) A/N as N large

◦ (If K=1/P(good) this is the Laplacian correction)

Descriptors (insert) Advantages

◦ Can describe huge number of features (up to 4 billion; MDL 1024; Lead scope 27,000)

◦ Contains tertiary and stereochemistry information◦ Fast

Classification Analysis

◦ Developing Non-Linear Scoring Functions to classify actives and non-actives

◦ (insert graphs)

◦ Cost Function to Minimize: Gini Impurity N= 1-ΣP^2(ω)

Training Set Prediction Success

(insert table)

10-fold cross validation

Randomly split training and test sets

Significant Improvement in Separating Actives from Non-Actives

(insert graph)

Significant Improvement in Finding Hits Using New SF

Optimal tree identified (insert graph)

No random effects (insert graph)

(insert cluster)

Able to identify different molecular property criteria that lead to hits

(insert graph)

(insert graph)

Size= magnitude of OBA

OBA values cover range of descriptor space

(insert graph)

Choose 1 & 2D Descriptors for ease of interpretation and lower “noise”

Build Model (insert graphs) Apply Model

Features found in high OBA

Features found in low OBA

Would be nice if CART did similar view

Improved scoring functions for separating hits from non-hits in structure-based drug design developed with CART and Bayesian models

Identified key differences in molecular physical properties that led to hits

Built reasonably predictive OBA model (cannot expect method to extend to other systems given complexity of OBA, however)

Biogen IDEC

Modeling ◦ Rajiah Denny◦ Claudio Chuaqui◦ Juswinder Singh◦ Herman van Vlijmen◦ Norman Wang◦ Anuj Patel◦ Zhan Deng

Chemistry◦ Kevin Guckian◦ Dan Scott◦ Thomas Durand-Reville◦ Pat Conlon◦ Charlie Hammond◦ Chuck Jewell

Pharmacology◦ Tonika Bonhert

Technology

Improved Predictions in Structure-Based Drug Design Using CART and Bayesian Models