Upload
rguha
View
1.957
Download
1
Tags:
Embed Size (px)
Citation preview
R & CDK: A Sturdy Platform in the Oceans ofChemical Data
Rajarshi Guha
NIH Center for Translational Therapeutics
5th May, 2011EBI, Hinxton
Background
I Cheminformatics methods since 2003I QSAR, diversity analysis, virtual
screening, fragments, polypharmacology,networks
I More recentlyI RNAi screening, high content screening
I Extensive use of machine learningI All tied together with software
developmentI User-facing GUI toolsI Low level programmatic librariesI Core developer for the CDK
I Believer and practitioner of Open Source
Why Cheminformatics in R?
I In contrast to bioinformatics (cf. Bioconductor), not a wholelot of cheminformatics support for R
I For cheminformatics and chemistry relevant packages includeI rcdk, rpubchem, fingerprintI bio3d, ChemmineR
I A lot of cheminformatics employs various forms of statisticsand machine learning - R is exactly the environment for that
I We just need to add some chemistry capabilities to it
See here for a much more detailed tutorial on R &cheminformatics presented at the EBI in 2010
Motivations
I Much of cheminformatics is datamodeling and mining
I But the numeric data is derived fromchemical structure
I Thus we want to work withI molecules & and their partsI files containing moleculesI databases of molecules
What is R?
I R is an environment for modelingI Contains many prepackaged statistical and mathematical
functionsI No need to implement anything
I R is a matrix programming language that is good forstatistical computing
I Full fledged, interpreted languageI Well integrated with statistical functionalityI More details later
What is R?
I It is possible to use R just for modelingI Avoids programming, preferably use a GUI
I Load data → build model → plot data
I But you can also get much more creativeI Scripts to process multiple data filesI Ensemble modeling using different types of modelsI Implement brand new algorithms
I R is good for prototyping algorithmsI Interpreted, so immediate resultsI Good support for vectorization
I Faster than explicit loopsI Analogous to map in Python and Lisp
I Most times, interpreted R is fine, but you can easily integrateC code
What is R?
I R integrates with other languagesI C code can be linked to R and C can also call R functionsI Java code can be called from R and vice versa. See various
packages at rosuda.orgI Python can be used in R and vice versa using Rpy
I R has excellent support for publication quality graphics
I See R Graph Gallery for an idea of the graphing capabilities
I But graphing in R does have a learning curveI A variety of graphs can be generated
I 2D plots - scatter, bar, pie, box, violin, parallel coordinateI 3D plots - OpenGL support is available
Parallel R
I R itself is not multi-threadedI Well suited for embarassingly parallel problems
I Even then, a number of “large data” problems are nottractable
I Recent developments on integrating R and Hadoop address thisI See the RHIPE package
I snow which allows distribution of processing on the samemachine (multiple CPU’s) or multiple machines
I But see snowfall for a nice set of wrappers around snow
I Also see multicore for a package that focuses on parallelprocessing on multicore CPU’s
R and databases
I Bindings to a variety of databases are availableI Mainly RDBMS’s but some NoSQL databases are being
interfaced
I The R DBI spec lets you write code that is portable overdatabases
I Note that loading multiple database packages can lead toproblems
I This can happen even when you don’t explicitly load adatabase package
I Some Bioconductor packages load the RSQLite package as adependency, which can interfere with, say, ROracle
Why use a database?
I Dont have to load bulk CSV or .Rda files each time we startwork
I Can index data in RDBMS’s so queries can be very fast
I Good way to exchange data between applications (as opposedto .Rda files which are only useful between R users)
Using the CDK in R
I Based on the rJava package
I Two R packages to install (not counting the dependencies)
I Provides access to a variety of CDK classes and methods
I Idiomatic R
R Programming Environment
rJava
CDK Jmol
rcdk
XML
rpubchem
fingerprint
Acessibility & usability
I Plain R is not necessarily the most “usable” platform
I So rcdk doesn’t really satisfy usability for complete R newbies
I But, if you know R, installation is trivial
> install.packages(’rcdk’, dependencies=TRUE)
I R specifies a documentation format
I Most packages have quite good documentation, rcdk is noexception
I A tutorial is also available from within R, in addition to thefunction docs
> vignette(’rcdk’) # read tutorial
> ls(’package:rcdk’) # list functions
> ?load.molecules # get help on a function
Reading in data
I The CDK supports a variety of file formats
I rcdk loads all recognized formats, automatically
I Data can be local or remote
mols <- load.molecules( c("data/io/set1.sdf",
"data/io/set2.smi",
"http://rguha.net/rcdk/remote.sdf"))
I Gives you a list of Java references representingIAtomContainer objects
I For large SDF’s use an iterating reader
I Can’t do much with these objects, except via rcdk functions
Writing molecules
I Currently only SDF is supported as an output file format
I By default a multi-molecule SDF will be written
I Properties are not written out as SD tags by default
smis <- c("c1ccccc1", "CC(C=O)NCC", "CCCC")
mols <- sapply(smis, parse.smiles)
## all molecules in a single file
write.molecules(mols, filename="mols.sdf")
## ensure molecule data is written out
write.molecules(mols, filename="mols.sdf", write.props=TRUE)
## molecules in individual files
write.molecules(mols, filename="mols.sdf", together=FALSE)
Working with molecules
I Currently you can access atoms, bonds, get certain atomproperties, 2D/3D coordinates
I Since rcdk doesn’t cover the entire CDK API, you might needto drop down to the rJava level and make calls to the Javacode by hand
Working with atoms
I Simple elemental analysis
I Identifying flat molecules
mol <- parse.smiles("c1ccccc1C(Cl)(Br)c1ccccc1")
atoms <- get.atoms(mol)
## elemental analysis
syms <- unlist(lapply(atoms, get.symbol))
round( table(syms)/sum(table(syms)) * 100, 2)
## is the molecule flat?
coords <- do.call("rbind", lapply(atoms, get.point3d))
any(apply(coords, 2, function(x) length(unique(x)) == 1))
SMARTS matching
I rcdk supports substructure searches with SMARTS orSMILES
I May not be practical for large collections of molecules due tomemory
mols <- sapply(c("CC(C)(C)C",
"c1ccc(Cl)cc1C(=O)O",
"CCC(N)(N)CC"), parse.smiles)
query <- "[#6D2]"
hits <- matches(query, mols)
> print(hits)
CC(C)(C)C c1ccc(Cl)cc1C(=O)O CCC(N)(N)CC
FALSE TRUE TRUE
Visualization
I rcdk supports visualization of 2D structure images in twoways
I First, you can bring up a Swing window
I Second, you can obtain the depiction as a raster image
I Doesn’t work on OS X
mols <- load.molecules("data/dhfr_3d.sd")
## view a single molecule in a Swing window
view.molecule.2d(mols[[1]])
## view a table of molecules
view.molecule.2d(mols[1:10])
Visualization
Visualization
I The Swing window is a little heavy weight
I It’d be handy to be able to annotate plots with structures
I Or even just make a panel of images that could be saved to aPNG file
I We can make use of rasterImage and rcdk
I As with the Swing window, this won’t work on OS X
m <- parse.smiles("c1ccccc1C(=O)NC")
img <- view.image.2d(m, 200,200)
## start a plot
plot(1:10, 1:10, pch=19)
## overlay the structure
rasterImage(img, 1,8, 3,10)
Molecular descriptors
I Numerical representations of chemical structure featuresI Can be based on
I connectivityI 3D coordnatesI electronic propertiesI combination of the above
I Many descriptors are described and implemented in variousforms
I The CDK implements 50 descriptor classes, resulting in ≈ 300individual descriptor values for a given molecule
Descriptor caveats
I Not all descriptors are optimized for speed
I Some of the topological descriptors employ graphisomorphism which makes them slow on large molecules
I In general, to ensure that we end up with a rectangulardescriptor matrix we do not catch exceptions
I Instead descriptor calculation failures return NA
CDK Descriptor Classes
I The CDK provides 3 packages for descriptor calculationsI org.openscience.cdk.qsar.descriptors.molecularI org.openscience.cdk.qsar.descriptors.atomicI org.openscience.cdk.qsar.descriptors.bond
I rcdk only supports molecular descriptorsI Each descriptor is also described by an ontology
I For rcdk this is used to classify descriptors into groups
Descriptor calculations
I Can evaluate a single descriptor or all available descriptors
I If a descriptor cannot be calculated, NA is returned (so noexceptions thrown)
dnames <- get.desc.names(’topological’)
descs <- eval.desc(mols, dnames)
I Just evaluate topological descriptors
I descs will be a data.frame which can then be used in any ofthe modeling functions
I Column names are the descriptor names provided by the CDK
The QSAR workflow
The QSAR workflow
I Before model development you’ll need to clean the molecules,evaluate descriptors, generate subsets
I With the numeric data in hand, we can proceed to modelingI Before building predictive models, we’d probably explore the
datasetI Normality of the dependent variableI Correlations between descriptors and dependent variableI Similarity of subsets
I Then we can go wild and build all the models that R supports
Accessing fingerprints
I CDK provides several fingerprintsI Path-based, MACCS, E-State, PubChem
I Access them via get.fingerprint(...)
I Works on one molecule at a time, use lapply to process a listof molecules
I This method works with the fingerprint packageI Separate package to represent and manipulate fingerprint data
from various sources (CDK, BCI, MOE)I Uses C to perform similarity calculationsI Lots of similarity and dissimilarity metrics available
Accessing fingerprints
mols <- load.molecules("data/dhfr_3d.sd")
## get a single fingerprint
fp <- get.fingerprint(mols[[1]], type="maccs")
## process a list of molecules
fplist <- lapply(mols, get.fingerprint, type="maccs")
## or read from file
fplist <- fp.read("data/fp/fp.data", size=1052,
lf=bci.lf, header=TRUE)
I Easy to support new line-oriented fingerprint formats byproviding your own line parsing function (e.g., bci.lf)
I See the fingerprint package man pages for more details
Similarity metrics
I The fingerprint package implements 28 similarity anddissimilarity metrics
I All accessed via the distance function
I Implemented in C, but still, large similarity matrix calculationsare not a good idea!
## similarity between 2 individual fingerprints
distance(fplist[[1]], fplist[[2]], method="tanimoto")
distance(fplist[[1]], fplist[[2]], method="mt")
## similarity matrix - compare similarity distributions
m1 <- fp.sim.matrix(fplist, "tanimoto")
m2 <- fp.sim.matrix(fplist, "carbo")
par(mfrow=c(1,2))
hist(m1, xlim=c(0,1))
hist(m2, xlim=c(0,1))
Comparing datasets with fingerprints
I We can compare datasets based on a fingerprints
I Rather than perform pairwise comparisons, we evaluate thenormalized occurence of each bit, across the dataset
I Gives us a n-D vector - the “bit spectrum”
bitspec <- bit.spectrum(fplist)
plot(bitspec, type="l")
0Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384
Bit spectrum
0 50 100 150
0.0
0.2
0.4
0.6
0.8
1.0
Bit Position
Fre
quen
cy
I Only makes sense with structural key type fingerprints
I Longer fingerprints give better resolution
I Comparing bit spectra, via any distance metric, allows us tocompare datasets in O(n) time, rather than O(n2) for apairwise approach
0Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384
GSK & ONS Ugi datasets
I 117K member virtual library of Ugi products was the basis ofan ONS project looking for anti-malarials (Jean-ClaudeBradley, Drexel University)
I GSK recently published their anti-malarial screening dataset(13K compounds)
I How do the two data sets compare?
GSK & ONS Ugi datasets
I A little easier to identify differences if we take the “differencespectrum”
Comparing fingerprint performance
I Various studies comparing virtual screening methods
I Generally, metric of success is how many actives are retrievedin the top n% of the database
I Can be measured using ROC, enrichment factor, etc.I Exercise - evaluate performance of CDK fingerprints using
enrichment factorsI Load active and decoy moleculesI Evaluate fingerprintsI For each active, evaluate similarity to all other molecules
(active and inactive)I For each active, determine enrichment at a given percentage of
the database screened
For a given query molecule, order dataset by decreasing similarity,look at the top 10% and determine fraction of actives in that top
10%
Comparing fingerprint performance
I A good dataset to test this out is the Maximum UnbiasedValidation datasets by Rohr & Baumann
I Derived from 17 PubChem bioassay datasets and designed toavoid analog bias and artifical enrichment
I As a result, 2D fingerprints generally show poor performanceon these datasets (by design)
I See here for a comparison of various fingerprints using two ofthese datasets
0Rohrer, S.G et al, J. Chem. Inf. Model, 2009, 49, 169–1840Good, A. & Oprea, T., J. Chem. Inf. Model, 2008, 22, 169–1780Verdonk, M.L. et al, J. Chem. Inf. Model, 2004, 44, 793–806
Comparing fingerprint performance
0.0
0.2
0.4
0.6
0.8
1.0
Query Molecule
Sim
ilarit
y
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Fragment based analysis
I Fragment based analysis can be a useful alternative toclustering, especially for large datasets
I Useful for identifying interesting seriesI Many fragmentation schemes are available
I ExhaustiveI Rings and ring assembliesI Murcko
I The CDK supports fragmentation (still needs work) intoMurcko frameworks and ring systems
Getting fragments
I Access to exhaustive and Murcko fragmentation schemes
I Exhaustive fragmentation can take a long time in some cases
I Both have several parameters allow us to filter fragments
mol <- parse.smiles(
"c1cc(c(cc1c2c(nc(nc2CC)N)N)[N+](=O)[O-])NCc3ccc(cc3)C(=O)N4CCCCC4")
mfrags <- get.murcko.fragments(mol)
xfrags <- get.exhaustive.fragments(mol)
Doing stuff with fragments
I Look at frequency of occurence of fragments
I Pseudo-cluster a dataset based on fragments
I Compound selection based on fragment membership
I Develop predictive models on fragment members, looking forlocal SAR
Fragments & kinase activities
I Consider the Abbot kinasedataset (Metz et al)
I ≈ 1500 structures, 172targets
I Slice and dice activitiesbased on Murcko frameworkmembership
frags <- lapply(mols,
get.murcko.fragments)
fworks <-lapply(frags,
function(x) x[[1]]$frameworks)
frag.freq <- data.frame(
table(unlist(fworks))
)
Fragment ID
Fre
quen
cy
0
50
100
150
0 500 1000 1500
Fragments & kinase activities
I Explore activity data on a fragment-wise basis
I Compare activity distributions by targets
## build a look up table (frag SMILES -> molecule ID)
ftable <- do.call(’rbind’, mapply(function(x,y) {
if (length(y) == 0) y <- NA
data.frame(mid=x, frag=y)
}, names(fworks), fworks, SIMPLIFY=FALSE))
rownames(ftable) <- NULL
ftable <- subset(ftable, !is.na(frag))
ftable$frag <- as.character(ftable$frag)
## identify molecules containing a fragment
query <- ’c1nccc(n1)c3c[nH]c2ncccc23’
values <- subset(ftable, frag == query)
depvs <- subset(abbot, PUBCHEM_SID %in% values$mid)[, 15:186]
Fragments & kinase activities
Matched molecular pairs
I Inspired by Gregs’ 1-line SQL query
I But performed over 172 kinase targets
I Slower, especially the similarity matrix calculation
## load molecules and get similarity matrix
mols <- load.molecules(’abbot.smi’)
fps <- fp.sim.matrix(lapply(mols, get.fingerprint, ’extended’))
## identify similar pairs
idxs <- which(fpsims > 0.95, arr.ind=TRUE)
idxs <- idxs[ idxs[,1] > idxs[,2], ] # ignore diagonal elements
## evaluate activity differences
mps <- t(apply(idxs, 1, function(x) {
apply(depvs, 2, function(z) {
d <- abs(z[x[1]] - z[x[2]])
ifelse(d >= 1, d, NA)
})
}))
Matched molecular pairs
Matched molecular pairs
pIC50 = 5.3 pIC50 = 8.6
pIC50 = 5.4 pIC50 = 8.5
Future developments
I One current drawback of the package is that mostcheminformatics operations cannot be parallelized
I Many objects are Java refs so can’t be sharedI Many CDK methods are not threadsafe
I Data table and depictions
I Streamline I/O and molecule configuration
I Add more atom and bond level operationsI Convert from jobjRef to S4 objects and vice versa
I Would allow for serialization of CDK data classesI Is it worth the effort?
Acknowledgements
I rcdkI Miguel RojasI Ranke Johannes
I CDKI Egon WillighagenI Christoph SteinbeckI . . .