34
Gregory Landrum NIBR Informatics Novartis Institutes for BioMedical Research UK QSAR 2014 Open-source tools for querying and organizing large reaction databases

Open-source tools for querying and organizing large reaction databases

Embed Size (px)

DESCRIPTION

Presentation from the Spring 2014 UK QSAR meeting.

Citation preview

Page 1: Open-source tools for querying and organizing large reaction databases

Gregory Landrum NIBR Informatics

Novartis Institutes for BioMedical Research

UK QSAR 2014

Open-source tools for querying and organizing large reaction databases

Page 2: Open-source tools for querying and organizing large reaction databases

Outline

2

§ Public data sources and reactions

§ Handling reactions with the RDKit

§ Fingerprints for reactions

§ Validation: • Machine learning • Clustering

§ Application: Identifying interesting clusters of reactions

Page 3: Open-source tools for querying and organizing large reaction databases

Public data sources in cheminformatics an aside at the beginning

Page 4: Open-source tools for querying and organizing large reaction databases

Protein data bank

4

the exception

•  Crystal structures of proteins •  Deposition is mandatory for publishing protein crystal structures

Page 5: Open-source tools for querying and organizing large reaction databases

Pubchem

5

Evolution

Compounds

Assays (non-ChEMBL)

Collection of molecules from vendors and patents together with some assay data, primarily from NIH-funded screening centers.

Page 6: Open-source tools for querying and organizing large reaction databases

ChEMBL

6

Evolution

Compounds

Activities

2009

Collection of molecules and assay data curated (primarily) from the literature

Page 7: Open-source tools for querying and organizing large reaction databases

What about how we made those molecules?

7

Public reaction data?

§ The literature:

§ Plenty of data locked up in large commercial databases, very very little in the open

Yan, L. et al. SAR studies of 3-arylpropionic acids as potent and selective agonists of sphingosine-1-phosphate receptor-1 (S1P1) with enhanced pharmacokinetic properties. Bioorganic & Medicinal Chemistry Letters 17, 828–831 (2007).

Page 8: Open-source tools for querying and organizing large reaction databases

An emerging area: chemical reactions

8

Not just what we made, but how we made it

§  Text-mining applied to open patent data to extract chemical reactions : 1.12 million reactions[1]

§  Reactions classified using namerxn, when possible, into 318 standard types : >599000 classified reactions[2]

[1] Lowe DM: “Extraction of chemical structures and reactions from the literature.” PhD thesis. University of Cambridge: Cambridge, UK; 2012. [2] Reaction classification from Roger Sayle and Daniel Lowe (NextMove Software) http://nextmovesoftware.com/blog/2014/02/27/unleashing-over-a-million-reactions-into-the-wild/

Lots of reactions, lots of repeats

Page 9: Open-source tools for querying and organizing large reaction databases

More about the classes

9

Frequency of classes, revisited:

44675 2.1.2 Carboxylic acid + amine reaction 39297 1.7.9 Williamson ether synthesis 28194 2.1.1 Amide Schotten-Baumann 26739 1.3.7 Chloro N-arylation 22400 1.6.2 Bromo N-alkylation 20465 7.1.1 Nitro to amino 20405 1.6.4 Chloro N-alkylation 17226 6.2.2 CO2H-Me deprotection 16602 6.1.1 N-Boc deprotection 16021 6.2.1 CO2H-Et deprotection 12952 1.2.1 Aldehyde reductive amination 12250 2.2.3 Sulfonamide Schotten-Baumann 10659 11.9 Separation 8538 3.1.5 Bromo Suzuki-type coupling 7261 1.7.7 Mitsunobu aryl ether synthesis 7102 6.3.7 Methoxy to hydroxy 7071 3.3.1 Sonogashira coupling 6472 3.1.1 Bromo Suzuki coupling 6383 1.8.5 Thioether synthesis 5791 9.1.6 Hydroxy to chloro

20 most common classes:

Page 10: Open-source tools for querying and organizing large reaction databases

RDKit: What is it?

§  Open-source C++ toolkit for cheminformatics §  Wrappers for Python (2.x), Java, C# §  Functionality:

•  2D and 3D molecular operations •  Descriptor generation for machine learning •  PostgreSQL database cartridge for substructure and similarity searching •  Knime nodes •  IPython integration •  Lucene integration (experimental) •  Supports Mac/Windows/Linux

§  Releases every 6 months §  business-friendly BSD license §  Code: https://github.com/rdkit §  http://www.rdkit.org

Page 11: Open-source tools for querying and organizing large reaction databases

RDKit: Some features §  Input/Output: SMILES/SMARTS, SDF, TDT, PDB,

SLN [1], Corina mol2 [1] §  “Cheminformatics”:

•  Substructure searching •  Canonical SMILES •  Chirality support (i.e. R/S or E/Z labeling) •  Chemical transformations (e.g. remove matching

substructures) •  Chemical reactions

§  2D depiction, including constrained depiction §  2D->3D conversion/conformational analysis via

distance geometry §  UFF and MMFF94 implementation for cleaning up

structures §  Fingerprinting: Daylight-like, atom pairs, topological

torsions, Morgan algorithm, “MACCS keys”, etc. §  Similarity/diversity picking §  2D pharmacophores [1] §  Gasteiger-Marsili charges §  Hierarchical subgraph/fragment analysis §  Bemis and Murcko scaffold determination §  RECAP and BRICS implementations

§  Multi-molecule maximum common substructure §  Feature maps §  Shape-based similarity §  Fraggle similarity (from GSK) §  Molecule-molecule alignment §  Open3DAlign implementation §  Integration with PyMOL for 3D visualization §  Functional group filtering §  Salt stripping §  Molecular descriptor library:

Topological (κ3, Balaban J, etc.), Compositional (Number of Rings, Number of Aromatic Heterocycles, etc.), EState, SlogP/SMR (Wildman and Crippen approach), “MOE like” VSA descriptors, Feature-map vectors

§  Machine Learning: •  Clustering (hierarchical) •  Information theory (Shannon entropy, information

gain, etc.) §  Tight integration with the IPython notebook and

pandas §  Integration with the InChI library

[1] These implementations are functional but are not necessarily the best, fastest, or most complete.

Page 12: Open-source tools for querying and organizing large reaction databases

RDKit reaction handling Basics

From an rxn file:

Page 13: Open-source tools for querying and organizing large reaction databases

RDKit reaction handling Virtual Protecting groups

The problem:

Introducing the protecting group on amide Ns:

The result:

Page 14: Open-source tools for querying and organizing large reaction databases

Another approach for tuning specificity start with the problem again

Page 15: Open-source tools for querying and organizing large reaction databases

Another approach for tuning specificity and now the solution

Thanks to Holger Claussen (BioSolveIT) for the idea to use atom values for this

Query definitions added as atom values

Page 16: Open-source tools for querying and organizing large reaction databases

Got the reactions, what about reaction fingerprints?

16

Criteria for them to be useful

§ Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier?

§ Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together?

Page 17: Open-source tools for querying and organizing large reaction databases

Similarity applied to reactions

17

What are we talking about?

§  These two reactions are both type: “1.2.5 Ketone reductive amination”

It’s obvious that these are the same, right?

Page 18: Open-source tools for querying and organizing large reaction databases

Got the reactions, what about reaction fingerprints?

18

Start simple: use difference fingerprints:

Similar idea here: 1) Ridder, L. & Wagener, M. SyGMa: Combining Expert Knowledge and Empirical Scoring in the Prediction of Metabolites. ChemMedChem 3, 821–832 (2008). 2) Patel, H., Bodkin, M. J., Chen, B. & Gillet, V. J. Knowledge-Based Approach to de NovoDesign Using Reaction Vectors. J. Chem. Inf. Model. 49, 1163–1184 (2009).

FPReacts = FPii∈Reactants∑

FPProducts = FPii∈Products∑

FPRxn = FPProds −FPReacts

Page 19: Open-source tools for querying and organizing large reaction databases

Refine the fingerprints a bit

19

Text-mined reactions often include reagents or solvents in the reactants

Explore two options for handling this: 1.  Decrease the weight of reactant molecules where too many

of the bits are not present in the product fingerprint 2.  Decrease the weight of reactant molecules where too many

atoms are unmapped

Page 20: Open-source tools for querying and organizing large reaction databases

Another reaction analysis scheme

20

Looking at functional group changes

§ Similar idea to the fingerprint analysis: count the numbers of common functional groups in the reactants and products and subtract the one from the other:

       rfp=None          for  ri  in  range(rxn.GetNumReactantTemplates()):                  m  =  rxn.GetReactantTemplate(ri)                  fp  =  np.array(FunctionalGroups.CreateMolFingerprint(m,fgh))                  if  rfp  is  None:                          rfp  =  fp                  else:                          rfp  +=  fp          pfp=None          for  ri  in  range(rxn.GetNumProductTemplates()):                  m  =  rxn.GetProductTemplate(ri)                  fp  =  np.array(FunctionalGroups.CreateMolFingerprint(m,fgh))                  if  pfp  is  None:                          pfp  =  fp                  else:                          pfp  +=  fp          fp  =  pfp-­‐rfp  

Page 21: Open-source tools for querying and organizing large reaction databases

Functional groups considered

21

acidchloride acidchloride_aromatic acidchloride_aliphatic carboxylicacid carboxylicacid_aromatic carboxylicacid_aliphatic carboxylicacid_alphaamino sulfonylchloride sulfonylchloride_aromatic sulfonylchloride_aliphatic amine amine_primary amine_primary_aromatic amine_primary_aliphatic amine_secondary amine_secondary_aromatic amine_secondary_aliphatic amine_tertiary amine_tertiary_aromatic amine_tertiary_aliphatic amine_aromatic amine_aliphatic amine_cyclic boronicacid boronicacid_aromatic boronicacid_aliphatic

isocyanate isocyanate_aromatic isocyanate_aliphatic alcohol alcohol_aromatic alcohol_aliphatic aldehyde aldehyde_aromatic aldehyde_aliphatic halogen halogen_aromatic halogen_aliphatic halogen_notfluorine halogen_notfluorine_aliphatic halogen_notfluorine_aromatic halogen_bromine halogen_bromine_aliphatic halogen_bromine_aromatic halogen_bromine_bromoketone azide azide_aromatic azide_aliphatic nitro nitro_aromatic nitro_aliphatic terminalalkyne

Page 22: Open-source tools for querying and organizing large reaction databases

Functional group changes analyzed

22

Do the results make sense at all?

Func%onal  Group  Avg  in  

Reac%on  Overall  Average  

halogen   -­‐0.98   -­‐0.3  alcohol   -­‐0.95   -­‐0.12  

halogen_no4luorine   -­‐0.89   -­‐0.27  alcohol_aroma:c   -­‐0.67   -­‐0.04  halogen_alipha:c   -­‐0.62   -­‐0.15  

halogen_no4luorine_alipha:c   -­‐0.62   -­‐0.14  carboxylicacid   -­‐0.5   -­‐0.23  

halogen_bromine   -­‐0.42   -­‐0.11  halogen_bromine_alipha:c   -­‐0.39   -­‐0.06  

halogen_aroma:c   -­‐0.36   -­‐0.16  alcohol_alipha:c   -­‐0.28   -­‐0.08  

halogen_no4luorine_aroma:c   -­‐0.27   -­‐0.13  amine   -­‐0.04   -­‐0.3  

amine_alipha:c   -­‐0.04   -­‐0.27  carboxylicacid_alipha:c   -­‐0.04   -­‐0.08  

halogen_bromine_aroma:c   -­‐0.03   -­‐0.05  amine_ter:ary   -­‐0.02   -­‐0.06  

amine_ter:ary_alipha:c   -­‐0.02   -­‐0.08  carboxylicacid_aroma:c   -­‐0.02   -­‐0.03  

amine_cyclic   -­‐0.01   -­‐0.02  halogen_bromine_bromoketone   -­‐0.01   0  

Func%onal  Group  Avg  in  

Reac%on  Overall  Average  

acidchloride   0   -­‐0.07  acidchloride_alipha:c   0   -­‐0.05  acidchloride_aroma:c   0   -­‐0.02  

aldehyde   0   -­‐0.04  aldehyde_alipha:c   0   -­‐0.01  aldehyde_aroma:c   0   -­‐0.03  

amine_aroma:c   0   -­‐0.03  amine_primary   0   -­‐0.15  

amine_primary_alipha:c   0   -­‐0.07  amine_primary_aroma:c   0   -­‐0.07  

amine_secondary   0   -­‐0.04  amine_secondary_alipha:c   0   -­‐0.07  amine_secondary_aroma:c   0   0.03  

amine_ter:ary_aroma:c   0   0  azide   0   0  

azide_alipha:c   0   0  azide_aroma:c   0   0  

boronicacid   0   -­‐0.03  boronicacid_alipha:c   0   0  boronicacid_aroma:c   0   -­‐0.03  

carboxylicacid_alphaamino   0   0  isocyanate   0   -­‐0.01  

isocyanate_alipha:c   0   0  isocyanate_aroma:c   0   0  

nitro   0   -­‐0.03  nitro_alipha:c   0   0  nitro_aroma:c   0   -­‐0.03  sulfonylchloride   0   -­‐0.02  

sulfonylchloride_alipha:c   0   -­‐0.01  sulfonylchloride_aroma:c   0   -­‐0.01  

terminalalkyne   0   -­‐0.01  

Compare the average deltas for the >39K instances of Williamson ether synthesis

These look sensible

Page 23: Open-source tools for querying and organizing large reaction databases

Are the fingerprints useful?

23

§ Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier?

§ Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together?

Page 24: Open-source tools for querying and organizing large reaction databases

Machine learning and chemical reactions

24

§ Validation set: • The 68 reaction types with at least 2000 instances from the patent

data set -  “Resolution” reaction types removed (e.g. 11.9 Separation and 11.1 Chiral

separation) -  Final: 66 reaction types

§ Process: • Training set is 200 random instances of each reaction type • Test set is 800 random instances of each reaction type • Learning: random forest (scikit-learn)

Page 25: Open-source tools for querying and organizing large reaction databases

Learning reaction classes

25

Results for test data

Overall: •  Recall: 0.94 •  Precision: 0.94 •  Accuracy: 0.94

For a 66-class classifier, this looks pretty good!

Page 26: Open-source tools for querying and organizing large reaction databases

Learning reaction classes

26

~94% accuracy much of the confusion is between related types

Confusion matrix for test data

Bromo Suzuki coupling

Bromo Suzuki-type coupling Bromo N-arylation

Page 27: Open-source tools for querying and organizing large reaction databases

Are the fingerprints useful?

27

§ Question 1: do they contain bits that are helpful in distinguishing reactions from another? Test: can we use them with a machine-learning approach to build a reaction classifier?

§ Question 2: are similar reactions similar with the fingerprints Test: do related reactions cluster together?

Page 28: Open-source tools for querying and organizing large reaction databases

Clustering reactions

28

§ Reaction similarity validation set: • The 66 most common reaction types from the patent data set • Look at the homogeneity of clusters with at least 10 members

1.2.5 Ketone reductive amination

1.2.5 Ketone reductive amination

1.2.5 Ketone reductive amination

Integration

Interpretation: <30% of clusters are <90% homogeneous Interpretation: <40% of clusters are <80% homogeneous

Page 29: Open-source tools for querying and organizing large reaction databases

Similarity applied to reactions

29

Can we help classify the remaining 600K reactions?

§  Starting point: we have a similarity measure that clusters related reactions together

§  We can apply the machine-learning model to the unclassified reactions and see if the original assignment missed any instances

§  We can then look for big clusters of unclassified molecules and (manually) assign classes to them.

Page 30: Open-source tools for querying and organizing large reaction databases

Finding related unclassified reactions

30

§  Process: 1.  Pick 10K random unclassified reactions 2.  Cluster using the same fingerprint described above 3.  Characterize clusters by average functional-group profile 4.  Pick clusters where there is a clear signal

§  An example:

Cluster  12        amine  -­‐0.68        amine_secondary  -­‐0.35        amine_secondary_aliphatic  -­‐0.35        amine_aliphatic  -­‐0.61        aldehyde  -­‐0.58        aldehyde_aromatic  -­‐0.58  

Page 31: Open-source tools for querying and organizing large reaction databases

Example reactions from cluster 12

31

•  Clearly related reactions •  Using this approach we’ve identified a number of reaction classes

Page 32: Open-source tools for querying and organizing large reaction databases

Wrapping up

32

§ Dataset: 1+ million reactions text mined from patents (publically available) with reaction classes assigned

§ Fingerprint: weighted atom-pair delta fingerprints implemented using the RDKit

§ Fingerprint Validation: • Multiclass random-forest classifier ~94% accurate • Similarity measure works: similar reactions cluster together

§ Combination of clustering + functional group analysis clustering allows identification of new reaction classes

Page 33: Open-source tools for querying and organizing large reaction databases

§ NIBR: • Anna Pelliccioli • Sereina Riniker • Mike Tarselli

§ NextMove Software: • Roger Sayle • Daniel Lowe

33

Acknowledgements

Page 34: Open-source tools for querying and organizing large reaction databases

Advertising

34

3rd RDKit User Group Meeting 22-24 October 2014

Merck KGaA, Darmstadt, Germany

Talks, “talktorials”, lightning talks, social activities, and a hackathon on the 24th. Registration: http://goo.gl/z6QzwD Full announcement: http://goo.gl/ZUm2wm We’re looking for speakers. Please contact [email protected]