Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
NIH Virtual Workshop on Reaction Informatics, May 2021
PistachioJohn Mayfield, Ingvar Lagerstedt and Roger Sayle
NextMove Software
“Fantastic reactions and how to use them”
NIH Virtual Workshop on Reaction Informatics, May 2021
What is Pistachio?
A document centric database of 13.3 million reactions
Automatically extracted from U.S., European and WIPO patents
JSON and SMILES provided for bulk analysis/model building
Containerised WebApp for exploring and querying the data
Aim is to extract reactions as described in the original document,
Warts and all
NIH Virtual Workshop on Reaction Informatics, May 2021
History
Daniel’s PhD Thesis (2012)repository.cam.ac.uk/handle/1810/244727
DEPARTMENT OF CHEMISTRY
Extraction of chemical structures and reactions
from the literature
Daniel Mark Lowe Pembroke College
This dissertation is submitted for the degree of Doctor of Philosophy
June 2012
Original Open-Source Projectdan2097/patent-reaction-extraction
Pistachio (13.3 million) nextmovesoftware.com/pistachio
We use an internal fork built using LeadMine instead of OSCAR4.
Primarily improves chemical entity and physical quantity recognition, spelling
correction, etc.
USPTO CC-Zero Subset (3.7 million)Chemical_reactions_from_US_patents_1976-Sep2016_/5104873
NIH Virtual Workshop on Reaction Informatics, May 2021
Data ImpactChristos Nicolaou et al. The Proximal Lilly Collection: Mapping, Exploring and Exploiting Feasible Chemical Space J. Chem. Inf. Model., 2016, 56 (7), pp 1253–1266
Nadine Schneider et al. Big Data from Pharmaceutical Patents: A Computational Analysis of Medicinal Chemists’ Bread and Butter. J. Med. Chem., 2016, 59 (9), pp 4385–4402
Bowen Liu et al. Retrosynthetic reaction prediction using neural sequence-to-sequence models. ACS Cent. Sci., 2017, 3 (10), pp 1103–1113
Philippe Schwaller et al. Molecular transformer: A model for uncertainty-calibrated chemical reaction prediction ACS Cent. Sci., 2019, 5 (9), pp 1572–1583
Connor Coley et al. Prediction of Organic Reaction Outcomes Using Machine Learning. ACS Cent. Sci., 2017, 3 (5), pp 434–443
Philippe Schwaller et al. Extraction of organic chemistry grammar from unsupervised learning of chemical reactions. Sci. Adv., 2021, 7 (15)
Alessandra Toniato et al. Unassisted noise reduction of chemical reaction datasets. Nat. Mach. Intell. 2021
Amol Thakkar et al. Artificial intelligence and automation in computer aided synthesis planning. React. Chem. Eng., 2021, 6
NIH Virtual Workshop on Reaction Informatics, May 2021
Important: The same reaction will occur in application/grant, related patents, sketches/text and different authorities (WIPO/EPO/USPTO). Using RInChI without any role normalisation ~4.2 million.
Often identical but not always - different description/yield/actions.
NIH Virtual Workshop on Reaction Informatics, May 2021
Important: The same reaction will occur in application/grant, related patents, sketches/text and different authorities (WIPO/EPO/USPTO). Using RInChI without any role normalisation ~4.2 million.
Often identical but not always - different description/yield/actions.
NIH Virtual Workshop on Reaction Informatics, May 2021
A solution of 2-(2-hydroxyethyl)-5-methoxy-1-indanone (105 mg, 0.51 mmol) in methanol (2.0 mL) at room temperature was treated with ethyl vinyl ketone (EVK, 0.102 mL) and 0.5M sodium methoxide in methanol (0.204 mL, 0.1 mmol). The mixture was stirred in a capped flask and heated in an oil bath at 60° C. for 8 hours. After cooling, the reaction mixture was diluted with EtOAc (25 mL), washed with 0.2N HCl (15 mL), water (15 mL), and brine (15 mL), dried over MgSO4, filtered, and evaporated under vacuum to afford 2-(2-hydroxyethyl)-5-methoxy-2-(3-oxopentyl)-1-indanone as an oil.
Amy Fried and Robert Wilkening Merck Sharp & DohmeEstrogen receptor modulators. US 7151196 B2 [0236] (19-Dec-2006)Example 2, Step 2
A solution of 2-(2-hydroxyethyl)-5-methoxy-1-indanone (105 mg, 0.51 mmol) in methanol (2.0 mL) at room temperature was treated with ethyl vinyl ketone (EVK, 0.102 mL) and 0.5M sodium methoxide in methanol (0.204 mL, 0.1 mmol). The mixture was stirred in a capped flask and heated in an oil bath at 60°C for 8 hours. After cooling, the reaction mixture was diluted with EtOAc (25 mL), washed with 0.2N HCl (15 mL), water (15 mL), and brine (15 mL), dried over MgSO4, filtered, and evaporated under vacuum to afford 2-(2-hydroxyethyl)-5-methoxy-2-(3-oxopentyl)-1-indanone (138 mg, 93% yield) as an oil.
Dann Parker, Ronald Ratcliffe, Kenneth Wildonger and Robert Wilkening Merck Sharp & DohmeEstrogen Receptor Modulators EP 1257264 B1 [0261] (14-Sep-2011)EXAMPLE 34, Step 2
NIH Virtual Workshop on Reaction Informatics, May 2021
USPTO vs Pistachio Data• Updated quarterly• EPO, WIPO patents• USPTO sketches• NameRxn Classification/AAM
– Improved role assignment– NameRxn 71.5% coverage
• Example/Step Labels• Solvent Mixtures• Solvent associations• Document Assignees, Targets and Diseases• Continual tweaks based on feedback
U.S. Grant Text 3,366,399 2021-05-18U.S. Appl. Text 3,629,411 2021-05-13WIPO PCT Text 1,520,596 2021-05-06Euro. Grant Text 1,074,590 2021-05-12Euro. Appl. Text 702,035 2021-05-12U.S. Grant Sketch 1,211,521 2021-05-18U.S. Appl. Sketch 1,834,132 2021-05-13
NIH Virtual Workshop on Reaction Informatics, May 2021
USPTO vs Pistachio DataPistachio is a “super-set” of USPTO but not strictly so…• NameRxn filtering/mapping• Improved/changed name-to-
structure, roles, sectioning• Structure normalisation differences• Whack-a-mole/pachinko machine
– Obvious sensible change can have unforeseen consequences
The NextMove’s pachinko machine
Regression Testing
NIH Virtual Workshop on Reaction Informatics, May 2021
USPTO vs Pistachio Data• Reactions from USPTO Application Text 2001-22nd Sep 2016
– 1,939,253 CC-Zero Subset– 2,568,513 Pistachio– 458,995 common (-1,480,258,+ 2,109,518) by SMILES
NIH Virtual Workshop on Reaction Informatics, May 2021
USPTO vs Pistachio Data• Reactions from USPTO Application Text 2001-22nd Sep 2016
– 1,939,253 CC-Zero Subset– 2,568,513 Pistachio– 1,386,306 common (-552,947,+1,182,207) by ~RInChI
NIH Virtual Workshop on Reaction Informatics, May 2021
USPTO vs Pistachio Data• Reactions from USPTO Application Text 2001-22nd Sep 2016
– 1,939,253 CC-Zero Subset– 2,568,513 Pistachio– 1,465,946 common (-473,307,+1,102,567) by norm SMILES
NIH Virtual Workshop on Reaction Informatics, May 2021
USPTO vs Pistachio Data• Reactions from USPTO Application Text 2001-22nd Sep 2016
– 1,939,253 CC-Zero Subset– 2,568,513 Pistachio– 1,866,314 common (-72,939,+702,199) by paragraph Id
Overview of extraction
NIH Virtual Workshop on Reaction Informatics, May 2021
Text Extraction
Example from US20020133011A1 [0070]
Sectioning
Tagging/Tokenization
Parsing
Action Phrases
Reaction Assembly
NIH Virtual Workshop on Reaction Informatics, May 2021
Text ExtractionSectioning
Tagging/Tokenization
Parsing
Action Phrases
Reaction Assembly
Examples from WO 2020/239862 A1 PatentScope OCR
Missed break
Extra break
NIH Virtual Workshop on Reaction Informatics, May 2021
Text Extraction
UnitType.MassUnitType.PercentQuantityType.Yield
Sectioning
Tagging/Tokenization
Parsing
Action Phrases
Reaction Assembly
UnitType.PercentQuantityType.Purity
NIH Virtual Workshop on Reaction Informatics, May 2021
Text ExtractionSectioning
Tagging/Tokenization
Parsing
Action Phrases
Reaction Assembly
NIH Virtual Workshop on Reaction Informatics, May 2021
Text Extraction
Type AddCompounds • ethyl cyanoacetate (mass=13.56 g)• ethyl 4-fluorocinnamate (mass=19.4 g)• sodium ethoxide (mass=2.3 g, vol=50 ml)Conditions • 2-3 minutes• 60° C.
Sectioning
Tagging/Tokenization
Parsing
Action Phrases
Reaction Assembly
Type YieldCompounds • 2-cyano-3-(flurophenyl)-glutarate
(mass=23 g, yield=74%, purity=98%)
Type HeatConditions • 1 hour
Type CoolConditions • 5° C.
NIH Virtual Workshop on Reaction Informatics, May 2021
Text ExtractionSegmentation
Tagging/Tokenization
Parsing
Action Phrases
Reaction Assembly
Preliminary role assignment based on action, surrounding context and dictionaries (common solvents/catalysts)
NIH Virtual Workshop on Reaction Informatics, May 2021
ChEMU 2020 Evaluation Lab
Run Exact matching Relaxed matchingF1-score Precision Recall F1-score Precision Recall
Task 1 0.8983 0.9042 0.8924 0.9240 0.9301 0.9181Task 2 0.8977 0.9441 0.8556 n/a n/a n/a
end-2-end 0.8026 0.8492 0.7609 0.8196 0.8663 0.7777end-2-end
(after deadline)0.8255 0.8746 0.7816 0.8420 0.8909 0.7983
Nguyen D.Q. et al. (2020) ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. In: Jose J. et al. (eds) Advances in Information Retrieval. ECIR 2020. Lecture Notes in Computer Science, vol 12036. Springer, Cham. https://doi.org/10.1007/978-3-030-45442-5_74
Daniel Lowe and John Mayfield. Extraction of reactions from patents using grammars. 2020http://ceur-ws.org/Vol-2696/paper_221.pdf
NIH Virtual Workshop on Reaction Informatics, May 2021
Sketch Extraction
US 09718816 B2 Example 26
Example 26, US 9718816 B2
John May, et al. Sketchy Sketches: Hiding Chemistry in Plain Sight. Seventh Joint Sheffield Conference on Cheminformatics. 2016
Step 1
Step 4
Step 3
Step 2
etc..
NextMove’s Praline
Overview of Filtering/Mapping
NIH Virtual Workshop on Reaction Informatics, May 2021
Reaction Filtering - Text
1 >= Precursors <= 15 1 >= Products <= 4 Product not on left Min Product Size = 9 Definite Reference
NameRxn AAM
Indigo AAM
2 >= Num Precursors <= 15
Rebond
Fix Roles
Calculate Yield
Reject
Reject
Reject
Mapped
Sane
NIH Virtual Workshop on Reaction Informatics, May 2021
Reaction Filtering - Sketch
Specific Reaction 1 >= Precursors 1 >= Products
NameRxn AAM
Fix Roles
Reject
Mapped
Sane
NIH Virtual Workshop on Reaction Informatics, May 2021
ROLE FIX
3) Move unmapped reactants back to agents
1) Move all agents to reactants
2) Atom-Atom Mapping - Michael addition (3.11.92)
NIH Virtual Workshop on Reaction Informatics, May 2021
Why NameRxn?• 1,543 rule based classes - easy to update a mapping disagreement
• Higher precision/lower recall• Originally for pharmaceutical ELNs ~80% • Pistachio coverage is ~71.5%
– >77% USPTO appl. text.• Fast ~380 reactions per second per core
– A few hours to remap entire database– Speed depends on backend
4.1.6 Cyclic Beckmann rearrangement
NIH Virtual Workshop on Reaction Informatics, May 2021
NameRxn - Magic functional groupsNameRxn originally written as classification tool, AAM is a by product
• For us no answer is better than a wrong answer • Lowest number of wrong answers (Disagreement
with gold-standard)
• Yellow bar is so called “magic group additions” where a product atom is unmapped:• We didn’t know where a group came from• Where there group came from was missing• Stoichometry (multiple groups from one reactant)
• Aim to indicate this better in bulk data• AMAP bench
Arkadii Lin et al. Atom-to-atom mapping: a benchmarking study of popular mapping algorithms and consensus strategies https://chemrxiv.org/articles/preprint/Atom-to-Atom_Mapping_A_Benchmarking_Study_of_Popular_Mapping_Algorithms_and_Consensus_Strategies/13012679/1
NIH Virtual Workshop on Reaction Informatics, May 2021
It’s a kind of magic…
Bromo Grignard + nitrile ketone synthesis (3.7.10)EP0200736B1 [0072] Example 1, Step 1
RxnMapper/Indigo
RxnMapper
Water comes from the quenching:“The reaction mixture is slowly poured into ice cold 10% hydrochloric acid”“quenched slowly with 2N aq. HCl” (different paragraph)
NIH Virtual Workshop on Reaction Informatics, May 2021
Symmetry/Stoichiometry
NameRxn - 8.2.2 Sulfanyl to sulfonyl
RxnMapper
US20010000511A1 [0357]
NIH Virtual Workshop on Reaction Informatics, May 2021
Symmetry/Stoichiometry
Handle by reusing atom-maps in the reactant
US20010000511A1 [0357]
NIH Virtual Workshop on Reaction Informatics, May 2021
Symmetry/Stoichiometry
RxnMapper
Indigo
US 03674855 A
NIH Virtual Workshop on Reaction Informatics, May 2021
AMAP BENCH
Indigo
RxnMapper
US20200071310A1 [0487] Example 34
AMAP bench: Changed: 23, Broken: 13, C-C Broken: 7
AMAP bench: Changed: 5, Broken: 3, C-C Broken: 0Daniel Lowe, Roger Sayle. Evaluating the Quality and Performance of Automatic Atom Mapping Algorithms. 244th ACS National Meeting & Exposition. Aug 2012
NIH Virtual Workshop on Reaction Informatics, May 2021
Indigo/RxnMapper
8-(3,5-Bis-trifluoromethyl-benzoyl)-3-furan-2-yl-methyl-1-o-tolyl-1,3,8-triaza-spiro[4.5]decane-2,4-dione
AMAP bench: Changed: 4, Broken: 2, C-C Broken: 1
Ambiguous Names
NIH Virtual Workshop on Reaction Informatics, May 2021
Ambiguous Names
NameRxn/Indigo/RxnMapper
8-(3,5-Bis-trifluoromethyl-benzoyl)-3-furan-2-ylmethyl-1-o-tolyl-1,3,8-triaza-spiro[4.5]decane-2,4-dione
1.2.9 Alcohol + amine condensation
AMAP bench: Changed: 2, Broken: 1, C-C Broken: 0
NIH Virtual Workshop on Reaction Informatics, May 2021
Example 1Example 2Example 3Example 4Example 5Example 6Example 7Example 8Example 9Example 10Example 11Example 12Example 13Example 14Example 15Example 16Example 17Example 18Example 19Example 20
Example 21Example 22Example 23Example 24Example 25Example 26Example 27Example 28Example 29Example 30Example 31Example 32Example 33
US 2020/0087299 A1
Case Study
NIH Virtual Workshop on Reaction Informatics, May 2021
Example 1Example 2Example 3Example 4Example 5Example 6Example 7Example 8Example 9Example 10Example 11Example 12Example 13Example 14Example 15Example 16Example 17Example 18Example 19Example 20
Example 21Example 22Example 23Example 24Example 25Example 26Example 27Example 28Example 29Example 30Example 31Example 32Example 33
US 2020/0087299 A1
Case Study
NameRxn 127/154 82.4% Indigo 15/154 9.7% Reject 12/154 7.7%
NIH Virtual Workshop on Reaction Informatics, May 2021
Example 1Example 2Example 3Example 4Example 5Example 6Example 7Example 8Example 9Example 10Example 11Example 12Example 13Example 14Example 15Example 16Example 17Example 18Example 19Example 20
Example 21Example 22Example 23Example 24Example 25Example 26Example 27Example 28Example 29Example 30Example 31Example 32Example 33
NameRxn 132/154 85.7% Indigo 10/154 6.4% Reject 12/154 7.7%
US 2020/0087299 A1
Case Study
NIH Virtual Workshop on Reaction Informatics, May 2021
Example 27
Typo: “tert-butyl”
Typo: “tert-butyl”
NIH Virtual Workshop on Reaction Informatics, May 2021
Example 12
Step 1 Small Product (8 heavy atoms)
Typo: “methyl”
Data STORAGE TIPS
NIH Virtual Workshop on Reaction Informatics, May 2021
Hierarchical DataHierarchical data is index in the WebApp: NameRxn Tags, Assignees, Diseases (MESH), Targets (ChEMBL), IPC Codes
A simple way of store and searching for the data is using a nested identifier string, e.g. LIKE ’11.%’ pulls back all AstraZeneca and related companies:
NameRxn is handled slightly different, we pack the three level number into an integer
Parent queries e.g. 3.1 (Suzuki coupling) can be handled as a range
See also: https://www.postgresql.org/docs/9.1/ltree.html
11 AstraZeneca
11.5 Imperial Chemical Industries
11.7 MedImmune
...
3.1.1 50397185
4.1.42 67174442
(lvl1<<24)|(lvl2<<16)|(lvl3&0xffff)
3.1 >= 50397184 <= 50462720
NameRxn concepts and rxno
CINF 13, ACS Fall 2017, Washington, D.C.
1 Heteroatom alkylation and arylation .7 O-substitution .1 Chan-Lam ether coupling .2 Diazomethane esterification .3 Ethyl esterification .4 Hydroxy to methoxy .5 Hydroxy to triflyloxy .6 Methyl esterification .n 2 Acylation and related processes .6 O-acylation to ester .1 Ester Schotten-Baumann .2 Esterification (generic) .3 Fischer-Speier esterification .4 Baeyer-Villiger oxidation .5 Yamaguchi esterification .6 Hydroxy to imidazolecarbonyloxy .7 Imidazolecarbonyl to ester .8 Hydroxy to acetoxy .9 Steglich esterification .n
CINF 13, ACS Fall 2017, Washington, D.C.
1 Heteroatom alkylation and arylation .7 O-substitution .1 Chan-Lam ether coupling .2 Diazomethane esterification .3 Ethyl esterification .4 Hydroxy to methoxy .5 Hydroxy to triflyloxy .6 Methyl esterification .n 2 Acylation and related processes .6 O-acylation to ester .1 Ester Schotten-Baumann .2 Esterification (generic) .3 Fischer-Speier esterification .4 Baeyer-Villiger oxidation .5 Yamaguchi esterification .6 Hydroxy to imidazolecarbonyloxy .7 Imidazolecarbonyl to ester .8 Hydroxy to acetoxy .9 Steglich esterification .n
Esterification (7)
Chan-Lam coupling (3)
Schotten-Baumann Reaction (9)
RXNO: http://github.com/rsc-ontologies/rxno
NameRxn concepts and rxno
NIH Virtual Workshop on Reaction Informatics, May 2021
SummaryWe always welcome feedback if you spot a mistake!
• It’s a long tail but many things are simple changes that are fixed when rerun• Lot’s of people “cleaning” the data, We’d rather know what was wrong and can we
fix it
Plans • Reaction sketch compound numbers• Better quality indication
• Integrate RxnMapper, AMAP bench indicators, Boot-strapping sequences
• Handled reactions from non-english patents• General procedures/example references, currently only resolve
compoundsAcknowledgements
Daniel Lowe (MineSoft)Richard Gowers (NextMove Software)