18
www.guidetopharmacology.org Deuterogate: Causes and consequences of automated extraction of patent- specified virtual deuterated drugs feeding into PubChem Christopher Southan IUPHAR/BPS Guide to PHARMACOLOGY, Center for Integrative Physiology, University of Edinburgh ACS Boston CINF session: Enabling Machines to "Read" the Chemical Literature: Techniques 1 http:// www.slideshare.net/cdsouthan/causes-and-consequences-of-automated-extr action-of-patentspecified-virtual-deuterated-drugs

Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

Embed Size (px)

Citation preview

Page 1: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

1

www.guidetopharmacology.org

Deuterogate: Causes and consequences of automated extraction of patent-specified virtual

deuterated drugs feeding into PubChem

Christopher Southan

IUPHAR/BPS Guide to PHARMACOLOGY, Center for Integrative Physiology, University of EdinburghACS Boston CINF session: Enabling Machines to "Read" the Chemical Literature: Techniques

http://www.slideshare.net/cdsouthan/causes-and-consequences-of-automated-extraction-of-patentspecified-virtual-deuterated-drugs

Page 2: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

2

Abstract

The strategy of deuterating drugs to improve clinical profiles via the kinetic isotope effect has been known for over 50 years. However, recent development candidates have been predicated on a surge of opportunistic patent filings between 2008 and 2011. For automated chemical named entity recognition (CNER) these present particular challenges. These are investigated in this work by comparing sources of the 80K deuterated compounds inside PubChem. Of these, 45K originate from the patent CNER submissions of SCRIPDB, IBM and SureChEMBL plus 23K from Thomson Pharma via manual expert curation (MEXC). For CNER there are three options, image extraction, recognition of [2H] in IUPAC text forms or Complex Work Unit (CWU) molfiles obtained from the USPTO. For images, conversions to structures using OSRA with explicit H and D positions failed. Tests with chemicalize.org and OPSIN established that text “deuterio” did convert. The SureChEMBL pipeline also handles the “dx” prefix (e.g. methyl-d3). These tests, combined with inspection of SureChEMBL export records, confirmed that deuteration feeding into PubChem from patents was predominantly image-only derived. It was also clear that CWUs had provided the majority of these via molfiles. However, despite conceptually simillar CNER pipelines the three CNER sources showed divergent capture. Importantly, inspection of patents from the three major applicants in the deuteration IP Gold Rush indicated little reduction to practice. The unexpected consequences are that most of ~25K derivatives in PubChem of ~500 established drugs. are virtual, (i.e. the structures do not exist). This achilles heel of CNER will be discussed, since it presents database users with the dilemma between virtual swamping but possible IP significance on the one hand, verses the permanent absence of linked bioactivity data on the other.

Page 3: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

3

Introduction

Page 4: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

4

DalbavancinFDA approved May 2014

Page 5: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

5

Scifinder extraction

Page 6: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

6

US20090062182: Deuterium-enriched dalbavancin

Page 7: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

7

Protia portfolio

Page 8: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

8

OSRA:fails on explicit “D-” image > struct

Page 9: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

9

The extraction problem for deuts

• Majority of patents are image-only so no conversion• IUPAC specification of “detero” and “deuterio” is rare but

OPSIN, SureChEMBL and chemicalize.org will do the name-to-struc

• Thomson (Derwent) and SciFinder draw them in manually for conversion

• SureChEMBL, SCRIPDB and IBM use the Complex Work Units from the USPTO

• These include the molfiles drawn by the contractors and are the major source of deuteration in PubChem

Page 10: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

10

Codeine: the enumeration record from US20080045558

Left panel shows a section from one of approximately 55 pages of images.

Right panel shows the first three examples from the 520 intersect between the 992 CIDs retrieved via the patent number and the 551 from “Same, Connectivity” for codeine (CID 5284371), ranked by Mw.

Thomson Pharma only extracted three examples from this patent

Page 11: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

11

SureChEMBL indexing

First structure in the list SCHEMBL12905541 corresponds to CID 237918906 which has merged the SureChEMBL SID 237918906 with SCRIPDB SID 141460523.

Page 12: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

12

Deuterated source splits

Page 13: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

13

Source divergence in deuteration capture

TRP, SCR and SCH have an approximate three-way split, with the union of 64195 covering 81% of PubChem deuteration (77882 March 2015)

Page 14: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

14

Propagation: UniChem indexing

Page 15: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

15

Deuteration over time: patent surge in Thomson Pharma

TRP deuteration in PubChem on a per-year basis (left vertical axis and hatched bars) with patent publication dates taken from the USPTO for Auspex, Concert and Protia combined (the right hand vertical axis and solid lines with triangles).

Page 16: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

16

Picking off drug structures

Page 17: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

17

SciFinder results indicate invention by consortium

• SciFinder facilitated certain queries orthogonal to PubChem (e.g. assignee query for substances)

• 19841 isotopic substances were derived from 165 Auspex patents • Concert 6766 from 189• Protia 1959 from 252• Remarkably, the substance union query gave 28076 with an intersect

of only 30 as deuteration reagents• This means the assignees somehow contrived to divide up ~ 600

drug filings (i.e. to avoid each others claims)

Page 18: Causes and consequences of automated extraction of patent-specified virtual deuterated drugs

18

Consequences and problems of virtual deuteration

• Classic case of unintended consequences• Confounding drug analogue searching• Breaking the PubChem unofficial rule of extant-only compounds• Extant and virtual structures cannot be computationally separated• Secondary submitters cause intra-PubChem proliferation• Persistence as no-data entries• Proliferation between open databases• Both commercial sources of patent chemistry and source

aggregation projects within pharmaceutical companies will be affected

• Annotation can be confounded (e.g. the attribution of biological study in SciFinder)

• Equivocal IP situation