28
www.digitalchemistry. co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry Ltd., UK Presented at EMBL-EBI Industry Programme Workshop on Chemical Registry Systems Hinxton, Cambridge, 10/11 October 2011 Entire presentation Copyright © Digital Chemistry Ltd., 2011

Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

Embed Size (px)

Citation preview

Page 1: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

www.digitalchemistry.co.uk

Representing Markush Structures from Patents and Combinatorial Libraries

Dr John M. BarnardScientific DirectorDigital Chemistry Ltd., UK

Presented at EMBL-EBI Industry Programme Workshop on Chemical Registry Systems

Hinxton, Cambridge, 10/11 October 2011

Entire presentation Copyright © Digital Chemistry Ltd., 2011

Page 2: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

2

Outline

• What are Markush structures?

– where do they occur?

– what types of variability do they include?

• Existing representation formats

– internal and external

• Canonicalization issues

• Proposed InChI Extensions

Page 3: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

3

Markush structures

Dr Eugene Markush (1887-1968) was involved in a legal wrangle with the US Patent Office in 1924

Classes of molecule with common structural features– represent sets of individual

specific molecules (from a handful to many billions, or even infinite numbers)

– best known in chemical patents

– can also be used to represent combinatorial libraries, and in other contexts

Page 4: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

4

Markush structure enumeration

Specific structures can be generated by combinatorial assembly of alternatives for each R-group

O H

R 1

R 2

CH 2 C H 3 CH 2 C H 2

C H 3

C H 3

F C l B r I

R 1 =

R 2 =

O H

F

O H

F

O H

F

O H

C l

O H

B r

O H

C l

O H

C l

O H

B r

etc...

Page 5: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

5

Variability in Markush Structures

• substituent (s-) variation list of specific alternatives,

which can be expressed in many different ways

• position (p-) variation variable point of

attachment

• frequency (f-) variation multiple occurrence of

groups

• homology (h-) variation generically-described

group (e.g. “alkyl”) potentially infinite number

of alternatives

R1 = propyl / CH3CH(Br)CH 3 / OMe /* CH2

R2

R2 = F

R3 = alkyl C1-6

M = 2-5

CH3

OH

CH2M

N

N

O

R1

R3

Four main ways in which variability can be shown in Markush structures

Page 6: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

6

Markush structure representation

• Generally extensions of methods used for specific structures

– connection tables, line notations etc.

– distinction between internal (processing) and external (storage and exchange) formats

– special provision for homology variation

• Variety of proprietary formats, some published

– many with limited capabilities

• No accepted standard

Page 7: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

7

Markush representations – pre 1980

• Mainly based on fragmentation codes

– Derwent WPI code

– IFI/Plenum CLAIMS

– GREMAS code (German-based consortium)

• Some work on extensions to traditional line notations

– Hayward notation

– Wiswesser Line Notation

– ALWIN (ALgorithmically extended WIswesser Notation)

• BASF connection table-based system (E. Meyer et al.)

– designed in late 1950s and fully operational by 1965

Incomplete, ambiguous

representations

Page 8: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

8

Sheffield University Project (1979-94)

• Long-running project on patent Markush storage and retrieval, directed by Prof Michael Lynch

• Early part of project concentrated on structure representation

– GENSAL (input and display format)

– parameter lists

• representation of homology-variant groups

– Extended Connection Table Representation (ECTR)

• internal (in-memory) format for processing

• complex AND/OR tree of partial connection tables with links showing logical structure of Markush

Page 9: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

9

GENSAL

Formal language (analogous to programming language, with fully-defined syntax) formalising Markush descriptions in Derwent abstracts

Markush scaffold

Variable-position

attachment

Combined definition of R1 and R2 (forming fused

ring) Conditional definition

“optionally substituted by”

“Statements” to define R-group

variables Parameter list

Definition using nomenclature

Definition using structure diagram

R1 = H / alkyl <1-4>;

R2 = F / Cl ;

R1 + R2 = SD

;

R3 = phenyl OSB [2,4,6] <1-2> Cl;

IF R2 = Cl THEN R1 = H.

R2

R1

R3

O

positions of substituents

number of substituents

Page 10: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

10

Parameter lists

Represent homology-variant expressions by set of permitted numerical ranges for structural parameterse.g. “alkyl”:

1-n carbon atoms 0 heteroatoms 0 double bond 0 triple bonds 0-n branch points 0 rings

Original GENSAL parameters

C carbon countZ heteroatom countE double bond countY triple bond countQ quaternary branch pointsT ternary branch pointsRC number of ringsRN number of ring atomsRF number of ring fusionsRA number of aromatic rings

5-10C unbranched alkyl: C<5-10> Z<0> E<0> Y<0> Q<0> T<0> RC<0>

optionally-aromatic fused heterocycle: Z<1-> RC<2-> RA<0->

Page 11: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

11

Markush DARC (1988-)

• First commercial Markush structure system

• Joint development by Questel (software), Derwent and INPI (French Patent Office) (database)

– MMS (Merged Markush Service) database now owned by ThomsonReuters

• Proprietary format for data storage and exchange

– VMN files (binary connection table)

– AMN file (associated text file with parameter list data)

– XML version of VMN file also exists, though involves significant data loss on orientation of R-groups with two or more attachments

Page 12: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

12

Markush DARC Superatoms

• Used to represent homology-variant groups

• Set of 22 predefined groups with mnemonic names

– some represent enumerated lists of elements e.g. HAL

– some represent classes e.g. CHK (alkyl), HEA (heteroaryl)

– some represent structurally-undefined groups e.g. DYE

• Can be qualified by

– attributes (qualitative)

– parameters (quantitative – comparable to GENSAL)

Page 13: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

13

Markush DARC Superatoms

Page 14: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

14

MARPAT (1991-)

• Chemical Abstracts Service's competitor to Markush DARC

• Proprietary software and database

• Input and display format similar toGENSAL

• Internal representationis extension of CASspecific structureformat

Page 15: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

15

• Hierarchical set ofspecial atom types used to represent homology-variant groups

• Can be qualified by

– categories (cf Markush DARC attributes)

– attributes (cf Markush DARC parameters)

MARPAT Generic Group Nodes

Page 16: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

16

MDL RGfiles

• A flavour of Molfile– text-formatted

connection table– various versions– proprietary to Accelrys

(formerly Symyx, MDL)

• Really intended for R-group queries, but widely used for Markush structures

• Significant limitations— substituent-variation only— limit of two fixed-position

connections for each R-group

Page 17: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

17

Oc1c([1*])c([2*])ccc1

SMILES and Extensions

• Daylight's original SMILES can only represent complete molecules

– [*] atoms can be used as dummy atoms (for R-groups and attachment points), and given “isotope” labels

• Digital Chemistry pioneered use of “pseudo” ring closures to assemble complete molecules from Markush building blocks

O H

R 1

R 2

Oc1c%11c%12ccc1 . C%11 . F%12Oc1c%11c%12ccc1 . CC%11 . F%12Oc1c%11c%12ccc1 . C%11 . Br%12

Page 18: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

18

SMILES and Extensions

• Many vendors have added their own non-standard extensions to SMILES to show R-groups and attachment points in Markush structures– limited consistency between vendors/parsers

• Daylight developed their own CHUCKLES and CHORTLES notations– primarily for peptide libraries

• “Open SMILES” project could provide forum for agreeing “standard” extensions– has got bogged down in other issues

• There are issues in representing incomplete aromatic rings, and potentially aromatic rings

Page 19: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

19

Other Line Notations

• Sybyl Line Notation (SLN)– similar to extended

SMILES– developed by Tripos– designed to show

combinatorial libraries

• ROSDAL– developed by Beilstein

Institute– primarily a query language

– has some capabilities for showing homology variation

7G1,6-1=-6—9.8=10O;G1=(1&1-2&2;3&1=4&2;5O&2-=11-6,9-12&1,8-14Cl,10-13Cl).

Page 20: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

20

XML, CML etc.

• Various XML-ifications of pre-existing formats have been promoted– usually just put the original format in an XML

wrapper

• CML is pre-eminent among “proper” XML formats for chemistry– has not yet achieved wide acceptance as a standard– latest version (2.4) has extensions able to handle

polymer repeating units, but no real Markush capabilities

• Digital Chemistry has done some design work on an XML format for Markush structure exchange– not published or fully implemented

Page 21: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

21

Markush Canonicalisation

• Canonicalisation involves putting a chemical representation into a unique “correct” form

– applying business rules

– renumbering atoms into canonical order

• This becomes a bit more complicated when it comes to Markush structures, which represent “sets” of specific molecules

– obvious rule is that Markushes are equivalent when they cover the same set of specific molecules

– but...

Page 22: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

22

Equivalent Markushes?

O H

C H 2

R 2

R 1

H CH 2 C H 3C H 3

F B rC l I

R 1 =

R 2 =

CH 2 C H 3 CH 2 C H 2

C H 3

C H 3

F IC l B r

O H

R 1

R 2

R 1 =

R 2 =

The “segmentation problem”

Page 23: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

23

Equivalent Markushes?

“Extensional” vs. “Intensional” representation

O H

R 1

R 2

CH 2 C H 3 CH 2 C H 2

C H 3

C H 3 C H

C H 3

C H 3

CH 2 C H 2

C H 2C H 3

CH 2 C H

C H 3

C H 3

C H C H 2

C H 3C H 3C H 3

C H 3

C H 3

F C l B r IR 2 =

R 1 =

C 1-4 a lkyl

ha l

O H

R 1

R 2

R 1 =

R 2 =

Page 24: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

24

Business Rules and Tautomers

… may be difficult to apply in a Markush structure

N H

O

N

O HThe preferred tautomeric form...

Aromaticity detection may also be an issue

H, C H 3

N

O

R 1

R 1 =

Page 25: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

25

Markush Canonicalisation

• Canonicalising an individual building block (scaffold or R-group alternative) is relatively simple

– “dummy atoms” for attachment points and R-groups

• Could define a sequence for R-group alternatives

– alphanumeric sequence of canonical representations

• Could define a sequence for R-groups

– non-arbitrary R-group labels

Would give you a “canonical Markush” but dependent on arbitrary segmentation (boundaries between scaffold and R-groups) and with limitations on

applicability of business rules.

Page 26: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

26

Canonicalisation of homology variation

• Problem here is defining what is to be represented

– canonicalising a parameter list would be fairly simple

• Different existing systems have subtly different representations

– Markush DARC superatoms/attributes/textnote parameters

– MARPAT generic group nodes/categories/attributes

– GENSAL parameter lists

Any standard canonical representation would effectively have to impose its own choice of

basic representation, which ideally would be a superset of everyone else's

Page 27: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

27

InChI generic structure extensions

• InChI is now well-established as a canonical representation standard for specific molecules

– unique alphanumeric string identifier

– open-source software for generation

• Working party has been looking at extending standard and software to handle Markush structures

– InChI Trust has approved proposals from Digital Chemistry Ltd for staged implementation

– currently awaiting allocation of funding

Page 28: Www.digitalchemistry.co.uk Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry

28

InChI generic structure extensions

1.InChIs for groups with external attachments

- InChI for “methyl” etc. as distinct from methane or radical

2.Assembly of InChIs into Markush structure

- arbitrary segmentation means little point in canonicalising this assembly

3.Additional types of variability

- several stages

O H

R 1

C H 2C H 3

C H 3R 1 =

O H

R 1C H 3R 1 =

C H2C H

3R 1 =