43
1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software & Consultancy Services Sheffield, UK

1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

Embed Size (px)

Citation preview

Page 1: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

1Chemical Structure Representation

and Search Systems

Lecture 1. Oct 28, 2003

John Barnard

Barnard Chemical Information LtdChemical Informatics Software & Consultancy Services

Sheffield, UK

Page 2: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

2 Purpose of my 7 lectures

How do you store chemical structures on computer?

What can you do with them there? How do the computer systems used in

chemical informatics work?

Data Structures + Algorithms

Page 3: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

3 Lecture topics

Oct 28 Introduction to structure representation;

Introduction to Graph theory [video link] Oct 30 Problems of structure representation

[video link] Nov 4More graph theory; Structure analysis

and processing [video link] Nov 11 Structure searching I [video link] Nov 13 Structure searching II [video link] Nov 18 Chemical similarity [Indianapolis] Nov 20 Cluster analysis etc. [Bloomington]

Page 4: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

4 John Barnard

B.Sc. in Biochemistry (Birmingham, UK) M.Sc. and Ph.D in Information Studies (Sheffield,

UK) Has run chemical informatics software

development and consultancy business since 1985• Barnard Chemical Information (BCI) Ltd• http://www.bci.gb.com

Adjunct Professor of Informatics at Indiana University

Page 5: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

5 Lecture 1: Topics to be Covered

Structure representations and computers• structure diagrams• nomenclature• line notations• connection tables

Introduction to Graph Theory

Page 6: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

6 Representing a chemical structure

How much information do you want to include?• atoms present• connections between atoms

o bond types

• stereochemical configuration• charges• isotopes • 3D-coordinates for atoms

C8H9NO3

Page 7: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

7 Representing a chemical structure

How much information do you want to include?• atoms present• connections between atoms

o bond types

• stereochemical configuration• charges• isotopes • 3D-coordinates for atoms

OH

CH2

CHNH2OH

O

Page 8: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

8 Representing a chemical structure

How much information do you want to include?• atoms present• connections between atoms

o bond types (aromatic ring identification)

• stereochemical configuration• charges• isotopes • 3D-coordinates for atoms

OH

CH2

CHNH2OH

O

Page 9: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

9 Representing a chemical structure

How much information do you want to include?• atoms present• connections between atoms

o bond types

• stereochemical configuration• charges• isotopes • 3D-coordinates for atoms

OH

CH2

CHNH2OH

O

Page 10: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

10 Representing a chemical structure

How much information do you want to include?• atoms present• connections between atoms

o bond types

• stereochemical configuration• charges• isotopes• 3D-coordinates for atoms

OH

CH2

CHNH3+

O

O

Page 11: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

11 Representing a chemical structure

How much information do you want to include?• atoms present• connections between atoms

o bond types

• stereochemical configuration• charges• isotopes• 3D-coordinates for atoms

OH

CH2

C14 HNH2OH

O

Page 12: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

12 Representing a chemical structure

How much information do you want to include?• atoms present• connections between atoms

o bond types

• stereochemical configuration• charges• isotopes• 3D-coordinates for atoms

Page 13: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

13 2D structure diagram

chemists’ “natural language” used by most computer systems for display shows topology, optionally stereochemistry several commonly-used computer programs allow input/

editing of structure diagrams• ISIS/Draw (MDL)

http://www.mdl.com/downloads/downloadable/index.jsp• ChemDraw (CambridgeSoft)

http://www.cambridgesoft.com/products/• GRINS/JavaGRINS (Daylight)

http://www.daylight.com/products/javatools.html• MarvinSketch

http://www.chemaxon.com/marvin/

Page 14: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

14 2D structure diagram

provides 2D pictorial representation of chemical structure• display on screen• cut/paste/embed in Word document etc.

inter-convert with other forms for further processing• database searching• structure analysis• property prediction• database analysis

Page 15: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

15 Chemical Nomenclature

name that can be used to identify a substance • potentially important for legislation

represents chemical structure as text string • which can (sometimes) be pronounced

trivial names• usually short and easy to pronounce• do not usually give much information about structure

systematic names• usually long and difficult to pronounce• usually describe structure in considerable detail

Page 16: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

16 Trivial and Systematic Names

Trivial name:• tyrosine

Systematic names: -(p-hydroxyphenyl)alanine -amino-p-hydroxyhydrocinnamic acid

OHCH2CH

NH2

OH

O

Page 17: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

17 Systematic Names

several systems under continual revision and extension• IUPAC• Chemical Abstracts (lecture from Dr Davis on Sep 9)• some special systems designed by individuals

not usually designed for computer processing• programs exist both to read (translate) and to generate

systematic names from computer formatso http://www.beilstein.com/products/autonom/anm2000.shtmlo http://www.acdlabs.com/products/name_lab/

have arguably outlived their usefulness• IUPAC “IChI” (IUPAC Chemical Identifier) project

Page 18: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

18 Registry Numbers

unique identifiers for compounds or substances• catalogue number

most chemical databases have them• Chemical Abstracts• Beilstein• private compound registries in pharmaceutical companies

usually just “idiot numbers”• no chemical information

may have hierarchical structureparent compound stereoisomer salt batch

need to decide what is a separate compound

Page 19: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

19 Line Notations

represent structures as compact linear string of alphanumeric symbols

easily handled by computer• compact storage• easily transmitted over a network

allow rapid manual coding/decoding by trained users• much faster for input than using a structure drawing

program

Page 20: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

20 Line Notations: SMILES

Simplified Molecular Input Line Entry System developed by Dave Weininger (Daylight)

OC(=O)C(N)CC1=CC=C(O)C=C1

OHCH2CH

NH2

OH

O 1

Page 21: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

21 Simplified SMILES encoding rules

atoms are shown by atomic symbols: B, C, N, O, F, P, S, Cl, Br, I

hydrogen atoms are assumed to fill spare valencies adjacent atoms are connected by single bonds

• double bonds are shown `=', triple bonds are `#' branching is indicated by parentheses ring closures are shown by pairs of matching digits

Full rules:http://www.daylight.com/smiles/smiles-intro.html

Page 22: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

22 Other line notations

ROSDAL (Beilstein)Representation Of Structure Diagram Arranged Linearly

1O-2=3O,2-4-5N,4-6-7=-12-7,10-13O Sybyl Line Notation (Tripos)

OHC(=O)CH(NH2)CH2C[1]=CHCH=C(OH)CH=CH@1 Wiswesser Line Notation (WLN) (obsolete)

QVYZ1R DQ

OHCH2CH

NH2

OH

O

1

3

4

5

6

8 9

111213

Page 23: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

23 Connection Tables (CTs)

main form of structure representation in computer systems• list atoms and bonds (and other data) as a table

many different formats • “internal” CTs (in memory)

o algorithmic processing• “external” CTs (disk files)

o archival storage o data exchange between programs

Page 24: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

24 “Redundant” Connection Table

1. O 1 2 12. C 0 1 1 3 2 4 13. O 0 2 24. C 1 2 1 5 1 6 15. N 2 4 16. C 2 4 1 7 17. C 0 6 1 8 2 12 18. C 1 7 2 9 19. C 1 8 1 10 210. C 0 9 2 11 1 13 111. C 1 10 1 12 212. C 1 11 2 7 113. O 1 10 1

9

OH

CH2

CHNH2

OHO 13

4

5

6

8

11

12

13

Page 25: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

25 Internal Connection Table

usually “redundant”• every bond shown twice, once for each atom

implemented as array of records record for each atom might store

• atomic type• hydrogen count• formal charge• 2D display co-ordinates• bonds to neighbouring atoms• etc.

Page 26: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

26 MDL Connection Table

proprietary file format developed by MDL• http://www.mdl.com/downloads/public/ctfile/ctfile.jsp

de facto standard for exchange of datasets several different flavours and versions

• Molfile (single molecule)• SDfile (set of molecules and data)• RGfile (Markush structure)• Rxnfile (single reaction)• RDfile (set of reactions with data)

separates atoms and bonds into separate blocks

Page 27: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

27 New MDL File Formats

Since this lecture was delivered on Oct 28, 2003 MDL have published details of a new file format called “XDfile”• XML-based data format for transferring

structure/reaction information with associated data• built around existing MDL connection table formats • can incorporate Chime strings (encrypted format used

to render structures and reactions on a Web page)• can incorporate SMILES strings

Details available in MDL documentation at: • http://www.mdl.com/downloads/public/ctfile/ctfile.jsp

Page 28: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

28 MDL Connection Table

Header Block• data on molecule name and file origin• counts of atoms and bonds etc.

Tyrosine

-ISIS- 08220120432D

13 13 0 0 0 0 0 0 0 0999 V2000

Page 29: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

29 MDL Connection Table Atoms block

• one line per atom• specifies X,Y,Z-coords, atom symbol, isotope, charge,

stereo code etc. 0.2459 -1.4736 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.5815 -1.4724 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.9944 -2.1872 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.5810 -2.9037 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.2495 -2.9008 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.6586 -2.1854 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.4836 -2.1830 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 -1.9042 -2.1792 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.1027 -2.1870 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 -3.1359 -1.1516 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -3.9070 -2.1847 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -4.4070 -2.6845 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 -4.4989 -1.5618 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0

Page 30: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

30 MDL Connection Table Bonds Block

• one line per bond (each bond shown once)• specifies row numbers for atoms, and codes for

bond type, bond stereochemistry etc. 1 2 2 0 0 0 0 6 7 1 0 0 0 0 3 4 2 0 0 0 0 3 8 1 0 0 0 0 4 5 1 0 0 0 0 9 10 1 0 0 0 0 2 3 1 0 0 0 0 9 11 1 0 0 0 0 5 6 2 0 0 0 0 11 12 1 0 0 0 0 6 1 1 0 0 0 0 11 13 2 0 0 0 0 8 9 1 0 0 0 0M END

Page 31: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

31 Standard Connection Table Formats

different vendors have proprietary CT formats many attempts to establish agreed “standard”

formats• no real general success• different user communities have failed to coordinate

efforts• some standards exist in restricted areas

SMILES and MDL CT formats widely used most popular programs read/write several different

formats

Page 32: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

32 Standard Connection Table Formats

Standard Molecular Data (SMD) format• never gained wide acceptance

Protein Data Bank (PDB) format Crystallographic Information File (CIF/mmCIF) Molecular Information File (MIF)

• developed from SMD and compatible with CIF

Chemical Exchange Format (CXF) • Chemical Abstracts Service

Page 33: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

33 Standard Connection Table Formats

Chemical Markup Language (CML)• uses principles of the eXtensible Markup Language (XML)

protocol for data exchange using the Internet• http://www.xml-cml.org

Chemical EXchange (CEX)• exchange protocol for TCP/IP networks developed collaboratively

by several organizations• http://www.cgl.ucsf.edu/cex

Chemical MIME• incorporates several popular formats into protocols for exchange

of molecular structures as e-mail attachments• http://www.ch.ic.ac.uk/chemime/

Page 34: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

34 IUPAC Chemical Identifier (IChI)

Project being undertaken by International Union of Pure and Applied Chemistry

Intended to provide unique identifier for compounds, but with “chemical intelligence”• based on connection table• “canonicalised” (see lecture 3 on November 4)• compacted to short alphanumerical string

http://www.iupac.org/projects/2000/2000-025-1-800.html see also Dr Nicklaus’s lecture on Oct 16

Page 35: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

35 Topological Graph Theory

branch of mathematics• particularly useful in chemical informatics

and in computer science generally study of “graphs” which

consist of• a set of “nodes”• a set of “edges” joining

pairs of nodes

Page 36: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

36 Properties of graphs

graphs are only about connectivity• spatial position of nodes is irrelevant • length of edges are irrelevant• crossing edges are irrelevant

Page 37: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

37 Properties of Graphs

nodes and edges can be “coloured” to distinguish them

OH

CH2

CHNH2OH

O

Page 38: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

38 Structure Diagrams as Graphs

2D structure diagrams very like topological graphs• atoms nodes• bonds edges

terminal hydrogen atoms are not normally shown as separate nodes (“implicit” hydrogens)

• reduces number of nodes by ~50%• “hydrogen count” information used to colour neighbouring

“heavy atom” atom• separate nodes sometimes used for “special” hydrogens

o deuterium, tritiumo hydrogen bonded to more than one other atomo hydrogens attached to stereocentres

Page 39: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

39 Advantages of using graphs

mathematical theory is well understood graphs can be easily represented in

computers• many useful algorithms are known

identical graphs identical molecules different graphs different molecules

Page 40: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

40 Disadvantages of using graphs

analogy between chemical structures and graphs is not perfect

• identical graphs identical molecules• different graphs different molecules

realities of chemical structures cause problems• aromaticity stereochemistry• tautomerism coordination compounds• multi-centre bonds inorganic compounds• macromolecules polymers• incompletely-defined substances

many graph algorithms are inherently slow

//

Page 41: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

41 Lecture 1: Conclusions

There are lots of ways of storing a chemical structure in a computer• including different amounts of information

Most important ones are• line notations (e.g. SMILES)• connection tables (e.g. MDL Molfile)• nomenclature

Structure diagrams used for input/output Chemical structures can be regarded as topological

graphs

Page 42: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

42 Lecture 2: Topics to be Covered

Special problems of structure representation• aromaticity and tautomerism• multi-centre bonds• stereochemistry and coordination compounds• inorganic compounds• macromolecules and polymers• incompletely-defined substances• Markush structures

Page 43: 1 Chemical Structure Representation and Search Systems Lecture 1. Oct 28, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software

43 Further reading

• A. R. Leach and V. J. Gillet, An Introduction to Chemoinformatics, Dordrecht: Kluwer, 2003

• J. Gasteiger and T. Engel Chemoinformatics: a Textbook, Wiley-VCH 2003

• J. Gasteiger (ed.) Handbook of Chemoinformatics: From Data to Knowledge, Wiley-VCH, 2003

o Vol 1, Chapter II (Representation of chemical compounds)