Syntax-directed translation of organic chemical formulas into their two-dimensional representation

SYNTAX-DIRECTED TRANSLATION OF ORGANIC CHEMICAL FORMULAS INTO THEIR TWO-DIMENSIONAL

REPRESENTATION

NEAL CARPENTER Feldberg Computer Center, Brandeis University, Waltham, MA02154, U.S.A.

(Received 23 May 1975)

Abstract-This paper describes a computer program which translates standard organic chemistry formulas into the,ir two-dimensional representation. The syntactic rules describing the input language as well as the semantic actions which perform the translation when a rule is recognized are described in the paper. The translator is operational on a PDP-10 computer and displays the formulas using a plotter.

(1) INTRODUCTION

In 1957 the International Union of Pure and Applied Chemistry (IUPAC) standardized a nomenclature for organic compounds which has been widely accepted among chemists (Hendrickson et al., 1970; RCNOC, 1960). This nomenclature provides the organic chemist with a basis for identification of numerous compounds. A strict nomenclature system is obviously required to eliminate ambiguities in characterizing a chemical formula. The IUPAC system satisfies this requirement.

Syntax-directed translation (Gries, 1971) can be advan- tageously used to transform formulas in the IUPAC nomenclature into their standard two-dimensional representation. The purpose of this paper is to describe how this translation can be accomplished. The availability of translation schemes of this kind constitute the first step towards the automatic formulation of organic reactions by computer.

(2) TRANSLATION

The syntactic rules for defining organic compounds according to the to the IUPAC nomenclature are presented in Table 1. According to these rules, one can define-compound names such as:

(1) 3-CHLORO-3-AMINO-2-PENT-4-ENONE (2) 2,CDIBROMO-CYCLO HEXANE (3) J-HEPTYNE.

The above strings (implicitly) contain the following information necessary for the translation:

(a) the length of the carbon chain (e.g. PENT indicating a carbon chain of length 5 in (l), HEX meaning a chain length of 6in (2));

(b) one or more unit prefixes, indicating the number of the carbon to which any functional* group is attached, followed by the functional group itself (e.g. a chlorine atom and amino group are attached to carbon 3 in (1));

(c) the carbon number and type of unsaturationt, which composes the primary suffix (e.g. a double bond at carbon 4 in (1); a triple bond at carbon 3 in (3));

(d) a secondary suffix again providing points of attachment of functional groups (e.g. carbon 2 is a carbonyi in (1));

*Functional groups are chemical groups which are attached to the carbon skeleton and are sites of chemical reactivity or function.

Wnsaturation refers to that state of a chemical compound which contdns double or triple bonds.

(e) singular or multiple functionality (e.g. 2 bromine atoms are present in (2)).

The translation process involves the execution of semantic actions once the elements of a rule are recognized while parsing (Gries, 1971). These semantic actions are informalLy described in Table 2.

The data structure utilized by the translator consists of a 2-dimensional table having four columns. Each time a new piece of information is translated, the corresponding entry is made in the next row of the table. The first column contains the position of attachment of any functional group. Column 2 is used to store the type of multiple bond (double or triple bond) if any exist in the given molecule. Column 3 contains the functional group itself. Column 4 stores secondary suffixes which, when present, would disturb the conventional numbering of carbon atoms in this particular method of display (e.g. COOH, CONH2, etc.). Besides the information stored in the table, the additional item needed to draw the formula is the length of the carbon chain. This length is determined by the semantic action attached to rule 7 (see Tables 1 and 2). The actual drawing proceeds in three steps. The carbon skeleton is first drawn followed by the attachment of the functional groups (if any exist) on the specified carbon atom. Lastly, multiple bonds are drawn into the molecule.

In order to illustrate the translation process, consider the input string: 2-HYDROXYJ,%DI CHLORO-1,4-OCT ADI-EN-6YN E. Notice that the blanks placed between basic symbols in this formula are automatically intro- duced as separators by the lexical scanner which is called by the syntactical analyzer (Gries, 1971). This string is parsed as shown in Fig. 1 and its translated 2-dimensional representation is displayed in Fig. 2. The corresponding data structure is given in Table 3. Notice that according to the semantic rules of Table 2 syntactically correct but semantically incorrect formulai are detected by the translator.

The described translation scheme has been implemented on a PDP-10 computer using a compiler generator (Cohen ef al., 1972), the semantic actions of Table 2 having been coded into FORTRAN. The reader will find in Pig. 3 several examples which illustrate the current capabilities of the translator.

(3) FINAL REMARKS

Along with the translation of IUPAC names to their two-dimensional representation, the reverse translation- the parsing of structural formulas to their corresponding

25

26 N. CARPENTER

Table 1

NO.

I

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16 17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

Rule

ccompound>::= <prafixa <carbon> <suffix> 1 <carbon> csufflux

<carbon>::- zcycllc=alkyl> 1

CUnSatUratfon>

wycllc-aIkyIr::= calkyl> 1 CYCLD <alkyl>

sa,ky,>::= HETH~ETH(PROP~BUT~PENl~HEX~HEPT~OCT~NON~DEC

cunsaturstlon>::= ~txo~rcCyClic=alkyl> w 1 .two'~ <pointer~~~cyslic=alkyl~ ATRI 1 <polnterrG<cycllc-alkyla 1 <cycllc=alkyl> i

<prec,n>t:= <prefln> anitprecix~ 1 <unitprefIx> I <unlabeled unitprefix>

<unitprePlx>:!. <pointer> = <functional gro"p,l ) <9roup Of functionat groups,z

<unlabeled unitprefix>::= <functional group, =

<s"ff,r>::= cprlmary suffix> <secOnd.ary suffix> / <unsaturated group> <secondary suffix> I cprimary suffix> <suffix group, <secondary suffix>I

<unsaturated group> csufflx group> <secondary suffix,

<unsaturated gro"pr ::- <unsaturated gtw"p> <suffix group>

<primary suffix> 1

csuffix group, <primary suffix>)

=<pr1mary suffix>

<primary suffix,:!- ANlENlfi

<secondary suffix>;:- =IONEIALiOIC ACIDlOlC AMIDElOATEl

ONITRILE(OVLIE

<S"ff,X group>::= L<t*icer I :ctllr, ces I - -ancer

<functional group>::= HVOROXV~FLUORO~CHLOROIBRONO~IODO~

ANINO~KETO~NITRO~CVANO~METHVLlMETHoXV~

ETHVL~ETHOXY~PROPYLlEUTYLlBUTOXYl

PENTVLlPROPoXV~

<group Of functlonal *Po"p">::- ctwicez <functional 9roup> / <thrice> <functional group,

<twice>::- <tno>~ x

etllr,ce>: :. <two> L epo1nter> r yQ conccs::= <pointerr; <pointer>::- intag*r

<two>::= <pointerr ~ ipolnterr

Remarks: 1. Basic symbols are underlined.

2. The variable sneezer denotes an unsigned in;;;~,x defined IS in most programming languages. variable Is recognized by a lexical sca"ne~.

3. The above grammar doer not contain self-embedding metavariabler and it Is therefore representable in a finlte-state form. Thus the semantic actions ;;;‘uie;ave been dr,ven by a f,nlte=state recognizer:

, the avaIlability of a compiler generator "as a ma,or factor in the decirlon to implement the

tranS1ItQr based on the abow BNF rules.

Syntax-directed translation of organic chemical formulas 27

Table 2 Rule Number ACtlOllS

6 No actlon is performed but notice that although cyclic compounds are not graphically dIsPlayed, they are accepted sentences in the language.

7 The length of the carbon chatn is determined.

36 Make new entry in column 1.

30 SQeCifiC functionality is entered in COlUnn 3. If there is no entry in tolumn 1 for that row, enter a '1' by defauIt (the first carbon is conventl~nally used rf no others are specified).

8,9 Checks are performed to ensure previous entries in column 1 when multiple functionality is recognized.

25 Make corresponding entires In column 2 af data str"ct"le.

26 Flag either column 3 or 4 depending upon the type of secondary suffix.

132 The information contained In the matrix Is pro- cessed and graphlcally displayed using a plotter. Procensing involves rearranging carbon attachment numbers, the length of the carbon chain and checkfng to ensure that there are 4 bonds to each carbon atom. If theSe requirements are not met. an mrror message is given. (The input string is syntactIcally correct but semantically incorrect.)

cdrefixs

<groupbf functional

#I3

\

groups,

#31

I/ cunit preflxz

/----+= _.

)_

II <

1 I . 4 _ ocr ADI _ EN _ 6 _ YN E

#I

Fig. 1.

Fig. 2.

chemical names was also implemented. Examples of this inverse translation are given in Fig. 4, Checking is again performed to ensure the detection of syntactically correct but semantically incorrect strings (e.g. CHY-CHz(CH,) = CW.

It should be mentioned that both translation schemes have their limitations. The IUPAC nomenclature system cannot be used to describe numerous compounds due to the abundance of common names such as benzene, naphthalene, etc., but these names can be stored in a data

Table 3

36,30

36.30

36,30

36,8,25

36,8,25

36,25

N. CARPENTER

NHz

2-HYDROXY-2-BROMO-3.5 DI CHLORO-%METHYL-%AMINO- 1,4-OCT ADI-EN-f%YN E

BR

Z-BROMO-3-CHLORO-t- HYDROXY-7,7,7-TRI NITRO- 3,5-HEFT AD1 EN OIC AMIDE

2-CHLORO-3-METHYL- 3-HYDROXY-4-BROMO- 2-BUT EN E 3-PENT EN OIC ACID

2-KETO-PENT AN E

Br

Ti?--- //

2-KETO-3-HYDROXY-3- 1,l ,l-TRI HYDROXY-

BROMO-4-HEX YN E PENT AN E

Fig. 3.

COOH-~H~-CH(N~~KH(CN)-CH~ W fl CH+CHINHZ)-CH(CL)kCOOH n W CH?.=CH-CRC-C#C-CH=CHZ n

3-NImO+-CYANO-PENTANOIC-ACID 2-CHLORO-3-AMINO-BUTANOIC-ACID I.TI)CTADI-EN-3.5-YNE

CH3-CH(OH)-CH[OH)-CW3 n m CH3<HKH3)-C(CL)=CH-C(BR)~BRl .CH3 n W CH210H~-CH~CLI-CH~C2HS)-CONH2 n

2.3~DIWDROXY-BUTANE 3CHLOR(t5.5~0lBROMO-~METHYL3-HEXENE 4-HYDROXY-3-CHLORO-2-ETHYL-ElUTAfWlC+.blDE

Fig.4.

base with their appropriate intermediate matrices which enable the drawing of their two-dimensional representation. Similarly, the grammar which describes the reverse translation does not span the wide range of structural formulas-only those formulas which are systematically convertible to IUPAC names are acceptable. However, this grammar could easily be modified to accept and transform any structural formula into its corresponding data structure.

usage in that context, the current programs can also be used as an educational tool in introductory chemistry courses.

It is felt that the ‘described translation schemes can be easily implemented and conveniently used as input- output parts of general organic chemistry programs designed to simulate chemical reactions. To this effect, it should be inentioned’ that the internal data structure utilized by the translator can be readily transformed into an equivalent form (described in Hendrickson, 1971) and which is suitable for specifying reactions. Besides its

Acknowledgemenfs-The author wishes to thank Prof. Jacques Cohen for his guidance in the preparation of this paper and to Prof. James B. Hendrickson whose conrse in Organic Chemistry inspired me to do the work descdbed in this paper.

REFERENCES

Cohen, J. (1972), A Compiler Generator, Brandeis Univ. Cities, D. (1971), Compiler Construction for Digital Computers,

New York, Wiley. Hendrickson, J. B. (%‘I), 1. Amer. Chem. Sot. 93, 6847. Hendrickson, ,J. ‘B., Cram, D. J. & Hammond, G. S. (l%‘O),

Organic Chemistty, New York, McGraw-Hill. Report of the Commission on the Nomenclature of Organic

Chemistry, (1960), _I. Amer. Chem. Sot. 82, 5546.

Documents

Syntax-directed translation of organic chemical formulas into their two-dimensional representation