20
This article was downloaded by: [Soongsil University] On: 22 April 2012, At: 16:31 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK SAR and QSAR in Environmental Research Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/gsar20 yaInChI: Modified InChI string scheme for line notation of chemical structures Y.S. Cho a , K.T. No b & K.-H. Cho a a Department of Bioinformatics and Research Center for Integrative Basic Science, SoongSil University, Seoul, Korea b Department of Biotechnology, Yonsei University, Seoul, Korea Available online: 02 Apr 2012 To cite this article: Y.S. Cho, K.T. No & K.-H. Cho (2012): yaInChI: Modified InChI string scheme for line notation of chemical structures, SAR and QSAR in Environmental Research, 23:3-4, 237-255 To link to this article: http://dx.doi.org/10.1080/1062936X.2012.657677 PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms-and- conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

038 2012A SAR ya In Ch I

Embed Size (px)

Citation preview

This article was downloaded by: [Soongsil University]On: 22 April 2012, At: 16:31Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

SAR and QSAR in EnvironmentalResearchPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/gsar20

yaInChI: Modified InChI string schemefor line notation of chemical structuresY.S. Cho a , K.T. No b & K.-H. Cho aa Department of Bioinformatics and Research Center forIntegrative Basic Science, SoongSil University, Seoul, Koreab Department of Biotechnology, Yonsei University, Seoul, Korea

Available online: 02 Apr 2012

To cite this article: Y.S. Cho, K.T. No & K.-H. Cho (2012): yaInChI: Modified InChI string scheme forline notation of chemical structures, SAR and QSAR in Environmental Research, 23:3-4, 237-255

To link to this article: http://dx.doi.org/10.1080/1062936X.2012.657677

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representationthat the contents will be complete or accurate or up to date. The accuracy of anyinstructions, formulae, and drug doses should be independently verified with primarysources. The publisher shall not be liable for any loss, actions, claims, proceedings,demand, or costs or damages whatsoever or howsoever caused arising directly orindirectly in connection with or arising out of the use of this material.

SAR and QSAR in Environmental ResearchVol. 23, Nos. 3–4, April–June 2012, 237–255

yaInChI: Modified InChI string scheme for line notation of chemical

structures$£

Y.S. Choa, K.T. Nob and K.-H. Choa*

aDepartment of Bioinformatics and Research Center for Integrative Basic Science,SoongSil University, Seoul, Korea; bDepartment of Biotechnology, Yonsei University, Seoul, Korea

(Received 30 August 2011; in final form 19 October 2011)

A modified InChI (International Chemical Identifier) string scheme, yaInChI(yet another InChI), is suggested as a method for including the structuralinformation of a given molecule, making it straightforward and more easilyreadable. The yaInChI theme is applicable for checking the structural identitywith higher sensitivity and generating three-dimensional (3-D) structures from theone-dimensional (1-D) string with less ambiguity than the general InChI method.The modifications to yaInChI provide non-rotatable single bonds, stereochem-istry of organometallic compounds, allene and cumulene, and parity of atomswith a lone pair. Additionally, yaInChI better preserves the original informationof the given input file (SDF) using the protonation information, hydrogen countþ1, and original bond type, which are not considered or restrictively consideredin InChI and SMILES. When yaInChI is used to perform a duplication check ona 3D chemical structure database, Ligand.Info, it shows more discriminatingpower than InChI. The structural information provided by yaInChI is in acompact format, making it a promising solution for handling large chemicalstructure databases.

Keywords: line notation; duplication check; InChI; chemical database; SMILES

1. Introduction

The ‘One compound-One name conundrum’ method provides a unique name to eachcompound and has attracted major interest in chemistry and related fields. Presently,SMILES [1,2] and InChI [3–5] strings are the most widely used to provide compoundnames (or line notations). The original SMILES scheme was designed by Arthur andDavid Weininger in the 1980s [2] and has since been modified by others affording severaldifferent algorithms for producing SMILES strings [6]. The SMILES strings share thesame representation, but the generated SMILES strings differ from one algorithm toanother depending on the string generation method and canonicalization algorithm. Thecanonicalization algorithm is used to assign unique numbers to atoms in a molecule,independent of the order of atoms in a given input file. Additionally, SMILES is difficultto apply to molecules with complicated structures.

*Corresponding author. Email: [email protected]$Dedicated to the memory of Professor Corwin H. Hansch (1918–2011).£Presented at CMTPI 2011: Computational Methods in Toxicology and Pharmacology IntegratingInternet Resources (Maribor, Slovenia, 3–7 September 2011).

ISSN 1062–936X print/ISSN 1029–046X online

� 2012 Taylor & Francis

http://dx.doi.org/10.1080/1062936X.2012.657677

http://www.tandfonline.com

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

In contrast, InChI was developed in cooperation with the International Union of Pureand Applied Chemistry (IUPAC) and the National Institute of Standards and Technology(NIST). InChI is the latest method for describing strings of chemical structures and itovercomes the ambiguity associated with SMILES strings. The InChI system was derivedfrom the chemical structures and uses unique, layered and tautomer-friendly character-ization. InChI was designed to assign one name to one compound using a universalcanonical numbering system affording a unique string. A layered format was used inInChI, affording a variety of aims and to describe the tautomer forms within the InChIstring. Moreover, compared with SMILES, InChI can easily be applied to molecules withcomplicated structures. However, the InChI string is not easily readable because InChIrepresents all bond types with a single dash (-), which does not provide the number orlocation of double or triple bonds in a molecule. This requires the user to understandorbital and valence theories, to know the number of hydrogen atoms attached to thecentral atom and to identify the charge to estimate bond types. Additionally, it is verydifficult to determine the number of rings and their sizes using the InChI string. Both theInChI and SMILE systems have several limitations in describing chemical structures –non-rotatable single bonds, allene or cumulene, parity of atoms that have three branchesand one lone pair such as amines, stereochemistry of metal-containing compounds, andgenerating three-dimensional (3-D) structures from one-dimensional (1-D) strings. Theadvantages and disadvantages of SMILES and InChI are listed (Table 1) [7].

Large chemical databases have different compounds that share the same name or thesame compounds are stored under different names or IDs, which make the databaseinefficient and difficult to use. The best way of checking molecule identity is by convertingthe 3-D structures to a 1-D string and then comparing the outputs; however, doing sorequires more sophisticated methods. We have suggested a modified InChI scheme,yaInChI, to overcome the current InChI limitations (ver. 1.03) and to include as much

Table 1. Characteristics of SMILES and InChI methods.

C

C

C

N

N

BC

S

C

OH

S

C

C

C

O O

SMILES InChI

CCCS(¼O)(¼O)[N]1¼NC¼c2sc(C)cc2¼B1O InChI¼1S/C9H13BN2O3S2/c1-3-4-17(14,15)12-10(13)8-5-7(2)16-9(8)6-11-12/h5-6,13H,3-4H2,1-2H3

- not unique- bond type is explicitly expressed- no tautomer information- difficult to generate string- ring information

- unique- multiple layer information- supports tautomer information- low human readability- no ring information

238 Y.S. Cho et al.

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

structural information as possible, in order to limit ambiguity, which is present in othermethods. However, some ambiguity is necessary for compatibility with differentconventions. The yaInChI method could be beneficial for some purposes such as checkingstructural identity, improving readability and generating 3-D structures from thecorresponding 1-D string. yaInChI is available at http://ebio.ssu.ac.kr/yaInChI.

2. Methods

To avoid ambiguity and to enhance the readability of InChI, we propose a modifiedversion of InChI called yaInChI. yaInChI was developed based on the InChI scheme,which means yaInChI inherits most of the layers from InChI. Additionally, yaInChIcontains a few more layers such as /bt, /nr, /en, /mt and /mh, and modifies the /c, /t, /fh, /pand /q layers to provide more structural information. An outline of the yaInChI layers anda comparison to InChI are presented (Table 2). The main purpose of the yaInChI system isto include more structural information in a structure file. With the yaInChI string, one cancompare molecules very easily and convert them from 1-D string to 3-D structures withless ambiguity.

2.1 Input file format

The input for yaInChI is standard SDF (structure data file) format [8] except for the extraatom information in the eighth column of the atom block, which is not used in thestandard SDF format (Figure 1). The SDF format does not present tautomer informationin the standard format, so some modifications were necessary. Tautomer-related mobilehydrogen information from a tautomer-detection program is reported in the eighthcolumn. The tautomer information can be obtained from various algorithms [9].

The InChI system calculates the mobile hydrogen using an intrinsic tautomer detectionalgorithm based on balanced network searches (BNS) [10]. However, the accuracy of thetautomer detection algorithms is still controversial, so instead of using a tautomerdetection algorithm, yaInChI uses tautomer information from the input file. Atoms thathave the same mobile hydrogen group will have the same number in the eighth column asshown in (Figure 1). If the tautomer detection program can distinguish the stability of thetautomer, information such as 1A, 1B and 1C, where the number represents the number ofthe tautomer group and the letter represents the order of the tautomer stability, can beadded by the user. All of the x, y and z coordinates are required for optimum performancebecause some stereochemical outputs are estimated from the coordinates; for example,configurations of metal-containing compounds and four types of special double bondstereochemistry, which are described in the next section.

2.2 Stereochemistry of special double bonds

Generally, the stereochemistry of some special double bonds such as allene or cumulene,and non-rotatable single bonds, is expressed as cis or trans based on the assumption thatall atoms involved in the stereochemistry are planar. However, the molecule stereochem-istry (dihedral angle) is sometimes closer to �90� or þ90� than 0� or 180�. If one moleculehas a dihedral angle of 89� and another has 91�, they will end up as cis andtrans conformation, respectively, with the prototypical cis-trans definition.

SAR and QSAR in Environmental Research 239

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

Alternatively, the yaInChI system uses four stereochemistry definitions to represent thestereochemistry in special cases. Though seemingly complicated, the information is veryuseful for building a 3-D structure from a given 1-D string. The stereochemistry definitionsused in this paper are presented in Table 3. Additionally, the yaInChI method does notrepresent the stereochemistry (/b, /en and /nr layers) of atoms in less than seven-memberedrings. The rings in a molecule are identified using the RP-Path [11].

Table 2. Identification of yaInChI layers and comparison with InChI.

Layer Meaning of layerDifference betweenyaInChI and InChI

Included incanonicalization

1. Main layer /f chemical formula Not changed No/c connectivity Modified, yaInChI

specific/c layer/h hydrogen (include mobile

hydrogen)Not modified, but takes

information fromgiven SDF file

2. Charge layer /q net charge Modified, net charge ofmolecule

No

/p protonation Modified, informationof all protonatedatoms

No

3. Stereo layer /b cis–trans double bond Not changed/en allene or cumulene Added, structural infor-

mation of series ofdouble bonds

/t parity Modified, includesatoms having threedifferent brancheswith lone pair andfour branches havingthree or four differentbranches

/nr non-rotatable bond Added, structural infor-mation of non-rotatable single bond

/mt metal connectivity Added, structuralinformation of metalconnectivity

/m parity inverted to obtainrelative stereo

Deleted No

/s stereo type Deleted No4. Extra layer /i isotope Not changed

/mh tautomer-specific hydrogen Added, original tauto-mer specific hydrogeninformation

No

/fh hydrogen count þ1 Added, original value ofhydrogen count þ1column

No

/bt bond table Added, bond informa-tion of given input

No

240 Y.S. Cho et al.

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

2.3 Connectivity layer, /c

The /c layer contains the unique atom number and their connection table values based on

the canonicalization process (described in Section 2.11). In InChI, an atom having not

only the smallest number of branches, but also the smallest canonical number (see

Section 2.11 for canonical numbers) is selected as the starting atom. The remaining atoms

Figure 1. SDF format column information.

Table 3. Stereochemistry definitions for special double bonds.

Dihedral angles Symbol

þ45�5T� 135� þ

�135�5T��45� �

�45�5T� 45� ¼

T��135� or 135�5T %

SAR and QSAR in Environmental Research 241

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

are ordered from the smallest canonical number using the connection table. Alternatively,the /c layer of yaInChI is designed to use the longest path among the pairs of shortestFloyd and Warshall algorithm paths [12] as the main connectivity string (main chain),providing a rough estimate of the molecular length. If the longest paths are of the samelength, the path containing end-point atoms with the smallest number of branches isselected as the main chain. Then, the branch connectivity strings are added to the front ofthe main chain using a similar method as above based on the connection table data. Thenewly generated strings are merged into the previously generated string using parentheses.This process is repeated until all information of the connection table is used. Rings areexpressed using the same number twice. In Table 4, ‘1(3-7(8)4)2-6-4’ means that atomnumbers 1, 3, 7, 4, 6 and 2 make a six-membered ring, and ‘1-2-6-4-5’ is the longest path inthe molecule. With this scheme, the length of molecule, the number of rings and their sizes,the number of branches and the overall molecule shape can be visualized.

2.4 Charge layer, /q and /p

The /q and /p layers in yaInChI store information on net charge and protons, respectively,in a given molecule. These definitions differ from the InChI system. InChI changes thecharged state and bond types by adding extra protons to radicals, disconnecting salts andmetals, and recalculating the formal charges according to the new state in thenormalization step [13]. This process to the InChI system limits the original chargedistribution information. The yaInChI system uses a normalization step to neutralize themolecule charge distribution, but maintains salt and metal structures, takes tautomerinformation from the input file and represents original charge distribution in the /p layer.The original charge distribution information is provided by the SDF (atom block in the

Table 4. Representation of charge information in yaInChI.1

(a) (b)

C1 C3

N7

C4N6

C2

N5

H

H

O8+1 -1

C1 C3

N7

C4N6

C2

N5

O80 0

H

H

yaInChI yaInChI¼/fC4H5N3O/c1(3-7(8)4)2-6-4-5/h1-3H,(H2,5,6)/mh5H2/p7Y3,8Y5/bt44441441

yaInChI¼/fC4H5N3O/c1(3-7(8)4)2-6-4-5/h1-3H,(H2,5,6)/mh5-6H/bt21122112

InChI InChI¼1S/C4H5N3O/c5-4-6-2-1-3-7(4)8/h1-3H,(H2,5,6)

1 Additionally, eliminating the /p, /mh and /bt layers from yaInChI results in the same output asInChI.

242 Y.S. Cho et al.

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

‘proton’ column). The utilization of the /p layer in InChI and yaInChI for keeping theprotonation information for molecules (a) and (b) is shown in Table 4. InChI generates thesame string for both (a) and (b); however, yaInChI generates different strings for eachmolecule according to the information contained in the input file. Information on /p couldaffect the /mh and /bt layers; therefore, due to the elimination of the /p, /mh and /bt layers,yaInChI generates the same string for both molecules.

For the duplication check, /p, /mt and /bt layers were not considered, but could beincluded depending on the desired sensitivity. The net charge of the molecule in Table 4column (a) was ‘zero’ and therefore the /q layer does not appear in the string.

2.5 Cumulene layer, /en

InChI and SMILES could manage the stereochemistry of cumulene structures. The InChIsystem uses even and odd numbers of double bonds to determine the stereochemistry. Aneven number of double bonds in the /t layer (parity layer) suggests a tetrahedral structureand an odd number of double bonds in the /b layer (cis–trans layer) indicates the cis–transconformation [14]. However, in some cases, cumulene could have cis or trans conforma-tion even though they have an even number of double bonds and could have tetrahedralconformations with an odd number of double bonds due to the steric constraints of theentire molecule. Unfortunately, InChI cannot separate those cases correctly. Therefore,yaInChI utilizes the /en layer to avoid any uncertainty related to cumulene (Tables 5 and6). The information in the /en layer was calculated using the dihedral angle and two atoms

Table 5. Misrepresentation of allene stereochemistry using InChI for (a) 1-[2-(1,1,4,8-Tetramethyl-nona-2,3,7-trienyl)-oxazolidin-3-yl]-ethanone and (b) 2-methyl-2,3-pentadien-1-amine.

(a) (b)

C1

C14

C2 C8C7

C9

C15

C3

C10

C11

C18

C5

C6

C17O21

C13

C12

N19C16

C4

O20

C1

C3C4

C6

C2

C5

N7

yaInChI yaInChI¼/fC18H29NO2/c1-14(2)8-7-9-15(3)10-11-18(5,6)17(21-13-12-19)19-16(20)4/h8,11,17H,7,9,12-13H2,1-6H3/en11^15/t17-/nr19þ16 /bt111111112122111112111

yaInChI¼/fC6H11N/c1-3-4-6(2)5-7/h3H,5,7H2,1-2H3/en3^6/bt112211

InChI InChI¼1S/C18H29NO2/c1-14(2)8-7-9-15(3)10-11-18(5,6)17-19(16(4)20)12-13-21-17/h8,11,17H,7,9,12-13H2,1-6H3/t10?,17-/m1/s1

InChI¼1S/C6H11N/c1-3-4-6(2)5-7/h3H,5,7H2,1-2H3

SAR and QSAR in Environmental Research 243

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

of larger canonical number from both ends. For example, the dihedral angle of four

consecutive atoms, C1-C3-C12-C11, in Table 6 was measured. The angle definition of the /

en layer is presented (Table 3). The information in the /en layer consists of two numbers

and one symbol between them, which represents the atom at both ends of the double bond

series and one of four types of stereochemistry, respectively.

2.6 Parity layer, /t

The concept of parity is similar to chirality and provides information pertaining to the

spatial direction of four branches attached to the centre atom. Parity uses canonical

numbers of atoms instead of weights or branch priority. In InChI, only atoms having four

different branches or centre atoms of even numbers of double bonds (cumulene) can have

parity and are expressed in the /t layer, but in yaInChI, any sp3 atom, for example, an

atom with three branches and one lone pair, is also included. The lone pair cannot change

its position freely and thus is included in the parity layer, /t. The yaInChI system indicates

parity of atoms having both four and three different types of branches to provide

selectivity in situations such as N15 (Table 6). C13 (Table 6) has only three different types of

branches; however, without displaying the parity on C13, the two molecules are not

distinguishable. For reference, the /m and /s layers (the stereo options in InChI) were not

used in yaInChI because the /m and /s layers are subordinate to the /t layer and

stereoisomerism related to the /t layer can be distinguished without these layers. The

symbols following the atom numbers, ‘þ’ and ‘�’, indicate clockwise and

counter-clockwise spatial arrangements of atoms with increasing canonical numbers,

respectively. The lone pair has the lowest priority.

Table 6. Example of /en and /t layers for hypothetical isomers.1

(a) (b)

C1

C2

C3C4

C5C6

C7

C8

C9

C10

C11

C12

C13

N14

N15

C1

C2

C3C4

C11N14

C8

C7

C10

C9

C5

C12

C13

C6

N15

yaInChI yaInChI¼/fC13H22N2/c1-3-4-12(11-14-13-8-10-15)5-6-13-7-9-15-2/h3,14H,5-11H2,1-2H3/en3%12/

t13-,15-/bt1122111111111111

yaInChI¼/fC13H22N2/c1-3-4-12(11-14-13-8-10-15)5-6-13-7-9-15-2/h3,14H,5-11H2,1-2H3/en3^12/

t13-,15Y/bt1122111111111111InChI InChI¼1S/C13H22N2/c1-3-4-12-5-6-13(14-11-12)7-9-15(2)10-8-13/h3,14H,

5-11H2,1-2H3

1 The lone pair of N15 in (a) is closer to of N14, whereas the lone pair of N15 in (b) is closer to C6. TheyaInChI strings for these molecules are different in the /en and /t layers; however, InChI considers(a) and (b) to be the same.

244 Y.S. Cho et al.

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

2.7 Non-rotatable single bond layer, /nr

A non-rotatable single bond is a single bond that cannot rotate freely, such as a peptidebond in proteins. The peptide bond, C–N, is presented as a single bond in SDF format butit cannot rotate freely because of the sp2–sp2 hybridization causing double bond character.Molecules have different stereochemistry around the C–N bonds, cis and trans, which havedifferent properties. However, SMILES and InChI do not handle non-rotatable singlebonds and consider these molecules to be the same. In contrast, the yaInChI system usesthe /nr layer, which provides information about non-rotatable single bonds, including theamide group and sp2 carbons connected to three nitrogen atoms as in hydroxyl alginine.Because non-rotatable single bonds can have angles closed to 90� and �90�, the /nr layerfollows the four types of stereochemistry described in Section 2.2. Further, non-rotatablesingle bonds can exist as various forms within the same molecule; for example, an amidecan transform to imidic acid by tautomerization (Table 7). Therefore, yaInChI gives thesame string for both (a) the imidic acid in the cis form and (b) the amide in the cis formwith non-rotatable single bond information. The InChI system indicates the cis form forboth cases; however, it cannot distinguish the stereochemistry around the non-rotatablebonds if they are different such as (c) amide in the trans form.

The information in the /nr layer consists of two numbers and one symbol between thenumbers, which represent the two atoms at both ends of the non-rotatable single bond andone of the stereochemistry cases, respectively.

2.8 Metal connectivity layer, /mt

In InChI, all metal atoms of organometallic compounds are disconnected in the main layerand are not considered as a part of the molecule. The user is able to manage the metalconnectivity but not the stereochemistry of the metal atom with the ‘reconnect’ option.

Table 7. Example of /nr layer related to tautomers of N-methylacetamide.1

(a) Imidic acid cis form (b) Amide cis form (c) Amide trans form

C1

O5

C3

N4

C2

C1

O5

C3

N4

C2 C1

C3

O5

N4

C2

yaInChI yaInChI¼/fC3H7NO/c1-3(5)4-2/h1-2H3,(H,4,5)/nr4^3/mh5H/bt1121

yaInChI¼/fC3H7NO/c1-3(5)4-2/h1-2H3,(H,4,5)/nr4^3/mh4H/bt1112

yaInChI¼/fC3H7NO/c1-3(5)4-2/h1-2H3,(H,4,5)/nr4%3/

mh4H/bt1112InChI InChI¼1S/C3H7NO/c1-3(5)4-2/h1-2H3,(H,4,5)

1 The compounds are the same in (a) and (b) whereas (c) is a different compound from (a) and (b)because of the non-rotatable single bond. The user can determine the level of identificationsensitivity by including or excluding the /nr and /mh layers.

SAR and QSAR in Environmental Research 245

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

In contrast, yaInChI considers the stereochemistry of metals, distinguishes the moleculeshaving different metal connectivity and preserves the original structural information.Table 8 shows an example of organometallic compounds with different stereochemistry.Metals in molecules could have various hybridization states and geometries (shapes). TheyaInChI system was devised to treat metals with up to six bonds, which could have ninedifferent shapes total (Table 9). The stereochemistry of the distorted molecules wasestimated using the provided atomic coordinates and was fitted to one of nine shapes.

The first number in the /mt layer indicates the canonical number of the metal (centreatom) and the numbers after ‘:’ indicate the atoms attached to the centre atom. In the caseof two and three branches, the different symbols between the numbers, such as ‘�’, ‘¼’ and‘_’, indicate different shapes. In the case of two, three and four branches, the first numberafter ‘:’ is always the smallest number among the attached atoms (in this case 1) and thesecond number is the next atom in the clockwise direction, and so on. With five and sixbranches, the number in parentheses indicates atoms in the plane staring from the smallestnumber followed by numbers in the clockwise direction. The number before ‘(’ is the axialatom with the smaller canonical number and the number after ‘)’ is the axial atom with thelarger canonical number. The atoms in the plane and in the axial direction are estimatedfrom the given atomic coordinates.

One of the purposes of yaInChI is to use the string to generate a 3-D structure. If theinitial structure is poor, the 3-D structure may have the wrong geometry regardless ofwhether energy minimization is used. This layer provides useful information whengenerating 3-D structure from the 1-D string.

2.9 Extra hydrogen layer, /mh and /fh

The extra hydrogen layer of yaInChI consists of two parts, /mh and /fh. The /mh layerrepresents the tautomer-specific hydrogen, which means the location of the hydrogen

Table 8. yaInChI displays stereochemistry of metal-containing compounds.

(a) (b)

N2

O4

Ru6

P5

N1

O3

N1

O4

Ru6

P5

O3

N2

yaInChI yaInChI¼/fH8N2O2PRu/c1-6(3,4,5)2/h3-4H,1-2,5H2/qþ2/mt6:1(2^3^5)4/p6þ2/bt11111

yaInChI¼/fH8N2O2PRu/c1-6(3,4,5)2/h3-4H,1-2,5H2/qþ2/mt6:3(1^2^5)4/p6þ2/bt11111

InChI(OB) InChI¼1S/2H2N.2H2O.H2P.Ru/h5*1H2;/q2*-1;;;-1;þ7/p-2

InChI(OB) With‘reconnect’ option

InChI¼1/2H2N.2H2O.H2P.Ru/h5*1H2;/q2*-1;;;-1;þ7/p-2/rH8N2O2PRu /c1-6(2,3,4)5 /h3-4H,1-2,5H2 /qþ2

246 Y.S. Cho et al.

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

among the paired atoms in tautomer parenthesis (/h layer). A tautomer is an organic

compound isomer that immediately converts from one form to another at room

temperature.Representing extra hydrogen information in yaInChI due to tautomerization is

somewhat different from InChI. The InChI system represents the mobile hydrogen groups

for tautomers in the /h layer by placing paired atoms in parenthesis. For example, ‘(H2, 5,

6)’ indicates that two hydrogen atoms are connected to the ‘N5’ or ‘N6’ atom and that this

hydrogen can migrate from one location to another (Table 4). InChI calculates

mobile hydrogen using an intrinsic BNS-based tautomer detection algorithm [13];

Table 9. Types of hybridization of metal-containing compounds and notation using /mt layer.

Number of branches Metal connectivity types

2 Bent Linear

1 3

2

1 3 2

3:1-2 3:1=2

3 Horn Trigonal planar

2

4

13

2

4

1 3 4:1-2-3 4:1=2=3

4 Tetrahedral Plane Pyramid

1

5

3 24

1

5

3

4 2

5

4 21 3

5:1-2-3-4 5:1=2=3=4 5:1_2_3_4

5 Trigonal bipyramid

4 6

1

5

2

3

6:1(2=3=4)5

6 Octahedral

1

7

6

24

5

3

7:1(2=3=4=5)6

SAR and QSAR in Environmental Research 247

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

however, the accuracy of tautomer-detection algorithms is questionable. The yaInChIsystem uses the tautomer information provided in the input file explained in Section 2.1.If a tautomer detection program is elaborated enough to distinguish the stability oftautomers, one can add some more information such as 1A, 1B and 1C in the column.In that case, numbers in /mh layer have the order of stability not the order of canonicalnumbers. Various tautomer detection algorithms could be used according to user’s choice.

The /fh layer contains information on the hydrogen countþ1 column (the fourthcolumn of extra atom information in the atom block (Figure 1), which is the number ofexcess hydrogen atoms. InChI does not contain information on the hydrogen countþ1column shown (Table 10); however, yaInChI includes this information and is authentic tothe input file.

2.10 Bond type layer, /bt

In the InChI scheme, the various types of bonds in a molecule are not explicitly presentedbecause it is impossible to present defined bond types when a molecule has tautomers andvarious protonated states. Bond types could be calculated with given information such as

Table 10. Representation of /fh layer in yaInChI.1

(a) (b)

C5 C7

N8

C6

C4

C2

H

C3C1 8 8 0 0 0 0 0 0 0 0999 V2000 4. 9974 - 2. 5284 0. 3886 C 0 0 0 0 0 0 0 0 4. 0236 - 1. 4336 0. 5516 C 0 0 0 0 0 0 0 0 4. 4382 - 0. 2142 0. 9226 N 0 0 0 2 0 0 0 0 3. 3820 0. 6084 1. 0004 C 0 0 0 0 0 0 0 0 3. 5628 2. 0101 1. 4080 C 0 0 0 0 0 0 0 0 2. 2107 - 0. 0952 0. 6709 C 0 0 0 0 0 0 0 0 2. 6317 - 1. 4059 0. 3775 C 0 0 0 0 0 0 0 0 1. 7871 - 2. 5412 - 0. 0647 C 0 0 0 0 0 0 0 0 1 2 1 0 0 0 2 3 4 0 0 0 3 4 4 0 0 0 4 5 1 0 0 0 4 6 4 0 0 0 6 7 4 0 0 0 2 7 4 0 0 0 7 8 1 0 0 0

C5 C7

N8

C6

C4

C2

C3C1 8 8 0 0 0 0 0 0 0 0999 V2000 4. 9974 - 2. 5284 0. 3886 C 0 0 0 0 0 0 0 0 4. 0236 - 1. 4336 0. 5516 C 0 0 0 0 0 0 0 0 4. 4382 - 0. 2142 0. 9226 N 0 0 0 0 0 0 0 0 3. 3820 0. 6084 1. 0004 C 0 0 0 0 0 0 0 0 3. 5628 2. 0101 1. 4080 C 0 0 0 0 0 0 0 0 2. 2107 - 0. 0952 0. 6709 C 0 0 0 0 0 0 0 0 2. 6317 - 1. 4059 0. 3775 C 0 0 0 0 0 0 0 0 1. 7871 - 2. 5412 - 0. 0647 C 0 0 0 0 0 0 0 0 1 2 1 0 0 0 2 3 4 0 0 0 3 4 4 0 0 0 4 5 1 0 0 0 4 6 4 0 0 0 6 7 4 0 0 0 2 7 4 0 0 0 7 8 1 0 0 0

yaInChI yaInChI=/fC7H11N/c1-5(7-3)4-6(8-7)2 /h4,8H,1-3H3/fh8H/bt11144444

yaInChI=/fC7H10N/c1-5(7-3)4-6(8-7)2 /h4H,1-3H3/bt11144444

InChI InChI=1S/C7H11N/c1-5-4-6(2)8-7(5)3/h4,8H,1-3H3

1The yaInChI system provides hydrogen countþ1 in a column of the SDF format. The N8 atom of(a) molecule has ‘2’ in the hydrogen countþ1 column in the SDF, which means one excess hydrogenwhile (b) does not. Neither InChI (IUPAC) nor InChI (OB) considers this information, which meansInChI generates the same string for both molecules.

248 Y.S. Cho et al.

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

atom types, number of attached hydrogen atoms and the charged state. However, inmolecules with complicated structures, assigning bond types is ambiguous and aromaticitycalculation from non-aromatic bond types is difficult.

The yaInChI system presents the molecule bond type in the /bt layer. With the /btlayer, the yaInChI string is able to conserve the original bond type informationconsidering tautomer-specific forms and charged states. The /bt layer information isgenerated from the bond info in the SDF format. To generate the /bt layer, the bondinformation is sorted in ascending order using lexicographical comparison (the first andthe second atoms are sorted first by their atom number in ascending order and then eachpair of atoms is sorted using lexicographical comparison in ascending order), for example,(1,2)5 (2,3) and (3,4)5 (3,5). The numbers in the bond type column [(the third column inbond information (Figure 1)] range from 1 to 8. The numbers correspond to the SDFformat definition (1¼ single, 2¼double, 3¼ triple, 4¼ aromatic, 5¼ single or double,6¼ single or aromatic, 7¼ double or aromatic, and 8¼ any). The number of atomsconnected to each other does not need to be displayed in the /bt layer because thisinformation can be extracted from the connectivity layer, /c.

The purpose of the /bt layer is to display the bond type of a molecule given by SDF sothat the information could be used to convert from a 1-D string to a 3-D structure with lessambiguity. Because the /bt layer may vary in different tautomer or protonation states, evenwith the bond type represented in the SDF file, this layer could be eliminated forduplication checks. Table 4 shows an example of a molecule with two different /bt layers.

2.11 Modified canonicalization algorithm

It is necessary to generate the same 1-D string for a molecule, even if the atoms are enteredinto the SDF in a different order. To do that, InChI uses a canonicalization algorithm,which algorithmically generates a set of unique atom labels (canonical numbers) [14].The InChI canonicalization algorithm consists of four major steps:

(A) After removing all hydrogen atoms, atoms are labelled by considering atom nameand number of connections.

(B) After adding hydrogen atoms to heavy atoms except mobile hydrogens, all atomsare re-labelled with hydrogen connection.

(C) After adding isotopic composition to the structure, the atoms are re-ordered.(D) Finally the canonical numbers are obtained by considering stereochemistry.

The yaInChI system has several extra layers compared with InChI and thereforerequires a modified canonicalization algorithm. Though similar to the InChI algorithm,the new algorithm includes the /en, /nr and /mt layers in the Major Step D [14] in companywith /b and /t layers (see Table 2). For the /en and /nr layers, the priority of symbols is‘þ’4 ‘�’4 ‘¼’4 ‘%’, and for the /mt layer, the priority is ‘�’4 ‘¼’4 ‘_’. (See Table 9for the definition of ‘�’, ‘¼’, ‘_’ in the /mt layer.)

Other added or modified layers such as /p, /mh, /fh and /bt were not included in themodified canonicalization algorithm because they do not represent different moleculesrather different states. Therefore, when applying the yaInChI string to the canonicalizationand duplication check, the /p, /mh, /fh and /bt layers should not be considered. However,to better preserve the original information of given input file, these layers should beincluded in the yaInChI string. The /q layer was also removed from the modified

SAR and QSAR in Environmental Research 249

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

canonicalization algorithm because it represents the state of the molecule and not a specificatom, but could be included in the duplication check.

2.12 Test chemical dataset

A total of 1,140,785 molecules out of 1,159,274 from Ligand.Info Meta Database(ver. 1.02) [15] after removing data that lacked 3-D coordinates (1,244) and had multiplemolecules (16,298) were used to test the ability of yaInChI. Data with the atomic symbol‘A’ (947) were also eliminated because OpenBabelTM (ver. 2.3.0) [16], with which wewanted to compare, could not handle these molecules. Finally, two hypothetical molecules(Table 8) were added to the test chemical data set because there were no examples inLigand.Info that related to different organometallic stereochemistry. Therefore, 1,140,787molecules were used for the test chemical data set. Ligand.Info was chosen because itprovides a convincing collection of biologically active compounds with 3-D structures andduplication checks using SMILES were reported. The results indicated that 1,016,389compounds out of 1,159,224 compounds (87.7%) were unique [15].

3. Results and discussion

The InChI system was originally developed in cooperation with IUPAC and NIST, andthen implanted into other software. The original InChI (IUPAC) has several discrepanciesbetween the manual [13] and the software (ver. 1.03). The InChI scheme implemented inOpenBabelTM (ver. 2.3.0), InChI(OB), has fewer bugs and was therefore used to test theyaInChI system for duplication.

In general, the yaInChI method is similar to InChI. However, yaInChI was developedto contain more information on stereochemistry by using the /en, /t, /nr and /mt layers,and to improve the amount of structural information by utilizing charge (/q and /p), extrahydrogen (/fh), tautomers (/mh), connectivity (/c) and bond type (/bt) layers. Asmentioned previously, the /p, /fh, /mh and /bt layers were not considered for yaInChIbecause they represent different states, not different molecules. In the case of the InChIsystem, the /p layer was included for duplication check because it was defined differently.

The InChI and yaInChI methods generally consider tautomeric structures to be thesame molecule. Because yaInChI takes the tautomer information from the input file,tautomer information from InChI(OB) was used for fair comparison. The test chemicaldataset described in Section 2.12 was used to check redundancy. The results from theduplication check for both methods are presented in Table 11 and shown in Figure 2.

Table 11. Duplication check results for InChI and yaInChI.1

Total Unique group Duplicated group

InChI 1,140,787 998,016 142,771yaInChI 1,140,787 998,076 142,711

1InChI strings are not modified and the yaInChI strings do not include the /mh, /p, /fh and /bt layersfor duplication check.

250 Y.S. Cho et al.

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

The number of unique molecules produced using yaInChI was larger than InChI

because of the enhanced stereochemical representations. The yaInChI specific layers, such

as /t, /nr and /mt, distinguish the stereochemistry in greater detail. Tables 8 and 12 show

some examples which InChI(OB) cannot distinguish correctly.The difference between the yaInChI and InChI systems was 95 cases, but the numerical

difference between them is 94 cases (Table 13 and Figure 2). In one case, yaInChI classified

a group of four molecules into a, a, a and b, whereas InChI classified them into a, a, b and

b. Although both methods classified the molecules into two groups, they were different

identifications. The number of cases were 24, 1, 1, 3, 15 and 51, which were related to

non-rotatable bond (/nr), parity (/t), metal connectivity (/mt), charge (/q and /p layers of

InChI), hydrogen information (/h) and aromaticity, respectively (Table 13). The

differences relating to the /nr, /t and /mt layers were expected because InChI does not

consider the information (/nr and /mt) or has a different meaning (/t).

Table 12. Misrepresentation in hybridization and hydrogen number for InChI(OB).1

(a) (b)

C6

C2

C4

C8

C12

C10

N15

C11

C9

C13

C7

C3

C1

C5

N14

C6

C2

C4

C8

C12

C10

N15

C11

C9

C13

C7

C3

C1

C5

N14

yaInChI yaInChI¼/fC13H14N2/c1(3-7-11(15-12)9)5-9-13(14)10(12-8-4)6-2-4/h2,4,6,8H,1,3,5,7H2,(H2,14,15)/mh14H2/bt11441414144444441

yaInChI¼/fC13H10N2/c1(3-7-11(15-12)9)5-9-13(14)10(12-8-4)6-2-4/h1-8H,(H2,14,15)/mh14H2/bt44444444444444441

InChI InChI¼1S/C13H14N2/c14-13-9-5-1-3-7-11(9)15-12-8-4-2-6-10(12)13/h1,3,5,7H,2,4,6,8H2,(H2,14,15)

1The InChI system regards the two different molecules to be the same but (a) has a sp3 carbon and(b) has no sp3 carbon. These molecules have 14 and 10 hydrogen atoms, respectively.

Figure 2. Venn diagram of duplication-check results. The number of cases for both InChI andyaInChI, number of InChI-specific unique cases and number of yaInChI-specific unique cases are997,999, 17 and 77, respectively.

SAR and QSAR in Environmental Research 251

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

Molecules with a metal atom have a different charge string (/q and /p in InChI) in some

cases due to different normalization steps in yaInChI (Table 14).The 15 cases related to the /h layer were classified into two categories, where 14 cases

were related to hydrogen countþ1 and one case was related to hydrogen number and

location. Among all cases, 14 were related to the ‘hydrogen countþ1’ column because

InChI does not utilize the information, whereas yaInChI represents it in the /h and /fh

layers. Information from the ‘hydrogen countþ1’ column could be implicitly indicated in

the /h layer in InChI(OB), but was exactly indicated using the yaInChI system.One difference in the /h layer is due to the miscalculation of hydrogen in the InChI

trial. After careful visual inspection, we concluded that the yaInChI number of hydrogen

atoms and location definitions were correct.The 51 differences related to aromaticity were due to misidentification of aromaticity

using InChI(OB) (Table 12). The difference in aromaticity for the two methods was

identified using a comparison of the number of hydrogen atoms attached in aromatic

bonds. Determination of the aromaticity of a molecule was very complicated in some

cases. Further investigation was completed by comparing the results with InChI (IUPAC).

In 48 of the 51 cases, our result was identical to InChI (IUPAC); in the three remaining

cases, the InChI (IUPAC) displayed error messages. It seems that InChI (OB) has a bug in

the aromaticity calculation program. The yaInChI takes the aromaticity information from

the SDF file rather than calculating the value.The yaInChI string has a layered structure similar to that of the InChI system, thus

allowing modifications to the level of sensitivity for the purpose of duplication check. For

example, stereochemistry information layers (/b, /en, /t, /nr, /mt) can be included or

excluded from the duplication check. Excluding the layers provides similar results for both

methods. The yaInChI provides higher structural sensitivity and achieves various levels of

sensitivity depending on the layers included. Several features were added to the basic

InChI system, including non-rotatable single bonds, metal connectivity, stereochemistry of

organometallic compounds, allene and cumulene, and atom parity with lone pairs. All of

the protonation information including, hydrogen countþ1 and original bond type were

incorporated into the yaInChI system to better preserve the original information. These

were not considered or restricted in the InChI and SMILES systems. The hydrogen

countþ1 and original bond type were not essential information, but were useful for

generating 3-D structures from 1-D strings. Additionally, the yaInChI method used four

classes of stereochemistry notation for non-rotatable double bonds and allene or

cumulene, which provides valuable molecular geometry information.

Table 13. Different cases between yaInChI and InChI (OB).

Cases Number of cases

Non-rotatable bond(/nr) 24Parity(/t) 1Metal connectivity(/mt) 1Charge(/q) 3Hydrogen information(/h) Hydrogen countþ1 14

Hydrogen number and location 1Aromaticity 51

252 Y.S. Cho et al.

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

The duplication check results indicate that the yaInChI system was more discriminating thanany other methods. The Ligand.Info and PubChem compound databases were ca. 87.5%(998,076 compounds) and 93% unique, respectively, based on the yaInChI studies. TheyaInChI system shows promise as a useful and efficient tool to eliminate redundancy in large

Table 14. Example of different normalization steps in yaInChI and InChI.1

O28+

C24

C18C10

C3

C9

C17C23

O27+Sn31

O29+

C25

C19

C11C4

C12

C20C26

O30+

C13

C5C1

C7

C15

C16

C8

C2C6

C14

C21-

C22-

O28 C24

C18C10

C3

C9

C17C23

O27+Sn31

O29C25

C19

C11C4

C12

C20C26

O30+

C13

C5C1

C7

C15

C16

C8

C2C6

C14

C21 C22

InChI(OB) InChI¼1S/2C7H6O2.2C6H5.Sn/c2*8-6-4-2-1-3-5-7(6)9;2*1-2-4-6-5-3-1;/h2*1-5H,(H,8,9);2*1-5H;/q;;2*-1;Y4

InChI¼1S/2C7H6O2.2C6H5.Sn/c2*8-6-4-2-1-3-5-7(6)9;2*1-2-4-6-5-3-1;/h2*1-5H,(H,8,9);2*1-5H;/q;;;;Y4/p-2

yaInChI yaInChI¼/fC26H20O4Sn/c3(10-18-24-23)9-17-23-27-31(21-13-5-1-7-15-21,22-14-6-2-8-16-22,28-24,30-26-20-12-4)29-25(26)19-11-4/h1-20H/qþ2/mt31:21(22¼27¼28¼29)30/p21Y5,22Y5,27Y3,28-

3,29Y3,30Y3/bt444412214444211244441221111211121111

yaInChI¼/fC26H20O4Sn/c3(10-18-24-23)9-17-23-27-31(21-13-5-1-7-15-21,22-14-6-2-8-16-22,28-24,30-26-20-12-4)29-25(26)19-11-4/h1-20H/qþ2/mt31:21(22¼27¼28¼29)30/p27Y3,30Y3/bt444412214444211244441221111211121111

InChI (IUPAC) InChI¼1S/2C7H6O2.2C6H5.Sn/c2*8-6-4-2-1-3-5-7(6)9;2*1-2-4-6-5-3-1;/h2*1-5H,(H,8,9);2*1-5H;/q;;2*-1;Y4

InChI¼1S/2C7H6O2.2C6H5.Sn/c2*8-6-4-2-1-3-5-7(6)9;2*1-2-4-6-5-3-1;/h2*1-5H,(H,8,9);2*1-5H;/q;;;;Y4/p-2

InChI(OB)(reconnected metal)

InChI¼1/2C7H6O2.2C6H5.Sn/c2*8-6-4-2-1-3-5-7(6)9;2*1-2-4-6-5-3-1;/h2*1-5H,(H,8,9);2*1-5H;/q;;2*-1;þ4/rC26H22O4Sn/c1-5-13-21(14-6-1)31(22-15-7-2-8-16-22,27-23-17-9-3-10-18-24(23)28-31)29-25-19-11-4-12-20-26(25)30-31/h1-20,27,29H/qþ2

InChI¼1/2C7H6O2.2C6H5.Sn/c2*8-6-4-2-1-3-5-7(6)9;2*1-2-4-6-5-3-1;/h2*1-5H,(H,8,9);2*1-5H;/q;;;;þ4/p-2/rC26H20O4Sn/c1-5-13-21(14-6-1)31(22-15-7-2-8-16-22,27-23-17-9-3-10-18-24(23)28-31)29-25-19-11-4-12-20-26(25)30-31/h1-20H/qþ2

1The yaInChI system considers both molecules to be the same after charge normalization, whereasInChI does not.

SAR and QSAR in Environmental Research 253

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

chemical databases, which is very important in the fields of chemoinformatics and drugdiscovery.

4. Conclusions

The yaInChI system was developed to incorporate as much structural information aspossible for a given molecule. The yaInChI method uses more layers than the prototypicalInChI system. This feature provides higher sensitivity inspection of the structural identityin a large chemical database. We applied yaInChI to several compound databases andfound that yaInChI provided superior duplication results compared to InChI.Furthermore, the yaInChI system is easier to read than InChI with /bt, /fh and modified/c layers. Because yaInChI contains more structural information in a compact format thanother methods, it could be used to generate 3-D structures with less ambiguity, which isimportant for large chemical database that are affected by duplication. The yaInChIsystem reported here provides a readable output that is straightforward with improvedstructural sensitivity. The advances provide promise for yaInChI future applications inlarge chemical databases.

Acknowledgements

This work was supported by the Human Resources Development of the Korea Institute of EnergyTechnology Evaluation and Planning (KETEP) grant funded by the Ministry of KnowledgeEconomy, Republic of Korea (No. 20104010100610) and a grant of the Korea HealthcareTechnology R & D Project, Ministry for Health, Welfare & Family Affairs, Republic of Korea(No. A100096).

References

[1] D. Weininger, SMILES, a chemical language and information system. I. Introduction to

methodology and encoding rules, J. Chem. Inf. Comput. Sci 28 (1988), pp. 31–36.

[2] D. Weininger, A. Weininger, and J.L. Weininger, SMILES. 2. Algorithm for generation of unique

SMILES notation, J. Chem. Inf. Comput. Sci 29 (1989), pp. 97–101.[3] S.R. Heller and A.D. McNaught, The IUPAC international chemical identifier (InChI), Chem.

Int 31 (2009), pp. 7–9.[4] IUPAC, The IUPAC International Chemical Identifier (InChITM); available at http://

www.iupac.org/inchi (last accessed July 2011).[5] Murray-Rust Research Group, University of Cambridge, The Unofficial InChI FAQ; available

at http://wwmm.ch.cam.ac.uk/inchifaq/ (last accessed July 2011).[6] Daylight Chemical Information Systems Inc., SMILES – A Simplified Chemical Language;

available at http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html (last accessed July

2011).[7] B. Kosata, Comparison of InChI to other chemical formats; available at http://inchi.info/

(last accessed July 2011).[8] A. Dalby, J.G. Nourse, W.D. Hounshell, A.K.I. Gushurst, D.L. Grier, B.A. Leland, and

J. Laufer, Description of several chemical structure file formats used by computer programs

developed at Molecular Design Limited, J. Chem. Inf. Comput. Sci. 32 (1992), pp. 244–255.

[9] A.W. Wendy, Tautomerism in chemical information management systems, J. Comput. Aid. Mol.

Des. 24 (2010), pp. 497–520.

254 Y.S. Cho et al.

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12

[10] W. Kocay and D. Stone, An algorithm for balanced flows, J. Comb. Math. Comb. Comput. 19

(1995), pp. 3–31.

[11] C.J. Lee, Y.M. Kang, K.H. Cho, and K.T. No, A robust method for searching the smallest set of

smallest rings with a path-included distance matrix, Proc. Natl. Acad. Sci. U. S. A. 106 (2009),

pp. 17355–17358.[12] R.W. Floyd, Algorithm 97: Shortest Path, Commun. Ass. Comput. Mach. 5 (1962), p. 345.

[13] E.S. Stephen, R.H. Stephen, and V.T. Dmitrii, IUPAC International Chemical Identifier (InChI)

InChI version 1, software version 1.03 (2010) Technical Manual; available at http://

www.iupac.org/inchi/download/version1.03/INCHI-1-DOC.zip (last accessed July 2011).[14] R. Apodaca, InChI canonicalization algorithm; available at http://depth-first.com/articles/2006/

08/12/inchi-canonicalization-algorithm (last accessed July 2011).[15] M. von Grotthuss, G. Koczyk, J. Pas, L.S. Wyrwicz, and L. Rychlewski, Ligand.info small-

molecule meta-database, Comb. Chem. High. Throughput Screening 7 (2004), pp. 757–761.[16] The Open Babel Package, version 2.3.0; available at http://openbabel.org/wiki/Main_Page

(last accessed July 2011).

SAR and QSAR in Environmental Research 255

Dow

nloa

ded

by [

Soon

gsil

Uni

vers

ity]

at 1

6:31

22

Apr

il 20

12