10
DOI: 10.1002/minf.201100110 Automatic Perception of Chemical Similarities Between Metabolic Pathways Diogo A. R. S. Latino* [a, b] and Jo¼o Aires-de-Sousa* [a] Dedicated to Professor Fernando M. S. Silva Fernandes 1 Introduction With the advent of the genomic era, information is becom- ing available for a large number of metabolic pathways, and their full reconstruction is envisaged for a great diversi- ty of organisms. The information concerning the respective genes, proteins, metabolites and chemical reactions are available in databases such as KEGG and MetaCyc. [1,2] New chemo- and bioinformatics approaches are needed to ana- lyze these data, improving our understanding of the bio- chemical machinery of organisms. The numerical represen- tation of these different levels of information enables auto- matic learning methods to assist in the process. Metabolic pathways are at the crossroad between the chemical world of small molecules and the biological world of enzymes, genes and regulation. Methods for the codifi- cation and automatic comparison of metabolic pathways are required to identify key differences and similarities among organisms, with application e.g., in the discovery of antimicrobial agents, in the design of biosynthetic process- es, or in evolutionary studies identifying common pieces of biochemical machinery in different organisms. [3,4] Moreover, the analysis of results of horizontal gene transfer between organisms of different phylogenetic domains seems to indi- cate that phylogenetic reconstruction cannot be done ex- clusively from genomic data and, so, methods to assess similarity of metabolic pathways and reactomes of different organisms can assist in phylogenetic reconstruction tasks. Several methods have been proposed for the alignment and comparison of metabolic pathways. [5–7] Giuliani et al. [8] represented a metabolic pathway by a matrix. In their approach, each element of the matrix indi- cates a transformation (or not) of a metabolite into another – the comparison of metabolic pathways are done by com- parison of the matrices. Pinter et al. [9] proposed a method for the alignment of metabolic pathways that advanta- geously takes into account similarities between enzymatic reactions classified in terms of EC (Enzyme Commission) numbers. In this case, a metabolic pathway is represented by a mathematical graph and each node in the graph rep- resents an enzyme. More recently, the same lab [10] present- ed a study that applies the previous approach for the align- ment of metabolic pathways in the analysis of co-evolution of metabolic pathways and to infer phylogeny from meta- bolic pathways. [a] D. A. R. S. Latino, J. Aires-de-Sousa REQUIMTE and CQFB, Departamento de Quȷmica, Faculdade de CiÞncias e Tecnologia, Universidade Nova de Lisboa Monte de Caparica, 2829-516 Caparica, Portugal fax: (+ 351) 212948550 *e-mail: [email protected] [email protected] [b] D. A. R. S. Latino CCMM, Departamento de Quȷmica e Bioquȷmica, Faculdade de CiÞncias, Universidade de Lisboa Campo Grande, 1749-016 Lisboa, Portugal Supporting Information for this article is available on the WWW uunder http://dx.doi.org/10.1002/minf.201100110. Abstract : Metabolic pathways are at the crossroad be- tween the chemical world of small molecules and the bio- logical world of enzymes, genes and regulation. Methods for their processing are therefore required for a great varie- ty of applications. The work presented here reports a new method to encode metabolic pathways and reactomes of organisms based on the MOLMAP approach. Pathways are represented from features of the metabolites involved in their reactions enabling to automatically perceive chemical similarities, and making no use of EC numbers. MOLMAP descriptors are based on atomic topological and physico- chemical features of the bonds involved in reactions. The results show that self-organizing maps (SOM) can be trained with MOLMAPs of pathways to automatically recog- nize similarities between pathways of the same type of me- tabolism. The study also illustrates the possibility of apply- ing the MOLMAP methodology at progressively higher levels of complexity, bridging chemical and biological infor- mation, and going all the way from atomic properties to the classification of organisms. Keywords: Bioinformatics · Chemoinformatics · Metabolic pathways · Neural networks · Self-organizing maps Mol. Inf. 2012, 31, 135 – 144 # 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 135

Automatic Perception of Chemical Similarities Between Metabolic Pathways

Embed Size (px)

Citation preview

Page 1: Automatic Perception of Chemical Similarities Between Metabolic Pathways

DOI: 10.1002/minf.201100110

Automatic Perception of Chemical Similarities BetweenMetabolic PathwaysDiogo A. R. S. Latino*[a, b] and Jo¼o Aires-de-Sousa*[a]

Dedicated to Professor Fernando M. S. Silva Fernandes

1 Introduction

With the advent of the genomic era, information is becom-ing available for a large number of metabolic pathways,and their full reconstruction is envisaged for a great diversi-ty of organisms. The information concerning the respectivegenes, proteins, metabolites and chemical reactions areavailable in databases such as KEGG and MetaCyc.[1,2] Newchemo- and bioinformatics approaches are needed to ana-lyze these data, improving our understanding of the bio-chemical machinery of organisms. The numerical represen-tation of these different levels of information enables auto-matic learning methods to assist in the process.

Metabolic pathways are at the crossroad between thechemical world of small molecules and the biological worldof enzymes, genes and regulation. Methods for the codifi-cation and automatic comparison of metabolic pathwaysare required to identify key differences and similaritiesamong organisms, with application e.g. , in the discovery ofantimicrobial agents, in the design of biosynthetic process-es, or in evolutionary studies identifying common pieces ofbiochemical machinery in different organisms.[3,4] Moreover,the analysis of results of horizontal gene transfer betweenorganisms of different phylogenetic domains seems to indi-cate that phylogenetic reconstruction cannot be done ex-clusively from genomic data and, so, methods to assesssimilarity of metabolic pathways and reactomes of differentorganisms can assist in phylogenetic reconstruction tasks.

Several methods have been proposed for the alignmentand comparison of metabolic pathways.[5–7]

Giuliani et al.[8] represented a metabolic pathway by amatrix. In their approach, each element of the matrix indi-cates a transformation (or not) of a metabolite into another– the comparison of metabolic pathways are done by com-parison of the matrices. Pinter et al.[9] proposed a methodfor the alignment of metabolic pathways that advanta-geously takes into account similarities between enzymaticreactions classified in terms of EC (Enzyme Commission)numbers. In this case, a metabolic pathway is representedby a mathematical graph and each node in the graph rep-resents an enzyme. More recently, the same lab[10] present-ed a study that applies the previous approach for the align-ment of metabolic pathways in the analysis of co-evolutionof metabolic pathways and to infer phylogeny from meta-bolic pathways.

[a] D. A. R. S. Latino, J. Aires-de-SousaREQUIMTE and CQFB, Departamento de Qu�mica, Faculdade deCiÞncias e Tecnologia, Universidade Nova de LisboaMonte de Caparica, 2829-516 Caparica, Portugalfax: (+ 351) 212948550*e-mail : [email protected]

[email protected]

[b] D. A. R. S. LatinoCCMM, Departamento de Qu�mica e Bioqu�mica, Faculdade deCiÞncias, Universidade de LisboaCampo Grande, 1749-016 Lisboa, Portugal

Supporting Information for this article is available on the WWWuunder http ://dx.doi.org/10.1002/minf.201100110.

Abstract : Metabolic pathways are at the crossroad be-tween the chemical world of small molecules and the bio-logical world of enzymes, genes and regulation. Methodsfor their processing are therefore required for a great varie-ty of applications. The work presented here reports a newmethod to encode metabolic pathways and reactomes oforganisms based on the MOLMAP approach. Pathways arerepresented from features of the metabolites involved intheir reactions enabling to automatically perceive chemicalsimilarities, and making no use of EC numbers. MOLMAP

descriptors are based on atomic topological and physico-chemical features of the bonds involved in reactions. Theresults show that self-organizing maps (SOM) can betrained with MOLMAPs of pathways to automatically recog-nize similarities between pathways of the same type of me-tabolism. The study also illustrates the possibility of apply-ing the MOLMAP methodology at progressively higherlevels of complexity, bridging chemical and biological infor-mation, and going all the way from atomic properties tothe classification of organisms.

Keywords: Bioinformatics · Chemoinformatics · Metabolic pathways · Neural networks · Self-organizing maps

Mol. Inf. 2012, 31, 135 – 144 � 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim 135

Page 2: Automatic Perception of Chemical Similarities Between Metabolic Pathways

Also concerning phylogenetic similarity Clementeet al.[11, 12] calculated the structural similarity of metabolicpathways for a set of organisms based on the similarity ofthe enzymes that participate in the metabolic pathways.Application of clustering methods enabled the reconstruc-tion of robust phylogenetic trees from metabolic pathwayswithout genomic data.

Mazurie et al.[13] represented metabolomes with NetworkInteracting Pathways, NIP. The information concerning themetabolome of an organism is compressed in such a waythat metabolic pathways are overlapped. Nodes representpathways and links represent the metabolites exchangedby the pathways. The NIP method enables to extract infor-mation concerning evolutionary events and to accuratelypredict phylogenetic distances between species.

Chang et al.[14] represented an organism by a vector ofsubstrate-product relationships. This codification is used asinput to obtain the phylogenetic trees. A metabolic path-way is represented as a hypergraph of substract-productrelationships for each enzymatic reaction. This approachcould be used, combined with genome-based phylogeneticreconstruction methods, to assist in a better understandingof species-environment interactions.

Despite the recent advances, representations of path-ways generally do not use information about the physico-chemical properties of the reaction centers and thus do nottake into account their intrinsic organic chemistry features.Macchiarulo et al.[15] mapped human metabolic pathways inthe chemical space of small molecules to study the extentof clustering and overlap of pathways, to determine theproximity of a metabolic pathway to a given small mole-cule, and to estimate the proximity of drugs to human met-abolic pathways.

Here we report a chemoinformatics approach to encodeand compare metabolic pathways based on an extensionof the MOLMAP (MOLecular Map of Atom-level Properties)strategy to represent chemical reactions.[16] MOLMAPs wereoriginally developed to represent the chemical bonds exist-ing in a molecule. A Kohonen Self-Organizing Map (SOM) istrained with a diversity of bonds that are distributed over a2D surface of neurons, according to similarities in their fea-tures. Topological and physicochemical bond features wereused. The pattern of neurons activated by the bonds of amolecule is a representation of that molecule. In a chemicalreaction, the difference between the MOLMAPs of theproducts and the MOLMAPs of the reactants was proposedto represent the structural changes occurring in the reac-tion. It is a numerical fixed length code of a reaction thatcan be further processed by machine learning meth-ods.[17–19]

The further application of this method to the representa-tion and mapping of other levels of metabolic informationis here reported and illustrated. Using the MOLMAP con-cept, a new SOM is trained with a data set of metabolic re-actions, and the pattern of neurons activated by the reac-tions of a metabolic pathway is a descriptor of that path-

way. With a fixed-length descriptor of pathways, SOMs canthen be trained to classify pathways. Furthermore, the pat-tern of neurons activated by the reactions of an organismcan be used as a representation of the biochemical machi-nery of the organism, and this descriptor can be used toclassify organisms. SOMs are sequentially built to organizethree different levels of information: maps of metabolicpathways (or organisms) on top of maps of chemical reac-tions, and these on top of maps of chemical bonds. It mustbe emphasized that the three SOMs are independent ofeach other, have different goals and are trained with differ-ent types of objects (chemical bonds, chemical reactionsand metabolic pathways or organisms) (Scheme 1).

2 Methodology and Computational Details

2.1 Data Sets

Metabolites in MDL .mol format and metabolic reactionswere extracted from the KEGG LIGAND database[1,20,21] (re-lease 40.0). The information concerning the reactions thatparticipate in each metabolic pathway was extracted fromthe KEGG PATHWAY database.

The data set of enzymatic reactions used was validatedin previous studies.[18,19] The original data set from theKEGG LIGAND database consists in 6810 reactions: 445 re-actions listed with more than one assigned EC number, 959reactions without assigned EC number, 536 reactions or re-actions with incomplete EC numbers and 4870 reactionswith one EC number. The data set of metabolic reactionswas pre-processed in the following way. In some reactions,general fragment symbols were replaced, such as ‘X’ by achlorine atom, or ‘R’ by methyl, adenine, cytosine or othertype of fragment depending on the reaction. Reactionswere removed that involved a compound producing nooutput by the software employed for the calculation of de-scriptors (STANDARDIZER and CXCALC tools from JChempackage, ChemAxon, Budapest, Hungary).

Only the pathways with more than 75 % of the reactionsencoded were used. Types of metabolism with a smallnumber of metabolic pathways did not participate in theexperiments.

The KEGG classification of pathways was followed withmetabolic pathways classified in 6 different types of metab-olism. The experiments were performed with 92 metabolicpathways from six different types of metabolism: 16 of car-bohydrate metabolism, 12 of lipid metabolism, 18 of xeno-biotics biodegradation and metabolism, 24 of metabolismof amino acid and other amino acid (considered in thesame class), 12 of biosynthesis of secondary metabolitesand 10 of metabolism of cofactors and vitamins.

The experiments with reactomes of organisms were per-formed with the lists of metabolic reactions for each organ-ism extracted from the KEGG GENES database (release 41.0)restricted to Prokaryotes fully sequenced, and classification

136 www.molinf.com � 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2012, 31, 135 – 144

Full Paper D. A. R. S. Latino, J. Aires-de-Sousa

Page 3: Automatic Perception of Chemical Similarities Between Metabolic Pathways

was considered at the third taxonomy level. Table 1 pres-ents the number of organisms by group.

A total of 347 organisms, classified in 16 groups, wereused in the experiments to map and classify organisms interms of their third level of taxonomy.

2.2 Kohonen Self-Organizing Maps

SOMs[22] are the key machine learning method in this study.They were used for three independent tasks, the classifica-tion of chemical bonds for the generation of MOLMAP mo-lecular descriptors, the classification of metabolic reactionsfor the generation of MOLMAP-based metabolic pathwaydescriptors, and the classification of metabolic pathways(or organisms). It is to emphasize that the three SOMs areindependent and were trained with different objects(chemical bonds, chemical reactions and metabolic path-ways/organisms).

A Kohonen SOM[22] is an unsupervised method that mapsmultidimensional objects, on the basis of their descriptors,into a 2D surface (a map) consisting in a grid of neurons.SOMs are able to reveal similarities between objects insuch a way that similar objects are mapped into the sameor neighbor neurons in the map.

SOMs with toroidal topology were used in the experi-ments presented here. A linear decreasing triangular scal-ing function was used in the training with an initial learningrate of 0.1 and an initial learning span between 3 and 7.The winning neuron was selected using the minimum Eucli-dean distance between the input vector and the neuronweights. In the most part of the experiments 50 or 75cycles were used to perform the training, with the learningspan and the learning rate linearly decreasing until zero.SOMs were implemented throughout this study with an in-house developed Java application derived from theJATOON Java applets.[23]

Scheme 1. Simplified illustration of the procedure to classify metabolic pathways using a hierarchy of SOMs trained with different types ofinformation (pathways, reactions, and bonds).

Table 1. Number of organisms in each group defined at the thirdtaxonomy level.

Classification at the third taxonomy level Class[a]

Number[b]

Prokaryotes/Bacteria/Proteobacteria A 165Prokaryotes/Bacteria/Acidobacteria B 1Prokaryotes/Bacteria/Firmicutes C 81Prokaryotes/Bacteria/Actinobacteria D 23Prokaryotes/Bacteria/Fusobacteria/Fusobacteria E 1Prokaryotes/Bacteria/Planctomyces/Planctomy-ces

F 1

Prokaryotes/Bacteria/Chlamydia G 10Prokaryotes/Bacteria/Spirochete H 6Prokaryotes/Bacteria/Cyanobacteria I 17Prokaryotes/Bacteria/Bacteroides J 5Prokaryotes/Bacteria/Green Sulfur Bacteria K 3Prokaryotes/Bacteria/Green Non Sulfur Bacteria L 2Prokaryotes/Bacteria/Deinococcus-thermus M 4Prokaryotes/Bacteria/Hyperthemophilic Bacteria N 2Prokaryotes/Archaea/Euryarchaeota O 21Prokaryotes/Archaea/Crenarchaeota P 5

[a] Class: Labels used in Kohonen SOMs. [b] Number: Number oforganisms in each taxonomic class.

Mol. Inf. 2012, 31, 135 – 144 � 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 137

Automatic Perception of Chemical Similarities Between Metabolic Pathways

Page 4: Automatic Perception of Chemical Similarities Between Metabolic Pathways

2.3 MOLMAP Reaction Descriptors

The generation of MOLMAP molecular descriptors is basedon a SOM that distributes chemical bonds through a gridof neurons. The chemical bonds are represented by topo-logical and physicochemical features (Table S1 in Support-ing Information). The SOM is trained with a diverse set ofchemical bonds representative of the metabolic reactionchemical space.

After the training of the Kohonen SOM, the bonds exist-ing in a molecule can be represented as a whole by map-ping all the bonds of that molecule onto the SOM previ-ously trained with a diversity of bonds. The pattern of acti-vated neurons is a representation of the available bonds inthe molecule, and it was used as a molecular descriptor(MOLMAP). From this molecular descriptor, a reactionMOLMAP is obtained by the difference between the molec-ular MOLMAPs of the products and reactants. It representsthe changes operated by the reaction and can be interpret-ed as a fingerprint of the reaction. Further details onMOLMAP reaction descriptors are available in the litera-ture.[16–19]

2.4 MOLMAP-Based Encoding of Metabolic Pathways

The metabolic pathway descriptor was derived from physi-cochemical and topological atomic features of metabolites.MOLMAPs of metabolites and metabolic reactions werefirst calculated and were then processed to deriveMOLMAP descriptors of metabolic pathways. Scheme 2shows an illustration of the procedure.

The following steps describe the generation of the meta-bolic pathway descriptor from atomic features of metabo-lites:

1. A data set of 1568 bonds, from a diverse representa-tive set of molecules involved in enzymatic reactions, wereextracted from metabolites using the Ward method. Forfurther details concerning the Ward method and the selec-tion of the data set of bonds see the literature.[18]

2. The original molfiles of the compounds were treatedwith the JChem Standardizer tool to add hydrogens, aro-matize and clean stereochemistry. Then empirical physico-chemical and topological descriptors were calculated for allcompounds from properties calculated with JChem.

3. The physicochemical descriptors were z-normalizedbased on the whole data set of chemical bonds to makethem equally relevant. Topological descriptors 1–14 and35–41 were multiplied by 3 (Table S1 in Supporting Infor-mation).

4. A Kohonen SOM of size 25 � 25 was trained with thedata set of 1568 chemical bonds represented by a set of 55physicochemical and topological descriptors (Table S1 inSupporting Information). The objects of this SOM are chem-ical bonds represented by physicochemical and topologicalbond properties.

5. All the bonds in one compound were submitted to thepreviously trained SOM (SOM A in Scheme 2). The patternof activated neurons by bonds of one structure is a molec-ular descriptor, the molecular MOLMAP (step a. inScheme 2).

6. The frequency of activation is counted for eachneuron. The map (matrix) is then transformed into a vectorby concatenation of columns. To account for the relation-ship between the similarity of chemical bonds and proximi-ty in the map, a value of 0.3 was added to each neuronmultiplied by the number of times a neighbour was activat-ed by a chemical bond.

7. Steps 5–6 were repeated for all compounds that par-ticipate in metabolic reactions.

8. The MOLMAP reaction descriptors were generated forall metabolic reactions available by the difference betweenthe MOLMAPs of the products and MOLMAPs of the reac-tants (step c. in Scheme 2). If the reaction has more thanone reactant, the MOLMAPs of all reactants are summed.The same is done for the products.

9. A new and independent Kohonen SOM of size 49 � 49 istrained with all metabolic reactions available, from all meta-bolic pathways (SOM B in Scheme 2), encoded in the Steps1–8. The data was obtained and processed as previouslyfor the automatic prediction of EC numbers,[17–19] but nowalso included reactions with no full EC number assignedand reactions with more than on EC number. The objectsof this SOM are metabolic reactions represented byMOLMAP reaction descriptors of size 625 (step d. inScheme 2).

10. After the training all the metabolic reactions in onemetabolic pathway are mapped on the new trained Koho-nen SOM (Step e. in Scheme 2). Each metabolic reaction ac-tivates one neuron in the map. In this step, reactions arerepresented with the direction(s) reported in the KEGGpathway.

11. The pattern of activated neurons is a representationof the metabolic reactions available in that pathway – a de-scriptor of the pathway.

12. The pattern of activated neurons is encoded numeri-cally for computational processing. Each neuron is given avalue equal to the number of times it was activated bymetabolic reactions (step e. in Scheme 2).

13. The map (a matrix) is transformed into a vector byconcatenation of columns. To account for the relationshipbetween the similarity of metabolic reactions and proximityin the map, a value of 0.3 was added to each neuron multi-plied by the number of times a neighbour was activated bya reaction.

All the metabolic pathways are encoded and could nowbe processed by a new Kohonen SOM (SOM C inScheme 2). In this third SOM the objects are metabolicpathways represented by MOLMAP descriptors of size 2401(step f. in Scheme 2).

138 www.molinf.com � 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2012, 31, 135 – 144

Full Paper D. A. R. S. Latino, J. Aires-de-Sousa

Page 5: Automatic Perception of Chemical Similarities Between Metabolic Pathways

2.5 MOLMAP-Based Reactome Descriptors

The Kohonen SOM trained with the diversity of metabolicreactions allows to map not only the chemical reactions ofa metabolic pathway but also the entire reactome of an or-ganism.

The methodology to generate MOLMAP-based reactomedescriptors was the same as described in Subsection 2.4 for

pathways. The difference was in the Steps 10–13. Instead ofmapping the reactions of a metabolic pathway, all reactionsparticipating in the metabolism of an organism weremapped on the Kohonen SOM previously trained with met-abolic reactions. Here the reactions were represented in thedirection written in KEGG REACTION. The pattern of activat-ed neurons by the metabolic reactions of that organism

Scheme 2. Simplified illustration of the procedure leading to the representation and mapping/classification of metabolic pathways.

Mol. Inf. 2012, 31, 135 – 144 � 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 139

Automatic Perception of Chemical Similarities Between Metabolic Pathways

Page 6: Automatic Perception of Chemical Similarities Between Metabolic Pathways

can be interpreted as a fingerprint of the organism basedon its chemical reactions – a reactome MOLMAP descriptor.This reactome descriptor encodes, in a fixed-length numeri-cal code, the biochemical machinery of an organism. Or-ganisms were encoded with MOLMAPs of size 29 � 29 =841.

3 Results and Discussion

3.1 Mapping of Metabolic Pathways

Experiments were performed with 92 metabolic pathways,classified into six different types of metabolism – 16 of car-bohydrate metabolism, 12 of lipid metabolism, 18 of xeno-biotics biodegradation and metabolism, 24 of amino acidmetabolism and other amino acid (considered in the sameclass), 12 of biosynthesis of secondary metabolites and 10of cofactors and vitamins metabolism.

All metabolic pathways were encoded with MOLMAPs ofsize 49 � 49 = 2401 and were submitted to a new independ-ent SOM of size 12 � 12 – step f) in Scheme 2. In this SOM(C), the objects are metabolic pathways. SOMs were imple-mented with toroidal topology – neurons on the left areconsidered adjacent to those on the right and top neuronsare adjacent to bottom neurons. The full list of metabolicpathways and the corresponding predictions and mappingcoordinates are provided in Table S2 of Supporting Infor-mation.

The association between the classes of metabolism andthe distribution of pathways on the map was investigated– Figure 1. Well-defined regions for specific types of metab-olism can be observed indicating the ability of the SOM toautomatically perceive similarities between pathways of thesame class. For example, pathways C030, C040, C051, C052,C053 and C500, all pathways from carbohydrate metabo-lism, were mapped in a well defined region of the mapwithout neighbors of other types of metabolism. Most ofthese pathways are related with each other. In the class of“amino acid metabolism and other amino acid metabo-lism”, pathways A280, A350, A360 and A380 occupy a sepa-rate region, in agreement with the fact that all of thembelong to the “amino acid metabolism” class in the mostrecent version of KEGG. The pathways of lipid metabolism,xenobiotics biodegradation and metabolism and biosynthe-sis of secondary metabolites were also mapped in well de-fined regions of the map. It is worth mentioning that SOMslearn with unsupervised training, i.e. , the algorithm doesnot use the information about classes during the trainingprocess – objects (pathways) are mapped exclusively onthe basis of their features (MOLMAPs) and only afterwardsare the classes considered for classifying (coloring) neurons.

In these preliminary studies with metabolic pathways, weused a previous version of the KEGG database, which ena-bles to compare the updates in the organization of path-ways operated by experts since the release date, with thesimilarities between pathways encoded by our method and

perceived by the SOMs. After analysis of the obtained mapand comparison with the currently available description ofpathways in the KEGG database, some interesting casesemerged. For example, pathways A271 and A272, formerlymethionine metabolism and cystein metabolism are nowcollected in only one pathway (“cysteine and methioninemetabolism” KEGG code 00270) and they were mapped inadjacent neurons (B:8 and B:9) in our map. Similarly, bothpathways C530 (amino sugar metabolism) and C520 (nu-cleotide metabolism) activate neuron L:5 – pathway 00520is now named “amino sugar and nucleotide sugar metabo-lism” and the pathway 00530 was deleted. In another ex-ample, pathways L140 and L150 were mapped in neuronE:6, and they were formerly named “C21-steroid hormonemetabolism” and “androgen and estrogen metabolism” –currently pathway L150 has disappeared while 00140 isnamed “steroid hormone metabolism” and consists in themetabolism of three groups of steroids (C21 steroids of glu-cocorticoids and mineralocorticoids, C19 steroids of andro-gens, and C18 steroids of estrogens).

Some of the formal conflicts between types of metabo-lism correspond to chemical similarities between metabolicpathways classified in different types of metabolism. For ex-ample, pathway C660 – C5-branched dibasic acid metabo-lism (carbohydrate metabolism) – and pathway A290 –valine, leucine and isoleucine metabolism (amino acid me-tabolism) – activate neuron H:8. In this case the two path-ways share a common reaction, R03896 with EC number4.2.1.35, and two reactions with the same EC number,R04673 (EC2.2.1.6) in pathway A290 and R00226 (EC2.2.1.6)in pathway C660. Moreover, the two pathways have fourreactions with similar EC numbers: R01213 (2.3.3.13),R04001 (4.2.1.33), R03968 (4.2.1.33) and R00996 (4.3.1.19) inpathway A290 comparing with R00932 (2.3.3.11), R02491(4.2.1.56), R03693 (4.2.1.34) and R02696 (4.3.1.2) in pathwayC660. The two pathways are related – one of the final me-tabolite of C5-branched dibasic acid metabolism, (S)-2-ace-tolactate also participates in the valine, leucine and isoleu-cine biosynthesis. Scheme 3 shows the similarity betweenthe metabolic reactions of these pathways.

It is to point out that the mapping of two pathways ofdifferent metabolisms into the same neuron does not implythat the entire pathways are similar. Overlapping, or verysimilar fragments, of the pathways (two or three consecu-tive reactions) often suffice for both pathways to activatethe same neuron.

The Euclidean distance between the neuron weights andthe pathway descriptors of the metabolic pathway that ac-tivate the neuron can sometimes highlight a less reliablemapping. For example, both pathways A440 and B521(pathways of different types of metabolism and with no re-lation) activate neuron B:12, the first presenting a higherEuclidean distance between descriptor vector and neuronweights. That was in accordance with the fact that mostneighbor neurons were activated by pathways of class B.The same happens in neuron L:1 with pathway V130 in

140 www.molinf.com � 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2012, 31, 135 – 144

Full Paper D. A. R. S. Latino, J. Aires-de-Sousa

Page 7: Automatic Perception of Chemical Similarities Between Metabolic Pathways

conflict with X351, X363, X642. Pathway V130 presents anEuclidean distance to the neuron much larger than theother three pathways. A high similarity between the querypathway and the neuron weights generally corresponds toa common type of metabolism in adjacent neurons, evenin neurons with conflicts.

The SOMs also exhibited an ability to perceive andencode, in a certain way, the relation between pathways,also for pathways of different types of metabolism. The or-ganization of pathways in the map was compared with theinformation available in KEGG for inter-pathways relations,and many cases were found of related pathways mappedinto the same or adjacent neurons – pathway A350 (G : 1)and pathway B950 (F : 1), pathway A360 (I : 2) and pathwayX362 (I : 1), pathways A280 (I : 4) and C640 (I : 5), pathwayC620 (I : 5) and pathway C640 (I : 5), pathway A410 (K : 9)

and pathways V770 (J : 10) and A330 (K : 8). In the case ofconflicting pathways A440 and B521 no such a relation wasfound.

The MOLMAP-based similarity between two reactions (asapplied in this work) is derived from the similarity betweencovalent bonds that change, break and are created in thetwo reactions. Bonds are compared in terms of their physi-cochemical properties, as well as their substructural envi-ronment. On the other hand, the KEGG classification of me-tabolisms does not always necessarily reflect chemical simi-larities – it is strongly guided by biological functions. Thiswork shows that a chemically-oriented MOLMAP-basedclustering of metabolic pathways reflects their KEGG clas-sification to a large extent.

Figure 1. Toroidal surface of a 12 � 12 Kohonen SOM trained with 92 metabolic pathways encoded by metabolic pathway MOLMAPs ofsize 2401. After the training, each occupied neuron was colored according to the type of metabolism of the metabolic pathways that wasmapped onto it. Black neurons correspond to conflicts (tie in the major class).

Mol. Inf. 2012, 31, 135 – 144 � 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 141

Automatic Perception of Chemical Similarities Between Metabolic Pathways

Page 8: Automatic Perception of Chemical Similarities Between Metabolic Pathways

3.2 Mapping of Reactomes of Organisms

The SOM B in Scheme 2, trained with a diversity of meta-bolic reactions, allows to map not only the chemical reac-tions of a metabolic pathway but also the entire reactomeof an organism. Applying the MOLMAP concept, the pat-tern of activated neurons by reactions of organisms cannow be used to describe, compare and classify organisms.Some exploratory exercises were performed to proof thisidea.

All reactions participating in the metabolism of an organ-ism were mapped on SOM B (Scheme 2). The pattern of ac-tivated neurons was interpreted as a fingerprint of the or-ganism based on its chemical reactions – the MOLMAP de-scriptor of the organism. Experiments were performed withthe lists of metabolic reactions for each organism extractedfrom the KEGG GENES database.

Organisms were restricted to Prokaryotes fully se-quenced, and the KEGG classification was considered at thethird taxonomy level (Table 1). A total of 347 organisms of16 taxonomic types were encoded and submitted to a newindependent SOM of size 25 � 25. After the training, the ac-tivated neurons were colored according to the third taxon-omy level of the organisms and the result is displayed inFigure 2.

A remarkable clustering according to the taxonomy wasobserved for several types of organisms. Clustering accord-ing to the previous level of classification can also be ob-served in some cases, for example with Euryarchaeota (classO) and Crenarchaeota (class P), the only two classes of Arch-aea organisms, mapped in the same region of the map.The full list of organisms and the corresponding predictions

and mapping coordinates are provided in Table S3 of Sup-porting Information.

Similarly to the strategy followed in the experimentswith metabolic pathways, a previous version of KEGGGENES was used. Again, the comparison of the mappingwith the current classification in KEGG demonstrates thepotential of this automatic procedure. The organism Sym-biobacterium thermophilum, sth, was labeled in our data setas Actinobacteria. However, this organism activated neuronM:10 in the neighborhood of Proteobacteria and Firmicutes,in a region with no other Actinobacteria organisms. Indeed,Lyashenko et al.[24] reported that none of the signaturegenes used in their studies for Actinobacteria have ortho-logs in Symbiobacterium while the signature genes for Fir-micutes do, which is an evidence that Symbiobacterium be-longs to Firmicutes. Symbiobacterium thermophilum waschanged to Firmicutes in a later release of KEGG. Additional-ly, Firmicutes are currently separated in Firmicutes and Ten-ericutes. Figure 3 shows the remapping of these organisms,labelled differently for Firmicutes and Tenericutes, wherethe separation between the two groups is evident.

It is to point out that the obtained clustering of organ-isms according to their taxonomic classification can be af-fected by the way the genomes were annotated. In fact, agenome being 100 % sequenced does not mean that thereactome is 100 % completed, i.e. not all reactions thatoccur in an organism are available. It can happen that thebiochemical machinery of two organisms appear as verysimilar as a consequence of the similarities between ge-nomes – genomes are often annotated by comparison withsimilar genes in other organisms. The mapping also de-pends on the reactions that are listed in the database for

Scheme 3. Example of similar reactions in two pathways mapped in the same neuron. Met. Path. – Label of metabolic Pathway (A290 –valine, leucine and isoleucine metabolism, and C660 – C5-branched dibasic acid metabolism). KEGG ID – ID of metabolic reactions in KEGGREACTION database.

142 www.molinf.com � 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2012, 31, 135 – 144

Full Paper D. A. R. S. Latino, J. Aires-de-Sousa

Page 9: Automatic Perception of Chemical Similarities Between Metabolic Pathways

each organism. Therefore, the results reflect not only thebiologically significant similarities/differences between or-ganisms, but also the contents of the database.

4 Conclusions

This work demonstrates a new approach to the encodingand automatic perception of chemical similarities between

metabolic pathways, without relying on EC numbers. It canbe used to compare different metabolic pathways and met-abolic pathways of different organisms. Further refinementsof the method are foreseen, making possible the applica-tion to more complex tasks, e.g. to automatically comparethe same metabolic pathway of different organisms or toassist in phylogenetic reconstruction of organisms.

The representation and mapping of organisms in termsof their MOLMAP-based reactomes was demonstrated, and

Figure 2. Toroidal surface of a 25 � 25 Kohonen SOM trained with reactomes of 347 organisms encoded by reactome MOLMAPs of size841. After the training, each neuron was colored according to the third taxonomic level of the organisms that were mapped onto it. Theblack neurons are conflicts (tie in the major class).

Mol. Inf. 2012, 31, 135 – 144 � 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim www.molinf.com 143

Automatic Perception of Chemical Similarities Between Metabolic Pathways

Page 10: Automatic Perception of Chemical Similarities Between Metabolic Pathways

good general concordance with the established taxonomicclassification was observed. However it is to emphasizethat this application shall be seen as an exercise limited bythe quality of the available data. Genomes fully sequenceddo not correspond to complete metabolomes or reactomesof the organisms. Reactomes are necessarily incompleteand may be biased by artifacts related to the building ofthe database.

The study shows the possibility of applying the MOLMAPmethodology at progressively higher levels of complexity,bridging chemical and biological information, and going allthe way from atomic properties to the classification of or-ganisms.

Acknowledgements

The authors thank ChemAxon Ltd (Budapest, Hungary) foraccess to JChem and Marvin software and Kyoto UniversityBioinformatics Center (Kyoto, Japan) for access to the KEGGDatabase. The authors acknowledge FCTMES (Fundażopara a CiÞncia e a Tecnologia), Minist�rio da CiÞncia, Tecnolo-gia e Ensino Superior, Lisbon Portugal for financial support.

References

[1] M. Kanehisa, S. Goto, Nucleic Acids Res. 2000, 28, 27 – 30.[2] R. Caspi, H. Foerster, C. S. Fulcher, P. Kaipa, M. Krummenacker,

M. Latendresse, S. Paley, S. Rhee, A. G. Shearer, C. Tissier, T. C.Walk, P. Zhang, P. D. Karp, Nucleic Acids Res. 2008, 36, D623 –D631.

[3] M. Kanehisa, S. Goto, M. Hattori, K. F. Aoki-Kinoshita, M. Itoh, S.Kawashima, T. Katayama, M. Araki, M. Hirakawa, Nucleic AcidsRes. 2006, 34, D354 – D357.

[4] M. Ferrer, F. Martinez-Abarca, P. Golyshin, Curr. Opin. Biotech-nol. 2005, 16, 588 – 593.

[5] H. Ogata, W. Fujibuchi, S. Goto, M. Kanehisa, Nucleic Acids Res.2000, 28, 4021 – 4028.

[6] M. Koyut�rk, A. Grama, W. Szpankowski, Bioinformatics 2004,20, i200 – i207.

[7] M. Chen, R. Hofestadt, Appl. Bioinformatics 2004, 3, 241 – 252.[8] K. Tun, P. K. Dhar, M. C. Palumbo, A. Giuliani, BMC Bioinformat-

ics 2006, 7, 24.[9] R. Y. Pinter, O. Rokhlenko, E. Yeger-Lotem, M. Ziv-Ukelson, Bio-

informatics 2005, 21, 3401 – 3408.[10] A. Mano, T. Tuller, O. B�j�, R. Y. Pinter, BMC Bioinformatics

2010, 11, S38.[11] J. C. Clemente, K. Satou, G. Valiente, Genome Inform. 2005, 16,

45 – 55.[12] J. C. Clemente, K. Satou, G. Valiente, Bioinformatics 2007, 23,

e110 – e115.[13] A. Mazurie, D. Bonchev, B. Schwikowski, G. A. Buck, Bioinfor-

matics 2008, 24, 2579 – 2585.[14] C.-W. Chang, P.-C. Lyu, M. Arita, BMC Bioinformatics 2011, 12,

s27.[15] A. Macchiarulo, J. M. Thornton, I. Nobeli, J. Chem. Inf. Model.

2009, 49, 2272 – 2289.[16] Q.-Y. Zhang, J. Aires-de-Sousa, J. Chem. Inf. Model. 2005, 45,

1775 – 1783.[17] D. A. R. S. Latino, J. Aires-de-Sousa, Angew. Chem. Int. Ed. 2006,

45, 2066 – 2069.[18] D. A. R. S. Latino, Q.-Y. Zhang, J. Aires-de-Sousa, Bioinformatics

2008, 24, 2236 – 2244.[19] D. A. R. S. Latino, J. Aires-de-Sousa, J. Chem. Inf. Model. 2009,

49, 1839 – 1846.[20] M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, M. Hattori, Nu-

cleic Acids Res. 2004, 32, D277 – D280.[21] S. Goto, T. Nishioka, M. Kanehisa, Bioinformatics 1998, 14,

591 – 599.[22] T. Kohonen, Self-Organizing Maps 1997, Springer-Verlag, New

York, Inc. , Secaucus, NJ.[23] J. Aires-de-Sousa, Chemom. Intell. Lab.Syst. 2002, 61, 167 – 173.[24] M. Bern, D. Goldberg, E. Lyashenko, Nucleic Acids Res. 2006,

34, 4342 – 4353.

Received: August 5, 2011Accepted: December 12, 2011

Published online: February 8, 2012

Figure 3. Toroidal surface of a 25 � 25 Kohonen SOM trained withreactomes of 347 organisms encoded by reactome MOLMAPs ofsize 841. Neurons belonging to Firmicutes were colored accordingto the new classification of the organisms as Firmicutes (neuronswith diagonal lines) and Tenericutes (neurons with horizontal lines).

144 www.molinf.com � 2012 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2012, 31, 135 – 144

Full Paper D. A. R. S. Latino, J. Aires-de-Sousa