8
Agreement assessment of biochemical pathway models by structural analysis of their intersection Tatjana Rubina, Martins Mednis, and Egils Stalidzans Biosystems Group, Department of Computer Systems, Latvia University of Agriculture, Liela iela 2, LV-3001, Jelgava, Latvia Abstract—In case of model development, it would be an advantage to assess the quality of available models looking for the best one or to find suitable parts of a published model to build a new one. The differences or contradictions in reconstructions can indicate the level of agreement between different authors about the topic of interest. The intersecting part of models can reveal also the differences in the scope of the models. Two pairs of models from BioCyc database were analyzed: 1) the Escherichia coli models ecol199310cyc and ecol316407cyc and 2) the Saccharomyces cerevisiae models iND750 and iLL672. The ModeRator software tool is used to compare models and generate their intersection model. The structural parameters of models are analyzed by the software BINESA. The study reveals very different parameters of the intersections of the pairs of the E.coli and the S.cerevisiae models. The models built by the same group of authors like in the case of E.coli is selected as an example of a high agreement between models and can be interpreted as a consensus part of two initial models. The intersection of the S.cerevisiae models demonstrates very different structural prop- erties and the intersection model would not be able to function even after significant improvement. The structural analysis of the pairs of original models and their intersections is performed to determine which structural parameters can be used to determine a poor agreement between the pairs of models. It is concluded that an application of the automated comparison and intersection generation of two models can give a fast insight in the similarity of the models to find out the consensus level in modelling of metabolism of a particular organism. This approach can be used also to find similarities between the models of different organisms. Automation of intersection creation and structural analysis are enabling technologies of this approach. Index Terms—stoichiometric model, intersection, structure analysis. I. I NTRODUCTION T HE fast development of the sequencing techniques en- ables relatively fast reconstructing of biochemical reac- tion network in many organisms. The reconstruction process of metabolic networks is well developed [1], [2] and implemented for a number of different organisms [3], [4] , including human [5], [6] . Several of these networks are available online in dif- ferent databases: Kyoto Encyclopedia of Genes and Genomes (KEGG) [7], [8] , EcoCyc [9] BioCyc [10] , BIGG [4] and metaTIGER [11] . The available reconstructions and models are growing both in number and size (a number of interactions within reconstruction or model). Modelling a particular biochemical network, it would be an advantage to assess the scope and quality of already published models in a fast way to find the most suitable one or to find suitable parts or modules to build a new model. The differences or contradictions in reconstructions, especially genome scale reconstructions give an insight in the scope of models and the level of agreement between different authors about the topic of interest. The level of agreement can be estimated analysing the intersection model: a model which consists of identical reactions found in the compared models. The comparison of models is a complicated task due to the different habits of model builders and the growing size of models. The metabolites and compartments are often named in different ways. In some cases reconstructions or models have formulas of metabolites but many have no indication about the formula and the name of the metabolite becomes the most valuable parameter for comparison. Quite common practice among researchers is to ignore water molecules and hydrogen ions in reactions assuming that water and hydrogen is always available. The reactions written in such a manner will most likely be unbalanced. Still that doesn’t change the essence of the reactions. Ignoring of water and hydrogen has become popular both developing [1], [2] and visualizing [12] reconstructions. The automated software tool ModeRator [13] can be used for comparison and intersection of models taking into account earlier mentioned habits of modellers. The structural or topological analysis can be used for fast detection of similarities and differences in reconstructions or models. The structure of cellular biochemical networks has some characteristic features and one can assume that a model not following those features may be incorrect. One biochemical network property is their scale-free distribution of degrees. The most part of metabolites in scale-free networks have only a few links, while a small set of high-degree hubs participates in dozens of interactions. The scale-free network has a power-law degree distribution, P (k) k -y . When the power is 2 y 3, the hubs play significant role in the network [14] . The structure and dynamics of these networks are independent of the network size as measured by the number of elements in the network ——–(Zhang, 2009) . Another property is the small-world effect, which states that any two elements can be connected via short path of a few links. According to the formal definition, in a small-world network: 1) most elements have a low connection degree, and the degree distribution follows a power law also referred to as scale-freeness; 2) high-degree elements, called hubs, dominate the network, and most elements are clustered around hubs; and 3) the average path length remains the theoretical minimum [15] . Scale-free network is ultra-small, while the path length in there is much shorter than that is predicted by 411 CINTI 2013 • 14th IEEE International Symposium on Computational Intelligence and Informatics • 19–21 November, 2013 • Budapest, Hungary 978-1-4799-0197-5/13/$31.00 ©2013 IEEE

[IEEE 2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI) - Budapest, Hungary (2013.11.19-2013.11.21)] 2013 IEEE 14th International Symposium

  • Upload
    egils

  • View
    217

  • Download
    5

Embed Size (px)

Citation preview

Page 1: [IEEE 2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI) - Budapest, Hungary (2013.11.19-2013.11.21)] 2013 IEEE 14th International Symposium

Agreement assessment of biochemical pathwaymodels by structural analysis of their intersection

Tatjana Rubina, Martins Mednis, and Egils StalidzansBiosystems Group, Department of Computer Systems, Latvia University of Agriculture, Liela iela 2, LV-3001,

Jelgava, Latvia

Abstract—In case of model development, it would be anadvantage to assess the quality of available models looking for thebest one or to find suitable parts of a published model to builda new one. The differences or contradictions in reconstructionscan indicate the level of agreement between different authorsabout the topic of interest. The intersecting part of models canreveal also the differences in the scope of the models. Twopairs of models from BioCyc database were analyzed: 1) theEscherichia coli models ecol199310cyc and ecol316407cyc and 2)the Saccharomyces cerevisiae models iND750 and iLL672. TheModeRator software tool is used to compare models and generatetheir intersection model. The structural parameters of modelsare analyzed by the software BINESA. The study reveals verydifferent parameters of the intersections of the pairs of the E.coliand the S.cerevisiae models. The models built by the same groupof authors like in the case of E.coli is selected as an exampleof a high agreement between models and can be interpreted asa consensus part of two initial models. The intersection of theS.cerevisiae models demonstrates very different structural prop-erties and the intersection model would not be able to functioneven after significant improvement. The structural analysis of thepairs of original models and their intersections is performed todetermine which structural parameters can be used to determinea poor agreement between the pairs of models. It is concludedthat an application of the automated comparison and intersectiongeneration of two models can give a fast insight in the similarityof the models to find out the consensus level in modelling ofmetabolism of a particular organism. This approach can be usedalso to find similarities between the models of different organisms.Automation of intersection creation and structural analysis areenabling technologies of this approach.

Index Terms—stoichiometric model, intersection, structureanalysis.

I. INTRODUCTION

THE fast development of the sequencing techniques en-ables relatively fast reconstructing of biochemical reac-

tion network in many organisms. The reconstruction process ofmetabolic networks is well developed [1], [2] and implementedfor a number of different organisms [3], [4] , including human[5], [6] . Several of these networks are available online in dif-ferent databases: Kyoto Encyclopedia of Genes and Genomes(KEGG) [7], [8] , EcoCyc [9] BioCyc [10] , BIGG [4] andmetaTIGER [11] . The available reconstructions and modelsare growing both in number and size (a number of interactionswithin reconstruction or model).

Modelling a particular biochemical network, it would be anadvantage to assess the scope and quality of already publishedmodels in a fast way to find the most suitable one or to findsuitable parts or modules to build a new model. The differences

or contradictions in reconstructions, especially genome scalereconstructions give an insight in the scope of models and thelevel of agreement between different authors about the topicof interest. The level of agreement can be estimated analysingthe intersection model: a model which consists of identicalreactions found in the compared models.

The comparison of models is a complicated task due tothe different habits of model builders and the growing size ofmodels. The metabolites and compartments are often namedin different ways. In some cases reconstructions or modelshave formulas of metabolites but many have no indicationabout the formula and the name of the metabolite becomesthe most valuable parameter for comparison. Quite commonpractice among researchers is to ignore water molecules andhydrogen ions in reactions assuming that water and hydrogenis always available. The reactions written in such a mannerwill most likely be unbalanced. Still that doesn’t change theessence of the reactions. Ignoring of water and hydrogen hasbecome popular both developing [1], [2] and visualizing [12]reconstructions. The automated software tool ModeRator [13]can be used for comparison and intersection of models takinginto account earlier mentioned habits of modellers.

The structural or topological analysis can be used for fastdetection of similarities and differences in reconstructionsor models. The structure of cellular biochemical networkshas some characteristic features and one can assume that amodel not following those features may be incorrect. Onebiochemical network property is their scale-free distribution ofdegrees. The most part of metabolites in scale-free networkshave only a few links, while a small set of high-degree hubsparticipates in dozens of interactions. The scale-free networkhas a power-law degree distribution, P (k) ∼ k−y. Whenthe power is 2 ≤ y ≤ 3, the hubs play significant rolein the network [14] . The structure and dynamics of thesenetworks are independent of the network size as measured bythe number of elements in the network ——–(Zhang, 2009) .Another property is the small-world effect, which states thatany two elements can be connected via short path of a fewlinks. According to the formal definition, in a small-worldnetwork: 1) most elements have a low connection degree,and the degree distribution follows a power law also referredto as scale-freeness; 2) high-degree elements, called hubs,dominate the network, and most elements are clustered aroundhubs; and 3) the average path length remains the theoreticalminimum [15] . Scale-free network is ultra-small, while thepath length in there is much shorter than that is predicted by

411

CINTI 2013 • 14th IEEE International Symposium on Computational Intelligence and Informatics • 19–21 November, 2013 • Budapest, Hungary

978-1-4799-0197-5/13/$31.00 ©2013 IEEE

Page 2: [IEEE 2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI) - Budapest, Hungary (2013.11.19-2013.11.21)] 2013 IEEE 14th International Symposium

the small-world effect ——–(Zhang, 2009) . Within the cell,this ultra-small effect was first documented for metabolism,where path length of only three or four reactions can linkmost pairs of metabolites [14] . Cellular networks can becharacterized by the average clustering coefficient, that issignificantly larger in most real biochemical networks than thata random network of equivalent size and degree distribution[16] . The metabolic networks offer striking evidence forthis, while their average clustering coefficient is independentof the network size, in contrast to module-free scale-freenetworks, for which it decreases [17], [14] . For example,protein interaction networks have a high average clusteringcoefficient [18] which implies that the network comprises acollection of modules ——–(Zhang, 2009) and can be usedfor verification of modular structures existence.

Several tools are available for structure analysis of biochem-ical networks can be used to calculate a number of simple andcomplex topological parameters, detect network motifs andfeatures [19], [20] .

In this study the structural analysis of model pairs ofthe same organism and their intersection demonstrates casesof highly similar and different models. Different structuralparameters are determined for all models to find out the pa-rameters that give indications about the quality of intersectionmodels. Those parameters can be used for a fast analysis ofintersections of models assessing the quality of intersectionand level of agreement between the compared models.

II. MATERIALS AND METHODS

Two pairs of models are compared: 1) the Escherichiacoli models ecol199310cyc (named Eco 1 in this study) andecol316407cyc (named Eco 2 in this study) from BioCycdatabase (http://www.biocyc.org/) and 2) the Saccharomycescerevisiae models iLL672 (named Sce 1 in this study) devel-oped by Lars Kuepfer [21] and iND750 (named Sce 2 in thisstudy) developed by Natalie Duarte [22] . The reconstructionof the S.cerevisiae metabolism iLL672 is based on the pre-vious reconstruction iFF708 [23] . The biomass reactions areexcluded from analysis as it is a complicated process describedin a form of a large lumped biochemical reaction that can betreated in a very different way by different authors.

The comparison tool of the stoichiometric models ModeR-ator [13] is used to compare models and generate their inter-section model. The intersection was found using the followingparameters: ratio -70%, edit distance - 10, ignore letter case -enabled, filter by compartments - enabled. The comparisonof models used in this research is described in details byMednis and Aurich [24] . Water and hydrogen molecules areignored. The reactions are compared substrates-1 to substrates-2 and products-1 to products-2. Enabling cross check option,reaction sides are compared as in default substrates-1 tosubstrates-2, products-1 to products-2 and also substrates-1to products-2 and products-1 to substrates-2. The metabolitesof the E.coli models were compared by their identifiers. Thesoftware BINESA [25] (www.biosystems.lv/binesa) is used fora structure analysis. The S.serevisiae models were convertedfrom COBRA compatible xls files to SBML files to achieve

compatibility with BINESA. The intersection models of theE.coli and the S.cerevisiae models are named as Eco i/s andSce i/s correspondingly. To illustrate the network structure ofthe examined models, they were converted to GV and PNGfiles using GraphViz software.

The intersection of the E.coli model pair (Eco i/s) is takenas an example of a high quality intersection indicating a highagreement of original models in the intersecting part. Thereason for a high agreement is the fact that models are built bythe same research group. The S.cerevisiae models are built bydifferent groups of researchers and their intersection wouldnot be able to function even after significant improvements.Therefore the intersection of the S.cerevisiae models (Sce i/s)is chosen as a sample of low agreement models.

The structural parameters of models were analyzed by thesoftware BINESA with the following settings: a maximumnumber of reactants in a reaction 10, a maximum numberof products in a reaction 10. The following parameters wereanalyzed: a number of elements, a number of connected ele-ments, a number of isolated elements, a number of reactions, anumber of links, an average degree, an incoming and outgoingdegree, an average incoming and outgoing degree, an incomingand outgoing degree distribution, a clustering coefficient, anaverage clustering coefficient, a number of neighbors and anaverage number of neighbors.

III. CALCULATION

The formulas used for a calculation of the structural param-eters are described below. The degree of a network element(1), ki, is a number of the edges or links that it has with theother network elements that is incident with the i element.

ki =

n∑j

kij (1)

An incoming degree (2) is a number of links that point tothe network element i:

k+i =n∑j

kij (2)

An outgoing degree (3) is a number of links that start withthe network element i:

ki+ =n∑j

kij (3)

An average degree of the network (4) is an average of thedegrees of all individual network elements:

K =

N∑i=1

ki

N, (4)

where ki – a degree of the element iN – a total number of the elements.

A degree distribution is a number of elements with degreek(k = 1, 2, 3 . . . n). For a directed network, the degreedistribution is separated into an incoming and outgoing degree

T. Rubina et al. • Agreement Assessment of Biochemical Pathway Models by Structural Analysis of their Intersection

412

Page 3: [IEEE 2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI) - Budapest, Hungary (2013.11.19-2013.11.21)] 2013 IEEE 14th International Symposium

distribution. The statistical model for a degree distribution isthe following:

P (K = k) =Nk

N, (5)

where Nk – a number of the elements with degree k =1, 2 . . . nN – a total number of the elements.

A number of the neighbors qi is a number of the elementsthat is incident with i element. An average number of neigh-bors is an average of numbers of the neighbors of the allindividual network elements:

Q =

N∑i=1

qi

N, (6)

The clustering coefficient Ci is a ratio of a number ofthe existing links between the element i neighbors and to amaximum number of possible links between them:

Ci =2ni

qi(qi − 1), (7)

where qi is a number of the element i neighborsni is a number of the links between qi neighbors.

An average clustering coefficient is the average of all theindividual clustering coefficients of the network elements:

C =

N∑i=1

Ci

N(8)

Within this study, metabolic network models are analysed,for this reason metabolites are considered as network elementsand reactions as network links. In several cases, one reactioncan include many links.

IV. RESULTS

The summary of the structural parameters of the E.coli andthe S.cerevisiae models indicate a different intersection sizecompared with the original model pairs (Table I), while somestructural parameters remain relatively similar.

The distributions of the structural parameters (an incomingdegree and an outgoing degree (Fig. 1), a number of neighbors(Fig. 2 and 3) and a clustering coefficient (Fig. 4) are deter-mined by BINESA software. The distribution of clusteringcoefficient in several classes of the E.coli and the S.cerevisiaemodels is summarized in Table II.

V. DISCUSSION

The lumped reactions of biomass formation were excludedfrom the structural analysis as they do not represent metabolicreactions and would change the overall impression aboutthe similarity of model structure. Exclusion of the lumpedreactions from the comparison influences directly also theresults in case if one model represents a lumped reaction in adetailed way and the similarity can not be recognized.

The models can be compared by a number of parametergroups. The number of metabolites and reactions demonstratethe size and the level of the model agreement. The averagestructural parameters: an average degree, an average incoming

degree, an average outgoing degree, an average number ofthe neighbours and an average clustering coefficient allow tocompare models with different sizes. The distributions of thestructural parameters like an incoming degree and an outgoingdegree, a number of neighbours and a clustering coefficientindicate the interconnectivity of the network metabolites. Eachof the mentioned parameter groups are analysed under follow-ing subheadings.

A. Number of metabolites and reactions

The both Eco 1 and Sce 1 models are smaller than theothers in the pair (Table I). Still the number of intersectionmodel metabolites in percents compared to the Eco 1/Eco 2and Sce 1/Sce 2 show the big difference – 71/55 for Eco i/sand 24/15 for Sce i/s and, as a result, the Sce i/s model isproportionally much smaller than the compared models (TableI). A number of the matched reactions shows an even largerdifference. About a half of the reactions in case of E.coliare in the intersection model, while only about each tenthreaction is recognized as common in the S.cerevisiae models.Still it cannot be concluded by an analysis of a number ofthe metabolites and reactions if the small intersection modelis caused by a small overlapping part between the models or alow agreement of the model authors about the same processes.

B. Average structural parameters

The average parameters like an average degree, an averageincoming degree, an average outgoing degree, an averagenumber of the neighbours and an average clustering coefficient(Table I) do not depend on the size of the models. Thereforethese parameters have similar values for the initial models:the average degree within the range 4.04− 4.96, the averageincoming degree within the range 2.11−2.51 and the averageoutgoing degree within the range 1.88 − 2.45, the averagenumber of the neighbours within range 5.83 − 7.37 and theaverage clustering coefficients within the range 0.15 − 0.21.The both intersection models have values below the range ofthe initial models. All the average structural parameters ofEco i/s are up to twice larger than Sce i/s model indicatinga lower connectivity of the model metabolites that correlatewith a smaller number of reactions except for the averageclustering coefficient where the both intersection models havevery similar values: 0.12 and 0.14. The average clusteringcoefficient of the intersection models is at least 20% smallercomparing to the initial models. At the same time it is muchmore higher than the average clustering coefficient of thesmall-world representing random graphs [26] .

Therefore an average degree, an average incoming degree,an average outgoing degree and an average number of theneighbours in case of low values can be used as a fast indicatorof poor quality of intersection models.

C. Distributions of structural parameters

1) Incoming/outgoing degrees: Distributions of the incom-ing and outgoing degrees and the corresponding trendlines(Fig. 1) demonstrate power law distribution. All the three

413

CINTI 2013 • 14th IEEE International Symposium on Computational Intelligence and Informatics • 19–21 November, 2013 • Budapest, Hungary

Page 4: [IEEE 2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI) - Budapest, Hungary (2013.11.19-2013.11.21)] 2013 IEEE 14th International Symposium

(a) Incoming degree (b) Outgoing degree

(c) Incoming degree (d) Outgoing degree

Fig. 1: Distribution of E.coli(a) and S.cerevisiae(c) incoming and E.coli (b) and S.cerevisiae (d) outgoing degrees.This figure demonstrates probability that the selected metabolite has the incoming degree k. The power law trendlines describesa statistical trend of an incoming and outgoing degree distribution best of all more often used functional dependencies withthe power that accepts values in an interval -2¡y¡-3. The power law explains 98% (Eco 1), 93% (Eco 2) and 95% (Eco i/s),95% (Sce 1), 94% (Sce 2) and 91% (Sce i/s) of distribution data of the incoming degrees. But the adjustable coefficient inpower functions of the S.cerevisiae models intersection is 2 times (Sce 1) and 1.7 times (Sce 2) smaller comparing with theinitial models. The power law explains 97% (Eco 1), 98% (Eco 2) and 97% (Eco i/s), 89% (Sce 1), 95% (Sce 2) and 91%(Sce i/s) of distribution data of the outgoing degrees. But the adjustable coefficient in power functions of S.cerevisiae modelsintersection is 2 times (Sce 1) and 2.2 times (Sce 2) smaller comparing to the initial models.

T. Rubina et al. • Agreement Assessment of Biochemical Pathway Models by Structural Analysis of their Intersection

414

Page 5: [IEEE 2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI) - Budapest, Hungary (2013.11.19-2013.11.21)] 2013 IEEE 14th International Symposium

TABLE I: Structural parameters of E.coli and S.cerevisiae models

Eco 1 Eco 2 Eco i/s Sce 1 Sce 2 Sce i/sNumber of metabolites 1314 1704 932 (71/55)* 674 1061 162 (24/15)**Number of connected metabolites 1314 1704 932 674 1061 162Number of isolated metabolites 0 0 0 0 0 0Number of links 6540 8031 3367 (51/42)* 2671 6399 234 (9/4)**Number of reactions 1252 1552 731(58/47)* 903 1267 88(10/7)**Average degree 4.26 4.04 3.3 4.31 4.96 1.69Average in-degree 2.26 2.16 1.77 2.11 2.51 0.84Average out-degree 2 1.88 1.53 2.2 2.45 0.85Average number of neighbors 6.3 5.98 5.05 5.83 7.37 2.38Average clustering coefficient 0.2 0.19 0.14 0.15 0.21 0.12*-percents from Eco 1/Eco 2**-percents from Sce 1/Sce 2

(a) Metabolite (b) Metabolite

Fig. 2: Number of neighbors of E.Coli(a) and S.cerevisiae(b) models in decreasing order.The power law trendlines best of all describes a statistical trend of a number of the neighbors with the power that tends tothe -1. The power law explains 97% (Eco 1), 97% (Eco 2) and 96% (Eco i/s), 98% (Sce 1), 99% (Sce 2) and 90% (Sce i/s)of the distribution data of a number of the neighbours. But the adjustable coefficient in power functions of the S.cerevisiaemodels intersection is 9.3 times (Sce 1) and 21.5 times (Sce 2) smaller comparing with the initial models.

E.coli models, Eco 1, Eco 2 and Eco i/s, have similar per-centage of metabolites with one or two links (71.4, 75.6 and78.8 correspondingly) and metabolites with more than ten links(4.3, 3.7 and 3.7 correspondingly). Thus, the E.coli networksunder consideration have few hubs that hold together low-degree metabolites (which degree is less than 6 [27] [28] [27] )and are scale-free. The main 10 hubs for examined models areATP, ADP, PPI, Pi, NAD, NADH, CARBON 45 DIOXIDE,NADP, NADPH and S 45 ADENOSYSMETHIONINE.

In case of S.cerevisiae models the Sce i/s model has 92.6%of metabolites with one or two links compared to 58.6% and63.2% for Sce 1 and Sce 2 correspondingly. The percentageof hubs with more than ten links is just 1.2 for Sce i/scompared to 5.6 and 6.0 for Sce 1 and Sce 2 correspondingly.That clearly shows low interconnectivity of Sce i/s model.The main 10 hubs of Sce 1 and Sce 2 are ATP c, ADP c,Pi c, H e, NADP c, NADHP c, PPI c, CO2 c, GLU c and

AMP c. The intersection model has only three hubs from theabove mentioned ones: ATP c, ADP c and GLU c. Thus, ahigh percentage of the low degree and a low percentage of thehigh degree metabolites can indicate a low quality of a model.

The distribution of the incoming/outgoing degrees andtrendlines of the E.coli and S.cerevisiae models have a similarform and can be approximated by the power law (Fig. 1).The power law describes a statistical trend of the examinedparameters better than, for example, a linear, an exponential,a polynomial and the other functional dependencies. Thetrendlines are the functions of the power law with a negativefractional exponent, what describe a statistical trend of anincoming and outgoing degree distribution best of all with thepower within the interval −2 < y < −3 . The trendlines ofthe E.coli intersection model are much closer to the originalmodels than the trendline of the S.cerevisiae intersectionmodel. Still, the power of trendlines is very similar to all the

415

CINTI 2013 • 14th IEEE International Symposium on Computational Intelligence and Informatics • 19–21 November, 2013 • Budapest, Hungary

Page 6: [IEEE 2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI) - Budapest, Hungary (2013.11.19-2013.11.21)] 2013 IEEE 14th International Symposium

(a) Metabolite (b) Metabolite

Fig. 3: Distribution of a number of the neighbors in E.Coli(a) and S.cerevisiae(b) models.

TABLE II: Clustering coefficient of the E.coli and S.cerevisiae models metabolites

Clustering coeffi-cient

Eco 1 % Eco 2 % Eco i/s % Sce 1 % Sce 2 % Sce i/s %

0 537 41 814 48 525 59 312 46 416 39 129 80(0;0.2) 204 15 215 13 140 15 174 26 205 19 6 4[0.2; 0.5) 414 31 462 27 74 8 136 20 315 30 7 4[0.5; 0.7) 104 8 136 8 40 4 31 5 65 6 8 5[0.7; 1) 21 2 35 2 6 1 4 1 15 2 0 01 34 3 42 2 23 3 17 3 45 4 12 7

analyzed models and therefore distribution of the incomingdegree and the outgoing degree does not reflect the quality ofthe intersection models.

Therefore the percentage of low interconnectivity (one ortwo links) metabolites and hubs (more than ten links) canbe used to determine the quality of the intersection modelswhile approximation of distributions by the power law is lessinformative.

2) Number of neighbours: The distribution curves at thearea of a high number of the neighbors (Fig. 2) of the initialmodels are very similar. Larger models have more neighbours.The Sce i/s model has a very low number of the neighboursreflecting a very low number of the interactions in the network.All three models of the E.coli and the S.cerevisiae have thesame tendency affecting a number of the neighbours of thehighly connected metabolites. The power law describes thechanges of a neighbour number with the coefficient R2 ≥ 0.9and the power a ≤ −0.5 that tend to −1. The main differenceof the S.cerevisiae intersection model from the others is muchsmaller values of a neighbours number. Still, it cannot beconcluded by analyzing a distribution of the neighbours ofthe highly interconnected metabolites, if the small intersectionmodel is caused by a small overlap of the model scopes or by alow agreement of the model authors about the same processes.

The distribution curves at the area of a low number of theneighbors (Fig. 3) reveal interesting differences between theE.coli and the S.Cerevisiae models. In case of E.coli (Fig.3a), distribution of all the three models is very similar andproportional to the size of the models. The distribution curves

of the S.Cerevisiae models are very different: there is no clearpeak for two neighbours in all the models. The model Sce 1has even a local minimum for a number of the metabolites withthe two neighbours where all the E.coli models had a maxi-mum. The distribution curve of the intersection model Sce i/sis shaped different from all the other curves with similar valuesboth for one and two neighbours. Due to a different shape ofthe curves of the original pairs of models, the distributioncurves at the area of a low number of the neighbours cannotbe used to determine quality of the intersection models.

3) Clustering coefficients: The clustering coefficient valuesin the E.coli and the S.Cerevisiae models (Fig. 4, Table II)indicate that most part of model metabolites has a clusteringcoefficient below 0.2. In case of the E.coli that is 56.4%, 60.4%and 71.4% for Eco 1, Eco 2 and Eco i/s models correspond-ingly. In case of S.Cerevisiae, that is 72.1%, 58.5% and 83.3%for Sce 1 Sce 2 and Sce i/s models correspondingly. Highclustering coefficients (above 0.7) have 4.2%, 4.5% and 3.1%of for Eco 1, Eco 2 and Eco i/s models correspondingly and3.1%, 5.7% and 7.4% for Sce 1, Sce 2 and Sce i/s modelscorrespondingly. Distribution of the clustering coefficientsshow similar values and, therefore distribution of the clusteringcoefficients cannot be used to determine a quality of theintersection models.

CONCLUSIONS

Automated generation of an intersection of the two modelscombined with its structural analysis can give indication aboutthe agreement level between metabolic models of a particular

T. Rubina et al. • Agreement Assessment of Biochemical Pathway Models by Structural Analysis of their Intersection

416

Page 7: [IEEE 2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI) - Budapest, Hungary (2013.11.19-2013.11.21)] 2013 IEEE 14th International Symposium

(a) Metabolite (b) Metabolite

Fig. 4: Clustering coefficient of network metabolites in E.Coli (a) and S.cerevisiae (b) models in decreasing order.

organism. Due to an automated intersection generation, identi-cal reactions may not be recognized due to different names andthe representations of formulas. Therefore, a more intensivemanual study of the reactions with a relatively low similaritymay increase the accuracy of the intersection models. Useof standardized metabolite names or information about thename generation method would help to reveal a similaritymore accurately. The lumped reactions and biomass reactionin particular create problems to recognize equal fragments ofthe models. The artificial lumping of the reactions may beproposed to avoid creation of false negative cases.

Some structural parameters are good indicators of an agree-ment level between the models analyzing their intersection:they performed in a different way for the high agreementintersection model (E.coli) and the low agreement intersectionmodel (S.cerevisiae). A low agreement of the model pairresulting in a fragmented, poor quality model can be indicatedby low values of an average degree, an average incomingdegree, an average outgoing degree and an average numberof the neighbours. A low agreement of the model pair can bedetected also by the distribution of the incoming and outgoingdegrees of the metabolites: a high percentage of the lowinterconnectivity metabolites (one or two links) and a lowpercentage of the hubs (more than ten links).

Some parameters indicate a size of the intersection modelcompared to the initial one. That can be interpreted as a highagreement in a small part or a low agreement in a larger part ofthe models. Therefore, the intersection model parameters, likea number of the metabolites and a number of the reactions,cannot be used to assess a level of agreement.

A number of the metabolites, a number of the reactions, anapproximation of the incoming/outgoing degrees, distributionof the neighbours and distribution of the clustering coefficientsof the intersection model are weak indicators of an agreementlevel of two initial models.

A structural analysis of the models intersections can also be

implemented to find out the similarities between the modelsof different organisms.

ACKNOWLEDGEMENTS

This work is partly funded by a project of European Struc-tural Fund Nr. 2009/0207/1DP/1.1.1.2.0/09/APIA/ VIAA/128Latvian Interdisciplinary Interuniversity Scientific Group ofSystems Biology www.sysbio.lv.

This work and academic study is funded by theproject “Support for doctoral studies in LUA “/2009/0180/1DP/1.1.2.1.2/09/ IPIA/VIAA/017” agreements Nr. 04.4-08/EF2.PD.68 (T. Rubina) and 04.4-08/EF2.D3.30 (M. Med-nis)” and partly funded by the Latvian Council of Sciencegrant Nr. 09.1578 “Design and analysis of models for descrip-tion of biological and software systems”.

REFERENCES

[1] J. Schellenberger, R. Que, R. M. T. Fleming, I. Thiele, J. D. Orth, A. M.Feist, D. C. Zielinski, A. Bordbar, N. E. Lewis, S. Rahmanian, J. Kang,D. R. Hyduke, and B. O. Palsson, “Quantitative prediction of cellularmetabolism with constraint-based models: the COBRA Toolbox v2.0.”Nature Protocols, vol. 6, pp. 1290–307, 2011.

[2] I. Thiele and B. O. Palsson, “A protocol for generating a high-qualitygenome-scale metabolic reconstruction.” Nature protocols, vol. 5, no. 1,pp. 93–121, Jan. 2010.

[3] B. O. Palsson, Systems Biology: Properties of reconstructed networks.Cambridge University Press, 2006.

[4] J. Schellenberger, J. O. Park, T. M. Conrad, and B. O. Palsson, “BiGG:a Biochemical Genetic and Genomic knowledgebase of large scalemetabolic reconstructions,” BMC Bioinformatics, vol. 11, p. 213, 2010.

[5] N. C. Duarte, S. a. Becker, N. Jamshidi, I. Thiele, M. L. Mo, T. D.Vo, R. Srivas, and B. O. Palsson, “Global reconstruction of the humanmetabolic network based on genomic and bibliomic data.” Proceedingsof the National Academy of Sciences of the United States of America,vol. 104, no. 6, pp. 1777–82, Feb. 2007.

[6] I. Thiele, N. Swainston, R. M. T. Fleming, A. Hoppe, S. Sahoo,M. K. Aurich, H. Haraldsdottir, M. L. Mo, O. Rolfsson, M. D.Stobbe, S. G. Thorleifsson, R. Agren, C. Bolling, S. Bordel, A. K.Chavali, P. Dobson, W. B. Dunn, L. Endler, D. Hala, M. Hucka,D. Hull, D. Jameson, N. Jamshidi, J. J. Jonsson, N. Juty, S. Keating,I. Nookaew, N. Le Novere, N. Malys, A. Mazein, J. a. Papin, N. D.Price, E. Selkov, M. I. Sigurdsson, E. Simeonidis, N. Sonnenschein,

417

CINTI 2013 • 14th IEEE International Symposium on Computational Intelligence and Informatics • 19–21 November, 2013 • Budapest, Hungary

Page 8: [IEEE 2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI) - Budapest, Hungary (2013.11.19-2013.11.21)] 2013 IEEE 14th International Symposium

K. Smallbone, A. Sorokin, J. H. G. M. van Beek, D. Weichart,I. Goryanin, J. Nielsen, H. V. Westerhoff, D. B. Kell, P. Mendes, andB. O. Palsson, “A community-driven global reconstruction of humanmetabolism,” Nature biotechnology, vol. 31, no. 5, pp. 419–25, May2013.

[7] M. Kanehisa, S. Goto, Y. Sato, M. Furumichi, and M. Tanabe, “KEGGfor integration and interpretation of large-scale molecular data sets.”Nucleic Acids Research, vol. 40, no. Database issue, pp. D109–14,2011.

[8] M. Kanehisa and S. Goto, “KEGG: Kyoto Encyclopedia of Genes andGenomes,” Nucleic Acids Research, vol. 28, no. 1, pp. 27–30, 2000.

[9] I. M. Keseler, J. Collado-Vides, A. Santos-Zavaleta, M. Peralta-Gil,S. Gama-Castro, L. Muniz Rascado, C. Bonavides-Martinez, S. Paley,M. Krummenacker, T. Altman, P. Kaipa, A. Spaulding, J. Pacheco,M. Latendresse, C. Fulcher, M. Sarker, A. G. Shearer, A. Mackie,I. Paulsen, R. P. Gunsalus, and P. D. Karp, “EcoCyc: a comprehensivedatabase of Escherichia coli biology.” Nucleic acids research, vol. 39,no. Database issue, pp. D583–90, Jan. 2011.

[10] P. D. Karp, C. a. Ouzounis, C. Moore-Kochlacs, L. Goldovsky, P. Kaipa,D. Ahren, S. Tsoka, N. Darzentas, V. Kunin, and N. Lopez-Bigas,“Expansion of the BioCyc collection of pathway/genome databases to160 genomes.” Nucleic acids research, vol. 33, no. 19, pp. 6083–6089,Jan. 2005.

[11] J. W. Whitaker, I. Letunic, G. A. McConkey, and D. R. Westhead,“metaTIGER: a metabolic evolution resource,” Nucleic Acids Research,vol. 37, no. Database issue, pp. D531–D538, 2009.

[12] A. Kostromins and E. Stalidzans, “Paint4Net: COBRA Toolboxextension for visualization of stoichiometric models of metabolism,”Biosystems, 2012.

[13] M. Mednis, V. Brusbardis, and V. Galvanauskas, “Comparison ofgenome-scale reconstructions using ModeRator,” in 13th IEEE Inter-national Symposium on Computational Intelligence and Informatics,Budapest, 2012, pp. 79–84.

[14] A.-L. Barabasi and Z. N. Oltvai, “Network biology: understanding thecell’s functional organization.” Nature Reviews Genetics, vol. 5, no. 2,pp. 101–113, 2004.

[15] M. Arita, “The metabolic world of Escherichia coli is not small,”Proceedings of the National Academy of Sciences of the United Statesof America, vol. 101, pp. 1543–1547, 2004.

[16] D. J. Watts and S. H. Strogatz, “Collective dynamics of ’small-world’networks.” Nature, vol. 393, pp. 440–2, 1998.

[17] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, and A. L.Barabasi, “Hierarchical organization of modularity in metabolic net-works,” Science, vol. 297, pp. 1551–1555, 2002.

[18] A. Wagner, “The yeast protein interaction network evolves rapidlyand contains few redundant duplicate genes.” Molecular Biology andEvolution, vol. 18, pp. 1283–1292, 2001.

[19] T. Rubina, “Tools for analysis of biochemical network topology,”Biosystems and Information technology, vol. 1, no. 1, pp. 25–31, 2012. [Online]. Available: http://bit-journal.eu/publications/2012-11/bit id 121101 rubina.pdf

[20] T. Rubina and E. Stalidzans, “Software tools for structure analysisof biochemical networks,” in Proceedings of Applied Information andCommunication Technologies, Jelgava, 2010, pp. 33–49.

[21] L. Kuepfer, U. Sauer, and L. M. Blank, “Metabolic functions ofduplicate genes in Saccharomyces cerevisiae.” Genome research,vol. 15, no. 10, pp. 1421–30, Oct. 2005.

[22] N. C. Duarte, M. J. Herrga rd, and B. O. Palsson, “Reconstructionand validation of Saccharomyces cerevisiae iND750, a fullycompartmentalized genome-scale metabolic model.” Genome research,vol. 14, no. 7, pp. 1298–309, Jul. 2004.

[23] J. Forster, I. Famili, P. Fu, B. O. Palsson, and J. Nielsen, “Genome-scalereconstruction of the Saccharomyces cerevisiae metabolic network.”Genome research, vol. 13, no. 2, pp. 244–53, Feb. 2003.

[24] M. Mednis and M. K. Aurich, “Application of string similarity ratioand edit distance in automatic metabolite reconciliation comparingreconstructions and models,” Biosystems and Information technology,vol. 1, no. 1, pp. 14–18, 2012.

[25] T. Rubina and E. Stalidzans, “BINESA a software tool for evolutionmodelling of biochemical networks structure,” in Proceedings of 14thIEEE International Symposium on Computational Intelligence and In-formatics6, Budapest, Hungary, 2013.

[26] A. Wagner and D. a. Fell, “The small world inside large metabolicnetworks,” Proceedings. Biological sciences / The Royal Society, vol.268, no. 1478, pp. 1803–10, Sep. 2001.

[27] T. Hase, H. Tanaka, Y. Suzuki, S. Nakagawa, and H. Kitano, “Structureof Protein Interaction Networks and Their Implications on Drug Design,”PLoS Computational Biology, vol. 5, p. 9, 2009.

[28] A. Patil and H. Nakamura, “Disordered domains and high surfacecharge confer hubs with the ability to interact with multiple proteinsin interaction networks.” FEBS Letters, vol. 580, pp. 2041–2045, 2006.

T. Rubina et al. • Agreement Assessment of Biochemical Pathway Models by Structural Analysis of their Intersection

418