15
Physica A 385 (2007) 750–764 Empirical analysis of the evolution of a scientific collaboration network Marco Tomassini , Leslie Luthi Information Systems Department, University of Lausanne, Switzerland Received 2 May 2007; received in revised form 19 June 2007 Available online 25 July 2007 Abstract We present an analysis of the temporal evolution of a scientific coauthorship network, the genetic programming network. We find evidence that the network grows according to preferential attachment, with a slightly sublinear rate. We empirically find how a giant component forms and develops, and we characterize the network by several other time- varying quantities: the mean degree, the clustering coefficient, the average path length, and the degree distribution. We find that the first three statistics increase over time in the growing network; the degree distribution tends to stabilize toward an exponentially truncated power-law. We finally suggest an effective network interpretation that takes into account the aging of collaboration relationships. r 2007 Elsevier B.V. All rights reserved. PACS: 89.65.s; 89.75.k; 89.75.Fb Keywords: Network evolution; Preferential attachment; Scientific collaboration; Social networks 1. Introduction and previous work In recent years, thanks to the increasing availability of machine-readable data, many large networks have been empirically analyzed in detail in several disciplines including communications and information networks, biological, social, and technological networks. In many cases it has been found that these networks have small diameters and high clustering. In other words, any node is relatively close to any other node, and the local connection structure will not be random, but rather shaped by social or other forces [1,2]. The origins and the evolution of such networks have been the object of intensive research and there exist several models that can be used to explain the experimentally observed data. However, while models abound by now and the theory is rather well developed, the analysis has concentrated on static networks, i.e. networks that are, or are considered to be, in a steady state. However, to test models on network formation and evolution, one needs to study actual networks for which time-resolved data do exist, and these are more difficult to find. Some works have dealt with this kind of problem in the last few years. Noteworthy among them are the following ARTICLE IN PRESS www.elsevier.com/locate/physa 0378-4371/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.physa.2007.07.028 Corresponding author. E-mail address: [email protected] (M. Tomassini).

Empirical analysis of the evolution of a scientific collaboration network

Embed Size (px)

Citation preview

Page 1: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESS

0378-4371/$ - se

doi:10.1016/j.ph

�CorrespondE-mail addr

Physica A 385 (2007) 750–764

www.elsevier.com/locate/physa

Empirical analysis of the evolution of a scientificcollaboration network

Marco Tomassini�, Leslie Luthi

Information Systems Department, University of Lausanne, Switzerland

Received 2 May 2007; received in revised form 19 June 2007

Available online 25 July 2007

Abstract

We present an analysis of the temporal evolution of a scientific coauthorship network, the genetic programming

network. We find evidence that the network grows according to preferential attachment, with a slightly sublinear rate. We

empirically find how a giant component forms and develops, and we characterize the network by several other time-

varying quantities: the mean degree, the clustering coefficient, the average path length, and the degree distribution. We find

that the first three statistics increase over time in the growing network; the degree distribution tends to stabilize toward an

exponentially truncated power-law. We finally suggest an effective network interpretation that takes into account the aging

of collaboration relationships.

r 2007 Elsevier B.V. All rights reserved.

PACS: 89.65.�s; 89.75.�k; 89.75.Fb

Keywords: Network evolution; Preferential attachment; Scientific collaboration; Social networks

1. Introduction and previous work

In recent years, thanks to the increasing availability of machine-readable data, many large networks havebeen empirically analyzed in detail in several disciplines including communications and information networks,biological, social, and technological networks. In many cases it has been found that these networks have smalldiameters and high clustering. In other words, any node is relatively close to any other node, and the localconnection structure will not be random, but rather shaped by social or other forces [1,2]. The origins and theevolution of such networks have been the object of intensive research and there exist several models that canbe used to explain the experimentally observed data. However, while models abound by now and the theory israther well developed, the analysis has concentrated on static networks, i.e. networks that are, or areconsidered to be, in a steady state. However, to test models on network formation and evolution, one needs tostudy actual networks for which time-resolved data do exist, and these are more difficult to find. Some workshave dealt with this kind of problem in the last few years. Noteworthy among them are the following

e front matter r 2007 Elsevier B.V. All rights reserved.

ysa.2007.07.028

ing author.

ess: [email protected] (M. Tomassini).

Page 2: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESSM. Tomassini, L. Luthi / Physica A 385 (2007) 750–764 751

investigations: Newman’s study on scientific collaboration networks [3], Barabasi et al. [4] and Jeong et al. [5]investigations on the growth of coauthorship, citation, Internet, and actor collaboration networks. Otherinteresting studies have targeted the web [6,7], potential energy surfaces [8], social interactions represented bye-mail exchanges [9] and the Internet encyclopedia Wikipedia [10]. Some of these networks are technological,such as the Internet, while others have a more social flavor. Citation networks, the web, and Wikipedia cannotbe considered social networks in the proper sense, although they do support communication and informationtransmission in social contexts. On the other hand, scientific collaboration networks, e-mail networks, and theactor network usually imply underlying social ties with their associated costs and thus they are considered atleast good proxis for social networks. Some networks are directed (the web, Wikipedia, citation networks)while others are undirected (coauthorship networks, actors, Internet, potential energy surfaces).

Most of the graphs in the above mentioned works as well as many others have a measured degreedistribution PðkÞ1 that is either a power-law PðkÞ / k�g, or a power-law with an exponential cutoff. Thismeans that there is a non-negligible probability in these graphs that some vertices have high connectivity.Several growing models have been proposed to account for these topological features, most of them beingbased on some form of preferential attachment. Preferential attachment means that when new nodes join thegraph linking to existing nodes j, the rate Dkj=Dt is an increasing function of the degree kj of j. Some modelsassume this function to be linear [11], while in other cases it has been assumed to depend on a different powerof kj [7,12]. In general, we have that the probability PðkjÞ with which an edge belonging to a new nodeconnects to an existing node j of degree kj will be

PðkjÞ ¼ka

jPik

ai

,

where the sum is over all vertices i already present in the graph. Thus the rate of increase of node degree willbe: Dk=Dt / ka.

For a ¼ 1 the rate is linear and the model reduces to the familiar Barabasi–Albert construction [11] whichyields a power-law degree distribution PðkÞ. For ao1 the preferential attachment is sublinear and PðkÞ is astretched exponential [12]. For a41 a single node gets almost all the edges, with the rest having an exponentialdistribution of the degrees. Therefore, to know which kind of preferential attachment, if any, is at work in aparticular growing network, one needs to study empirically networks for which the time at which new nodesentered the graph and new edges formed is known.

The following conclusions have been reached in Refs. [3,5,10]. First of all, preferential attachment appearsto be present in all the studied networks, although some follow an almost linear growth (Internet, citationsand Wikipedia [5,10]), while others appear to grow at a sublinear rate (ao1) and give rise to stretchedexponential distributions. The latter case is present in some coauthorship networks and actor collaborations[3,5]. Since the coauthorship network studied in Ref. [4] shows a power-law degree distribution, the apparentcontradiction has been explained in Ref. [5] by the presence of another linear preferential attachmentmechanism involving the appearance of new internal edges among existing nodes.

From these results, it appears that social networks differ somewhat in their growing properties frominformation networks such as the web, Wikipedia and citation networks. This may be due to the fact thatmaking new links in the latter is essentially free of cost, while there is some cost in the former related to thenecessity of becoming acquainted with some other individual in the network before an association is possible.

In the present study we investigate the evolution of another scientific collaboration network, the geneticprogramming (GP) coauthorship network. The GP bibliography, created and maintained by W.B. Langdonand by S. Gustafson,2 is a database that contains almost all the papers published in the GP field since itsinception around 1986. This database is smaller than those that have been studied previously [4,13–15], but ithas the advantage of being essentially complete. Moreover, different authors with the same initial andsurnames and the same author spelled differently are rare occurrences, while this is a source of some error inthe larger databases. We are thus in a position to study the growth of the collaboration network from the very

1There are two distinct degree distributions in directed networks: one for the incoming links PðkinÞ and another for the outgoing

links PðkoutÞ:2http://www.cs.bham.ac.uk/�wbl/biblio/

Page 3: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESSM. Tomassini, L. Luthi / Physica A 385 (2007) 750–764752

beginning, which can be revealing. In contrast, previous studies were limited to a given time window due to theenormous amount of data and, sometimes, to the lack of reliable data at the beginning of the network history.The static structure of the network as of January 2007, including the study of its communities, has been fullydescribed in Ref. [16]. Its main features are summarized in the next section for the sake of completeness. In thepresent work we focus on the analysis of the time evolution of general network properties, especially thebehavior of connected components, and the processes that lead to network growth, including preferentialattachment and clustering.

2. The frozen collaboration network in 2006

We treat the GP social network as a graph where each node is a GP researcher, i.e. someone who has at leastone entry in the bibliography. There is a connection between two people if they have coauthored one or morepapers, or if they have coedited at least one book or proceeding. In order to characterize the strength of theinteraction, one could attribute a weight wij to a link hiji with a value proportional, say, to the number ofpapers coauthored by i and j. However, in the following analysis we use the simple unweighted network.

As of the start of 2007, there is a total of N ¼ 2809 connected nodes, i.e. authors who have at least onecollaborator, and a total of E ¼ 5853 edges (collaborations) in the GP coauthorship network. The graph isthus sparse since E / N. There are 367 isolated vertices, which represent authors who have not collaboratedwith others to the extent of coauthoring a paper. Isolated vertices are ignored in the graph statistics. In the restof the section we summarize the main quantitative aspects of the network.

The average number of collaborators per author, i.e. the mean degree hki of the coauthorship graph, is 4.17and the network turns out to be degree assortative, with a correlation coefficient of 0.13.

The distribution of the number of collaborators per author, i.e. the degree distribution, does not appear tobe a pure power-law. Rather, the distribution shows a power-law regime in the first part followed by anexponential decay in the tail. This is quite common. In fact, several measured social networks do not follow apower-law degree distribution [14,17] and are best fitted either by an exponential degree distributionPðkÞ�e�k=hki or by an exponentially truncated power-law of the type PðkÞ�k�ge�k=kc , where kc represents acritical connectivity and hki is the average degree.

The average clustering coefficient is hCi ¼ 0:665, whereas we would expect 0:0015 for a random graph withthe same number of vertices and edges.

The average path length of the giant component of the GP collaboration graph is 4:74. The diameter, i.e. thelongest among all the shortest paths, is 12.

Table 1 summarizes the results of this section and compares them with those of some other collaborationnetworks.

Table 1

Basic statistics for some scientific collaboration networks

GP SPIRES Medline Mathematics NCSTRL Physics

Total number of papers 4564 66 652 2 163 923 1 600 000 13 169 98 502

Total number of authors 2809 56 627 1 520 251 253 339 11 994 52 909

Average papers per author 3.16 11.6 6.4 7 2.55 5.1

Average authors per paper 2.25 8.96 3.754 1.5 2.22 2.53

Average collaborators per author 4.17 173 18.1 2.94 3.59 9.7

Size of the giant component (%) 36.5 88.7 92.6 82.0 57.2 85.0

Clustering coefficient 0.665 0.726 0.066 0.15 0.496 0.43

Average path length 4.74 4.0 4.6 7.73 9.7 5.9

GP is the genetic programming bibliography at the start of 2007. SPIRES is a data set of papers in high-energy physics. Medline is a

database of articles on biomedical research. Mathematics comprises articles from Mathematical Reviews. NCSTRL is a database of

preprints in computer science. Physics has been assembled from papers posted on the Physics E-print Archive. Details about these

databases can be found in Refs. [13–16].

Page 4: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESSM. Tomassini, L. Luthi / Physica A 385 (2007) 750–764 753

3. Network evolution

In this section we investigate the time evolution of several interesting aspects of the collaboration network.

3.1. Authors and collaborations

Fig. 1 shows the cumulated number of vertices (authors) and edges (collaborations) for the whole period ofnetwork existence up to the end of 2006. Collaborations grow faster than authors as new edges come not onlyfrom collaborations involving new authors but also from authors who were already in the network.Furthermore, a new paper involving n new authors generates nðn� 1Þ=2 new collaborations.

3.2. Evolution of the largest component

In Poisson random graphs there is a critical value of the average degree hki above which, for N !1, N

being the number of vertices, there is a sudden appearance of a giant component. In summary, for hkio1, allcomponents are of size OðlogNÞ, but as hki reaches 1 the system undergoes a phase transition whence onegiant component appears connecting a finite fraction of all vertices, i.e. OðNÞ, while all other components aresmaller with size OðlogNÞ [18]. While the above results are strictly true in the infinite graph limit for Poissonrandom graphs, it has been shown that a similar phenomenon is present in more general models of randomgraphs having scale-free or other arbitrary degree distributions [19–21].

In social networks such as the one presented here the emergence of a largest connected cluster has also beenfound empirically [4,13,14], although to our knowledge, no rigorous explanation has yet been given. Oneobserves that a collaboration network is fragmented into many connected components, which may correspondto discipline boundaries, or geographical and location boundaries, or both. There are also human behaviorsthat may result in fragmentation such as researchers who almost never collaborate with others or groups ofpeople who collaborate solely within the group. Whatever the reasons, it is interesting to watch the behavior ofcollaborating clusters as time goes by, an analysis that we can perform on the data starting from the verybeginning of the network existence and that can hopefully cast some light on the giant component emergenceprocess.

Fig. 2 shows the birth and the evolution of the largest component and of the second largest expressed interms of number of nodes (a) and relative size (b). After a transient period between 1986 and 1996 in whichthere are only small components and the size fluctuates, one can see the appearance of the giant component

Fig. 1. Evolution of the number of authors and collaborations during the period 1986–2006. The inset shows in detail the years 1996–2006

during which the growth rate of the number of authors is linear, while the number of collaborations grows quadratically.

Page 5: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESS

Fig. 2. Evolution of the size of the biggest and second biggest components; (a) absolute size; (b) relative size.

M. Tomassini, L. Luthi / Physica A 385 (2007) 750–764754

around 1996–1997. This interestingly coincides with the first general conference of the field which took place in1996, producing a high number of papers and, by consequence, many new collaborations. In this way manyisolated individuals and groups were led to collaborate and thus to be part of a large growing network. Sincethen the component size has been growing steadily, reaching about 37% of the total by the end of 2006. Thisfigure is smaller than analogous values for other collaboration networks which oscillate between 57% and92% (see Table 1) and it suggests that the largest component is still growing and has not yet reached itsasymptotic value. One possible explanation is that the present collaboration network is relatively young andthere are isolated individuals and groups of researchers who are likely to become part of the largest componentin the next few years. In addition, researchers are continuously entering the field, for example by writingpapers during their PhD studies with senior researchers who are already in the giant component. The resultsdiffer somewhat from those found by Barabasi et al. [4]. In Ref. [4] it was suggested that the giant componentappears very early in the collaboration process, which is not the case in the GP coauthorship network.However, they used data from a period well after the initial stage, in which many scientists had alreadycollaborated but appear to be disconnected in the data set. It is thus difficult to draw conclusions on the giantcomponent formation from their data.

Fig. 3 indicates that the number of components took off around 1993 and increases linearly as the networkis a non-equilibrium open system. We can also observe that, around year 1993, the increase in the number ofcomponents parallels the increase in the number of authors and collaborations (Fig. 1). However, theemergence of a giant component comes a few years later (around 1997), which lends credence to our remarkabove stating that the giant component does not form immediately.

Finally, Fig. 4 shows the cumulative distribution of the component sizes. Statistics are not reliable in thefirst few years as there are too few data. However, in the last period the curves in Fig. 4 can be fitted rather wellby a straight line on a log–log scale which means that the distributions follow a power-law.

3.3. Average degree

Fig. 5 depicts the evolution of the average degree. As expected, the average degree increases with timeindicating that, in the average, more links are added to the network than new authors, as confirmed by Fig. 1.This result is in agreement with the findings of Barabasi et al. [4]. The ‘‘spike’’ in the middle-upper part of thefigure is due to a paper published in 1995 with 12 coauthors. It is also interesting to remark in the above figurethat the average degree jumps from about 2 to about 4 at the time of the giant component formation.

3.4. Degree distribution

In this section we report results on the evolution of the degree distribution of the collaboration graph. Asnoted in Section 2, the degree distribution of the cumulated network up to the end of 2006 cannot be fitted by

Page 6: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESS

Fig. 4. Evolution of the cumulative component size distribution. Log–log scales. Number of components is plotted every other year from

1986. The largest component is excluded from the plot starting from its emergence around 1996.

Fig. 3. Evolution of the number of components present in the network.

M. Tomassini, L. Luthi / Physica A 385 (2007) 750–764 755

a single power-law. The observed distribution is more likely to obey an exponentially truncated power law orstretched exponential as previously found for some other collaboration networks [3,14]. This seems to be amore widespread observation for social networks not just limited to those of the scientific collaboration type.For instance, recently proposed models of social networks give rise to such a kind of distribution [22,23]. Onthe other hand, only by introducing a concept of node fitness can these distributions be reduced to the scale-free type, as shown in Ref. [24]. Barabasi et al. [4] found that their data on two collaboration networks indicatethat the graphs are scale free. However, they identified two different scaling regimes with slightly differentpower-law exponents. Of course, whether the decaying behavior is due to an exponential cutoff or to adifferent power-law regime is a matter of debate. Nevertheless, Barabasi et al. provided quantitativearguments in support of the latter interpretation.

In Fig. 6, we see that data are insufficient and give noisy results until the last few years during which theshape of the distribution does not change significantly. Fig. 6(b) shows the distribution at the end of the even

Page 7: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESS

Fig. 5. Evolution of the average degree of the network.

Fig. 6. Evolution of the cumulative degree distribution. Curves are traced every other year from 1996 in (a). Curves in (b) refer to the

period 2000–2006 and are normalized by the average degree. Log–log scales.

M. Tomassini, L. Luthi / Physica A 385 (2007) 750–764756

years from 2000 to 2006 where k has been normalized by its average value hki to show the common shape ofthe curves.

3.5. Clustering coefficient

The temporal behavior of the average clustering coefficient C is reported in Fig. 7 for the whole network andfor the giant component in (a), and normalized with respect to values that obtain in the equivalent randomgraphs in (b). The first thing to observe is that C is unreliable, being based on many small components, untilthe appearance of the large component. After 1996, it increases slightly and tends to reach an asymptotic valuearound 0.7, although this is a striking observation, it is probably too early to be confident that it will notchange in the future. In contrast to this, Barabasi et al. [4] found in their study that C decreases with time.3

How could this difference be explained? In qualitative terms, we can recognize a number of rate processesthat contribute to clustering in a network. Some of them will tend to increase C, while others will have anegative effect on it.

3We used the same definition of C as in Ref. [4], i.e. Ci ¼ 2Ni=kiðki � 1Þ and C ¼ hCiii.

Page 8: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESS

Fig. 7. Evolution of the clustering coefficient of the network (a). Evolution of the ratio of C=CðrandomÞ (b).

M. Tomassini, L. Luthi / Physica A 385 (2007) 750–764 757

Take for example a collaboration between two persons i and j who are already in the network but havenever coauthored a paper together. Let us try to have a better understanding of the effect the new link has onthe clustering coefficient of the two individuals and on the average clustering of the network. On the one hand,if author i had m mutual acquaintances with author j prior to their collaboration, we can calculate the increaseor decrease of its clustering coefficient DCi due to the new link:

DCi ¼2ðNi þmÞ

ðki þ 1Þki

� Ci ¼2Ni

ðki þ 1Þki

þ2m

ðki þ 1Þki

� Ci

¼2Ni

ðki þ 1Þki

ki � 1

ki � 1þ

2m

ðki þ 1Þki

� Ci

¼ Ci

ki � 1

ki þ 1� Ci þ

2m

ðki þ 1Þki

¼ Ci

�2

ki þ 1

� �þ

2m

ðki þ 1Þki

,

where ki is i’s degree, Ci its clustering coefficient and Ni the number of edges between its neighbors all prior toits coauthorship with j. Thus,

DCiX0 () Cipm

ki

() mXCiki, (1)

meaning that i will witness an increase of its clustering coefficient thanks to a collaboration with j, only if atleast Ci of its ki neighbors are already acquainted to j. Otherwise, the clustering coefficient will on the contrarydecrease. For example, an author who has C ’ 2

3—which is equal to the average clustering coefficient of the

GP network (see Table 1)—making a new collaboration with a researcher j, would need at least 23 of its

neighbors already knowing j to increase its clustering coefficient. Therefore, even though there is a higherprobability that a new link will connect two authors with already several mutual acquaintances than two thatshare only a few [3], there is still a good chance that this new link will reduce the clustering coefficient of bothcollaborators especially if they have a high degree.

On the other hand, this same link will increase the clustering coefficient of each of the m mutual neighborsof i and j, generating all together a difference of:

DCij ¼Xm

l¼1

DCij;l ¼Xm

l¼1

2

kij;lðkij;l � 1Þ,

where kij;l is the degree of the lth mutual neighbor of i and j. The total variation of the clustering coefficient ofthe graph DCG due to the new collaboration between i and j is given by

DCG ¼ DCi þ DCj þ DCij .

Page 9: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESSM. Tomassini, L. Luthi / Physica A 385 (2007) 750–764758

Now depending on the values of m, ki, kj and kij;l with 1plpm, the variation of the clustering coefficient ofthe graph can be positive or negative. If i and j have too few mutual acquaintances or if the different degreesare quite big, there could be a decrease of the average clustering coefficient.

Finally, note that new researchers coauthoring a paper with at least two coauthors will join the networkwith a clustering coefficient of 1, hence necessarily increasing the average clustering coefficient of the network.

Now, a possible explanation of the difference between the slightly increasing network clustering coefficientand the result of Barabasi et al. can be found by studying the proportions of the new types of collaborations.Fig. 8 clearly indicates that among the new collaborations in the GP field, more of them are between a newresearcher and an already present one than between two authors already in the network. Consequently, thenetwork average clustering coefficient will probably tend to increase. However, Barabasi et al. found that inthe mathematics and neuro-science coauthorship networks, the majority of the new collaborations are amongexisting researchers which could indeed explain the decrease of the overall clustering coefficient in these cases.

Looking back at Eq. (1), the latter suggests that the higher the degree of a node, the harder it is for it toincrease or simply maintain its clustering coefficient. Fig. 9 confirms this effect as node degree and the averageclustering coefficient of the nodes having this degree appear to be anticorrelated.

3.6. Path length

The GP collaboration network is a small world since the average path length L, defined as the average of allpossible shortest paths between pairs of nodes, is OðlogNÞ in the number of vertices N in the giant component(see Table 1). Fig. 10 clearly indicates that the average separation within the giant component increases withtime. Values of L before 1996 are not meaningful, as the network was essentially composed of small isolatedcomponents, and we do not know whether the values during the last two or three years are near an asymptote.We find different results than the authors of Ref. [4] who observe that L decreases in time for the databasesthey studied.

Because of the size of the systems, sampling was used in Ref. [4], which could be the source of some error,while we were able to perform an exact calculation of L. Moreover, Barabasi et al. wrote in their conclusionsthat their result is likely to be caused by the incomplete data set they studied.

In a rather speculative manner, the fact that L increases during the first part of the graph evolution might bedue to the relative importance of new links at that time. In Fig. 8, we saw that links between new researchersand authors already in the network are more common that links between two already present authors. Theformer will in general tend to lengthen the paths, while the latter should have the opposite effect. It is thus

Fig. 8. Histogram of the different types of new collaborations (see text).

Page 10: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESS

Fig. 9. Average clustering coefficient as a function of the degree.

Fig. 10. Evolution of the average path length of the largest component.

M. Tomassini, L. Luthi / Physica A 385 (2007) 750–764 759

possible that L increases at the beginning and becomes more stable at a later time when both processescreating new links equilibrate.

4. Testing for preferential attachment

In this section the data is used to test whether the preferential attachment hypothesis (see Section 1) duringnetwork growth can be confirmed. As stated in the introduction, preferential attachment has been empiricallyfound in the evolution of several network types, including scientific collaboration networks [3–5,10]. However,there are some differences as for the functional form of the effect. In some cases it appears to be quite close tolinear, while in other cases it has been found to be sublinear.

Papers in the GP database, as in most other cases, are dated only to the nearest year. This makes the precisetime ordering of the collaboration uncertain. However, even pre-print databases where entries are time-stamped with a daily granularity do not necessarily reflect the actual beginning of a collaboration. Forexample, some researchers may work faster than others or delay publication, or suspend a collaboration

Page 11: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESSM. Tomassini, L. Luthi / Physica A 385 (2007) 750–764760

temporarily and so on which will somehow influence the actual times. Nevertheless, daily data are obviouslypreferable to annual ones when available as they would more closely represent the actual order ofcollaborations.

A new link appearing in the network can be of three types: links connecting two old authors (an old authoris an author who was already present in the network prior to the new link), those connecting an old author anda new one, and finally those between two new authors. For the sake of simplicity, let us call the first kind oflinks ‘‘internal’’ and the second kind ‘‘external’’. To capture a possible preferential attachment mechanism, weconcentrate solely on external links. When studying preferential attachment, we do not take into accountinternal links since we believe several other factors compose the overall probability that a new collaborationbetween two old authors will take place. Indeed, besides preferential attachment, the number of acquaintancesthe two old authors have in common also increases the probability that they will collaborate in the future [3].In Barabasi et al. [4] this attachment is studied as a different, independent process in which attachmentprobability is proportional to the product of the degrees of the implied nodes. The third category of links areclearly not relevant.

To test for preferential attachment, we use a method proposed by Newman in Ref. [3]. We define the relativeprobability Rk that a new author entering the network at year y collaborates with an author who haspreviously coauthored papers with k researchers. The corresponding absolute probability PkðyÞ that thenewcomer connects to an author with degree k is PkðyÞ ¼ Rknkðy� 1Þ=Nðy� 1Þ, where nkðy� 1Þ is the numberof authors with degree k at year y� 1 and Nðy� 1Þ is the number of authors in the network at year y� 1. Rk

can be estimated by drawing a histogram of the degrees k of the researchers which each new author isconnected to and in which each sample is weighted by a factor of Nðy� 1Þ=nkðy� 1Þ. Due to the relativelysmall number of vertices of the GP network compared with other coauthorship networks studied, we measureRk over the last three years (2004–2006). Links that appear at year y between two old authors and thosebetween two new ones are added just before proceeding to year yþ 1. Furthermore, to improve statistics weplot rk ¼

R k0

1 Rk dk0 instead of Rk in Fig. 11. If there is no preferential attachment, Rk should be equal to 1.Otherwise, it should be an increasing function of k. In a matter of fact, if we believe that the degreedistribution of the network follows a power law, Rk should even increase linearly with k and hence rk�knþ1

with n ¼ 1. From the data plotted in the figure we find n ¼ 0:76� 0:03, which would imply a sub-linear rate.This is in qualitative agreement with the empirical degree distribution (see Section 3.4) which has anexponential decay and cannot be well fitted by a power-law. The authors of Ref. [5] found a similar value(0:79� 0:1) for the coauthorship network of neuroscience and 0:81� 0:1 for the actor network, whenconsidering only links between old and new authors. On the other hand, they obtained values close to 1 for thecitation network and for Internet. Newman’s results for two other coauthorship networks were �0:89 for the

Fig. 11. Cumulated relative probability of preferential attachment (see text).

Page 12: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESSM. Tomassini, L. Luthi / Physica A 385 (2007) 750–764 761

Los Alamos archive and �1:04 for a biological collaboration network (Medline). For the Wikipedia [10], theresult is about 0.9 for incoming as well as outgoing links, since in this case the graph is oriented. Thus, thepresent study indicates a sublinear preferential attachment behavior, which seems to be in line with otherevolving scientific collaboration networks.

As said above, the data have a yearly granularity. Thus, adding all the links at once might be the source ofsome systematic error which is difficult to assess. In order to get an idea of the magnitude of the error, we havealso done things in the reverse order, i.e. we have added first all the internal links, followed by the externalones. In this way, we obtain n ’ 0:88. Finally, we have also added all the links that appeared in the studiedperiod in random order, which gives n ’ 0:85. Thus, we see that the precise order in which links are added tothe growing network does not change the main conclusions.

5. An effective network?

The analysis presented above as well as those of Refs. [3–5,10] assume that, once a collaboration betweentwo researchers has been established (a reference to another page in the Wikipedia case), it stays there forever.This seems to be a safe first approximation as scientists remain active for quite a long time in general, at leastfor the time windows used in Refs. [3–5]. However, when one looks closely at how these collaborations areestablished and maintained, one sees that the rate at which collaborations cease or are no longer active is notthat negligible after all, at least as far as the GP database is concerned. First of all, over a long period of timepeople may retire or change fields, which makes them effectively unavailable for further collaboration;therefore, their links should no longer be taken into account in order to have a more correct representation ofthe effective network. Moreover, and this is particularly important in the GP community but should also bepresent in other fields, PhD students do coauthor papers with their directors and colleagues during theirgraduate studies but they often leave thereafter to work in the industry or elsewhere. In other words, up tonow, the view we have of these networks might be called ‘‘historical’’, in the sense that all the collaborations,however, brief or temporary, are recorded forever. This view has its logic and is important in its own way;however, it might produce an inaccurate description of the network at any given time.

In order to cope with the above limitations, we suggest the notion of an ‘‘effective’’ network, which is thenetwork that comprises only the real or possible collaborations. However, it is difficult in practice to estimatewhich collaborations are active in the coauthorship network. To get an approximate idea, we have used slidingtime windows of different lengths. We consider that if two authors are connected by a collaboration in a giventime window and do not collaborate again for the duration of the window size, then the collaboration hasceased, even though it could reappear at a later time. Thus, the effective network is the graph that isconstructed by accumulating all vertices and edges present during the time window. Fig. 12 shows the ratiobetween the number of deleted links that reappear after a given time over the total number of suppressed onesfor different window sizes. It seems that a time window of three or four years is adequate as only less than 2%of the deleted links actually reappear.

The notion of an effective network has an influence on a number of graph features. For instance, theaverage degree is less than that of the historical network, going from 4.17 to about 3.8, depending on thewindow size, for the whole graph. On the other hand, if one considers only the giant component, the averagedegree increases from 5:8 to �6:5 which suggests that the majority of nodes and links removed concern low-degree nodes, such as PhD students leaving the community. The effect of suppressing authors andcollaborations on the five most connected nodes are shown in Fig. 13. Fig. 13(a) refers to the historicalnetwork, while the curves in Fig. 13(b) concern the same nodes but for the effective network with a three-yeartime window. We notice that the degrees are significantly lower in the effective network and they oscillateinstead of continuously growing as in the historical network. In social terms, it seems that maintaining acollaboration has a cost and, as a consequence, only a limited number can be active in a given time frame. Thiscan be seen, for example, for author 1 who keeps acquiring new collaborations (Fig. 13(a)), but this is done atthe expense of progressively losing older collaborations (Fig. 13(b)). Also, author 3 is practically no longeractive toward the end of the period, while he was a highly connected one only a few years earlier.

With respect to graph growth, the effective network at the end of the period (end of 2006) only containsabout half the nodes and edges with respect to the historical one. However, the graph still grows but at a

Page 13: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESS

Fig. 13. Evolution of the degree of the top five collaborators. Historical network (a); Effective network with a window size of three years

(b). Note the different vertical scales in (a) and (b).

Fig. 12. Ratio of links that reappear over the total number of deleted links as a function of the window size. The experimental points are

well fitted by an exponential of the type 0:18� e�0:68 (dashed line).

M. Tomassini, L. Luthi / Physica A 385 (2007) 750–764762

slower pace, which means that aging of nodes and removal is a second-order effect with respect to the arrivalof new nodes and new links. As for the giant component, it is still there in the effective network and it emergesat the same time, but it comprises a lower fraction of the whole graph (23% instead of 36.5%).

Finally, we find a slightly higher value for the clustering coefficient in the effective network (about 0.70instead of 0.66). We did not study the preferential attachment rate in the effective network, but we hypothesizethat preferential attachment is still at work with perhaps a slightly different intensity; this would be aninteresting feature to study.

In conclusion, the idea of an effective network does not radically change the historical view, although it doesinfluence some of its features and might offer a view closer to the sociologist’s. We believe that the resultswould be similar for other coauthorship networks. However, social networks for which the representedrelationships have a more transient character, the effects of considering adding nodes and links as well as nodeand link aging processes might all be required ingredients for a faithful model of network growth anddynamics; see, for instance Refs. [9,25].

Page 14: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESSM. Tomassini, L. Luthi / Physica A 385 (2007) 750–764 763

6. Conclusions

In this work we have studied the temporal evolution of a collaboration network among researchers in thegenetic programming field. The distinguishing feature with respect to other similar investigations is that,thanks to the availability of data, we were able to study the network growth from the very beginning and overa 20-year period, which is particularly helpful to watch the formation of the giant component, a process thatappears to be still incomplete in the present case. Moreover, the largest component does not form immediatelybut rather after the mean degree hki reaches a value around 3 which is of course an idiosyncratic feature of thisparticular network and cannot be generalized. We have studied the variation with time of several graphquantities and, as in Ref. [4], the results show that they are all time dependent, although there seems to be atendency toward asymptotic values in the last few years. This is the case for the clustering coefficient, theaverage path length, and the degree distribution. However, in contrast with the results reported in Ref. [4], wefind that the clustering coefficient slightly increases, while the average path length tends to increase as time andnetwork size increase. Some possible explanations for these differences were suggested.

One of the main goals of the study was to test the preferential attachment hypothesis on anothercollaboration network. From this point of view, the empirical results tend to confirm previous observations[3–5], i.e. that preferential attachment holds and that the attachment rate appears to be slightly sublinear in thecase of coauthorship networks.

Finally, we have argued for an effective view of a social network, in which nodes and links may alsodisappear as a consequence of aging and other obsolescence processes, instead of being always present. Thisview could be even more significant for other social networked processes where acquaintances have shorterlifetimes. These observations could be helpful for the design of better models of social network growth anddynamics.

Acknowledgments

We would like to thank W. B. Langdon, who kindly made the data used in this study available to us.We also thank an anonymous reviewer for his useful comments on the manuscript.

References

[1] R. Albert, A.-L. Barabasi, Statistical mechanics of complex networks, Rev. Mod. Phys. 74 (2002) 47–97.

[2] M.E.J. Newman, The structure and function of complex networks, SIAM Rev. 45 (2003) 167–256.

[3] M.E.J. Newman, Clustering and preferential attachment in growing networks, Phys. Rev. E 64 (2001) 025102.

[4] A.-L. Barabasi, H. Jeong, Z. Neda, E. Ravasz, A. Schubert, T. Vicsek, Evolution of the social network of scientific collaborations,

Physica A 311 (2002) 590–614.

[5] H. Jeong, Z. Neda, A.-L. Barabasi, Measuring preferential attachment in evolving networks, Europhysics Lett. 61 (2003) 567–572.

[6] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. Wiener, Graph structure in the web,

Comput. Networks 33 (2000) 309–320.

[7] A.-L. Barabasi, R. Albert, H. Jeong, Scale-free characteristics of random networks: the topology of the World Wide Web, Physica A

281 (2000) 69–77.

[8] C.P. Massen, J.P.K. Doye, A self-consistent approach to measure preferential attachment in networks and its application to an

inherent structure network, Physica A 377 (2007) 351–362.

[9] G. Kossinets, D.J. Watts, Empirical analysis of an evolving social network, Science 311 (2006) 88–90.

[10] A. Capocci, V.D.P. Servedio, F. Colaiori, L.S. Buriol, D. Donato, S. Leonardi, G. Caldarelli, Preferential attachment in the growth

of social networks: the Internet encyclopedia Wikipedia, Phys. Rev. E 74 (2006) 036116.

[11] L. Barabasi, R. Albert, Emergence of scaling in random networks, Science 286 (1999) 509–512.

[12] P.L. Krapvsky, S. Redner, F. Leyvraz, Connectivity of growing random networks, Phys. Rev. Lett. 85 (2000) 4629–4632.

[13] J.W. Grossman, The evolution of the mathematical research collaboration graph, Congress. Numer. 158 (2002) 201–212.

[14] M.E.J. Newman, Scientific collaboration networks. I. Network construction and fundamental results, Phys. Rev. E 64 (2001) 016131.

[15] M.E.J. Newman, Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality, Phys. Rev. E 64 (2001)

016132.

[16] L. Luthi, M. Tomassini, M. Giacobini, W.B. Langdon, The genetic programming collaboration network and its communities, in:

D. Thierens, et al. (Eds.), Proceedings of the Genetic and Evolutionary Computation Conference GECCO’07, ACM Press, 2007,

pp. 1643–1650.

Page 15: Empirical analysis of the evolution of a scientific collaboration network

ARTICLE IN PRESSM. Tomassini, L. Luthi / Physica A 385 (2007) 750–764764

[17] L.A.N. Amaral, A. Scala, M. Barthelemy, H.E. Stanley, Classes of small-world networks, Proc. Natl. Acad. Sci. USA 97 (21) (2000)

11149–11152.

[18] B. Bollobas, Random Graphs, Cambridge University Press, Cambridge, UK, 2001.

[19] M. Molloy, B. Reed, A critical point for random graphs with a given degree sequence, Random Struct. Algorithms 6 (1995) 161–179.

[20] W. Aiello, F. Chung, L. Lu, A random graph model for massive graphs, in: Proceedings of the 32nd Annual ACM Symposium on

Theory of Computing, Association of Computing Machinery, New York, 2000, pp. 171–180.

[21] M.E.J. Newman, S.H. Strogatz, D.J. Watts, Random graphs with arbitrary degree distributions and their applications, Phys. Rev. E

64 (2001) 026118.

[22] M.C. Gonzalez, P.G. Lind, H.J. Herrmann, System of mobile agents to model social networks, Phys. Rev. Lett. 96 (2006) 088702.

[23] R. Toivonen, J.P. Onnela, J. Saramaki, J. Hyvonen, K. Kaski, A model for social networks, Physica A 371 (2006) 851–860.

[24] M.C. Gonzalez, P.G. Lind, H.J. Herrmann, Model of mobile agents for sexual interactions networks, Eur. Phys. J. B 49 (2006)

371–376.

[25] H. Ebel, J. Davidsen, S. Bornholdt, Dynamics of social networks, Complexity 8 (2003) 24–27.