Email networks and the spread of computer viruses

RAPID COMMUNICATIONS

PHYSICAL REVIEW E 66, 035101~R! ~2002!

Email networks and the spread of computer viruses

M. E. J. Newman,1 Stephanie Forrest,1,2 and Justin Balthrop21Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, New Mexico 87501

2Department of Computer Science, University of New Mexico, Albuquerque, New Mexico 87131-1386~Received 2 July 2002; published 10 September 2002!

Many computer viruses spread via electronic mail, making use of computer users’ email address books as asource for email addresses of new victims. These address books form a directed social network of connectionsbetween individuals over which the virus spreads. Here we investigate empirically the structure of this networkusing data drawn from a large computer installation, and discuss the implications of this structure for theunderstanding and prevention of computer virus epidemics.

DOI: 10.1103/PhysRevE.66.035101 PACS number~s!: 89.75.Hc, 89.20.Hh, 87.23.Ge, 87.19.Xx

ndc

f

otuec

-aic

lfcieshe.e-rudr

ofar

odta

foeroncaeea8

thncies

rea

ionssota

terirusllythehisn-v-

fall

is

thatis.

mailis-

areingn a

ofuss

t-

The structure of various networks, including social acomputer networks, has been a subject of considerable reinterest in the physics literature@1,2#. The spread of infectionis an area of special interest@3–6#, including the spread ohuman diseases and also computer viruses@7,8#, which arethe topic of this paper. We present an empirical analysisthe networks over which computer viruses spread and ssome possible control strategies for preventing virus inftions.

Currently, the primary vehicle for transmission of computer viruses is electronic mail. Viruses typically arrive oncomputer as an attachment to an email message, whwhen activated by the user, sends further copies of itseother recipients. The email addresses of these other reents are usually obtained by examining an email ‘‘addrbook,’’ a file in which the user for convenience stores temail addresses of his or her regular correspondentspointed out by Lloyd and May@5#, these address books crate a network of computer users over which the vispreads. One can visualize this network as a set of norepresenting computer users, with a link running from useAto userB if B’s email address appears inA’s address book.This network is entirely distinct from the physical networkoptical fibers and other connections over which datatransferred between computers@21#. The network over whichan email virus spreads is a social network of personal cnections between computer users. If we are to understanmechanisms by which viruses spread, we need to undersfirst the structure of this social network.

We have analyzed address book data in 20 commonmats, gathered from a large university computer system sing 27 841 users, and thereby reconstructed the corresping network of computer users. Because email virusesonly be transmitted if computer users actually read themail, all data were discarded for users who had not rtheir email in the previous 90 days, leaving a total of 16 8in the network.

The network necessarily omits any connections fromoutside world to users inside the network, since there isway to find out about such connections other than by colleing data from external users. A similar issue arises in studof the structure of the Worldwide web, in which hyperlinkto a website from other sites cannot easily be discoveConnections to users from outside the observed network

1063-651X/2002/66~3!/035101~4!/$20.00 66 0351

ent

fdy-

h,topi-s

As

ses

e

n-thend

r-v-d-n

ird

1

eot-s

d.re

important because it is presumably along these connectthat viral infection initially arrives. Thus our data can tell uabout the spread of viruses within a community, but nabout how those viruses arrive in the first place. Frompractical standpoint, however, there is little that compusystem administrators can do to control the spread of a vin the world at large. Consequently, their efforts are usuafocused on minimizing damage once the infection enterscomputer system for which they have responsibility. For treason, we have also eliminated from our network all conectionsto users outside the network, which are many, leaing a network composed only of those connections thatwithin the set of users studied.

An important property of our email network is that itdirected. That is, each edge~i.e., line! joining two vertices inthe network has a direction. Just becauseB’s email addressappears inA’s address book does not necessarily meanthe reverse is also true, although, as we will see, it oftenThe directed nature of the network makes the spread of eviruses qualitatively different from the spread of human deases, for which most types of disease-causing contactsundirected. As we will see, there are a variety of interestphenomena that are peculiar to the spread of infection odirected network.

Table I provides a summary of the statistical propertiesour email network. In the remainder of this paper we disc

TABLE I. Summary of statistical properties of the email nework.

Number of vertices 16881Number with address books 4581Number with nonzero in- or out-degrees 10110Mean number of entries per address book 12.45Mean degreez ~either in or out! 3.38Correlation coeff. of in- and out-degrees 0.529

Clustering coefficient 0.168Expected clustering on random graph 0.017

Total number of edges 57029Number of edges that point both ways 13176Fraction pointing both ways~reciprocity! 0.231Expected reciprocity on random graph 0.00095

©2002 The American Physical Society01-1

us

aes

athanru

iess

sethtus

Ineee

th

eiaed

ser

tad

-

entellesentit isrgeanyin-

r, itavehe

rily

velg a

ngeur

s ofr in

ede. In., if

gen-on

is

ob-alueeryis

ourusepliesin

andet-ersd

erotant

ksdi-r a

d a

hene


NEWMAN, FORREST, AND BALTHROP PHYSICAL REVIEW E66, 035101~R! ~2002!

in detail the network structure and its implications for virspread.

The first thing we notice about our network is that quitesmall fraction of the 16 000 vertices actually have addrbooks—around a quarter@22#. However, a majority of thevertices in the network are nonetheless connected to oneother, by edges leading either in or out of the vertex, or boAbout 10 000 vertices, or 59%, are connected to otherstherefore are at risk of either receiving or passing on viinfections.

The mean degreez of a vertex is 3.38.~Recall that thedegree of a vertex is the number of edges to which itconnected.! In a directed network such as this one, vertichave both an in-degree and an out-degree. The meanthese numbers are the same, since every edge that beginvertex must end at some other vertex. Thusz is both themean in- and out-degree. As a rough rule of thumb, viruspread when the mean out-degree of a vertex is greater1, since in this regime each infection received by a compuis on average passed on to more than one other compThus it appears that our network of computer users is eadense enough to spread infection.

Also of interest is the distribution of vertex degrees.Fig. 1 we show cumulative histograms of in- and out-degrfor our network. Both distributions are markedly faster dcaying than the power-law degree distributions seen in otechnological networks such as the Internet@9# and theWorldwide web @10,11#. In fact, as the figure shows, thcumulative distributions are well fit by a simple exponentfor the in-degree and a stretched exponential with expon12 for the out-degree. These correspond to noncumulativetributions pj;exp(2j/j0) for the in-degree and pk

;(1/Ak)exp(2Ak/k0) for the out-degree withj 058.57(9)and k054.18(3). ~Free fits to stretched exponential formgive values of 1.034 and 0.493 for the two exponents, vclose to the values of 1 and12 assumed here.! Interestingly,both these degree distributions are known to occur in cermodels of growing networks—the pure exponential in moels with random edge assignment@12# and the stretched ex

FIG. 1. In- and out-degree distributions for our network. Tsolid lines represent fits to the exponential and stretched expotial forms discussed in the text.

03510

s

n-.d

s

ssofat a

san

erter.ily

s-er

lntis-

y

in-

ponential in models with sublinear preferential attachm@13#. Thus the observed distributions would probably be wfit by a growth model in which the source of added edgwas chosen according to a sublinear preferential attachmand the destination at random. This seems reasonable:natural to suppose that individuals who already have laaddress books would be more likely to add to them thindividuals who do not, but it is not clear if there is anmechanism that would favor making new connections todividuals with a high in-degree.

Regardless of the precise degree distribution, howeveis clear that there are a few vertices in the network that ha very high degree. This has important implications for tspread of infection in the network@4,14,15#, a point whichwe discuss further below.

The in- and out-degrees of a vertex are not necessaindependent, but may be correlated~or anticorrelated!, andone should therefore really consider a joint distributionpjkof in-degreej and out-degreek @16#. Although this quantity isdifficult to represent visually, one can get an idea of the leof correlation between in- and out-degrees by calculatincorrelation coefficient for the two, given byr 5(( jk jkpjk2z2)/(s insout), where s in and sout are the correspondingstandard deviations. This quantity takes values in the ra21<r<1, depending on the level of correlation. For onetwork, we find its value to ber 50.53, indicating that thetwo degrees are strongly correlated—the email addresseindividuals who have large address books tend to appeathe address books of many others.

Another important statistical property peculiar to directnetworks is the ‘‘reciprocity’’@17#. Reciprocity measures thfraction of edges between vertices that point both waysthe network studied here, the reciprocity is about 0.23, i.ethere is an edge pointing from vertexA to vertex B, thenthere is a 23% probability that there will also be an edfrom B to A. We can also calculate the reciprocity on a radom network, and in terms of the joint degree distributipjk defined above, we find that the expected value(nz)21( jk jkpjk , which gives 9.4931024 for the presentnetwork, several orders of magnitude smaller than theserved value. This strongly suggests that the observed vis not the result of a pure chance association of vertices. Vlikely we are observing social phenomena at work—therea heightened chance that you will have a person in yaddress book if they have you in theirs, presumably becathe presence of a person’s address in an address book imsome kind of social connection between the two peoplequestion, which in many cases goes both ways.

Bidirectional edges can be thought of as undirected,the email network can be thought of as a ‘‘semidirected nwork,’’ a graph in which some edges are directed and othare undirected.~Technically one might define a semidirectenetwork as one in which the reciprocity does not tend to zasn becomes large, but instead tends to a nonzero consvalue.! It seems likely that many other real-world networthat are formally directed networks are in fact really semirected. For example, we have calculated the reciprocity fo269 504-vertex subset of the Worldwide web@10#, which is adirected network of web pages and hyperlinks, and foun

n-

1-2

n

meo

uu

, a

rehastenl-

rx

cdgpr

eere

anthnicsa

reanm

ia

s atns

aseofer-

e

sesec-la-forith

atti-

ntpi-husf ourof

ansut-

heirsen-sisw

et-rk.

eryn in

as

estede-

ocesran-


EMAIL NETWORKS AND THE SPREAD OF COMPUTER VIRUSES PHYSICAL REVIEW E66, 035101~R! ~2002!

value of 0.57, where the expected value on the correspoing random graph would be 1.231024, indicating that theWeb is probably also a semidirected graph.

We turn now to the specific issue of the spread of coputer virus infections over email networks. A virtue of thapproach taken here is that, since we have the entire netwavailable, we can study infection dynamics directly withorelying on approximate techniques such as differential eqtion models, statistical deduction, or computer simulationin most studies of human diseases. Here we make the mpessimistic assumption about email viruses, that they spwith essentially 100% efficiency. That is, we assume tthey ruthlessly send copies of themselves to everyone liin an address book, and that no recipients are immunviruses because of antivirus software or other precautio~The real-world situation is unlikely to be this bad; our caculations give a worst-case scenario.!

Consider then an email network of the type studied heSince the network is directed, there does not necessarily ea path that could carry a virus from vertexA to vertexB,even ifA andB are connected by edges in the network, sinthe virus can, in general, only pass one way along each eThe large-scale structure of a directed network can be resented by the ‘‘bowtie diagram’’ of Broderet al. @11# de-picted in Fig. 2. A strongly connected component of the nwork is defined to be any subset of vertices in which evvertex can be reached from every other. Typically the nwork has one giant strongly connected component~GSCC!which contains a significant fraction of the entire network,well as a number of smaller strongly connected componeThe GSCC is represented by the circular middle part ofbowtie in the figure. Then there is a giant in-componewhich comprises the GSCC plus those vertices from whthe GSCC can be reached but which cannot themselvereached from the GSCC. We can think of the latter setbeing the vertices ‘‘upstream’’ of the GSCC. They are repsented by the left part of the bowtie. There is also a giout-component consisting of the GSCC plus ‘‘downstreavertices~the right part of the bowtie!. In addition, there maybe small groups of vertices that are connected to the gcomponents but are not part of them~sometimes called ‘‘ten-

FIG. 2. The structure and relative sizes of the components ofemail network.

03510

d-

-

rkta-s

ostadt

edtos.

e.ist

ee.e-

t-yt-

sts.et,hbes-t

’’

nt

drils’’ ! or that are not connected to the giant componentall. For our email network, the sizes of these various portioare given in Fig. 2. As we can see, the bowtie is in this cquite asymmetric, with many more vertices downstreamthe GSCC than upstream of it. Most of the downstream vtices are vertices that have a zero out-degree themselves~i.e.,no address book! but which are pointed to by members of thGSCC.

We can apply these insights to the spread of email viruas follows. We concentrate on the giant components; inftions in the small components will not spread to the popution at large—it is the giant component that is responsiblelarge-scale virus epidemics. A virus outbreak that starts wa single vertex will become an epidemic if and only if thvertex falls in the giant in-component. The number of verces infected in such an epidemic~making the pessimisticassumptions above! is equal at least to the size of the giaout-component. It may be slightly larger than this if the edemic starts in the region upstream of the GSCC and taffects some vertices there also. For the particular case onetwork, we find that epidemics have a minimum size9108 vertices and a maximum size of 9132, which methat about 54% of the network is at risk from epidemic obreaks.

So how can we prevent these epidemics or reduce tsize? Current virus prevention strategies correspond estially to random ‘‘vaccination’’ of computers using antivirusoftware@23#. Our network data however suggest that thisan ineffective way of combating infection. In Fig. 3 we sho~dotted line! the maximum possible outbreak size in our nwork as vertices are removed at random from the netwoAs the figure shows, the outbreak size drops only vslowly as vertices are removed, a result similar to that seeother networks@14,18,19#.

On the other hand, previous work on other networks hshown that often a very effective strategy istargetedremovalof vertices, i.e., identification and removal of the verticmost responsible for the spread of infection. For undirecnetworks, simply removing the vertices with the highest dgree often works well@11,18,19#. A similar but slightly more

urFIG. 3. The maximum outbreak size on our network, as verti

in the giant in-component are progressively removed either atdom ~dotted line! or in decreasing order of their out-degree~solidline!.

1-3

e.s

ing

as0

ndaep

utWanannuuin

acghgup

-ieldili-

ell-toetheFor

rowec-

therorof

riseourmese-

the

forhisch,


NEWMAN, FORREST, AND BALTHROP PHYSICAL REVIEW E66, 035101~R! ~2002!

sophisticated strategy looks promising in the present casFig. 3 ~solid line!, we show the result of removing verticefrom the giant in-component of our network in decreasorder of the out-degree~i.e., of address book size!. As thefigure shows, the maximum size of the epidemic in this cdeclines sharply as vertices are removed, until about the 1mark, beyond which the epidemic is negligibly small afurther removal achieves little. This suggests that if we cprotect a suitably selected 10% of the vertices in the nwork, almost all vertices would become immune to an edemic.

In this paper, we have analyzed data on the structurethe network formed by the email address books of compusers; it is over this network that email viruses spread.have simulated the effect on virus propagation of both rdom and targeted vaccination of vertices and find that rdom vaccination, which is roughly equivalent to current ativirus precautions, is expected to have little effect on virspread. Targeted vaccination, on the other hand, looks mmore promising. This suggests that we should be developvirus control strategies that take network structure intocount. Similar concepts could also be used to identify hirisk vertices in the network and determine priority orderinfor security upgrades. Because it is often infeasible tograde all hosts in a network simultaneously~especially if the

om

go

-

e

03510

In

e%

nt-i-

ofere--

-schg--

s-

upgrades require hardware modifications!, and because upgrades are routine and continual, such a strategy could ya substantial benefit in terms of reduced network vulnerabties. For environments that use centralized and wprotected address books~e.g., to store addresses of interestan entire community!, the kind of analysis performed hercould potentially be useful in analyzing and managingtrade-offs between local and centralized address books.example, how large can a locally stored address book gbefore it becomes worthwhile to accord it the same prottions and restrictions as centralized databases?

The ideas considered here may also be applicable to osocial networks that are exploitable by computer virusesworms. Email networks are the most obvious examplesuch a network today, but other electronic services giveto social networks as well. The techniques employed inanalysis of email networks could readily be applied in soof these new settings, and more speculatively, might be uful as a guide for engineering new network services infuture.

The authors thank Jeff Gassaway and George Kelbleyproviding the data used for the analyses in this paper. Twork was supported in part by the Office of Naval Researthe NSF, DARPA, and by the Intel Corporation.

.

. E

tts,

t

e-singusly

ionus

deme

on

@1# R. Albert and A.-L. Baraba´si, Rev. Mod. Phys.74, 47 ~2002!.@2# S.N. Dorogovtsev and J.F.F. Mendes, Adv. Phys.51, 1079

~2002!.@3# C. Moore and M.E.J. Newman, Phys. Rev. E61, 5678~2000!.@4# R. Pastor-Satorras and A. Vespignani, Phys. Rev. Lett.86,

3200 ~2001!.@5# A.L. Lloyd and R.M. May, Science292, 1316~2001!.@6# D.J. Watts, Proc. Natl. Acad. Sci. U.S.A.99, 5766~2002!.@7# W.H. Murray, Computers and Security7, 139 ~1988!.@8# J.O. Kephart, S.R. White, and D.M. Chess, IEEE Spectrum30,

20 ~1993!.@9# M. Faloutsos, P. Faloutsos, and C. Faloutsos, Comput. C

mun. Rev.29, 251 ~1999!.@10# R. Albert, H. Jeong, and A.-L. Baraba´si, Nature~London! 401,

130 ~1999!.@11# A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Raja

palan, R. Stata, A. Tomkins, and J. Wiener, Comput. Netw.33,309 ~2000!.

@12# D.S. Callaway, J.E. Hopcroft, J.M. Kleinberg, M.E.J. Newman, and S.H. Strogatz, Phys. Rev. E64, 041902~2001!.

@13# P.L. Krapivsky, S. Redner, and F. Leyvraz, Phys. Rev. Lett.85,4629 ~2000!.

@14# R. Cohen, K. Erez, D. ben-Avraham, and S. Havlin, Phys. RLett. 85, 4626~2000!.

-

-

v.

@15# F. Liljeros, C.R. Edling, L.A.N. Amaral, H.E. Stanley, and YAberg, Nature~London! 411, 907 ~2001!.

@16# M.E.J. Newman, S.H. Strogatz, and D.J. Watts, Phys. Rev64, 026118~2001!.

@17# S. Wasserman and K. Faust,Social Network Analysis~Cam-bridge University Press, Cambridge, 1994!.

@18# R. Albert, H. Jeong, and A.-L. Baraba´si, Nature~London! 406,378 ~2000!.

@19# D.S. Callaway, M.E.J. Newman, S.H. Strogatz, and D.J. WaPhys. Rev. Lett.85, 5468~2000!.

@20# H. Ebel, L.-I. Mielsch, and S. Bornholdt, e-princond-mat/0201476.

@21# It is important also to distinguish between the network dscribed here and networks of actual email messages pasbetween computer users, which have been studied previoby, for example, Ebelet al. @20#. While the latter network iscertainly of use in understanding patterns of communicatbetween computer users, it is only indirectly relevant to virspread, since most email messages do not carry viruses.

@22# This is a lower bound. Although data were collected for a wivariety of address book formats, there are probably still sothat were missed.

@23# One should bear in mind that antivirus software works onlyknown viruses, not new or unknown strains.

1-4

Documents

Email networks and the spread of computer viruses