The one-hop neighborhood of Paul Erdös - UMD Computer … · 2013-11-12 · The one-hop neighborhood of Paul Erdös Souvik Bhattacherjee ([email protected]) Introduction Paul Erdös

The one-hop neighborhood of Paul Erdös

Souvik Bhattacherjee ([email protected])

Introduction

Paul Erdös (26 March 1913 – 20 September 1996) was a prolific Hungarian mathematician of the 20th

century, who spent a significant portion of his life out of a suitcase and writing papers with those of his

colleagues willing to give him room and board. He published more papers than any other mathematician

in history.

The idea of the Erdös number was created by his fellow mathematicians as a humorous tribute to his

enormous output as one of the most prolific modern writers of mathematical papers [1]. An Erdös

number 1 is awarded to the person who has published at least one mathematical paper with the

celebrated mathematician. Similarly, joint publications with someone with an Erdös number of 1 yield

an Erdös number of 2. Erdös himself has the number 0. The Erdös number has gained prominence in

scientific circles as one of the important metrics of adjudging mathematical prowess of a

mathematician.

In this project, we try to understand the collaboration network of authors having an Erdös number of 1

using NodeXL, a popular visualization tool for network analysis.

Dataset & Preprocessing

We obtain the dataset for this project from the Erdös Number Project [2]. Two datasets were used for

this project which is described below:

1. Erdos0 - This dataset lists all authors who have written a joint paper with Paul Erdös (i.e., who

have Erdös number 1). It is in alphabetical order and shows the date of first collaboration, as

well as the number of papers that each person has written with Erdös. There are currently 511

names on this list [3].

2. Erdos1graph – It contains the adjacency lists for the induced subgraph of the collaboration

graph on all Erdös coauthors, as of 2007. In other words, its vertices are people with Erdös

number 1, and are joined by an edge if they have published a joint paper (with or without other

collaborators). Paul Erdös himself and people with Erdös number 2 are not included. In

addition it also contains the number of Erdös number 2 authors that an author in this list has

collaborated with [4].

We had to preprocess the Erdos0 list to include a 1 in those places which did not have an entry to

indicate the number of publications between that author and Erdös. For the Erdos1graph the

adjacency list had to be converted to an undirected graph with no duplicate edges (taking the lower

triangular matrix). Also, the row author names and related information had to be separately parsed and

joined with the author list from the Erdos0 dataset.

Results and Analysis

We analyze the edge list of the Erdös #1 coauthors. There are 511 vertices excluding Erdös and 3208

edges in total. In some analysis we include Erdös as well to enhance the visualization, where we felt it

was necessary.

Headline 1: There are 2 types of authors that Erdös wrote papers with: Those

who also wrote papers among themselves and those who did not, at all.

Figure 1: Graph of Erdös #1 coauthors (grouped by connected components)

We group the coauthors having Erdös #1 by connected components and the results can be seen from

Figure 1. There are 42 connected components in which the largest component has 466 authors which

constitute 91.19% of the total number of authors in this network. The remaining authors as can be seen

from Figure 1 are either isolated or form a component of size at most 2. The diameter of the largest

connected component is 10. The presence of quite a number of isolated components intrigues us to

explore the properties of those authors further (presented later).

We analyze the graph further by grouping them into clique motif of size more than 4. It is easy to see

that the cliques would all be formed within the large connected component as all the other components

have a size less than 4. We found that the largest clique is a 7-clique. Apart from this, this component

contains 1 6-clique, 4 5-cliques and 19 4-cliques.

Figure 2: Graph of Erdös #1 coauthors (grouped by clique motif of size 4 or more)

Headline 2: a) Harary Frank* and Noga Alon coauthored actively with both

Erdös #1 and Erdös #2 authors whereas Peter Salamon coauthored only with

Erdös #2 authors

We layout the graph of Erdös #1 coauthors with the X-axis representing the # of Erdös #1 coauthors and

the Y-axis representing the # of Erdös #2 coauthors in Figure 3. The authors to the extreme right in the

X-axis (circled) are also the authors who are also positioned highest along the Y-axis. They are Harary

Frank* (44 Erdös #1 coauthors and 271 Erdös #2 coauthors) and Noga Alon (51 Erdös #1 coauthors and

228 Erdös #2 coauthors). We also notice Peter Salamon (circled) to the extreme left along the X-axis who

stands out among the rest of the authors in the same region with 113 Erdös #2 coauthors.

b) Lee Albert Rubel* plays a central role in this network with a comparatively

lower number of Erdös #1 coauthors

In the same graph, we order the size of the vertices by their degree and color them by betweenness

centrality. A blue node (circled in red) along the middle of the X-axis catches our attention. To observe

the node in detail, we use dynamic filtering to retain the top-10 nodes having the highest values of

betweenness centrality (Figure 4). This node is particularly interesting because it is the node with a

comparatively low degree which has a significant betweenness centrality value. Upon careful scrutiny,

we found that this author has the lowest degree among the top-10 authors but ranks 4th

in the

betweenness centrality value. He coauthored with 3 of the most prolific Erdös #1 authors; Ernst Gabor

Straus (rank 3), Carl Bernard Pomerance (rank 5) and Zoltan Furedi (rank 8), who in turn collaborated

with the top coauthors in this graph. Thus even with a comparatively low degree of 12 this node plays a

central role in the coauthor network, the next highest degree being 26.

Figure 3: Graph of Erdös #1 coauthors (X-axis: # of Erdös #1 coauthors, Y-axis: # of Erdös #2

coauthors)

Figure 4: Graph of Erdös #1 coauthors (Top-10 ordered by betweenness centrality)

Headline 3: Most of the Erdös #1 authors who did not collaborate with any

Erdös #1 author also collaborated less with Erdös #2 authors

Figure 5: Graph of Erdös #1 coauthors with Erdös who did not coauthor any paper with Erdös #1

author

We construct the graph of isolated authors in the Erdös #1 collaboration graph, keeping Erdös in this

case to have edges in this graph (Figure 5). The vertices (representing authors) are labeled by their

names and the edges are labeled by the year in which the corresponding author first published a paper

with Erdös. The edge width is determined by the total number of publications that this author has with

Erdös, with 3 as the maximum edge weight in this graph. The size of the vertices represents the number

of Erdös #2 coauthors that the author has collaborated with. The color of the vertices indicates whether

the author is living (blue) or has deceased (orange). We also order the authors (manually) by the year in

which they first publish a paper with Erdös, with the year increasing in a clockwise fashion.

We observe from Figure 5 that the sizes of most of the vertices are very less indicating that these

isolated authors also collaborated less with Erdös #2 coauthors, with the notable exceptions being Peter

Salamon, Marcus Solomon and Tarski Alfred* having collaborated with 113, 33 and 26 Erdös #2 authors,

respectively. This visualization also helps us to identify the oldest collaborators of Erdös, who are still

living; Joseph Lehner, in this graph.

Headline 4: Birds of same feather flock together: The top Erdös #1 collaborators

also collaborated highly among themselves

Figure 6: Graph of Erdös #1 coauthors having 30 or more collaborations (with Erdös #1 authors)

Collaboration graph of Erdös #1 coauthors having 30 or more collaborators is presented in Figure 6. We

observe that this graph is strongly connected indicating that the authors in this graph also collaborated

highly with each other. Figure 7 clusters these 14 authors using Girvan Newman clustering algorithm.

We observe that there is a 6-clique and a 4-clique which furthers establishes the high connectivity of this

network.

Figure 7: Graph of Erdös #1 coauthors having 30 or more collaborators (clustered using Girvan

Newman clustering algorithm)

Headline 5: Erdös #1 authors having high collaborations with Erdös #2 authors

did not collaborate highly among themselves

We construct the graph of Erdös #1 authors who has collaborated with 100 or more Erdös #2

collaborators (Figure 8). The sizes of the nodes represent the number of Erdös #2 coauthors that this

author has. The coloring is done based on the actual degree of the node in the Erdös #1 collaboration

graph (Figure 1). The observation here is that the graph in Figure 8 is not so strongly connected unlike in

Figure 6. This implies that these authors do not collaborate highly among themselves. In fact, the author

Saharon Shelah does not have any collaboration in this graph although he has collaborated with 15

other Erdös #1 authors. Peter Salamon did not have any collaboration with any of the Erdös #1

coauthors and is therefore not a surprise here.

Figure 8: Graph of Erdös #1 coauthors having 100 or more Erdös #2 collaborators

NodeXL Critiques

NodeXL is a great tool for handling graphs especially because of the fact that it is integrated with

Microsoft Excel. I have had the chance of using Pajek (another network analysis tool) before but haven’t

found it be as flexible as NodeXL. The features of NodeXL that interest me the most are the Grouping

options; especially the cluster and the motifs. The Graph Metric and the Autofill options were equally

useful. However there are few things that I feel needs more attention (as I found out during the course

of my NodeXL usage) and are listed below:

1. The user needs to handle isolated nodes manually for displaying them. If there are a lot of

isolated nodes in the graph it becomes problematic.

2. The legends occupy a large portion of the actual screen below the actual display (shown in

Figure 6 and Figure 8) which is wasteful.

3. The dynamic filter does not change the attributes of the graph dynamically. Consider for

example, a large graph is filtered based on some vertex attribute (say, betweenness centrality)

and the vertex size is dependent on the degree of the vertex. The vertices present in the filtered

graph might have low degrees now but the sizes of the vertices pertain to their original degrees.

In some cases, the original degree might be a requirement but an option may be presented to

the user where the vertex properties change dynamically, as well.

4. It would be useful if the edges in a graph can be laid out in some order in a Star layout (which is

one of the most common layouts). Although the Fruchterman-Reingold layout does give the

layout a Star shape but it does not have the option of ordering the edges. This idea comes from

the clock glyph designs studied earlier in this course. In this case, I had to lay out the edges

manually (Figure 5).

References

1. Erdös Number. http://en.wikipedia.org/wiki/Erd%C5%91s_number

2. The Erdös Number Project. http://www.oakland.edu/enp/

3. Erdos0 dataset. https://files.oakland.edu/users/grossman/enp/Erdos0.html

4. Erdos1graph dataset. https://files.oakland.edu/users/grossman/enp/erdos1graph.html

Documents

The one-hop neighborhood of Paul Erdös - UMD Computer … · 2013-11-12 · The one-hop neighborhood of Paul Erdös Souvik Bhattacherjee ([email protected]) Introduction Paul Erdös