Upload
revolution-analytics
View
7.852
Download
0
Tags:
Embed Size (px)
Citation preview
The network structure of R packages on CRAN and BioConductorAndrie de Vries
[email protected]@RevoAndrie
JSM 2015, Seattle
Joseph Rickert
[email protected]@RevoJoe
• R is an incredibly successful open source software project
• R is a large thriving community with tens of thousands of contributors and over 7K contributed packages on CRAN, BioConductor and github
• How do you begin to find what you are looking for?
• Before designing search algorithms, it is reasonable to study the structure of CRAN and BioConductor
Background
Modeling CRAN and BioConductorHypothesis:
Having different management structures:• CRAN almost anything
goes• BioConductor focused and
centrally managed
CRAN and BioConductor have discernably different package network structures.
Objectives of this study:
• Explore the network graph of CRAN and BioConductor
• Characterize their respective network structures
• Develop preliminary models to look for structural differences
Explore the network graph of CRAN and BioConductor
Characterize their respective network structures
Develop preliminary models to look for structural differences
A network of package dependencies
CRAN BioConductor
CRAN
*Note: Colour indicates communities found by the walktrap algorithm, but has no common meaning in the two networks
BioConductor
Explore the network graph of CRAN and BioConductor
Characterize their respective network structures
Develop preliminary models to look for structural differences
• Observe:• CRAN is ~4.5 times larger than BioConductor• But CRAN has ~20 times more clusters, i.e. many more, but smaller
clusters• This indicates that BioConductor is in fact stronger clustered, as
confirmed by the higher transitivity (clustering) coefficient
Graph statistics
nodes edges average.path.length assortativity.degree no.clusters cluster.coefcran 6867 14749 2.72 -0.082 1573 0.015bioc 1552 5756 1.95 -0.078 70 0.060
Bootstrapped cluster coefficient
Bootstrap sample: n = 1000, size of each subgraph = 500 nodes, no replacement
Two-sample Kolmogorov-Smirnov test
data: CRAN and BioConductorD = 0.643, p-value < 2.2e-16alternative hypothesis: two-sided
Analysis of degree distribution
Notice the difference at degree = 0 power.law.fit power.law.xmin power.law.KS.p
cran 2.55 5 0.061bioc 2.59 9 0.632
Comparing degree distribution
Degree distribution of CRAN and BioConductor
Two-sample Kolmogorov-Smirnov test
D^+ = 0.19943, p-value < 2.2e-16
alternative hypothesis: the CDF of x lies above that of y
• The original samples of networks are comparatively large, thus certain to find differences
• Sub-sampling allows us to look at finer level of detail
Resampling from degree distribution
Typical small sample n =100
P-value distribution
Explore the network graph of CRAN and BioConductor
Characterize their respective network structures
Develop preliminary models to look for structural differences
Exponential Random Graph Modeling (ERGM)
Formula: bioc_net ~ edges + degree(c(1, 2))
• The network structures of CRAN and Bioconductor are detectably different
• The large number of unconnected packages is a dominant feature of CRAN
• Large communities form around infrastructure and tools packages on CRAN
• Preliminary modeling indicates that feature-driven random graph models will be productive
Conclusions:
Next steps: join the project!!
Scripts available at:https://github.com/andrie/cran-network-structure
[email protected]@RevoAndrie
[email protected]@RevoJoe