21
The network structure of R packages on CRAN and BioConductor Andrie de Vries [email protected] om @RevoAndrie JSM 2015, Seattle Joseph Rickert [email protected] om @RevoJoe

The Network structure of R packages on CRAN & BioConductor

Embed Size (px)

Citation preview

Page 1: The Network structure of R packages on CRAN & BioConductor

The network structure of R packages on CRAN and BioConductorAndrie de Vries

[email protected]@RevoAndrie

JSM 2015, Seattle

Joseph Rickert

[email protected]@RevoJoe

Page 2: The Network structure of R packages on CRAN & BioConductor

• R is an incredibly successful open source software project

• R is a large thriving community with tens of thousands of contributors and over 7K contributed packages on CRAN, BioConductor and github

• How do you begin to find what you are looking for?

• Before designing search algorithms, it is reasonable to study the structure of CRAN and BioConductor

Background

Page 3: The Network structure of R packages on CRAN & BioConductor

Modeling CRAN and BioConductorHypothesis:

Having different management structures:• CRAN almost anything

goes• BioConductor focused and

centrally managed

CRAN and BioConductor have discernably different package network structures.

Objectives of this study:

• Explore the network graph of CRAN and BioConductor

• Characterize their respective network structures

• Develop preliminary models to look for structural differences

Page 4: The Network structure of R packages on CRAN & BioConductor

Explore the network graph of CRAN and BioConductor

Characterize their respective network structures

Develop preliminary models to look for structural differences

Page 5: The Network structure of R packages on CRAN & BioConductor

A network of package dependencies

Page 6: The Network structure of R packages on CRAN & BioConductor

CRAN BioConductor

Page 7: The Network structure of R packages on CRAN & BioConductor

CRAN

*Note: Colour indicates communities found by the walktrap algorithm, but has no common meaning in the two networks

BioConductor

Page 8: The Network structure of R packages on CRAN & BioConductor
Page 9: The Network structure of R packages on CRAN & BioConductor
Page 10: The Network structure of R packages on CRAN & BioConductor
Page 11: The Network structure of R packages on CRAN & BioConductor
Page 12: The Network structure of R packages on CRAN & BioConductor

Explore the network graph of CRAN and BioConductor

Characterize their respective network structures

Develop preliminary models to look for structural differences

Page 13: The Network structure of R packages on CRAN & BioConductor

• Observe:• CRAN is ~4.5 times larger than BioConductor• But CRAN has ~20 times more clusters, i.e. many more, but smaller

clusters• This indicates that BioConductor is in fact stronger clustered, as

confirmed by the higher transitivity (clustering) coefficient

Graph statistics

nodes edges average.path.length assortativity.degree no.clusters cluster.coefcran 6867 14749 2.72 -0.082 1573 0.015bioc 1552 5756 1.95 -0.078 70 0.060

Page 14: The Network structure of R packages on CRAN & BioConductor

Bootstrapped cluster coefficient

Bootstrap sample: n = 1000, size of each subgraph = 500 nodes, no replacement

Two-sample Kolmogorov-Smirnov test

data: CRAN and BioConductorD = 0.643, p-value < 2.2e-16alternative hypothesis: two-sided

Page 15: The Network structure of R packages on CRAN & BioConductor

Analysis of degree distribution

Notice the difference at degree = 0 power.law.fit power.law.xmin power.law.KS.p

cran 2.55 5 0.061bioc 2.59 9 0.632

Page 16: The Network structure of R packages on CRAN & BioConductor

Comparing degree distribution

Degree distribution of CRAN and BioConductor

Two-sample Kolmogorov-Smirnov test

D^+ = 0.19943, p-value < 2.2e-16

alternative hypothesis: the CDF of x lies above that of y

Page 17: The Network structure of R packages on CRAN & BioConductor

• The original samples of networks are comparatively large, thus certain to find differences

• Sub-sampling allows us to look at finer level of detail

Resampling from degree distribution

Typical small sample n =100

P-value distribution

Page 18: The Network structure of R packages on CRAN & BioConductor

Explore the network graph of CRAN and BioConductor

Characterize their respective network structures

Develop preliminary models to look for structural differences

Page 19: The Network structure of R packages on CRAN & BioConductor

Exponential Random Graph Modeling (ERGM)

Formula: bioc_net ~ edges + degree(c(1, 2))

Page 20: The Network structure of R packages on CRAN & BioConductor

• The network structures of CRAN and Bioconductor are detectably different

• The large number of unconnected packages is a dominant feature of CRAN

• Large communities form around infrastructure and tools packages on CRAN

• Preliminary modeling indicates that feature-driven random graph models will be productive

Conclusions:

Page 21: The Network structure of R packages on CRAN & BioConductor

Next steps: join the project!!

Scripts available at:https://github.com/andrie/cran-network-structure

[email protected]@RevoAndrie

[email protected]@RevoJoe