Data Science at the ATI and BL Web Archiving

Preview:

Citation preview

Data Science at the Alan Turing Institute and British Library Web Archiving

Dr Scott A. Hale @computermacgyve http://scott.hale.us

1. Fifteen Years of British Universities on the Web

2. Live versus Archive

3. Full Stack Data Science

Mapping the UK Webspace:

Scott A. Hale, Taha Yasseri, Josh Cowls, Eric T. Meyer, Ralph Schroeder, Helen Margetts

Fifteen Years of British Universities on the Web

With our thanks to Ning Wang, Adham Tamer, Andreas Kaltenbrunner, and our reviewers.

WebSci 2014, https://arxiv.org/abs/1405.2856

Few longitudinal studies of the Web

To what extent can online proxies reproduce traditional measures?

Does physical distance matter for universities online?

30 TB compressed data

6.2TB metadata and links

2.5 TB temporal links

Grouped to 3rd level domain (e.g., ox.ac.uk)

Grouped pages crawled at similar times (within 1,000 seconds)

Edge weight between any two domains for a given year is the largest number of hyperlinks between those two domains for any group that year

cam.ac.uk

ox.ac.uk

(2005, 2), (2006,8), ..., (2010, 13)

Colour ~ intensity

σ𝑖𝑗 =𝑠𝑖𝑗

𝑠𝑖𝑜𝑢𝑡𝑠𝑗

𝑖𝑛

𝑠𝑖𝑗 =𝑠𝑖𝑜𝑢𝑡𝑠𝑗

𝑖𝑛

𝑟0.28

University affiliations weakly reflected

Correlation between network centrality and league table rankings increasing

Physical distance still important

Completeness

Variable timing of captures

Boundary effects (.uk) ◦ Not really an issue for .ac.uk

Live versus Archive:

Scott A. Hale, Grant Blank, & Victoria D. Alexander

Comparing a Web Archive and to a Population of Webpages

In Web as History, R. Schroeder and N. Brügger (Eds.), London: UCL Press.

Ainsworth, et al. (2013). 35-90% of the Web is archived

Unclear how much of any website is archived

Comparison of 1996-2013 JISC data for tripadvisor.co.uk to the live Web

Why? Can determine when new webpages are added.

24% of TripAdvisor’s London attractions were in the JISC/Internet Archive data

Archived pages biased toward prominent Not a random sample

Full Stack Data Science

Methods to discover and evaluate whether a site not in .uk is ‘British’

More complete crawls / machine-readable metadata on what is not crawled

For social science research Appropriate network null models for missing/biased data

Rich and accessible meta-data

Data Science at the Alan Turing Institute and British Library Web Archiving

Dr Scott A. Hale @computermacgyve http://scott.hale.us

Recommended