19
Data Science at the Alan Turing Institute and British Library Web Archiving Dr Scott A. Hale @computermacgyve http://scott.hale.us

Data Science at the ATI and BL Web Archiving

  • Upload
    labsbl

  • View
    27

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Science at the ATI and BL Web Archiving

Data Science at the Alan Turing Institute and British Library Web Archiving

Dr Scott A. Hale @computermacgyve http://scott.hale.us

Page 2: Data Science at the ATI and BL Web Archiving

1. Fifteen Years of British Universities on the Web

2. Live versus Archive

3. Full Stack Data Science

Page 3: Data Science at the ATI and BL Web Archiving

Mapping the UK Webspace:

Scott A. Hale, Taha Yasseri, Josh Cowls, Eric T. Meyer, Ralph Schroeder, Helen Margetts

Fifteen Years of British Universities on the Web

With our thanks to Ning Wang, Adham Tamer, Andreas Kaltenbrunner, and our reviewers.

WebSci 2014, https://arxiv.org/abs/1405.2856

Page 4: Data Science at the ATI and BL Web Archiving

Few longitudinal studies of the Web

To what extent can online proxies reproduce traditional measures?

Does physical distance matter for universities online?

Page 5: Data Science at the ATI and BL Web Archiving

30 TB compressed data

6.2TB metadata and links

2.5 TB temporal links

Page 6: Data Science at the ATI and BL Web Archiving

Grouped to 3rd level domain (e.g., ox.ac.uk)

Grouped pages crawled at similar times (within 1,000 seconds)

Edge weight between any two domains for a given year is the largest number of hyperlinks between those two domains for any group that year

cam.ac.uk

ox.ac.uk

(2005, 2), (2006,8), ..., (2010, 13)

Page 7: Data Science at the ATI and BL Web Archiving
Page 8: Data Science at the ATI and BL Web Archiving

Colour ~ intensity

Page 9: Data Science at the ATI and BL Web Archiving

σ𝑖𝑗 =𝑠𝑖𝑗

𝑠𝑖𝑜𝑢𝑡𝑠𝑗

𝑖𝑛

𝑠𝑖𝑗 =𝑠𝑖𝑜𝑢𝑡𝑠𝑗

𝑖𝑛

𝑟0.28

Page 10: Data Science at the ATI and BL Web Archiving

University affiliations weakly reflected

Correlation between network centrality and league table rankings increasing

Physical distance still important

Page 11: Data Science at the ATI and BL Web Archiving

Completeness

Variable timing of captures

Boundary effects (.uk) ◦ Not really an issue for .ac.uk

Page 12: Data Science at the ATI and BL Web Archiving

Live versus Archive:

Scott A. Hale, Grant Blank, & Victoria D. Alexander

Comparing a Web Archive and to a Population of Webpages

In Web as History, R. Schroeder and N. Brügger (Eds.), London: UCL Press.

Page 13: Data Science at the ATI and BL Web Archiving

Ainsworth, et al. (2013). 35-90% of the Web is archived

Unclear how much of any website is archived

Comparison of 1996-2013 JISC data for tripadvisor.co.uk to the live Web

Why? Can determine when new webpages are added.

Page 14: Data Science at the ATI and BL Web Archiving
Page 15: Data Science at the ATI and BL Web Archiving
Page 16: Data Science at the ATI and BL Web Archiving

24% of TripAdvisor’s London attractions were in the JISC/Internet Archive data

Archived pages biased toward prominent Not a random sample

Page 17: Data Science at the ATI and BL Web Archiving

Full Stack Data Science

Page 18: Data Science at the ATI and BL Web Archiving

Methods to discover and evaluate whether a site not in .uk is ‘British’

More complete crawls / machine-readable metadata on what is not crawled

For social science research Appropriate network null models for missing/biased data

Rich and accessible meta-data

Page 19: Data Science at the ATI and BL Web Archiving

Data Science at the Alan Turing Institute and British Library Web Archiving

Dr Scott A. Hale @computermacgyve http://scott.hale.us