Building a Collection of the Historical UK Web for scholarly use

Building a Collection of the

Historical UK Web for

scholarly use

Helen Hockx-Yu

Head of Web Archiving, British Library

www.bl.uk 2

The UK Web Domain

4th TLD after .com, .de and .net

Over 10 million .uk registered domain

UK organisations also use non .uk domain

names (eg .com or .org) – scale unknown

Non-print Legal Deposit (since April 2013) applies to

the open (freely available) web: .uk and other UK-published (non

.uk) websites, such as .com, .org…

also e-journals, e-books, news web pages and other digital

publications, either by harvesting or mutual agreement on other

delivery methods

www.bl.uk 3

Web Archiving at the British Library

Collect UK digital heritage and provide continued access to archived

web resources

Started web archiving in 2003: Open UK Web Archive

Selective, topical collections and key sites

Consortium sharing infrastructure and development effort;

agreement on who collects what

Curating collections with organisations and researchers

Archiving UK Web for non-print Legal Deposit since April 2013: Legal

Deposit UK Web Archive

Comprehensive national archive with on-site access only

Joint responsibility of six Legal Deposit Libraries (LDLs)

http://www.webarchive.org.uk/ukwa/

www.bl.uk 4

Domain Crawl

News S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

Domain crawl:

• Broad

sweep of

UK domain

• Once or

twice a

year

Events & key

sites and

news:

• Events of

UK interest

• High value,

high impact

sites

• National &

regional

news

Special

Collection:

• Focused,

thematic

collections

• Support

priority

subjects

Key sites Events S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

S

p

e

c

i

a

l

c

o

l

l

e

c

t

i

o

n

Collecting strategy for websites

www.bl.uk 5

UK websites – territoriality explained

An online work is considered as “published in the UK” and therefore in

scope for Legal Deposit, if it meets either of the following criteria:

(a) it is made available to the public from a website with a domain

name which relates to the United Kingdom or to a place within the United

Kingdom; or

(b) it is made available to the public by a person and any of that

person’s activities relating to the creation or the publication of the work

take place within the United Kingdom

The Legal Deposit Libraries (Non-Print Works) Regulations, 2013

www.bl.uk 6

Territoriality - implementation

All websites with a .uk domain name

Including embedded content (eg CSS, images) regardless

where it is hosted

non .uk websites have to meet at least one criteria

UK Hosting: check external IP geo-location database and

add in-scope URLs to the fetch-chain

UK postal address

Correspondence

Professional judgement

www.bl.uk 7

UK Domain Crawl

2013 domain crawl stats

3.86 million seeds

1.9 billion URLs (web pages,

docs, images)

~31TB

Duration: 70days

2014 domain crawl

90 million seeds (starting URLs)

Started on 19th June 2014

Collected 52TB of data (by 9th

December (incl. 4.4GB of

viruses & 3TB of homepage

screenshots)

Nearly 2 million non .uk

domains

www.bl.uk 8

The “access” paradoxes

Completeness versus openness of web archives

Legal Deposit national collections have restricted access

Documents-centred versus data driven

Essentially a scale issue

Pre-selected or defined collections not relevant to all researchers;

difficulty in finding relevant content in large scale web archive.

Arbitrary (national) boundaries often irrelevant to research question

but most heritage institutions operation within certain geographical

areas

…

www.bl.uk 9 9

Web archive as historical document

www.bl.uk 10

Collaboration with researchers

Building collections

Researchers’ involvement in

scoping collections, selecting

and describing websites

Creation of specific, (narrow)

topical collections

Formulating research question

Brain-storm sessions, workshops, discussion, surveys etc.

Lack of awareness & baseline knowledge

Challenging: you don’t know what you don’t know

Co-development of access services

This is changing how we collect and store data

www.bl.uk 11

JISC UK Web Domain dataset (1996-2013)

Collaboration between the Internet Archive (IA), the Joint Information Systems

Committee (JISC) and the British Library

Extracted copies of UK websites from the Internet Archives collection

1st tranche : 1996 – 2010, 30TB, 2.5 billion URLs

2nd tranche: 2010 – April 2013, 27.5TB, 1.5 billion URLs (estimated)

Research agreement between JISC and IA, upholding IA’s Terms of Use

Access via IA’s Wayback Machine

Allows replication / extraction of derivative or secondary datasets

BL hosts the dataset on behalf of JISC

Data used by research projects

Institute of Historical Research project: Analytical Access to the Domain Dark

Archive (AADDA)

Oxford Internet Institute project: Big data for political science

http://www.history.ac.uk/projects/digital/AADDA




http://www.oii.ox.ac.uk/research/projects/?id=88



www.bl.uk 12

Completed work

Analytical Access to the Domain Dark Archive Project

Use cases & experimental UI

Demonstrating the Value of the UK Web Domain Dataset for Social

Science Research

Analysis of link graph

Paper accepted for WebSci’14: Mapping the UK Webspace:

Fifteen Years of British Universities on the Web

MA thesis by Jules Mataly: The Three Truths of Margaret Thatcher:

Creating and Analysing

Secondary datasets under open licence

Format profile, Geoindex, Host Link Graph

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2435481





http://julesmataly.com/documents/MA_thesis_jules_mataly.pdf





http://data.webarchive.org.uk/opendata/ukwa.ds.2/

www.bl.uk 13

Exploring Host Link Graph

Courtesy of Peter Webster, Rainer Simon and Jules Mataly

www.bl.uk 14

Visualising links (to and from bl.uk)

Interactive version

How it is done

https://www.youtube.com/watch?v=rX6Hix19_No

http://nbviewer.ipython.org/github/anjackson/keeping-codes/blob/gh-pages/experiments/Visualising Link Dynamics.ipynb/

www.bl.uk 15

Visualising links (to and from bl.uk)

Interactive version

How it is done

https://www.youtube.com/watch?v=rX6Hix19_No

http://nbviewer.ipython.org/github/anjackson/keeping-codes/blob/gh-pages/experiments/Visualising Link Dynamics.ipynb/

www.bl.uk 16

Evolution of the UK web (2004 -2013)

www.bl.uk 17

Memento service

www.bl.uk 18

Big UK Domain Data for Arts and Humanities

Funded by the UK Arts and Humanities Research Council as one of

the 21 “Big Data” projects

Collaboration between the Institution of Historical Research, Oxford

Internet Institute, British Library and Aarhus University

Develop theoretical and methodological framework for the study of

web archives

Build on ADDAA: researchers and the BL co-produce access tools

A major study of the history of UK web space from 1996 to 2013 +

sub-projects covering a range of disciplines

Also an online training course and peer-reviewed journal articles.

www.bl.uk 19

Web archiving researcher bursaries

www.bl.uk 20

Query building

Corpus formation and

handling

Annotation and curation

In-corpus analysis

Whole-dataset analysis

Shine

www.bl.uk 21

What’s in it for us?

Helps researchers understand the value of web archives and explore new

ways of using these for scholarly research

Allows BL to obtain hands-on experience with indexing and processing

large scale web archive datasets

(Prototypes) analytics and visualisations can be applied to our own Legal

Deposit collection

Enables BL to participate in various UK, European and international

projects

Helps curators understand characteristics of large scale digital corpora

Improve the way we collet and store web archive

www.bl.uk 22

Web archives for reference AND for

analytics

Base-line knowledge self-explanatory

Focus on national events for curated

collections; provide means to assemble

research corpora

Link to what we do not have

Offer a bag of tools to support scholarly use

The go-to state

Exploit open licences, changes to copyright law

Online access to selected websites, metadata and secondary datasets

The British Library Collection Development Policy for websites

Lobbying – review of Non-print Legal Deposit Regulations in 2018

websiteshttp://data.webarchive.org.uk/opendata

websiteshttp://data.webarchive.org.uk/opendata

http://www.bl.uk/aboutus/stratpolprog/digi/webarch/bl_collection_development_policy_v3-0.pdf





Education

Building a Collection of the Historical UK Web for scholarly use