Upload
aliss
View
311
Download
0
Tags:
Embed Size (px)
Citation preview
Building a Collection of the
Historical UK Web for
scholarly use
Helen Hockx-Yu
Head of Web Archiving, British Library
www.bl.uk 2
The UK Web Domain
4th TLD after .com, .de and .net
Over 10 million .uk registered domain
UK organisations also use non .uk domain
names (eg .com or .org) – scale unknown
Non-print Legal Deposit (since April 2013) applies to
the open (freely available) web: .uk and other UK-published (non
.uk) websites, such as .com, .org…
also e-journals, e-books, news web pages and other digital
publications, either by harvesting or mutual agreement on other
delivery methods
www.bl.uk 3
Web Archiving at the British Library
Collect UK digital heritage and provide continued access to archived
web resources
Started web archiving in 2003: Open UK Web Archive
Selective, topical collections and key sites
Consortium sharing infrastructure and development effort;
agreement on who collects what
Curating collections with organisations and researchers
Archiving UK Web for non-print Legal Deposit since April 2013: Legal
Deposit UK Web Archive
Comprehensive national archive with on-site access only
Joint responsibility of six Legal Deposit Libraries (LDLs)
www.bl.uk 4
Domain Crawl
News S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
Domain crawl:
• Broad
sweep of
UK domain
• Once or
twice a
year
Events & key
sites and
news:
• Events of
UK interest
• High value,
high impact
sites
• National &
regional
news
Special
Collection:
• Focused,
thematic
collections
• Support
priority
subjects
Key sites Events S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
S
p
e
c
i
a
l
c
o
l
l
e
c
t
i
o
n
Collecting strategy for websites
www.bl.uk 5
UK websites – territoriality explained
An online work is considered as “published in the UK” and therefore in
scope for Legal Deposit, if it meets either of the following criteria:
(a) it is made available to the public from a website with a domain
name which relates to the United Kingdom or to a place within the United
Kingdom; or
(b) it is made available to the public by a person and any of that
person’s activities relating to the creation or the publication of the work
take place within the United Kingdom
The Legal Deposit Libraries (Non-Print Works) Regulations, 2013
www.bl.uk 6
Territoriality - implementation
All websites with a .uk domain name
Including embedded content (eg CSS, images) regardless
where it is hosted
non .uk websites have to meet at least one criteria
UK Hosting: check external IP geo-location database and
add in-scope URLs to the fetch-chain
UK postal address
Correspondence
Professional judgement
www.bl.uk 7
UK Domain Crawl
2013 domain crawl stats
3.86 million seeds
1.9 billion URLs (web pages,
docs, images)
~31TB
Duration: 70days
2014 domain crawl
90 million seeds (starting URLs)
Started on 19th June 2014
Collected 52TB of data (by 9th
December (incl. 4.4GB of
viruses & 3TB of homepage
screenshots)
Nearly 2 million non .uk
domains
www.bl.uk 8
The “access” paradoxes
Completeness versus openness of web archives
Legal Deposit national collections have restricted access
Documents-centred versus data driven
Essentially a scale issue
Pre-selected or defined collections not relevant to all researchers;
difficulty in finding relevant content in large scale web archive.
Arbitrary (national) boundaries often irrelevant to research question
but most heritage institutions operation within certain geographical
areas
…
www.bl.uk 9 9
Web archive as historical document
www.bl.uk 10
Collaboration with researchers
Building collections
Researchers’ involvement in
scoping collections, selecting
and describing websites
Creation of specific, (narrow)
topical collections
Formulating research question
Brain-storm sessions, workshops, discussion, surveys etc.
Lack of awareness & baseline knowledge
Challenging: you don’t know what you don’t know
Co-development of access services
This is changing how we collect and store data
www.bl.uk 11
JISC UK Web Domain dataset (1996-2013)
Collaboration between the Internet Archive (IA), the Joint Information Systems
Committee (JISC) and the British Library
Extracted copies of UK websites from the Internet Archives collection
1st tranche : 1996 – 2010, 30TB, 2.5 billion URLs
2nd tranche: 2010 – April 2013, 27.5TB, 1.5 billion URLs (estimated)
Research agreement between JISC and IA, upholding IA’s Terms of Use
Access via IA’s Wayback Machine
Allows replication / extraction of derivative or secondary datasets
BL hosts the dataset on behalf of JISC
Data used by research projects
Institute of Historical Research project: Analytical Access to the Domain Dark
Archive (AADDA)
Oxford Internet Institute project: Big data for political science
www.bl.uk 12
Completed work
Analytical Access to the Domain Dark Archive Project
Use cases & experimental UI
Demonstrating the Value of the UK Web Domain Dataset for Social
Science Research
Analysis of link graph
Paper accepted for WebSci’14: Mapping the UK Webspace:
Fifteen Years of British Universities on the Web
MA thesis by Jules Mataly: The Three Truths of Margaret Thatcher:
Creating and Analysing
Secondary datasets under open licence
Format profile, Geoindex, Host Link Graph
www.bl.uk 13
Exploring Host Link Graph
Courtesy of Peter Webster, Rainer Simon and Jules Mataly
www.bl.uk 14
Visualising links (to and from bl.uk)
Interactive version
How it is done
www.bl.uk 15
Visualising links (to and from bl.uk)
Interactive version
How it is done
www.bl.uk 16
Evolution of the UK web (2004 -2013)
www.bl.uk 17
Memento service
www.bl.uk 18
Big UK Domain Data for Arts and Humanities
Funded by the UK Arts and Humanities Research Council as one of
the 21 “Big Data” projects
Collaboration between the Institution of Historical Research, Oxford
Internet Institute, British Library and Aarhus University
Develop theoretical and methodological framework for the study of
web archives
Build on ADDAA: researchers and the BL co-produce access tools
A major study of the history of UK web space from 1996 to 2013 +
sub-projects covering a range of disciplines
Also an online training course and peer-reviewed journal articles.
www.bl.uk 19
Web archiving researcher bursaries
www.bl.uk 20
Query building
Corpus formation and
handling
Annotation and curation
In-corpus analysis
Whole-dataset analysis
Shine
www.bl.uk 21
What’s in it for us?
Helps researchers understand the value of web archives and explore new
ways of using these for scholarly research
Allows BL to obtain hands-on experience with indexing and processing
large scale web archive datasets
(Prototypes) analytics and visualisations can be applied to our own Legal
Deposit collection
Enables BL to participate in various UK, European and international
projects
Helps curators understand characteristics of large scale digital corpora
Improve the way we collet and store web archive
www.bl.uk 22
Web archives for reference AND for
analytics
Base-line knowledge self-explanatory
Focus on national events for curated
collections; provide means to assemble
research corpora
Link to what we do not have
Offer a bag of tools to support scholarly use
The go-to state
Exploit open licences, changes to copyright law
Online access to selected websites, metadata and secondary datasets
The British Library Collection Development Policy for websites
Lobbying – review of Non-print Legal Deposit Regulations in 2018