Upload
eric-meyer
View
4.028
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Presentations from Oxford Internet Institute, the Internet Archive, and Hanzo Archives Ltd presenting the results of a JISC-NEH funded transatlantic digitisation project.
Citation preview
Slides from Humanities on the Web: Is it working?Date: Thursday, 19 March 2009, 10-4Location: Oxford University, Oxford, UKWebcast URL: http://webcast.oii.ox.ac.uk/?view=Webcast&ID=20090319_275Slide URL: http://www.slideshare.net/etmeyer/WWWoH
Afternoon Event:1:30 – 2:45: JISC/NEH Transatlantic Digitisation Collaboration Programme in conjunction with the Internet Archive: The World Wide Web of Humanities
OII: Selecting and analysing the sample WWI and WWII collections (Christine Madsen & Dr. Eric Meyer)
The Internet Archive: Extracting the data (Molly Bragg)Hanzo Archives Ltd.: Working with the data (Mark Middleton)Discussion and questions
Full details: http://www.oii.ox.ac.uk/events/details.cfm?id=238
Selecting and Analysing the WWI and WWII collections
Christine MadsenEric Meyer
19 March 2009
Why WWI and WWII?
Many branches of the humanities
History Journalism Art
Art history Advertising
Literature
Poetry Political science
Military history
Why WWI and WWII?
Well-rounded set of materials
Why WWI and WWII?
Language Doc types
Top-level domains
Secondary domains
Building the Collection
Supplemented with keyword searches in
the Archive
Selected from the live web
Building the Collection
Seeds are:
the website or portion of the website that you plan to include in your collection
Initial Collection
Seed 3
Seed 2Seed 1
Building the Collection
Seed 1
www
wwwwww
www
Seed 2
www
wwwwww
wwwSeed 3
www
wwwwww
www
Seed 4
www
wwwwww
www
Seed 5
www
wwwwww
www
Seed 6
www
wwwwww
www
Expanded Collection
A seed is also a web site from which additional sites can be discovered via the hyperlinks of the site
Building the Collection
Started with WWI
Too small (under 1,000,000 pages / object)Target was 250 million
Building the Collection
Expanded to WWII
Final collection: 5,362,425 unique URLs
Building the Collection
‘World War One’
‘World War I’
‘First world war’
‘World War II’
‘World War Two’
‘the great war’
‘Première Guerre Mondiale’
‘zweiter Weltkrieg’
Building the Collection
Record links from first 20
pages of search
Following links
[include dead links]
Returning to ‘hub’ sites for
further analysis
Building the Collection
http://www.greatwar.co.uk/westfront/Somme/index.htm
http://www.greatwar.co.uk
Expanding scope
Building the Collection
memory.loc.gov/ammem/collections/maps/wwii/index.html
www.memory.loc.gov/ammem/collections maps/wwii/
Expanding scope
Building the Collection
www.eyewitnesstohistory.com/ <= don’t
want whole site
www.eyewitnesstohistory.com/blitzkrieg.htmwww.eyewitnesstohistory.com/dday.html
www.eyewitnesstohistory.com/midway.htmwww.eyewitnesstohistory.com/airbattle.htmwww.eyewitnesstohistory.com/dunkirk.htm
www.eyewitnesstohistory.com/francesurrenders.htm
Dealing with illogical or flat directory structures
Building the Collection
• Stop when most results are redundant• Narrow in on more specific topics
WWIWWII
Building the Collection
• Materials in Foreign language– Focused on German sites– Consider local conventions, not just translations
WWII (zweiter Weltkrieg)
the period of National Socialism
(Zeit des Nationalsozialismus)
the period in which the Nazis ruled
(Nazizeit)
• Other foreign languages were included, but not sought after
Belarusian; Catalan/Valencian; Chamorro; Czech; Danish; German; Dzongkha; English; Spanish/Castilian; Finnish; French; Hebrew; Hungarian; Italian; Japanese; Luba-Katanga; Dutch/Flemish; Polish; Portuguese; Russian; Slovenian; Turkish; Ukrainian; Chinese
Building the Collection
The World Wide Web of Humanities “Extracting The Data”
St Anne's College, OxfordMarch 19, 2009
Molly Bragg, Partner Specialist
Web Group
The Internet Archive
Agenda
Brief Introduction to IA’s Web Archives
Discipline Specific Data Extraction from Longitudinal Web Archives: The WWWoH Case Study
Recommendations for Future Research and Tools Development Efforts
Brief Introduction to IA’s Web Archives
The Archive’s combined collections receive over 6 mil downloads a day!
www.archive.org
The Internet Archive is…
Web Pages Educational Courseware Films & Videos Music & Spoken Word Books & Texts Software Images
A digital library of ~4 petabytes of information
IA Web Archives
1.6+ petabytes of primary data (compressed)
150+ billion URIs, culled from 85+ million sites, harvested from 1996 to the present
Includes captures from every domain Encompasses content in over 40 languages As of 2009, IA will add ½ petabyte to 1 petabyte of
data to these collections each year.
Discipline Specific Data Extraction from Longitudinal Web Archives:
The WWWoH Case Study
WWWoH Case Study
http://neh-access.archive.org/neh/
WWWoH Case Study
Unique URLs in the collection: 5,362,425
Total number of captures: 23,006,857
Captures span: May, 1996 to Aug, 2008
Total size of compressed data: ~250 GBs
The Data Extraction Process
Oxford Internet Institute selected relevant sites/URLs
Identified all captures related to the seeds Identified all files embedded in each capture
(on & off seed domains) for extraction Attempted to locate additional candidate
seed URLs/domains for inclusion in the collection using outbound link data
The Data Extraction Process
Relevant URLs not identified as seeds were not extracted. Automatically harvesting ALL outbound links
can capture relevant non-seed urls however it can also introduce a large amount of extraneous content into the collection
Manually curating outbound links excludes non-relevant content, however it can be an overwhelming task due to the volume of links
WWWoH Case Study: WWI
Number of Seeds: 2263
Unique Hosts: 906
Number of Links: 143+ mil
WWWoH Case Study: WWI
WWWoH Case Study: WWI
WWI: Example
WWI: Example
WWI: Example
WWI: Example
WWWoH Case Study: WWII
Number of Seeds: 2592
Unique Hosts: 1475
Number of Links: 252+ mil
WWWoH Case Study: WWII
WWWoH Case Study: WWII
WWII: Example
WWII: Example
WWII: Example
Challenges
Identifying subject matter-specific resources of interest for an extraction and then automating those procedures.
Tools are missing from the workflow that might make the initial scoping of an extraction easier to define and revise Available tools for collection building and access are too technically focused for the average humanities scholar
Recommendations for Future Research and Tools
Development Efforts
Implications for Future Research
Need link and web graphing tools that use inbound and outbound link data to identify further resources of interest
Need to experiment with a more diverse range of UI navigational paradigms that address the dimension of time and curatorial input
Ideas/Concepts to Explore: Nomination Tools
Ideas/Concepts to Explore: Nomination Tools
Opportunities
Extractions make it easier for humanities scholars to locate and assemble source materials of interest. These collections can accelerate and/or augment discipline specific research efforts Extractions can encourage distributed collaboration and cooperation between entities who might not otherwise be aware of one another
Thank You!
http://neh-access.archive.org/neh/
Molly Bragg, Partner Specialist
The Internet Archive, Web Group
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Search and Analysis of Data in WWWoH
Mark Middleton
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Agenda
Brief introduction to Hanzo
Open Source Search-Tools: a toolkit for implementing analytical applications using web archives
WWWoH — working with the data
Recommendations for future research
Recommendations for future tools development
WWWoH Tools Deliverables
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Introduction to Hanzo
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Hanzo Archives Limited
Web Archiving Services
Company websites and intranets
Litigation support
E-Discovery
IP protection
Focus on legally defensible web archives of exceptional quality
Very advanced crawlers and access tools: dynamic html, video, flash, web 2.0
Some public archives
Mainly closed archives
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Hanzo Archiving Technology
Need advanced capabilities very quickly — continuous product innovation
Rapid development of tools
Create research and open source projects to promote mainstream awareness of web archives and web archiving technology
Open source projects include
WARC Tools
Search Tools
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
WWWoH and Development of Open Source Search-Tools
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Objectives
Deliver an open source search engine for web archives that is simple to extend, easy to install and deploy
Integrate with WARC Tools, the open source web archive file manipulation tools (Hanzo and IIPC)
Extend the search engine with interesting directives and options
Extend the search engine to provide data to analytical tools, develop an API, tools, and exemplar analytical tools
Encourage third party analytical tools to use web archives as their data repository
Migrate WWWoH extraction from ARC to WARC and ingest into Search Tools
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Full Text Search
Implemented FT search on top of WARC Tools — the toolkit for manipulating ISO-28500 WARC files
Reviewed several options: Java Lucene (and clones), Xapian, DB indexing (Sphinx, OpenFTS), etc.
Criteria: vibrant development community, extensible (searching web archives is different: temporal dimension, duplicate handling, etc.), fast and full-featured (boolean, time queries, ability to index multiple fields, query language)
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Component Architecture
Full text search engine based on Open Source Ferret
Knowledge Base stores search results
Python application with Django model and Django WUI
Memcache
Plug-in architecture to support multiple analytical applications
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Ferret
Ferret is FAST, both indexing and searching
Highly scalable, up to 100m documents on a single CPU
Supports distributed search
Phrase search, proximity ranking, stemming in several languages, stopwords, multiple document fields
Ferret Query Language
Docu
men
ts/s
http://ferret.davebalmain.com/trac/wiki/FerretVsLucene
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Advanced Search
url: (+bbc +wwii) -- search for URLs containing both ‘bbc’ and ‘wwii’
date: [2001 2002] -- search within date range
tag: wwwoh -- search content with the tag ‘wwoh’
title: (+wilfred +owen) -- search for Wilfred and Owen within the title
domain: fr -- restrict search to within .fr domain
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Working with the Data
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Migrating ARC to WARC
Data extracted from IA in ARC files
Hanzo WARC Tools and Search Tools projects combined enabled us to migrate ARC to WARC files (WARC is the new ISO standard):
Some challenges: broken ARCs, scale, etc.
3,264 WARC files
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Programmable Access to Data
WARC Tools and Search Tools provide a rich collection of programmable tools to enable analytics tools developers to use web archives:
Object-oriented C, REST API, fast iterators
Command lines for manipulating WARCs, indexing, searching
Web applications for browsing, searching, demonstrator analytics
C/C++, Python, Ruby, Perl, … and if you need to, Java, C#
Demonstration: the web applications
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
http://wwwoh.hanzoarchives.com
/
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Analytical Tools
Frequency Tables for:
Domains, MIME Types, Countries
Graphing Tools:
GUESS -- an exploratory data analysis and visualization tool for graphs and networks
Graphviz -- makes diagrams in several formats: images and SVG for web pages, Postscript; or display in an interactive graph browser
Hypergraph -- provides visualisation of hyperbolic geometry, to handle graphs and to layout hyperbolic trees
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Graphing Tools
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Recommendations for Future Research and Tools
Development
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Future Research
Faster, richer analytics
Rich API for analytics, to be developed in collaboration with IA, other archives, and IIPC
Temporal analytics and techniques
Link and network graphing and analytics
Enhance outreach/dissemination to the mainstream development community and research community
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Future Tools Development
Multi-machine indexing and application engine
Tighter integration of graphing tools, with more user parameters and configurations
Temporal analysis (animation of link graphs over time)
Enhance WARC Tools integration and investigate interoperability with other IIPC toolsets
Developer documentation
Analyst/researcher documentation
Installation tools for Linux, Mac OS X and Windows XP/Vista
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Deliverables at End March 2009
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Deliverables
The Search Tools project home is http://code.google.com/p/search-tools/
Source code
Documentation
Issue management
Mailing list
The WARC Tools project home is http://code.google.com/p/warc-tools/
The prototype application is http://wwwoh.hanzoarchives.com/
www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.
Thank YouHanzo Archives Limited
+44 20 8816 8226
www.hanzoarchives.com