Document

Slides from Humanities on the Web: Is it working?Date: Thursday, 19 March 2009, 10-4Location: Oxford University, Oxford, UKWebcast URL: http://webcast.oii.ox.ac.uk/?view=Webcast&ID=20090319_275Slide URL: http://www.slideshare.net/etmeyer/WWWoH

Afternoon Event:1:30 – 2:45: JISC/NEH Transatlantic Digitisation Collaboration Programme in conjunction with the Internet Archive: The World Wide Web of Humanities

OII: Selecting and analysing the sample WWI and WWII collections (Christine Madsen & Dr. Eric Meyer)

The Internet Archive: Extracting the data (Molly Bragg)Hanzo Archives Ltd.: Working with the data (Mark Middleton)Discussion and questions

Full details: http://www.oii.ox.ac.uk/events/details.cfm?id=238

Selecting and Analysing the WWI and WWII collections

Christine MadsenEric Meyer

19 March 2009

Why WWI and WWII?

Many branches of the humanities

History Journalism Art

Art history Advertising

Literature

Poetry Political science

Military history

Why WWI and WWII?

Well-rounded set of materials

Why WWI and WWII?

Language Doc types

Top-level domains

Secondary domains

Building the Collection

Supplemented with keyword searches in

the Archive

Selected from the live web


Seeds are:

the website or portion of the website that you plan to include in your collection

Initial Collection

Seed 3

Seed 2Seed 1


Seed 1

www

wwwwww

www

Seed 2

www

wwwwww

wwwSeed 3

www

wwwwww

www

Seed 4

www

wwwwww

www

Seed 5

www

wwwwww

www

Seed 6

www

wwwwww

www

Expanded Collection

A seed is also a web site from which additional sites can be discovered via the hyperlinks of the site


Started with WWI

Too small (under 1,000,000 pages / object)Target was 250 million


Expanded to WWII

Final collection: 5,362,425 unique URLs


‘World War One’

‘World War I’

‘First world war’

‘World War II’

‘World War Two’

‘the great war’

‘Première Guerre Mondiale’

‘zweiter Weltkrieg’


Record links from first 20

pages of search

Following links

[include dead links]

Returning to ‘hub’ sites for

further analysis


http://www.greatwar.co.uk/westfront/Somme/index.htm

http://www.greatwar.co.uk

Expanding scope


memory.loc.gov/ammem/collections/maps/wwii/index.html

www.memory.loc.gov/ammem/collections maps/wwii/

Expanding scope


www.eyewitnesstohistory.com/ <= don’t

want whole site

www.eyewitnesstohistory.com/blitzkrieg.htmwww.eyewitnesstohistory.com/dday.html

www.eyewitnesstohistory.com/midway.htmwww.eyewitnesstohistory.com/airbattle.htmwww.eyewitnesstohistory.com/dunkirk.htm

www.eyewitnesstohistory.com/francesurrenders.htm

Dealing with illogical or flat directory structures


• Stop when most results are redundant• Narrow in on more specific topics

WWIWWII


• Materials in Foreign language– Focused on German sites– Consider local conventions, not just translations

WWII (zweiter Weltkrieg)

the period of National Socialism

(Zeit des Nationalsozialismus)

the period in which the Nazis ruled

(Nazizeit)

• Other foreign languages were included, but not sought after

Belarusian; Catalan/Valencian; Chamorro; Czech; Danish; German; Dzongkha; English; Spanish/Castilian; Finnish; French; Hebrew; Hungarian; Italian; Japanese; Luba-Katanga; Dutch/Flemish; Polish; Portuguese; Russian; Slovenian; Turkish; Ukrainian; Chinese


The World Wide Web of Humanities “Extracting The Data”

St Anne's College, OxfordMarch 19, 2009

Molly Bragg, Partner Specialist

Web Group

The Internet Archive

Agenda

Brief Introduction to IA’s Web Archives

Discipline Specific Data Extraction from Longitudinal Web Archives: The WWWoH Case Study

Recommendations for Future Research and Tools Development Efforts

Brief Introduction to IA’s Web Archives

The Archive’s combined collections receive over 6 mil downloads a day!

www.archive.org

The Internet Archive is…

Web Pages Educational Courseware Films & Videos Music & Spoken Word Books & Texts Software Images

A digital library of ~4 petabytes of information

http://www.icdlbooks.org:8080/servlet/BookPreview?bookid=greappl_00150034&summary=true&route=text&lang=English

IA Web Archives

1.6+ petabytes of primary data (compressed)

150+ billion URIs, culled from 85+ million sites, harvested from 1996 to the present

Includes captures from every domain Encompasses content in over 40 languages As of 2009, IA will add ½ petabyte to 1 petabyte of

data to these collections each year.

Discipline Specific Data Extraction from Longitudinal Web Archives:

The WWWoH Case Study

WWWoH Case Study

http://neh-access.archive.org/neh/

WWWoH Case Study

Unique URLs in the collection: 5,362,425

Total number of captures: 23,006,857

Captures span: May, 1996 to Aug, 2008

Total size of compressed data: ~250 GBs

The Data Extraction Process

Oxford Internet Institute selected relevant sites/URLs

Identified all captures related to the seeds Identified all files embedded in each capture

(on & off seed domains) for extraction Attempted to locate additional candidate

seed URLs/domains for inclusion in the collection using outbound link data

The Data Extraction Process

Relevant URLs not identified as seeds were not extracted. Automatically harvesting ALL outbound links

can capture relevant non-seed urls however it can also introduce a large amount of extraneous content into the collection

Manually curating outbound links excludes non-relevant content, however it can be an overwhelming task due to the volume of links

WWWoH Case Study: WWI

Number of Seeds: 2263

Unique Hosts: 906

Number of Links: 143+ mil



WWI: Example

WWI: Example

WWI: Example

WWI: Example

WWWoH Case Study: WWII

Number of Seeds: 2592

Unique Hosts: 1475

Number of Links: 252+ mil



WWII: Example

WWII: Example

WWII: Example

Challenges

Identifying subject matter-specific resources of interest for an extraction and then automating those procedures.

Tools are missing from the workflow that might make the initial scoping of an extraction easier to define and revise Available tools for collection building and access are too technically focused for the average humanities scholar

Recommendations for Future Research and Tools

Development Efforts

Implications for Future Research

Need link and web graphing tools that use inbound and outbound link data to identify further resources of interest

Need to experiment with a more diverse range of UI navigational paradigms that address the dimension of time and curatorial input

Ideas/Concepts to Explore: Nomination Tools

Ideas/Concepts to Explore: Nomination Tools

Opportunities

Extractions make it easier for humanities scholars to locate and assemble source materials of interest. These collections can accelerate and/or augment discipline specific research efforts Extractions can encourage distributed collaboration and cooperation between entities who might not otherwise be aware of one another

Thank You!

http://neh-access.archive.org/neh/

Molly Bragg, Partner Specialist

The Internet Archive, Web Group

[email protected]

mailto:[email protected]

www.hanzoarchives.com ◀ Copyright © 2009 Hanzo Archives Limited.

Search and Analysis of Data in WWWoH

Mark Middleton

http://www.hanzoarchives.com/


Agenda

Brief introduction to Hanzo

Open Source Search-Tools: a toolkit for implementing analytical applications using web archives

WWWoH — working with the data

Recommendations for future research

Recommendations for future tools development

WWWoH Tools Deliverables



Introduction to Hanzo



Hanzo Archives Limited

Web Archiving Services

Company websites and intranets

Litigation support

E-Discovery

IP protection

Focus on legally defensible web archives of exceptional quality

Very advanced crawlers and access tools: dynamic html, video, flash, web 2.0

Some public archives

Mainly closed archives



Hanzo Archiving Technology

Need advanced capabilities very quickly — continuous product innovation

Rapid development of tools

Create research and open source projects to promote mainstream awareness of web archives and web archiving technology

Open source projects include

WARC Tools

Search Tools



WWWoH and Development of Open Source Search-Tools



Objectives

Deliver an open source search engine for web archives that is simple to extend, easy to install and deploy

Integrate with WARC Tools, the open source web archive file manipulation tools (Hanzo and IIPC)

Extend the search engine with interesting directives and options

Extend the search engine to provide data to analytical tools, develop an API, tools, and exemplar analytical tools

Encourage third party analytical tools to use web archives as their data repository

Migrate WWWoH extraction from ARC to WARC and ingest into Search Tools



Full Text Search

Implemented FT search on top of WARC Tools — the toolkit for manipulating ISO-28500 WARC files

Reviewed several options: Java Lucene (and clones), Xapian, DB indexing (Sphinx, OpenFTS), etc.

Criteria: vibrant development community, extensible (searching web archives is different: temporal dimension, duplicate handling, etc.), fast and full-featured (boolean, time queries, ability to index multiple fields, query language)



Component Architecture

Full text search engine based on Open Source Ferret

Knowledge Base stores search results

Python application with Django model and Django WUI

Memcache

Plug-in architecture to support multiple analytical applications



Ferret

Ferret is FAST, both indexing and searching

Highly scalable, up to 100m documents on a single CPU

Supports distributed search

Phrase search, proximity ranking, stemming in several languages, stopwords, multiple document fields

Ferret Query Language

Docu

men

ts/s

http://ferret.davebalmain.com/trac/wiki/FerretVsLucene


http://ferret.davebalmain.com/trac/wiki/FerretVsLucene


Advanced Search

url: (+bbc +wwii) -- search for URLs containing both ‘bbc’ and ‘wwii’

date: [2001 2002] -- search within date range

tag: wwwoh -- search content with the tag ‘wwoh’

title: (+wilfred +owen) -- search for Wilfred and Owen within the title

domain: fr -- restrict search to within .fr domain



Working with the Data



Migrating ARC to WARC

Data extracted from IA in ARC files

Hanzo WARC Tools and Search Tools projects combined enabled us to migrate ARC to WARC files (WARC is the new ISO standard):

Some challenges: broken ARCs, scale, etc.

3,264 WARC files



Programmable Access to Data

WARC Tools and Search Tools provide a rich collection of programmable tools to enable analytics tools developers to use web archives:

Object-oriented C, REST API, fast iterators

Command lines for manipulating WARCs, indexing, searching

Web applications for browsing, searching, demonstrator analytics

C/C++, Python, Ruby, Perl, … and if you need to, Java, C#

Demonstration: the web applications



http://wwwoh.hanzoarchives.com

/









Analytical Tools

Frequency Tables for:

Domains, MIME Types, Countries

Graphing Tools:

GUESS -- an exploratory data analysis and visualization tool for graphs and networks

Graphviz -- makes diagrams in several formats: images and SVG for web pages, Postscript; or display in an interactive graph browser

Hypergraph -- provides visualisation of hyperbolic geometry, to handle graphs and to layout hyperbolic trees





Graphing Tools



Recommendations for Future Research and Tools

Development



Future Research

Faster, richer analytics

Rich API for analytics, to be developed in collaboration with IA, other archives, and IIPC

Temporal analytics and techniques

Link and network graphing and analytics

Enhance outreach/dissemination to the mainstream development community and research community



Future Tools Development

Multi-machine indexing and application engine

Tighter integration of graphing tools, with more user parameters and configurations

Temporal analysis (animation of link graphs over time)

Enhance WARC Tools integration and investigate interoperability with other IIPC toolsets

Developer documentation

Analyst/researcher documentation

Installation tools for Linux, Mac OS X and Windows XP/Vista



Deliverables at End March 2009



Deliverables

The Search Tools project home is http://code.google.com/p/search-tools/

Source code

Documentation

Issue management

Mailing list

The WARC Tools project home is http://code.google.com/p/warc-tools/

The prototype application is http://wwwoh.hanzoarchives.com/


http://code.google.com/p/search-tools/

http://code.google.com/p/warc-tools/


Thank YouHanzo Archives Limited

+44 20 8816 8226

www.hanzoarchives.com