Upload
ian-milligan
View
503
Download
4
Embed Size (px)
DESCRIPTION
Here are the slides for the talk I gave at the Digital Humanities 2014 conference in Lausanne, Switzerland. Paper abstract is http://dharchive.org/paper/DH2014/Paper-83.xml.
Citation preview
Ian Milligan (@ianmilligan1) Assistant Professor of History [email protected]
Clustering Search to Navigate A Case
Study of the Canadian World Wide Web as a
Historical Resource
Why? !
Historians need to think about Computational Methods in an era of
web archives.
INTERNET ARCHIVE~ 10,240 TBs
LIBRARY of CONGRESS~ 200 TBs
est. HOLDINGS:
The 80TB Wide Web Scrape
[March - December 2011]
Wayback Machine
or WARC files?
Building a .ca sample: !
622,365 distinct URLs / 8,512,275 overall URLs =
7.31% in case study
WARC Web ARChive file format
ISO 28500:2009
filesdump.py available at https://github.com/ianmilligan1/Historian-WARC-1/tree/master/WARC/warc-tools-mandel
WARC File WARC-Tools/Lynx!(warcfilter.py,
warchtmlindex.py and filesdump.py)
Indexing
CDX Files !(finding aids)
Full Text Index
Clustering Workbench
Other sorts of text
analysis
https://github.com/ianmilligan1/Historian-WARC-1/tree/master/WARC/warc-tools-mandel
WARC File WARC-Tools/Lynx!(warchtmlindex.py and filesdump.py)
Indexing
Downside is you still have to know what you’re looking for.
Playing with images?
Ian Milligan Assistant Professor of History [email protected]
Thanks (to you all and to funders).
!
http://ianmilligan.ca/