18
Ian Milligan (@ianmilligan1) Assistant Professor of History [email protected] Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

Embed Size (px)

DESCRIPTION

Here are the slides for the talk I gave at the Digital Humanities 2014 conference in Lausanne, Switzerland. Paper abstract is http://dharchive.org/paper/DH2014/Paper-83.xml.

Citation preview

Page 1: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

Ian Milligan (@ianmilligan1) Assistant Professor of History [email protected]

Clustering Search to Navigate A Case

Study of the Canadian World Wide Web as a

Historical Resource

Page 2: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

Why? !

Historians need to think about Computational Methods in an era of

web archives.

Page 3: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

INTERNET ARCHIVE~ 10,240 TBs

LIBRARY of CONGRESS~ 200 TBs

est. HOLDINGS:

Page 4: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

The 80TB Wide Web Scrape

[March - December 2011]

Page 5: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

Wayback Machine

or WARC files?

Page 6: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

Building a .ca sample: !

622,365 distinct URLs / 8,512,275 overall URLs =

7.31% in case study

Page 7: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

WARC Web ARChive file format

ISO 28500:2009

Page 8: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

filesdump.py available at https://github.com/ianmilligan1/Historian-WARC-1/tree/master/WARC/warc-tools-mandel

WARC File WARC-Tools/Lynx!(warcfilter.py,

warchtmlindex.py and filesdump.py)

Indexing

CDX Files !(finding aids)

Page 9: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

Full Text Index

Clustering Workbench

Other sorts of text

analysis

Page 10: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource
Page 11: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource
Page 12: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource
Page 13: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource
Page 14: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

https://github.com/ianmilligan1/Historian-WARC-1/tree/master/WARC/warc-tools-mandel

WARC File WARC-Tools/Lynx!(warchtmlindex.py and filesdump.py)

Indexing

Page 15: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

Downside is you still have to know what you’re looking for.

Page 16: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource
Page 17: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

Playing with images?

Page 18: Clustering Search to Navigate A Case Study of the Canadian World Wide Web as a Historical Resource

Ian Milligan Assistant Professor of History [email protected]

Thanks (to you all and to funders).

!

http://ianmilligan.ca/