Towards Multidimensional Web Archive Access (IIPC 2016)

Towards Multidimensional Web Archive Access

Creating & Analyzing Representations of Aggregated Web Content

Hugo Huurdeman Thaer Samar Jaap Kamps Arjen de Vries

Introduction

• Web archives:• exceptionally rich potential scholarly data source

• Important: temporal & hierarchical aspects• however, current access usually at single page level

Introduction

• Focus: how can we provide insights into the multidimensional aspects of the archive? • i.e. moving from singular representations of

time-stamped pages to larger aggregated representations

• Illustrated by previous work on scholarly access & examples from Dutch Web archive

Webarchive

Scholars’ Needsliterature analysis1

1.1 Exploratory Study

• Exploratory analysis of scholars’ research tasks (journal papers) [see: Huurdeman15]

• scholars using temporal Web data

• Focus on corpus generation, analysis and dissemination

artist:

1.1 Exploratory Study

• Method:

• querying EBSCOhost using the CMMC (Communication & Mass Media Complete), and LISTA (Library, Information Science & Technology Abstracts) databases

• selecting all journal papers (2007-2015) which contain longitudinal analyses (excl. computer science papers)

1.2 Results: Scholars’ Corpora

• Observation: • Of the 18 resulting papers, most scholars did

not use institutional Web archives as their data source

• Corpus definition:• 1. by selecting webpages or websites, e.g. based

on authoritative lists (13)

• 2. by querying regular search engines (5)

• 3. by taking a sample of webpages (4)

• or a combination thereof

1.3 Results: Dimensions

• Some research examples:• quality of answers in question-answering sites over time

(Chua et al, 2013)

• hyperlinking in news websites across time (Karlsson et al, 2015)

• electoral web spheres at election times (Xenos & Bennet, 2007)

• Various hierarchical and temporal dimensions

1.3.1 Results: Hierarchical Dimension

• Level of analysis:(b/o Brügger, 2013)

• page element (4) (22%) • e.g. mission statements

• web page (6) (33%) • e.g. blog pages

• web site* (7) (39%) • e.g. political actors’ sites

• web sphere (1) (6%) • e.g. electoral web sphere

web sphere (1)

website (7)

page element (4)

webpage (8)

1.3.2 Results: Temporal Dimension

2000 2005 2010

timepoints

singulartimerange

multiple timeranges

}5 (28%)

8 (44%)

5 (28%)

#Papers

1.3 Dimensions: Wrapup

• Scholars’ focus: not just on pages, but also on page elements, web sites and web spheres• at timepoints, singular timerange, multiple timeranges

• Various ways to define a corpus• queries, samples and selections (e.g. URL lists)

• How are these needs reflected in Web archive data and access functionality?

Dimensions of the Web archivedata and access2

2.1 Web Archive Data

• Usually stored in (W)ARC files• each containing one or more (W)ARC records

• resources of various kinds

2.1 Data: Dimensions

• (1) temporal dimension• versions of Web content accumulated over time

• timestamped (W)ARC records

• crawl dates

• last-modified dates

2000

2016

20041997

2008 20122008

2.2 Data: Dimensions

• (2) hierarchical dimension• “web sphere, web site, web page, page element”

• stored in (W)ARC files

• as “flat”(W)ARC records

Web sphere

Website

Page

Ele-ment

Website Website Website

Page Page Page Page Page Page Page

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

Ele-ment

eg, all pages under a host or domain;all homepages; all homepages+1

eg, set of websites;category of sites

eg, .css, .jpg file

Issue: delineating the granularities

2.3 Access: current limits• Open question: how to support these dimensions?

• current support in interfaces:• most: Selecting URLs, timestamps (Wayback Machine)• many: Querying contents of the archive, temporal filters• few: Selecting categories, facet filters

• usually still page-level results, i.e. individual pages

• How to provide aggregated results using different hierarchical and temporal dimensions?• scaling from page to site and ‘sphere’ level

• moving from single timestamp to time periods

Web sphere

Page element

Web site

Web page

2000 2005 2010

Exploring AggregationsAggregated representations in the Dutch Web archive3

Flickr: koninklijkebibliotheek

Statistics:•10,000+ websites

•35,000+ harvests

•16+ Terabyte

National Library of the Netherlands: Web archive since 2007

3.1 Data: extraction and processing

extracting all homepages + all pages 1 level deep

matching with seedlistadding KB metadata

cleaning, processing, data enrichment (e.g. NER) generate aggregations~900K XML

files

Sing

le p

age

Site

sum

mar

y

Sing

le p

ages

3.2 Potential Use: Explorations

• Potential for analysis and visualization

• Examples via Dutch Web archive• I. (aggregated) degree of change — hierarchical

• homepages+1, ssdeep (content text, links, images)

• II. (aggregated) content summaries — temporal

• homepages + 1, tf-idf

3.2.1 Examining aggregated degree of change

Web sphere

Page element

Web site

Web page

2010 2015

eyefilm.nl

http://eyefilm.nl

0"

20"

40"

60"

80"

100"

120"

20100722"

20100816"

20100817"

20110413"

20110610"

20110706"

20111013"

20111218"

20111220"

20120520"

20120613"

20120617"

20120618"

20120918"

20121014"

20121120"

20121221"

20121222"

20121222"

20130218"

20130413"

20130518"

20130611"

20130620"

20130818"

20131001"

20131013"

20131030"

20131101"

20131115"

20131118"

20131120"

20131130"

20131206"

20131220"

20131220"

20140118"

20140225"

20140413"

20140518"

20140609"

20141013"

20141118"

20150218"

20150413"

20150518"

Reeks1" Reeks2" Reeks3" Reeks4"

Example: eyefilm.nl (2010-2015)

redesign redesign

content links images overall

http://eyefilm.nl

0"

10"

20"

30"

40"

50"

60"

70"

80"

90"

100"

20090226"

20091110"

20100204"

20100210"

20100510"

20100804"

20100810"

20101110"

20110206"

20110211"

20110510"

20110706"

20110802"

20110810"

20111110"

20120202"

20120210"

20120510"

20120802"

20120810"

20121110"

20130210"

20130510"

20130810"

20131110"

20140210"

20140821"

20141110"

20150210"

20150510"


Example: escherinhetpaleis.nl (2010-2015)

0"

20"

40"

60"

80"

100"

120"

20100722"

20100816"

20100817"

20110413"

20110610"

20110706"

20111013"

20111218"

20111220"

20120520"

20120613"

20120617"

20120618"

20120918"

20121014"

20121120"

20121221"

20121222"

20121222"

20130218"

20130413"

20130518"

20130611"

20130620"

20130818"

20131001"

20131013"

20131030"

20131101"

20131115"

20131118"

20131120"

20131130"

20131206"

20131220"

20131220"

20140118"

20140225"

20140413"

20140518"

20140609"

20141013"

20141118"

20150218"

20150413"

20150518"

Reeks1" Reeks2" Reeks3" Reeks4"content links images overall

http://escherinhetpaleis.nl

Web sphere

Page element

Web site

Web page

2010 2015

unesco classifications

Changerate (type of site)

0"

10"

20"

30"

40"

50"

60"

01" 02" 03" 04" 05" 06" 08" 09" 16" 17" 18" 19" 20" 22" 23" 24" 25" 30" 31"

Gemiddeld"van"content"

Gemiddeld"van"images"

Gemiddeld"van"links"

Gemiddeld"van"combined"

Changes per unesco category (all p/quarter harvests, n=~600, 2009-2015)

MeteorologyLaw & government

HistorySports

Agriculture

0"

10"

20"

30"

40"

50"

60"

01" 02" 03" 04" 05" 06" 08" 09" 16" 17" 18" 19" 20" 22" 23" 24" 25" 30" 31"





Changerate (all sites)

0"

5"

10"

15"

20"

25"

30"

35"

2009Q3"

2009Q4"

2010Q1"

2010Q2"

2010Q3"

2010Q4"

2011Q1"

2011Q2"

2011Q3"

2011Q4"

2012Q1"

2012Q2"

2012Q3"

2012Q4"

2013Q1"

2013Q2"

2013Q3"

2013Q4"

2014Q1"

2014Q2"

2014Q3"

2014Q4"

2015Q1"





Changerate (all p/quarter harvests, 2009-2015)

0"

5"

10"

15"

20"

25"

30"

35"

2009Q3"

2009Q4"

2010Q1"

2010Q2"

2010Q3"

2010Q4"

2011Q1"

2011Q2"

2011Q3"

2011Q4"

2012Q1"

2012Q2"

2012Q3"

2012Q4"

2013Q1"

2013Q2"

2013Q3"

2013Q4"

2014Q1"

2014Q2"

2014Q3"

2014Q4"

2015Q1"





3.2.2 Examining aggregated content summaries

3.2.2 Exploring Content Summaries

• Examine textual contents of a website

• for example, nu.nl

• most popular Dutch news site (Alexa, 2016)

• daily crawls by KB

• Exploration: different temporal site-level summarizations

http://nu.nl

2014

2015

Jan’13 Feb’13 Mar’13 Apr’13

May’13 Jun’13 Jul’13 Aug’13

Sep’13 Oct’13 Nov’13 Dec’13

Daily (2012)

Organizations (NER)

201420132012

Persons (NER)

2013 2014 2015

Places (NER)

0"

20"

40"

60"

80"

100"

120"

20100722"

20100816"

20100817"

20110413"

20110610"

20110706"

20111013"

20111218"

20111220"

20120520"

20120613"

20120617"

20120618"

20120918"

20121014"

20121120"

20121221"

20121222"

20121222"

20130218"

20130413"

20130518"

20130611"

20130620"

20130818"

20131001"

20131013"

20131030"

20131101"

20131115"

20131118"

20131120"

20131130"

20131206"

20131220"

20131220"

20140118"

20140225"

20140413"

20140518"

20140609"

20141013"

20141118"

20150218"

20150413"

20150518"


3.2.3 Next: combining approaches

ConclusionTowards Multidimensional Web Archive Access4

4.1 Conclusion

• Gap between researchers needs and data/access

• Researchers’ needs

• rich access, e.g. different analytical levels, temporal ranges

• Archive access

• mainly access at single page level (URLs and queries)

• Calls for new approaches to provide access to aggregated contents

• temporally and hierarchically

4.2 Our approach

• Starting from a selection instead of a query

• Potential support exploratory stages of (re)search

• Potential support analysis and comparisons

• Issues: which levels of a website to summarize

• experimental focus on homepages and underlying pages

• deeper layers: additional richness, additional issues

• custom file formats vs standardized formats

• Integration into access interfaces

Web Archive

4.3 Ongoing and Future Work

• Further extending our approach; integration into WebARTist toolset

• providing new ways to explore material in the archive (without using queries)

• Creating aggregated representations of unarchived contents

• see “Lost but Not Forgotten: Finding Pages on the Unarchived Web” (2015)

“Corpus Creation”

“Analysis”

“Dissemination”

References

• Ben-David A. & Huurdeman H. (2014). Web Archive Search as Research: Methodological and Theoretical Implications. Alexandria Journal, Volume 25, No. 1 (2014)

• Brügger, N. (2013). Historical Network Analysis of the Web. Social Science Computer Review, 31(3), 306–321 • Brügger, N. (2014). Concluding Remarks. International Internet Preservation Consortium General

Consortium. Paris, France. Retrieved from: http://netpreserve.org/sites/default/files/attachments/Brugger.ppt (April 19, 2015)

• Chu, C. M. (1999). Literary critics at work and their information needs: A research-phases model. Library & Information Science Research, 21(2), 247–273.

• Dougherty, M., & Meyer, E. T. (2014). Community, tools, and practices in web archiving: The state-of-the-art in relation to social science and humanities research needs. Journal of the Association for Information Science and Technology, 65(11), 2195–2209. http://doi.org/10.1002/asi.23099

• Hockx-Yu, H. (2014). Access and Scholarly Use of Web Archives. Alexandria, 25(1-2), 113–127. • Huurdeman, H. (2015). Towards Research Engines: Supporting Search Stages in Web archives. Presented at

Web Archives as Scholarly Sources conference, Aarhus University, Denmark. • Huurdeman H., Kamps J., Samar T., de Vries A., Ben-David A., Rogers R. (2015). Finding Pages in the

Unarchived Web. International Journal on Digital Libraries. • Huurdeman, H., & Kamps, J. (2014). From Multistage Information-seeking Models to Multistage Search

Systems. In Proceedings of the 5th Information Interaction in Context Symposium (pp. 145–154). New York, NY, USA: ACM.

• Meho, L. I., & Tibbo, H. R. (2003). Modeling the information-seeking behavior of social scientists: Ellis’s study revisited. Journal of the American Society for Information Science and Technology, 54(6), 570–587.

• Rogers R. (2013). Digital Methods. MIT Press 2013

Thanks & Acknowledgements

• The WebART team (’12-’16): Jaap Kamps, Richard Rogers, Arjen de Vries, Hugo Huurdeman, Thaer Samar, Anat Ben-David, Sanna Kumpulainen

• We gratefully acknowledge the collaboration with the Dutch Web Archive of the National Library of the Netherlands.

• This research was supported by the Netherlands Organization for Scientific Research (WebART project, NWO CATCH # 640.005.001).

webarchiving.nl

@webart12

Towards Multidimensional Web Archive Access

Creating & Analyzing Representations of Aggregated Web Content

Hugo Huurdeman Thaer Samar Jaap Kamps Arjen de [email protected]@timelessfuture

Education

Towards Multidimensional Web Archive Access (IIPC 2016)