Upload
timelessfuture
View
492
Download
1
Embed Size (px)
Citation preview
Towards Multidimensional Web Archive Access
Creating & Analyzing Representations of Aggregated Web Content
Hugo Huurdeman Thaer Samar Jaap Kamps Arjen de Vries
Introduction
• Web archives:• exceptionally rich potential scholarly data source
• Important: temporal & hierarchical aspects• however, current access usually at single page level
Introduction
• Focus: how can we provide insights into the multidimensional aspects of the archive? • i.e. moving from singular representations of
time-stamped pages to larger aggregated representations
• Illustrated by previous work on scholarly access & examples from Dutch Web archive
Webarchive
Scholars’ Needsliterature analysis1
1.1 Exploratory Study
• Exploratory analysis of scholars’ research tasks (journal papers) [see: Huurdeman15]
• scholars using temporal Web data
• Focus on corpus generation, analysis and dissemination
artist:
1.1 Exploratory Study
• Method:
• querying EBSCOhost using the CMMC (Communication & Mass Media Complete), and LISTA (Library, Information Science & Technology Abstracts) databases
• selecting all journal papers (2007-2015) which contain longitudinal analyses (excl. computer science papers)
1.2 Results: Scholars’ Corpora
• Observation: • Of the 18 resulting papers, most scholars did
not use institutional Web archives as their data source
• Corpus definition:• 1. by selecting webpages or websites, e.g. based
on authoritative lists (13)
• 2. by querying regular search engines (5)
• 3. by taking a sample of webpages (4)
• or a combination thereof
1.3 Results: Dimensions
• Some research examples:• quality of answers in question-answering sites over time
(Chua et al, 2013)
• hyperlinking in news websites across time (Karlsson et al, 2015)
• electoral web spheres at election times (Xenos & Bennet, 2007)
• Various hierarchical and temporal dimensions
1.3.1 Results: Hierarchical Dimension
• Level of analysis:(b/o Brügger, 2013)
• page element (4) (22%) • e.g. mission statements
• web page (6) (33%) • e.g. blog pages
• web site* (7) (39%) • e.g. political actors’ sites
• web sphere (1) (6%) • e.g. electoral web sphere
web sphere (1)
website (7)
page element (4)
webpage (8)
1.3.2 Results: Temporal Dimension
2000 2005 2010
timepoints
singulartimerange
multiple timeranges
}5 (28%)
8 (44%)
5 (28%)
#Papers
1.3 Dimensions: Wrapup
• Scholars’ focus: not just on pages, but also on page elements, web sites and web spheres• at timepoints, singular timerange, multiple timeranges
• Various ways to define a corpus• queries, samples and selections (e.g. URL lists)
• How are these needs reflected in Web archive data and access functionality?
Dimensions of the Web archivedata and access2
2.1 Web Archive Data
• Usually stored in (W)ARC files• each containing one or more (W)ARC records
• resources of various kinds
2.1 Data: Dimensions
• (1) temporal dimension• versions of Web content accumulated over time
• timestamped (W)ARC records
• crawl dates
• last-modified dates
2000
2016
20041997
2008 20122008
2.2 Data: Dimensions
• (2) hierarchical dimension• “web sphere, web site, web page, page element”
• stored in (W)ARC files
• as “flat”(W)ARC records
Web sphere
Website
Page
Ele-ment
Website Website Website
Page Page Page Page Page Page Page
Ele-ment
Ele-ment
Ele-ment
Ele-ment
Ele-ment
Ele-ment
Ele-ment
Ele-ment
Ele-ment
Ele-ment
Ele-ment
Ele-ment
Ele-ment
Ele-ment
Ele-ment
eg, all pages under a host or domain;all homepages; all homepages+1
eg, set of websites;category of sites
eg, .css, .jpg file
Issue: delineating the granularities
2.3 Access: current limits• Open question: how to support these dimensions?
• current support in interfaces:• most: Selecting URLs, timestamps (Wayback Machine)• many: Querying contents of the archive, temporal filters• few: Selecting categories, facet filters
• usually still page-level results, i.e. individual pages
• How to provide aggregated results using different hierarchical and temporal dimensions?• scaling from page to site and ‘sphere’ level
• moving from single timestamp to time periods
Web sphere
Page element
Web site
Web page
2000 2005 2010
Exploring AggregationsAggregated representations in the Dutch Web archive3
Flickr: koninklijkebibliotheek
Statistics:•10,000+ websites
•35,000+ harvests
•16+ Terabyte
National Library of the Netherlands: Web archive since 2007
3.1 Data: extraction and processing
extracting all homepages + all pages 1 level deep
matching with seedlistadding KB metadata
cleaning, processing, data enrichment (e.g. NER) generate aggregations~900K XML
files
Sing
le p
age
Site
sum
mar
y
Sing
le p
ages
3.2 Potential Use: Explorations
• Potential for analysis and visualization
• Examples via Dutch Web archive• I. (aggregated) degree of change — hierarchical
• homepages+1, ssdeep (content text, links, images)
• II. (aggregated) content summaries — temporal
• homepages + 1, tf-idf
3.2.1 Examining aggregated degree of change
0"
20"
40"
60"
80"
100"
120"
20100722"
20100816"
20100817"
20110413"
20110610"
20110706"
20111013"
20111218"
20111220"
20120520"
20120613"
20120617"
20120618"
20120918"
20121014"
20121120"
20121221"
20121222"
20121222"
20130218"
20130413"
20130518"
20130611"
20130620"
20130818"
20131001"
20131013"
20131030"
20131101"
20131115"
20131118"
20131120"
20131130"
20131206"
20131220"
20131220"
20140118"
20140225"
20140413"
20140518"
20140609"
20141013"
20141118"
20150218"
20150413"
20150518"
Reeks1" Reeks2" Reeks3" Reeks4"
Example: eyefilm.nl (2010-2015)
redesign redesign
content links images overall
0"
10"
20"
30"
40"
50"
60"
70"
80"
90"
100"
20090226"
20091110"
20100204"
20100210"
20100510"
20100804"
20100810"
20101110"
20110206"
20110211"
20110510"
20110706"
20110802"
20110810"
20111110"
20120202"
20120210"
20120510"
20120802"
20120810"
20121110"
20130210"
20130510"
20130810"
20131110"
20140210"
20140821"
20141110"
20150210"
20150510"
Reeks1" Reeks2" Reeks3" Reeks4"
Example: escherinhetpaleis.nl (2010-2015)
0"
20"
40"
60"
80"
100"
120"
20100722"
20100816"
20100817"
20110413"
20110610"
20110706"
20111013"
20111218"
20111220"
20120520"
20120613"
20120617"
20120618"
20120918"
20121014"
20121120"
20121221"
20121222"
20121222"
20130218"
20130413"
20130518"
20130611"
20130620"
20130818"
20131001"
20131013"
20131030"
20131101"
20131115"
20131118"
20131120"
20131130"
20131206"
20131220"
20131220"
20140118"
20140225"
20140413"
20140518"
20140609"
20141013"
20141118"
20150218"
20150413"
20150518"
Reeks1" Reeks2" Reeks3" Reeks4"content links images overall
Web sphere
Page element
Web site
Web page
2010 2015
unesco classifications
Changerate (type of site)
0"
10"
20"
30"
40"
50"
60"
01" 02" 03" 04" 05" 06" 08" 09" 16" 17" 18" 19" 20" 22" 23" 24" 25" 30" 31"
Gemiddeld"van"content"
Gemiddeld"van"images"
Gemiddeld"van"links"
Gemiddeld"van"combined"
Changes per unesco category (all p/quarter harvests, n=~600, 2009-2015)
MeteorologyLaw & government
HistorySports
Agriculture
0"
10"
20"
30"
40"
50"
60"
01" 02" 03" 04" 05" 06" 08" 09" 16" 17" 18" 19" 20" 22" 23" 24" 25" 30" 31"
Gemiddeld"van"content"
Gemiddeld"van"images"
Gemiddeld"van"links"
Gemiddeld"van"combined"
Changerate (all sites)
0"
5"
10"
15"
20"
25"
30"
35"
2009Q3"
2009Q4"
2010Q1"
2010Q2"
2010Q3"
2010Q4"
2011Q1"
2011Q2"
2011Q3"
2011Q4"
2012Q1"
2012Q2"
2012Q3"
2012Q4"
2013Q1"
2013Q2"
2013Q3"
2013Q4"
2014Q1"
2014Q2"
2014Q3"
2014Q4"
2015Q1"
Gemiddeld"van"content"
Gemiddeld"van"links"
Gemiddeld"van"images"
Gemiddeld"van"combined"
Changerate (all p/quarter harvests, 2009-2015)
0"
5"
10"
15"
20"
25"
30"
35"
2009Q3"
2009Q4"
2010Q1"
2010Q2"
2010Q3"
2010Q4"
2011Q1"
2011Q2"
2011Q3"
2011Q4"
2012Q1"
2012Q2"
2012Q3"
2012Q4"
2013Q1"
2013Q2"
2013Q3"
2013Q4"
2014Q1"
2014Q2"
2014Q3"
2014Q4"
2015Q1"
Gemiddeld"van"content"
Gemiddeld"van"links"
Gemiddeld"van"images"
Gemiddeld"van"combined"
3.2.2 Examining aggregated content summaries
3.2.2 Exploring Content Summaries
• Examine textual contents of a website
• for example, nu.nl
• most popular Dutch news site (Alexa, 2016)
• daily crawls by KB
• Exploration: different temporal site-level summarizations
2014
2015
Jan’13 Feb’13 Mar’13 Apr’13
May’13 Jun’13 Jul’13 Aug’13
Sep’13 Oct’13 Nov’13 Dec’13
Daily (2012)
Organizations (NER)
201420132012
Persons (NER)
2013 2014 2015
Places (NER)
0"
20"
40"
60"
80"
100"
120"
20100722"
20100816"
20100817"
20110413"
20110610"
20110706"
20111013"
20111218"
20111220"
20120520"
20120613"
20120617"
20120618"
20120918"
20121014"
20121120"
20121221"
20121222"
20121222"
20130218"
20130413"
20130518"
20130611"
20130620"
20130818"
20131001"
20131013"
20131030"
20131101"
20131115"
20131118"
20131120"
20131130"
20131206"
20131220"
20131220"
20140118"
20140225"
20140413"
20140518"
20140609"
20141013"
20141118"
20150218"
20150413"
20150518"
Reeks1" Reeks2" Reeks3" Reeks4"
3.2.3 Next: combining approaches
ConclusionTowards Multidimensional Web Archive Access4
4.1 Conclusion
• Gap between researchers needs and data/access
• Researchers’ needs
• rich access, e.g. different analytical levels, temporal ranges
• Archive access
• mainly access at single page level (URLs and queries)
• Calls for new approaches to provide access to aggregated contents
• temporally and hierarchically
4.2 Our approach
• Starting from a selection instead of a query
• Potential support exploratory stages of (re)search
• Potential support analysis and comparisons
• Issues: which levels of a website to summarize
• experimental focus on homepages and underlying pages
• deeper layers: additional richness, additional issues
• custom file formats vs standardized formats
• Integration into access interfaces
Web Archive
4.3 Ongoing and Future Work
• Further extending our approach; integration into WebARTist toolset
• providing new ways to explore material in the archive (without using queries)
• Creating aggregated representations of unarchived contents
• see “Lost but Not Forgotten: Finding Pages on the Unarchived Web” (2015)
“Corpus Creation”
“Analysis”
“Dissemination”
References
• Ben-David A. & Huurdeman H. (2014). Web Archive Search as Research: Methodological and Theoretical Implications. Alexandria Journal, Volume 25, No. 1 (2014)
• Brügger, N. (2013). Historical Network Analysis of the Web. Social Science Computer Review, 31(3), 306–321 • Brügger, N. (2014). Concluding Remarks. International Internet Preservation Consortium General
Consortium. Paris, France. Retrieved from: http://netpreserve.org/sites/default/files/attachments/Brugger.ppt (April 19, 2015)
• Chu, C. M. (1999). Literary critics at work and their information needs: A research-phases model. Library & Information Science Research, 21(2), 247–273.
• Dougherty, M., & Meyer, E. T. (2014). Community, tools, and practices in web archiving: The state-of-the-art in relation to social science and humanities research needs. Journal of the Association for Information Science and Technology, 65(11), 2195–2209. http://doi.org/10.1002/asi.23099
• Hockx-Yu, H. (2014). Access and Scholarly Use of Web Archives. Alexandria, 25(1-2), 113–127. • Huurdeman, H. (2015). Towards Research Engines: Supporting Search Stages in Web archives. Presented at
Web Archives as Scholarly Sources conference, Aarhus University, Denmark. • Huurdeman H., Kamps J., Samar T., de Vries A., Ben-David A., Rogers R. (2015). Finding Pages in the
Unarchived Web. International Journal on Digital Libraries. • Huurdeman, H., & Kamps, J. (2014). From Multistage Information-seeking Models to Multistage Search
Systems. In Proceedings of the 5th Information Interaction in Context Symposium (pp. 145–154). New York, NY, USA: ACM.
• Meho, L. I., & Tibbo, H. R. (2003). Modeling the information-seeking behavior of social scientists: Ellis’s study revisited. Journal of the American Society for Information Science and Technology, 54(6), 570–587.
• Rogers R. (2013). Digital Methods. MIT Press 2013
Thanks & Acknowledgements
• The WebART team (’12-’16): Jaap Kamps, Richard Rogers, Arjen de Vries, Hugo Huurdeman, Thaer Samar, Anat Ben-David, Sanna Kumpulainen
• We gratefully acknowledge the collaboration with the Dutch Web Archive of the National Library of the Netherlands.
• This research was supported by the Netherlands Organization for Scientific Research (WebART project, NWO CATCH # 640.005.001).
webarchiving.nl
@webart12
Towards Multidimensional Web Archive Access
Creating & Analyzing Representations of Aggregated Web Content
Hugo Huurdeman Thaer Samar Jaap Kamps Arjen de [email protected]@timelessfuture