Towards Research Engines: Supporting Search Stages in Web Archives (2015)

Preview:

Citation preview

WebART project Web Archive Retrieval Tools

Jaap Kamps, Richard Rogers, Arjen de Vries Hildelies Balk, René Voorburg

!Thaer Samar, Hugo Huurdeman, Sanna Kumpulainen

Flickr: LucViatour

!Hugo Huurdeman!

University of Amsterdam!huurdeman@uva.nl!

!!!

Towards Research Engines: Supporting Search Stages in Web Archives

webarchiving.nl

Web Archives as Scholarly Sources conference, Aarhus University, 10 June 2015

Introduction

• Web archives preserve the fast-changing Web

• By now containing Petabytes of valuable Web data !

• This could be a valuable resource, however, archives have not frequently been used for research !

• Several underlying reasons exist. Here, the focus is on potential limitations in access

Flickr: laughingsquid

The concept of ‘task-sharing’

• We look at the concept of task-sharing (Beaulieu, 1999) !• i.e. how should we design web

archive access systems to better facilitate task-sharing between scholar and system? !

• Bottom-up approach: looking at scholars’ use of Web data, and how currents systems support scholars’ needs

scholar

research task

system

1 Scholars’ use of web data!& current support

1.1 Study: scholars’ research phases

• Exploratory analysis of scholars’ research tasks (journal papers)!• scholars using temporal Web data !

• Use research phases as a ‘lens’ to analyze these papers

artist:

1.1 Background: Research Phases

• Various scholars have defined different stages occurring in research tasks (Bronstein ’07; Chu ’99; Meho & Tibbo ’03) !

• Specifically, Brügger (2014) has defined several research phases relevant to web archive research:

1. Corpus creation

2. Analysis

3. Dissemination

1.2 Study: scholars’ research phases

• Method:!• querying EBSCOhost using the CMMC (Communication & Mass

Media Complete), and LISTA (Library, Information Science & Technology Abstracts) databases !

• selecting all journal papers (2007-2015) which contain longitudinal analyses (excluding computer science papers)

1.2 Study: literature corpus overview

• 18 papers (17 distinct first authors) !

• Main areas: • Information Science • Communication • New Media • Political Science

1.2 Study: literature corpus overview

• Observation: various ways of corpus definition, analysis and dissemination in journal papers !

• However, most papers in this literature set did not use Web archives as a data source !

• Corresponds to large gap potential community addressed by web archives & small group actually using them thus far (Dougherty & Meyer, 2014)

1.3.1 Study results: Corpus definition phase

• 1. selecting webpages or websites, e.g. based on authoritative lists (13) !

• 2. querying regular search engines (5) !

• 3. taking a sample of webpages (4) !

• Often: combination of methods

e.g. the term ‘informetrics’ (Bar-Ilan, 2009), descriptors of youth movements (Xenos & Bennet, 2007)

e.g. a list of insurance companies (Waite and Harrison, 2007)

e.g. one week per month (Li et al, 2014) ; to reduce large size of corpus, or data bias (John, 2013)

1.3.1 Study results: Corpus definition phase

Query

Selection

Sample

Query

Selection

Sample

➤➤➤

13

5 1

3

4

• Current support: • Most: Selecting URLs (Wayback Machine) • Many: Querying the contents of the archive • Few: Selecting (predefined) categories • Very few: Sampling contents of the archive

• Current limitations: • Defining, saving & sharing of corpora • Document-centric access methods [Hockx-Yu, 14] • Limitations of search [Ben-David & Huurdeman,14]

1.3.2 Results: Analysis phase (1/2)

• Content analysis (66.7%)!• manual coding

• coding schemes, at times based on existing frameworks !

• Content analysis (22.2%) • automatic

• existing/customly developed tools !

• Network analysis (11.1%)!• issue crawler, link

classifications

1.3.2 Results: Analysis phase (2/2)

• Level of analysis:(b/o Brügger, 2013)!!• page element (4) (22%)

• e.g. mission statements • web page (6) (33%)

• e.g. blog pages • web site* (7) (39%)

• e.g. political actors’ sites • web sphere (1) (6%)

• e.g. youth web sphere

web sphere (1)

website (7)

page element (4)

webpage (8)

• Current support • Very few: analysis (n-gram,

trends), export options

• Current limitations: • Generally not applicable to custom corpora • No ways to define granularity of results • Often have to resort to script-based analysis tools • Lack of integrated content analysis, coding support, ..

1.3.2 Support: Analysis phase

1.3.3 Results: Dissemination phase

• Tables (16) !

• Graphs (10) !

• Link networks (1) !

• Model (1)

1.3.3 Support: Dissemination phase

• Current limitations • Set of visualizations

depends on archive • Generally not applicable

to user-defined corpora

• Current support • some visualization options

(n-gram, tag clouds)

1.4 Summary

• Observation: omissions in current support for corpus creation, analysis and dissemination in a research context !

• Opportunities arise to increase task-sharing in future systems

scholar

research task

system

2 From Search to Research engines

2.1 Supporting the flow (1/2)

• How to integrate this varied set of features into an integrated access system?

• with a high usability and without cognitive overload !!!!!!!

• Traditional approach: “Complex” interface integrating all functionality

Search

?

Dunne

Dunne et al, 2012

2.1 Supporting the flow (2/2)

• Our approach: Divide functionality per (research) stage !

• Inspired by ongoing work on supporting the flow of Web and book search in multistage interfaces, based on cognitive models of the search process [Huurdeman & Kamps, 2014; Huurdeman, Kamps, Koolen & Kumpulainen, 2015]

Search

Corpus Creation

Search

Visualization

Search

Analysis

2.2 Current research prototypes: b/o Dutch Web archive

• National Library of the Netherlands (KB) !!

• Selective Web archive (2007-now)!• 10+ Terabyte (25,000+ harvests)

!• Idea: modular system

2.2.1 Supporting research phases: corpus creation

• faceted search interface • different modalities to

explore results • possibility to

• save (complex) queries

• save results • categorize

Search

Corpus Creation

Saved queries

2.2.1 Supporting research phases: corpus creation

• Further customization ’Under the hood’: define search strategy • via visual building blocks • flexibility in defining a

corpus (determine selection, ranking, queries, etc)[De Vries et al, 2010]see also: spinque.com

Search

Corpus Creation

2.2.2 Supporting research phases: analysis

• Analysis interface !• edit/annotate

dataset • search &

browse dataset • analyze

Search

Analysis

2.2.3 Supporting research phases: dissemination

• Visualization interface!• based on RAW

(raw.densitydesign.org) • visualize datasets

(graphs and visualizations)

Search

Dissemination

2.3 Caveats & discussion

• Looking at access aspects • not at underlying data & its properties • next step: contextualizing ‘completeness’ of

results [see Huurdeman, Kamps, Samar, De Vries, Ben-David & Rogers, 2015]

!• Slightly utopian vision: not all analysis

can be supported • generic versus specific approaches • towards ‘toolmaker’s tools’ !

• Different archives offer different toolsets • Importance of sharing (open-source) and

collaboration!

2.4 Conclusion

• Exploratory analysis of scholars’ choices related to corpus definition, analysis and dissemination!!

• These choices revealed a number of limitations of current access interfaces !

• Therefore, we propose a more fluid approach, moving from mere search to ‘research engines’

Wayback Machine

Search engine

‘Research’ engine

webarchiving.nl

@webart12

Thanks & Acknowledgements

• The WebART team (’12-’16): Jaap Kamps, Richard Rogers, Arjen de Vries, Thaer Samar, Sanna Kumpulainen; and Anat Ben-David. !

• We gratefully acknowledge the collaboration with the Dutch Web Archive of the National Library of the Netherlands. !

• This research was supported by the Netherlands Organization for Scientific Research (WebART project, NWO CATCH # 640.005.001).

References• Beaulieu, M. (2000). Interaction in information searching and retrieval. Journal of Documentation, 56(4), 431–439. • Ben-David A. & Huurdeman H. (2014). Web Archive Search as Research: Methodological and Theoretical

Implications. Alexandria Journal, Volume 25, No. 1 (2014) • Bronstein, J. (n.d.). The role of the research phase in information seeking behaviour of Jewish scholars: a

modification of Ellis’s behavioural characteristics. Retrieved April 20, 2015, from http://www.informationr.net/ir/12-3/paper318.html

• Brügger, N. (2014). Concluding Remarks. International Internet Preservation Consortium General Consortium. Paris, France. Retrieved from: http://netpreserve.org/sites/default/files/attachments/Brugger.ppt (April 19, 2015)

• Brügger, N. (2013). Historical Network Analysis of the Web. Social Science Computer Review, 31(3), 306–321 • Chu, C. M. (1999). Literary critics at work and their information needs: A research-phases model. Library &

Information Science Research, 21(2), 247–273. • Dunne, C., Shneiderman, B., Gove, R., Klavans, J., & Dorr, B. (2012). Rapid understanding of scientific paper

collections: Integrating statistics, text analytics, and visualization. Journal of the American Society for Information Science and Technology, 63(12), 2351–2369.

• Hockx-Yu, H. (2014). Access and Scholarly Use of Web Archives. Alexandria, 25(1-2), 113–127. • Huurdeman H., Kamps J., Samar T., de Vries A., Ben-David A., Rogers R. (2015). Finding Pages in the Unarchived

Web. International Journal on Digital Libraries. • Huurdeman H., Kamps J., Koolen M., Kumpulainen, S. (forthcoming). The Value of Multistage Interfaces for Book

Search. CEUR-WS. • Huurdeman, H., & Kamps, J. (2014). From Multistage Information-seeking Models to Multistage Search Systems. In

Proceedings of the 5th Information Interaction in Context Symposium (pp. 145–154). New York, NY, USA: ACM. • Meho, L. I., & Tibbo, H. R. (2003). Modeling the information-seeking behavior of social scientists: Ellis’s study

revisited. Journal of the American Society for Information Science and Technology, 54(6), 570–587. • Rogers R. (2013). Digital Methods. MIT Press 2013 • de Vries A., Alink W., Cornacchia R. (2010). Search by Strategy. Proc. ESAIR '10

!Hugo Huurdeman!

University of Amsterdam!huurdeman@uva.nl!

!!!

Towards Research Engines: Supporting Search Stages in Web Archives

webarchiving.nl

Web Archives as Scholarly Sources conference, Aarhus University, 10 June 2015

Recommended