17
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability Nicholas Taylor (@ nullhandle ) Web Archiving Service Manager Stanford University Libraries Archives 2016 209 - Balancing Quality of Life and Quality Assurance August 4, 2016

Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

Embed Size (px)

Citation preview

Page 1: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

Nicholas Taylor (@nullhandle)Web Archiving Service ManagerStanford University Libraries

Archives 2016209 - Balancing Quality of Life and Quality AssuranceAugust 4, 2016

Page 2: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

QA panelists

Dory BowerGovernment Publishing Office

Lori DonovanInternet Archive / Archive-It

Dallas PillenBentley Historical Library

Nicholas TaylorStanford University Libraries

Alex ThurmanColumbia University Libraries

Page 3: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

balancing QA + quality of life?

“Tab Tatham "junk. balance scales."” by ▓▒░ TORLEY ░▒▓ under CC BY-SA 2.0

Page 4: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

overheard re: QA @ SAA 2015

we set and forget; I’m just glad we’re doing something

did more QA at the beginning but, well, I don’t really look at the reports any moresteady,

ongoing QA is

challengingoccasionally I set aside a lunch hour to do some QA

my strategy right now is to let the big schools figure it out

Page 5: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

2015 SAA WebArchRT discussion

• if you could only apply 3 QA practices to your web archives, which 3?

• do you apply different QA practices to web archives created for different use cases?

• how do you ensure that staff time allocated to QA is best spent?

Page 6: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

quality assurance in the lifecycle

Archive-It: “The Web Archiving Life Cycle Model”

Page 7: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

quality assurance, expansively

typical QA• parsing robots.txt• scoping rules• object count limits• test crawling• inspecting archived

site• reviewing reports• patch crawling

and more• seed selection• assessing live site• capture tool selection• crawl scheduling• crawl duration limits• monitoring crawl• archivability advocacy• training

Page 8: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

3rd highest desired skill

Apprai

sal + Sele

ction

Archivi

ng Too

ls

Collab

oratio

n + C

ommun

icatio

n

Domain

Expert

ise

Metada

ta

Quality

Assu

rance

Software

Dev

elopm

ent

Web

Techno

logies

Other

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

NDSA: “2015 NDSA Web Archiving Survey”

Page 9: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

low perceived programmatic progress

Vision +

Obje

ctive

sPoli

cy

Resourc

es + W

orkflo

w

Risk M

anag

emen

t

Apprai

sal + Sele

ction

Scopin

g

Data C

aptur

e

QA + Ana

lysis

Storag

e + O

rganiz

ation

Preserv

ation

Metada

ta/Desc

riptio

n

Access/

Use/Reu

se0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

NDSA: “2015 NDSA Web Archiving Survey”

Page 10: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

greatest collaboration interest

Policy

+ Risk

Man

agem

ent

Captur

e Con

figura

tion

Collab

orativ

e Coll

ection

Dev

Input

on A

PIs + Stan

dards

Metada

ta Stan

dards

QA Techniq

ues +

Strateg

ies

Tool D

evOthe

r0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

NDSA: “2015 NDSA Web Archiving Survey”

Page 12: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

web archiving at Stanford

• 7 Archive-It accounts

• Heritrix, Webrecorder

• local preservation, discovery, access

• program manager, curators, students

• tens of collections• thousands of

seeds

Internet Archive: “Stanford University Homepage”

Page 13: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

quality assurance goals

• maximize impact + efficiency of QA efforts

• enable diverse, distributed, + approachable contributions

• calibrate investments in quality based on tool capabilities

“Goals” by Eric Peacock under CC BY-NC-SA 2.0

Page 14: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

capture, behavior, appearance

appearancebehavior

capture

NYARC: “I. Introduction - NYARC Documentation”

Page 15: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

capture, behavior, appearance

appearancebehavior

capture

NYARC: “I. Introduction - NYARC Documentation”

Page 16: Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Sustainability

in practicecare more about…• report data• crawl finishing• 4xx, 5xx, complete

robots.txt block• plausible duration• plausible object

counts• scoping out

extraneous content• new seeds

care less about…• visual inspection• reviewing every

capture• appearance fidelity• behavior fidelity• partial content out of

scope• partial content

blocked by robots.txt• ongoing seeds