Upload
abel-davis
View
220
Download
2
Tags:
Embed Size (px)
Citation preview
11
WARC standard revision workshop
Clément Oury
IIPC General Assembly open workshops
Stanford, April 28th, 2015
IIPC General Assembly – Stanford – April 28th, 2015
2
IIPC General Assembly – Stanford – April 28th, 2015
Summary of the presentation
Current status of the WARC standard The revision process Identify, discuss and prioritize revision needs Set up an organization and agenda for
further work
3
IIPC General Assembly – Stanford – April 28th, 2015
Summary of the presentation
Current status of the WARC standard The revision process Identify, discuss and prioritize revision needs Set up an organization and agenda for
further work
4
IIPC General Assembly – Stanford – April 28th, 2015
The WARC format A container format designed to store any kind of digital
content– Along with relevant metadata– Extension of the ARC format designed in 1996
WARC improvements– Assigns a unique identifier to each record– New records types:
• To describe the harvesting process: warcinfo, request, response, metadata records
• To store information on deduplication: revisit records• To store segmented files: continuation records• To record outputs of a file format migration: conversion records• To record non web material: resource records
– New named fields for each records
5
IIPC General Assembly – Stanford – April 28th, 2015
Usage of WARC format Widely adopted by the web archiving community
– Most institutions have switched from ARC to WARC format– Harvesting: Heritrix, Wget, WARCcreate– Data management/preservation: JWAT, Jhove2– Indexing and access: SOLR, Open Wayback
But also adopted beyond web archiving community– To store e-periodicals and e-books: LOCKSS project– To store all files ingested in a long-term repository: Danish
Bit Repository
Some usage issues discussed in the WARC implementation guidelines
6
IIPC General Assembly – Stanford – April 28th, 2015
The WARC standard
Published as “ISO 28 500” on May 15th, 2009– Standardization process had started in 2006– Mainly ensured by IIPC members under ISO
umbrella
ISO group: TC 46 / SC 4 / WG 12– TC 46: Information and communication– SC 4: technical interoperability– WG 12: WARC file format
ISO standards generally reviewed after 5 years– ISO members voted in 2014 in favor of the revision
7
IIPC General Assembly – Stanford – April 28th, 2015
Summary of the presentation
Current status of the WARC standard The revision process Identify, discuss and prioritize revision needs Set up an organization and agenda for
further work
8
IIPC General Assembly – Stanford – April 28th, 2015
The revision process
A maximum period of 36 months A two steps approach
– IIPC draft / IIPC WG– ISO validated standard / ISO WG
Proposed agenda in 2015– WARC revision workshop: now!– June: presentation of revision process during TC46
meeting – May-September: first IIPC draft– October (?): ISO WG meeting
9
IIPC General Assembly – Stanford – April 28th, 2015
The revision process – why?
Amend or improve the current standard, on several topics– clarify potential ambiguities or inconsistencies in the
standard;– offer better solutions to record some information, e.g. by
adding new named fields or even new record types;– take into account some needs not identified when the original
standard was designed (e.g. use of WARC for other documents than web archives);
– perform minor editorial revisions.
Afterwards, no change possible until the next revision!
10
IIPC General Assembly – Stanford – April 28th, 2015
Summary of the presentation
Current status of the WARC standard The revision process Identify, discuss and prioritize revision
needs Set up an organization and agenda for
further work
12
IIPC General Assembly – Stanford – April 28th, 2015
Revision needs – active discussions
Clarification– Is it allowed to add new named fields?
• New record types are allowed…• But nothing is indicated on new named fields
Two new named fields for deduplication– WARC-Refers-To-Target-URI– WARC-Refers-To-Date
A proposal to record screenshots?
13
IIPC General Assembly – Stanford – April 28th, 2015
Revision needs – WARC for data mining
WAT: Web Archive Transformation– Specified by Internet Archive to store metadata
extracted from WARC files– Metadata (HTML headers, HTML metadata, links…)
recorded in metadata records with a JSON structure
WET: WARC Encapsulated Text– Designed by Common Crawl– Contains only text content extracted from WARC files
Official recommendation as informative appendix?
14
IIPC General Assembly – Stanford – April 28th, 2015
Revision needs – open questions
Is WARC format suited for non-web material?
Is WARC format suited for server side archiving?
How to improve the use of unique IDs?
15
IIPC General Assembly – Stanford – April 28th, 2015
Summary of the presentation
Current status of the WARC standard The revision process Identify, discuss and prioritize revision needs Set up an organization and agenda for
further work