28
Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Embed Size (px)

Citation preview

Page 1: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development CenterOffice of Strategic Initiatives

Releasing Open Source at the Library of Congress

Leslie Johnston2009 LITA Forum

Page 2: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.2

STARTING DOWN A PATH TOWARDS BETTER CONTROL

What are our most basic needs? What is the first step? How do we know what we have, where

it is, and who it belongs to? How do we get files – new and legacy –

from where they are to where they need to be?

Page 3: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.3

IDENTIFYING THE TRANSFER PROBLEM SPACE

As part of its first phase repository development, the Library of Congress is working on solutions for a category of activities that we refer to as “Transfer.” At a high level, we define transfer as including the following human- and machine-performed tasks: Adding digital content to the collections, whether from

an external partner or created at LC; Moving digital content between storage systems

(external and internal); Review of digital files for fixity, quality and/or

authoritativeness; and Inventorying and recording transfer life cycle events for

digital files.

Page 4: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.4

RECENT TRANSFER EXPERIENCE

During 2008 the Library of Congress received: 30 Tb from NDIIPP preservation partners, 20 Tb in Web

Capture crawls to preserve identified web sites, 30 Tb from National Digital Newspaper Project (NDNP) partners, and 1 Tb from World Digital Library partners.

• From 20 MB to over 2 Tb in a single transfer retrieved over the network.

Dozens of hard drives with licensed, partner and vendor supplied content.

 All forms of content, some to be dark archived for preservation, some limited to Library use, and some to be made publicly available.

There is also newly internally digitized content that has to be managed.

Page 5: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.5

DEVELOP A STANDARD AND TOOLS TOOPTIMIZE TRANSFERS

Motivating use cases:

• Transfer of content internally and between preservation

partners.

• Long-term storage of content. Needs:

• Minimally self-identifying and self-describing packages.

• Support for error detection and transfer optimization. Characteristics:

• Low overhead

• Content-type agnostic

• Supported by off-the-shelf, easily supported tools.

BagIt: A Packaging Specification for File TransfersA packaging specification for file transfers. Supports minimally self-identifying and self-describing packages with support for error detection and transfer optimization.

http://www.digitalpreservation.gov/library/resources/tools/docs/bagitspec.pdf

Page 6: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.6

WHAT’S IN A BAG?

Package description: bag-info.txt

/data directory with contents

Manifest of contents with checksums

Page 7: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.7

TRANSFER TOOL DEVELOPMENT

Parallel Retriever script Efficient package transfer

Validation script Validates Bags against the BagIt specification

VerifyIt script Verifies that files are uncorrupted

BagIt Java Library (BIL) Used for application and command line tool development

Bagger Desktop application Graphical desktop tool to create/update/validate Bags

LocDrop Web application Supports partner registration of transfers, whether shipping a hard drive or sending over the

network. Inventory System

Record lifecycle events for packages of Bags and files. Workflow Tools

To promote the use of BagIt in the Library and outside, tools were required to make the specification easy to use.

Page 8: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.8

TRANSFER TOOL DEVELOPMENT: BAGGER Bagger Graphical Bag Authoring Tool

• Allows users to create generic Bags or Bags that meet specified project profiles.

• Provides project-specific templates that enforce project Bag descriptive metadata requirements.

• Built on top of the BagIt Java Library.• Presents a range of options for

compressed serialization and complete versus “holey” bags.

• Java Webstart version automatically checks for the most recent version to keep the tool updated.

• Standalone version is bundled with all necessary software and runs without requiring installation privileges.

• Runs on a PC or Mac.

Page 9: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.9

USING BAGGER

create and select a profileAdd files to the /data directory

Entering bag-info metadata

Page 10: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.10

USING BAGGER

Completed bag with generated manifest

Page 11: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.11

LOCDROP TOOL DEVELOPMENT

LocDrop is designed to support notification for transfers of content into the Library of Congress both from outside the Library and within the Library itself. The application currently lets you register network and physical media transfers (hard drives, CDs, DVDs, etc.) that the Library will retrieve. In later versions we expect to add the ability to launch network transfers directly.

LocDrop will simplify the processes to track content we expect to receive. Over time, we expect to connect this application to related services that will continually improve how we manage the transfer and receipt of materials from all sources.

Page 12: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.12

USING LOCDROP

Register the information needed totrack data shipments to and from the Library

Page 13: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.13

USING LOCDROP

Register the information needed for theLibrary to retrieve network transfers

Page 14: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.14

INVENTORY TOOL DEVELOPMENT

Record Package Events Examples of Package Events include

“Package Received Events,” which are recorded when a project receives a package; and “Package Accepted Events,” which are recorded when a project accepts curatorial responsibility for a package.

Record File Events Examples of File Events include “File

Copy Events,” which are recorded when a package is copied from one File Location to another; and “Quality Review Events,” which are recorded when quality review is performed.

For legacy collections the Inventory Tool can be pointed at existing file systems and directories to package, checksum, and record life cycle events to bring the files under initial control.

The Inventory Tool is implemented on top of our BIL Java Library.

Page 15: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.15

USING THE INVENTORY TOOL

Running an Inventory operation

Page 16: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.16

USING THE INVENTORY TOOL

Searching the Inventory, plus auditing, file count, space usage, and project-specific Inventory reports

Page 17: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.17

WORKFLOW DEVELOPMENT

The Transfer components and Inventory Tool are tied together through multiple project-based Workflow systems. Through case study development

with stakeholders we identify the data flow and tasks to be performed.

Workflow tasks formalized through the system include transfer, validation by an format validation application, manual quality review inspection, and file copying to archival storage and production storage.

A workflow UI allows users to initiate, monitor and administer processes; and notify the workflow engine of the outcome of manual tasks, including task completion.

Page 18: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.18

RUNNING A WORKFLOW

Starting, searching, and monitoring workflows

Page 19: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.19

RUNNING A WORKFLOW

Updating an in-progress workflow

Page 20: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.20

INITIATING THE OPEN SOURCE RELEASE It was decided that the three utility scripts –

the key tools needed for the movement and validation of Bagged content – should be the first candidates for open source release.

The scripts were submitted to the Office of General Counsel at the Library for review. This review included close scrutiny by the attorneys in the office for everything from purpose (automating a process) to originality (determining that no code came from any other licensed sources) to authorship (Library staff versus Library contractors).

Due to some contractual obligations with a contracting company which prohibited straightforward public domain release, the three scripts were released on SourceForge in December 2008 under a BSD license. http://sourceforge.net/projects/loc-xferutils/

Page 21: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.21

CONTINUING THE OPEN SOURCE RELEASE The next vital release had to be BIL—the BagIt Library—a Java

library developed to support Bag services. A barrier to uptake of the BagIt specification was the ability to automate the

Bagging process and to support the development of tools. BIL supports key functionality such as creating, manipulating validating, and verifying Bags, as well as the uploading of Bags using the SWORD deposit protocol.

The review of BIL for open source release by the Office of General Counsel was a more complex affair. There was a single author who was a Library staff member, but there were thirteen bundled dependencies each with their own licenses to be reviewed.

BIL was released into the public domain with the understanding that those licenses restricted any bundling of BIL and its dependencies into new tools by others, but in no way restricted the release.

BIL was released as both compiled and source code in June 2009.

Page 22: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.22

MANAGING THE RELEASE

At the time of both releases the Library made a conscious decision to just release the code, and not take advantage of the SourceForge functionality that supports the committing of code back into the project. These were three relatively simple scripts and it seemed to make

the most sense to release them and let others work with them or use them to model their own development.

No one was available at the time who could devote the effort needed to manage a full-blown open source project.

The scripts can be updated by anyone in the community for their use. The Library has committed to releasing its updates to BIL. Updates to the source code are expected and welcome through the Digital Curation group.

Page 23: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.23

UPCOMING RELEASES

The Bagger application is nearing the completion of its development and partner testing. Bagger is meant to provide a graphical desktop to for the Bagging of content, ideally requiring no client-side IT support or infrastructure. It is implemented as a Java Web-Start application for use across platforms

as well as a standalone version with its own bundled, stripped down Java JRE, and supports the aggregation of files into Bag packages, including the creation of checksum manifests and Bag information files. It is developed on top of BIL.

The Bagger review includes the proposed release of three variants – the Java Webstart version, and standalone versions for the PC and Mac – as well as the source code. The review encompasses a number of bundled dependencies, including

the redistribution license for Java.

Page 24: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.24

BUILDING A COMMUNITY

The BagIt specification was posted on the Library of Congress and California Digital Library sites and as an Internet “Request for Comment” (RFC).

The BagIt specification will also be released on SourceForge to promote wider dissemination, discussion, and community building.

BagIt and the tools have been promoted to partners from three different initiatives, blogged, tweeted, shared on Facebook, presented at conferences, described in the Library’s Digital Preservation Newsletter, described in email sent to listservs, discussed in a Google group, and written up in journal articles.

The team launched a Digital Curation Google group in part to support the activities of this increasingly participatory community and encourage open, public discussion. http://groups.google.com/group/digital-curation

The best strategy for building a community was in its use by the NDIIPP partner institutions. NDIIPP strongly encouraged partners to “bag” their content for their preservation transfers to the Library.

Page 25: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.25

BUILDING A COMMUNITY

The Library moved into new modes or promotion and community building, including development of an introductory video featuring Brian Vargas, one of the authors of the specification

http://www.digitalpreservation.gov/videos/bagit0609.html

Page 26: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.26

SUCCESSES FOR THE RELEASE

How is the success of this initiative measured? There have been close to 300 downloads from the

SourceForge site. The Google group has over 120 participants. A significant percentage of the 130 NDIIPP partners

have utilized the BagIt specification in their preservation transfers to the Library.

The Library recently become aware of the open source Ruby BagIt, a Ruby Gem released in early 2009 to support use of the specification.

http://rubyforge.org/projects/bagit/

Page 27: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.27

OUTCOMES FOR THE LIBRARY

The Library's first Open Source software release. http://sourceforge.net/projects/loc-xferutils/

BagIt is in use with multiple NDIIPP partners, in the eDeposit pilot project, and for the packaging and transport of file packages internally.

Gradual development of graphical workflow tools for all active projects

The transfer of partner content has informed the Library’s own preservation efforts, building our understanding about what we need to know about files and what events in their life cycle we need to record and track.

The Inventory Tool will support the Library's initial efforts in a file-level preservation audit.

Put all tools and services into full production during 2009

Page 28: Repository Development Center Office of Strategic Initiatives Releasing Open Source at the Library of Congress Leslie Johnston 2009 LITA Forum

Repository Development Center / Office of Strategic Initiatives p.28

Questions?

Leslie Johnston

[email protected]