111
Update at CNI, April 14, 2015 Chip German, APTrust Andrew Diamond, APTrust Nathan Tallman, University of Cincinnati Linda Newman, University of Cincinnati Jamie Little, University of Miami APTrust Web Site: aptrust.org

APTrust CNI Seattle April 2015

Embed Size (px)

Citation preview

Page 1: APTrust CNI Seattle April 2015

Update at CNI, April 14, 2015

Chip German, APTrustAndrew Diamond, APTrust

Nathan Tallman, University of CincinnatiLinda Newman, University of Cincinnati

Jamie Little, University of Miami

APTrust Web Site: aptrust.org

Page 2: APTrust CNI Seattle April 2015

Sustaining Members:

Columbia UniversityIndiana UniversityJohns Hopkins UniversityNorth Carolina State UniversityPenn StateSyracuse UniversityUniversity of ChicagoUniversity of Cincinnati

University of ConnecticutUniversity of MarylandUniversity of MiamiUniversity of MichiganUniversity of North CarolinaUniversity of Notre DameUniversity of VirginiaVirginia Tech

Page 3: APTrust CNI Seattle April 2015

• Quick description of APTrust: The Academic Preservation Trust (APTrust) is a consortium of higher education institutions committed to providing both a preservation repository for digital content and collaboratively developed services related to that content.  The APTrust repository accepts digital materials in all formats from member institutions, and provides redundant storage in the cloud.  It is managed and operated by the University of Virginia and is both a deposit location and a replicating node for the Digital Preservation Network (DPN).

• 39 of the registrants for CNI in Seattle are from APTrust sustaining-member institutions

Page 4: APTrust CNI Seattle April 2015

APTrustHow it works

Andrew DiamondSenior Preservation & Systems Engineer

Page 5: APTrust CNI Seattle April 2015
Page 6: APTrust CNI Seattle April 2015
Page 7: APTrust CNI Seattle April 2015
Page 8: APTrust CNI Seattle April 2015
Page 9: APTrust CNI Seattle April 2015

Behind the Scenes, Step by Step

Page 10: APTrust CNI Seattle April 2015
Page 11: APTrust CNI Seattle April 2015
Page 12: APTrust CNI Seattle April 2015
Page 13: APTrust CNI Seattle April 2015
Page 14: APTrust CNI Seattle April 2015
Page 15: APTrust CNI Seattle April 2015
Page 16: APTrust CNI Seattle April 2015

What You See,Step by Step

Page 17: APTrust CNI Seattle April 2015

Validation Pending

Page 18: APTrust CNI Seattle April 2015

Validation Succeeded (Detail)

Page 19: APTrust CNI Seattle April 2015

Validation Failed (Detail)

Page 20: APTrust CNI Seattle April 2015

Storage Pending

Page 21: APTrust CNI Seattle April 2015

Record Pending

Page 22: APTrust CNI Seattle April 2015

Record Succeeded

Page 23: APTrust CNI Seattle April 2015

List of Ingested Objects

Page 24: APTrust CNI Seattle April 2015

Object Detail

Page 25: APTrust CNI Seattle April 2015

List of Files

Page 26: APTrust CNI Seattle April 2015

File Detail

Page 27: APTrust CNI Seattle April 2015
Page 28: APTrust CNI Seattle April 2015
Page 29: APTrust CNI Seattle April 2015
Page 30: APTrust CNI Seattle April 2015

Pending Tasks: Restore & Delete

Page 31: APTrust CNI Seattle April 2015

Completed Tasks: Restore & Delete

Page 32: APTrust CNI Seattle April 2015

Detail of Completed Restore

Page 33: APTrust CNI Seattle April 2015

Detail of Completed Delete

Page 34: APTrust CNI Seattle April 2015

A Note on Bag Storage & Restoration

Page 35: APTrust CNI Seattle April 2015

Bag Storage

Page 36: APTrust CNI Seattle April 2015

Bag Restoration

Page 37: APTrust CNI Seattle April 2015

Where’s my data?

Page 38: APTrust CNI Seattle April 2015

On multiple disks in Northern Virginia.

S3

Glacier

On hard disk, tape or optical disk in Oregon.

Page 39: APTrust CNI Seattle April 2015

S3 = Simple Storage Service

Page 40: APTrust CNI Seattle April 2015

S3 = Simple Storage Service

99.999999999% Durability (9 nines)

Page 41: APTrust CNI Seattle April 2015

S3 = Simple Storage Service

99.999999999% Durability (9 nines)

Each file stored on multiple devices in multiple data centers

Page 42: APTrust CNI Seattle April 2015

S3 = Simple Storage Service

99.999999999% Durability (9 nines)

Each file stored on multiple devices in multiple data centersFixity check on transfer

Page 43: APTrust CNI Seattle April 2015

S3 = Simple Storage Service

99.999999999% Durability (9 nines)

Each file stored on multiple devices in multiple data centersFixity check on transfer

Ongoing data integrity checks & self-healing

Page 44: APTrust CNI Seattle April 2015

Glacier

Long-term storage with 99.999999999% durability (11 nines)

Not intended for frequent access

Takes five hours or so to restore a file

Page 45: APTrust CNI Seattle April 2015

Chances of losing an archived object?

Page 46: APTrust CNI Seattle April 2015

Chances of losing an archived object?

1 in 100,000,000,000

S3

Page 47: APTrust CNI Seattle April 2015

Chances of losing an archived object?

1 in 100,000,000,000

S3

Glacier

1 in 10,000,000,000,000

Page 48: APTrust CNI Seattle April 2015

Chances of losing an archived object?

1 in 100,000,000,000S3

Glacier1 in 10,000,000,000,000

1 in 1,000,000,000,000,000,000,000,000 chance of losing both to separate events

Page 49: APTrust CNI Seattle April 2015

Chances of losing an archived object?

1 in 100,000,000,000S3

Glacier1 in 10,000,000,000,000

1 in 1,000,000,000,000,000,000,000,000 chance of losing both to separate events

Very Slim

Page 50: APTrust CNI Seattle April 2015

Protecting Against FailureDuring Ingest

Page 51: APTrust CNI Seattle April 2015

Protecting Against FailureDuring Ingest

Microservices with state

Resume intelligently after failure

Page 52: APTrust CNI Seattle April 2015
Page 53: APTrust CNI Seattle April 2015
Page 54: APTrust CNI Seattle April 2015
Page 55: APTrust CNI Seattle April 2015
Page 56: APTrust CNI Seattle April 2015
Page 57: APTrust CNI Seattle April 2015

Protecting Against FailureAfter Ingest

Page 58: APTrust CNI Seattle April 2015

Protecting Against FailureAfter Ingest

Objects & Files

Metadata & Events

Page 59: APTrust CNI Seattle April 2015
Page 60: APTrust CNI Seattle April 2015
Page 61: APTrust CNI Seattle April 2015
Page 62: APTrust CNI Seattle April 2015
Page 63: APTrust CNI Seattle April 2015

What if...

Page 64: APTrust CNI Seattle April 2015

What if ingest services stop?

They restart themselves and pick up where they left off.

Page 65: APTrust CNI Seattle April 2015

What if ingest server dies?

We spin up a new one in a few hours.

Page 66: APTrust CNI Seattle April 2015

What if Fedora service stops?

We restart it manually.

Page 67: APTrust CNI Seattle April 2015

What if Fedora server dies?

We can build a new one from the data on the hot backup in Oregon.

Page 68: APTrust CNI Seattle April 2015

What if we lose all of Glacier in Oregon?

We have all files in S3 in Virginia. Ingest services and Fedora are not affected.

Page 69: APTrust CNI Seattle April 2015

What if we lose the whole Virginia data center?

Fedora IngestServices

Metadata and Events

Files and Objects

S3

Page 70: APTrust CNI Seattle April 2015

What if we lose the whole Virginia data center?

IngestServices

We can rebuild services from code and config files in hours.

Page 71: APTrust CNI Seattle April 2015

What if we lose the whole Virginia data center?

We can restore all files from Glacier in Oregon.

Files and Objects

S3

Page 72: APTrust CNI Seattle April 2015

What if we lose the whole Virginia data center?

We can restore all metadata and events from the Fedora backup in Oregon.Fedora

Metadata and Events

Page 73: APTrust CNI Seattle April 2015

Metadata stored with files

Page 74: APTrust CNI Seattle April 2015

Attached metadata for last-ditch recovery

Page 75: APTrust CNI Seattle April 2015

To lose everything...

Page 76: APTrust CNI Seattle April 2015

To lose everything...

A major catastrophic event in Virginia and Oregon, or...

Page 77: APTrust CNI Seattle April 2015

To lose everything...

Don’t pay the AWS bill

A major catastrophic event in Virginia and Oregon, or...

Page 78: APTrust CNI Seattle April 2015

Partner Tools

Page 79: APTrust CNI Seattle April 2015

Partner ToolsBag ManagementUploadDownloadList Delete

Page 80: APTrust CNI Seattle April 2015

Partner ToolsBag ManagementUploadDownloadList Delete

...or use open-source AWS libraries for Java, Python, Ruby, etc.

Page 81: APTrust CNI Seattle April 2015

Partner ToolsBag Management

Page 82: APTrust CNI Seattle April 2015

Partner ToolsBag Validation

Validate a bag before you upload it.

Get specific details about what to fix.

Page 83: APTrust CNI Seattle April 2015

Partner ToolsBag Validation

Page 84: APTrust CNI Seattle April 2015

Partner ToolsBag Management & Validation

Download from the APTrust wiki

https://sites.google.com/a/aptrust.org/aptrust-wiki/home/partner-tools

Page 85: APTrust CNI Seattle April 2015

APTrust

Repositoryhttps://repository.aptrust.org

Test Sitehttp://test.aptrust.org

Wikihttps://sites.google.com/a/aptrust.org/aptrust-wiki/

[email protected]

Page 86: APTrust CNI Seattle April 2015

Learning to RunUniversity of Cincinnati Libraries path to

contributing content to APTrust

Page 87: APTrust CNI Seattle April 2015

Digital Collection Environment● 3 digital repositories

o DSpace -- UC Digital Resource Commonso LUNA -- UC Libraries Image and Media Collectionso Hydra -- Scholar@UC Institutional Repository

● 1 website● 1 Storage Area Network file system

Page 88: APTrust CNI Seattle April 2015

Workflow Considerations● Preservation vs. Recovery

o platform-specific bags or platform-agnostic bags?o derivative files

● Bagging levelo record level or collection level?

Page 89: APTrust CNI Seattle April 2015

UC Libraries Workflow Choices● Recovery is as important as preservation

o Include platform specific files and derivatives● Track platform provenance

o use bag naming convention or additional fields in bag-info.text

● Track collection membershipo use bag naming convention or additional fields in

bag-info.text

Page 90: APTrust CNI Seattle April 2015

Bagging (crawling)

1st iteration● Start with DSpace, Cincinnati Subway

and Street Improvements collection● Replication Task Suite to produce Bags

o one bag per record● Bash scripts to transform to APTrust Bags● Bash scripts to upload to test repository● Worked well, but…

Scripts available at https://github.com/ntallman/APTrust_tools.

Page 91: APTrust CNI Seattle April 2015

Bagging (walking)

2nd iteration● Same starting place● Export collection as Simple Archive

Formato ~ 8,800 records, ~ 375 GB

● Create bags using bagit-pythono 250 GB is bag-size upper limit

Page 92: APTrust CNI Seattle April 2015

Bagging (walking)

2nd iteration (continued)● Collection had three series

o two series under 250 GB▪ each series was put into a single bag

o one series over 250 GB▪ used a multi-part bag to send to APTrust,

APTrust merges split bags into one

● Bash script to upload bags to production repository

Page 93: APTrust CNI Seattle April 2015

Bagging (running)

3rd iteration● Testing of tools created by Jamie Little of

University of Miami

● Jamie can tell you more but…

Page 94: APTrust CNI Seattle April 2015

Long-Term Strategy (marathons)● Bag for preservation and recovery● For most content

o one bag per collection▪ except for Scholar@UC

o manual curation and bagging▪ except for Scholar@UC

● Use bag-info.txt to capture platform provenance and collection membership

Page 95: APTrust CNI Seattle April 2015

at

Page 96: APTrust CNI Seattle April 2015

UM Digital Collections Environment● File system based long-term storage● CONTENTdm for public access

Page 97: APTrust CNI Seattle April 2015

Creating a workflow● Began at the command line working with

LOC’s BagIt Java Library ● Bash scripts● Our initial testing used technical criteria

(size, structural complexity, file types)

Page 98: APTrust CNI Seattle April 2015

What to put in the bag?● CONTENTdm metadata● FITS technical metadata● We maintained our file system structure in

the bags

Page 99: APTrust CNI Seattle April 2015

Beginning APTrust Testing● Began uploading whole collections as bags● Early attempts at a user interface using

Zenity

Page 100: APTrust CNI Seattle April 2015

UM’s APTrust Automation● Sinatra based server application● HTML & JavaScript interface

Page 101: APTrust CNI Seattle April 2015

Starting Screen

Page 102: APTrust CNI Seattle April 2015

Choosing a Repository ID

Page 103: APTrust CNI Seattle April 2015

Choosing Files

Page 104: APTrust CNI Seattle April 2015

Viewing Download Results

Page 105: APTrust CNI Seattle April 2015

Final Upload

Page 106: APTrust CNI Seattle April 2015

Bagging Strategies

Moved from bag per collection strategy to bag per file

Changed interface to allow selection of individual files

Page 107: APTrust CNI Seattle April 2015

APTrust Benefits for Developers● Wide range of open source tools available● Uploading to APTrust is no different than

uploading to S3

Page 108: APTrust CNI Seattle April 2015

Where to find us

github.com/UMiamiLibraries

http://merrick.library.miami.edu/

Page 109: APTrust CNI Seattle April 2015

Why are we doing this again?● Libraries’ and Archives’ historic commitment to long-

term preservation is the raison d'être, a core justification, for our support of repositories. It’s why our stakeholders trust us and will place their content in our repositories.

● We expect APTrust to be our long-term preservation repository to meet our responsibilities to our stakeholders.

Page 110: APTrust CNI Seattle April 2015

Why are we doing this again?● We know from some experience, and reports from our

peers, that backup procedures, while still essential, are fraught with holes.

● We expect APTrust to be our failover if any near-term restore from backup proves incomplete.

Page 111: APTrust CNI Seattle April 2015

Linda NewmanHead, Digital Collections and [email protected]

UC Libraries, Digital Collections and Repositories

Nathan TallmanDigital Content Strategist

[email protected]

http://digital.libraries.uc.edu https://scholar.uc.edu https://github.com/uclibs