70
HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust Unless otherwise noted, these slides and their contents are licensed under a Creative Commons Attribution Unported License .

HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Embed Size (px)

Citation preview

Page 1: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

HATHITRUST A Shared Digital Repository

The HathiTrust Digital Repository: Under the hood

SI 625April 20, 2015

Jeremy York, Assistant Director, HathiTrust

Unless otherwise noted, these slides and their contents are licensed under a Creative Commons Attribution Unported License.

Page 2: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Outline

• Introduction• Underlying Ideas• Repository and Services

Page 3: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Introduction

Page 4: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

HathiTrust MembersAllegheny CollegeAmerican University of BeirutArizona State UniversityAuburn UniversityBaylor UniversityBoston CollegeBoston UniversityBrandeis UniversityBrown UniversityCalifornia Digital LibraryCarnegie Mellon UniversityCase Western ReserveColby CollegeColumbia UniversityCornell UniversityDartmouth CollegeDuke UniversityEmory UniversityGetty Research InstituteGeorgetown UniversityGeorgia TechHarvard University LibraryIndiana UniversityIowa State UniversityJohns Hopkins UniversityKansas State UniversityLafayette CollegeLibrary of CongressMassachusetts Institute of

TechnologyMcGill University`Michigan State UniversityMontana State UniversityMount Holyoke CollegeNew York Public LibraryNew York UniversityNorth Carolina Central

University

North Carolina StateUniversity

Northeastern UniversityNorthwestern UniversityThe Ohio State UniversityOklahoma State UniversityPenn StatePrinceton UniversityPurdue UniversityRutgers UniversityStanford UniversityState University System of FloridaSwarthmore CollegeSyracuse UniversityTemple UniversityTexas A&M UniversityTexas TechTufts UniversityUniversidad Complutense

de MadridUniversity of AlabamaUniversity of AlbertaUniversity of ArizonaUniversity of British ColumbiaUniversity of CalgaryUniversity of California

BerkeleyDavisIrvineLos AngelesMercedRiversideSan DiegoSan FranciscoSanta BarbaraSanta Cruz

The University of ChicagoUniversity of ConnecticutUniversity of Delaware

University of HoustonUniversity of IllinoisUniversity of Illinois at ChicagoThe University of IowaUniversity of KansasUniversity of MaineUniversity of MarylandUniversity of Massachusetts,

AmherstUniversity of MiamiUniversity of MichiganUniversity of MinnesotaUniversity of MissouriUniversity of Nebraska-LincolnUniversity of New MexicoThe University of North

Carolina at Chapel HillUniversity of Notre DameUniversity of OklahomaUniversity of PennsylvaniaUniversity of PittsburghUniversity of QueenslandUniversity of Tennessee, KnoxvilleUniversity of TexasUniversity of UtahUniversity of VermontUniversity of VirginiaUniversity of WashingtonUniversity of Wisconsin-MadisonUtah State UniversityVanderbilt UniversityVirginia TechWake Forest UniversityWashington UniversityYale University Library

Page 5: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Digital Repository

• Launched 2008• Initial focus on digitized book and journal

content– 13.3 million total volumes – 6.7 million book titles– 350,000 serial titles– 5 million public domain (~38%)

Page 6: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

The Name

• The meaning behind the name– Hathi (hah-tee)--Hindi for elephant– Big, strong– Never forgets, wise– Secure– Trustworthy

Page 7: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Mission

To contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge

Page 8: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Universal Library

Common Goal

Single Entity, Many Partners

HathiTrust

Page 9: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Collections and Collaboration

• Comprehensive collection- Preservation…with Access

• ]Shared strategies– Copyright– Collection management, development– Preservation– Discovery / Use– Bibliographic Indeterminacy– Efficient user services

• Public Good

Page 10: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Content

Page 11: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

10/1/0

8

10/1/0

9

10/1/1

0

10/1/1

1

7/1/1

2

1/1/1

3

1/1/1

4

1/1/1

50%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%1. Michigan 4,712,752

2. California 3,612,596

3. Harvard 838,115

4. Wisconsin 561,094

5. Indiana 529,601

6. Cornell 510,286

7. Penn State 388,713

8. Illinois 329,136

9. NYPL 294,883

10. Princeton 252,837

11. Minnesota 193,124

12. Madrid 117,29113. Library of Congress 108,892

14. Keio University 90,112

Page 12: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Dates

2000-20099%

1990-199913%

1980-198914%

1970-197912%1960-1969

10%1950-1959

5%

1940-19493%

1930-19394%

1920-19294%

1910-19195%

1900-19095%

1850-189912%

1800-18493%

1700-17990.01%

1600-16990.01%

0-15000.04

%

Page 13: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Language Distribution (1)

English58%

German11%

French8%

Spanish5%

Chinese4%

Russian4%

Japanese4%

Italian3%

Arabic2%

Latin2%

The top 10 languages make up ~87% of all content

Page 14: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Language Distribution (2)

Portuguese7% Undetermined

7%

Polish7%

Dutch6%

Hebrew5%

Hindi4%

Swedish4%

Indonesian-for-Bill-Only!4%

Korean3%

Danish3%

Czech3%

Turkish3%

Thai3%

Urdu3%

Croatian2%

Hungarian2%

Persian2%

Norwegian2%

Tamil2%

No-linguistic-content2%

Bengali2%

Ukrainian2%

Sanskrit2%

Greek,-Modern-(1453--)2%

Serbian1%

Romanian1%

Bulgarian1%

Greek,-Ancient-(to-1453)1%

Vietnamese1%

Armenian1%

Marathi1%

Catalan1%

Panjabi1%

Finnish1%

Telugu1%

Multiple-languages1%

Malay1%

Slovak1%

Slovenian1%

Malayalam1%

Yiddish1%

The next 40 languages make up ~12% of total

Page 15: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

HathiTrust and other e-databases

Elsevie

r

Amazon

HathiTru

st

HathiTru

st* EBLYBP

EBSCO

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

JournalsBooks

Page 16: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Content Distribution

Limited View62%Public Domain

18%

US Fed GovDocs5%

Public Domain (US)14%

Open Access0.06%

Creative Commons0.08%

Page 17: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Underlying Ideas

Page 18: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Underlying ideas

• Community• Scale• Access and Preservation• Openness

Page 19: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Community

Page 20: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Community

Page 21: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Community

• OAIS• TRAC• METS and PREMIS• Repository Practices

– Content– Reference– Fixity

Page 22: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Scale

• Mission– To contribute to the common good by collecting,

organizing, preserving, communicating, and sharing the record of human knowledge

• Strategy– “Co-owned and managed”

Page 23: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Preservation and Access

• We engage in preservation for purposes of access

• “Light” archive benefits– Access to materials– Checks on integrity– Best chance for content to be used and valued,

preserved

Page 24: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Openness

• Repository centralized...open• Formats• Software• Organizational structure

Page 25: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Underlying ideas

Page 26: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Underlying ideas

Experience

Page 27: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

What’s Missing?

• What should be included in the AIP?• What should be validated?• How should content be identified?• How to operate at scale – managing

preservation information (PREMIS; access information in rational way at scale)

• ...

Page 28: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Repository Philosophy/Design

• OAIS/TRAC

• Consistency

• Standardization

• Simplicity (in design, not function)

• Practicality

• Sustainability

Page 29: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Source

Bibliographic Data

Content Package

MichiganIndiana

Bib Data

Data Management

Rights Data

Storage

Access

Ingest

Catalog

Full-text Search

PageTurner

APIs

Collections

Holdings Data

DatasetsTDR

Page 30: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Building the Digital Repository

• Shared infrastructure– Centralized

• Administration: Ingest, validation, content integrity• Functionality: full-text search, viewing print on demand

– Geographically distributed• In terms of location, coding, service development,

digitization, content preparation

Page 31: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Source

Bibliographic Data

Content Package

MichiganIndiana

Bib Data

Data Management

Rights Data

Storage

Access

Ingest

Catalog

Full-text Search

PageTurner

APIs

Collections

Holdings Data

Datasets

Page 32: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Content

• Selection of content for digitization and preservation

• Types of materials• Technology

– Largely uniform in technical characteristics– 3 formats

• ITU G4 TIFF• JPEG2000• Unicode (with and without coordinates)

Page 33: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Content Package

imagesSource METStext

HTMETS

Zip

Page 34: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Source

Bibliographic Data

Content Package

Ingest

Rigorous validation to ensure conformance with specifications:• Resolution, image metadata• Barcode• Fixity• Consistency• Well-formedness• Prepare archival package

Page 35: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Source

Bibliographic Data

Content Package

Ingest

More about ingest• New Digitization• Existing Digitization• http://www.hathitrust.org/ingestIngest checklist:• Deposit Forms• Bibliographic metadata specifications• http://www.hathitrust.org/ingest_checklistIngest tools• Tools for validating, remediating, packaging• Detailed content specifications• http://www.hathitrust.org/ingest_toolsDeposit Guidelines• Policies• http://www.hathitrust.org/deposit_guidelinesExample METS files and METS profile• http://www.hathitrust.org/

digital_object_specifications

Page 36: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Source

Bibliographic Data

Content Package

Bib Data

Data Management

Rights Data

Storage

Access

Ingest

Catalog

Full-text Search

PageTurner

APIs

Collections

Holdings Data

DatasetsMichigan

Indiana

Page 37: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Bib Data

Data Management

Rights Data

Holdings Data

Bibliographic Data• Inventory• Loading and updating records• Duplicate detection and collation• Source of information for VuFind catalog, APIs• Rights determination (automated and support• for manual review)

Page 38: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Bib Data

Data Management

Rights Data

Holdings Data

namespace id attr reason

source user time note

Inu 30000000078026 2 1 1 Jhovater 2009-10-15 23:30:23

NULL

Page 39: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

39

id name type dscr

1 pd copyright public domain

2 ic copyright in-copyright

3 opb copyright out-of-print and brittle (implies in-copyright)

4 orph copyright copyright-orphaned (implies in-copyright)

5 und copyright undetermined copyright status

6 umall access available to UM affiliates and walk-in patrons (all campuses)

7 world access available to everyone in the world

8 nobody access available to nobody; blocked for all users

9 pdus copyright public domain only when viewed in the US

10 cc-by copyright Creative Commons Attribution

11 cc-by-nd copyright Creative Commons Attribution-NoDerivatives

12 cc-by-nc-nd copyright Creative Commons Attribution-NonCommercial-NoDerivatives

13 cc-by-nc copyright Creative Commons Attribution-NonCommercial

14 cc-by-nc-sa copyright Creative Commons Attribution-NonCommercial-ShareAlike

15 cc-by-sa copyright Creative Commons Attribution-ShareAlike

16 orphcand copyrightorphan candidate - in 90-day holding period (implies in-copyright)

17 cc-zero copyright Creative Commons Zero license (implies pd)

18 und-world copyrightUndetermined copyright status and permitted as world-viewable by the depositor

19 Ic-us copyright In copyright in the US

Rights Attributes

Page 40: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

40

Rights Determination Reason Codesid name dscr1 bib bibliographically-derived by automatic processes2 ncn no printed copyright notice3 con contractual agreement with copyright holder on file4 ddd due diligence documentation on file5 man manual access control override; see note for details6 pvt private personal information visible7 ren copyright renewal research was conducted8 nfi needs further investigation (copyright research partially complete; an ambiguous,

unclear, or other time-consuming situation was encountered)

9 cdpp title page or verso contain copyright date and/or place of publication information not in bib record

10 cip condition review and in-print status research was conducted

11 unp unpublished work

12 gfv Google viewability set at VIEW_FULL

13 crms derived from multiple reviews in the Copyright Review Management System (CRMS) via an internal resolution policy; consult CRMS records for details

14 add author death date research was conducted or notification was received from authoritative source

15 exp expiration of copyright term for non-US work with corporate author

16 Del Deleted from repository; see note for details

17 Gatt Non-US public domain work restored to in-copyright in the US by GATT

Page 41: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Access Determinations

• Automated• Manual

Page 42: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Automatic Rights Determination

• Conducted on all works at time of ingest and when records are modified– Public domain worldwide

• US works published before 1923, US federal government publications, non-US works published prior to 1873

– Public domain in the United States• Non-US works published prior to 1923

Page 43: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Manual Rights Determination

• IMLS-funded CRMS project– CRMS-US

• 2008: US-published works 1923-1963• Staff at 4 partner institutions

– CRMS-World• 2011: Expanded to non-US works• Staff at 16 partner institutions

– Double review with additional expert review for conflicts– Compliance with copyright formalities– As of March 2015 511,520 reviewed, 270,979 opened

• Rights Holder Permissions

Page 44: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

• System of Precedence

Rights Database

Bibliographic (automatic)

Manual

Page 45: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Bib Data

Data Management

Rights Data

Holdings Data

Single-part monographsOCLC #; Local system ID; Timestamp; Holding Status; Condition

Multi-part monographsInclude enumeration and chronology

SerialsOCLC #; Local system ID; Timestamp; ISSN

Page 46: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Source

Bibliographic Data

Content Package

Bib Data

Data Management

Rights Data

Storage

Access

Ingest

Catalog

Full-text Search

PageTurner

APIs

Collections

Holdings Data

DatasetsMichigan

Indiana

Page 47: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Storage

MichiganIndiana

Reliability – ensure integrityRedundancy – in single and multiple sitesScalability – including ease of managementAccessibility – for repository processes and servicesPlatform-independence – for data/object management

Page 48: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Storage

MichiganIndiana

EMC Isilon storage• Disk-based• Load-balancing and fail-over • Internal redundancy (N+3)• Efficient, reliable replication (daily)• Scalable (single file system up to 5 petabytes)

Page 49: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Storage

MichiganIndiana

Object integrity• Continual checks on data integrity• Detection and repair of corrupt disk sectors• Fixity checks on ingest• Periodic checks on fixity of all objects

Page 50: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Architecture & Management

imagesSource METStext

HTMETS

../uc1/pairtree_root/b3/54/34/86/b34543486

b34543486.zip

b34543486.mets.xml

Page 51: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Architecture & Management

imagesSource METStext

HTMETS

../uc1/pairtree_root/b3/54/34/86/b34543486

b34543486.zip

b34543486.mets.xml

Page 52: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Architecture & Management

imagesSource METStext

HTMETS

../uc1/pairtree_root/b3/54/34/86/b34543486

b34543486.zip

b34543486.mets.xml

Page 53: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Architecture & Management

imagesSource METStext

HTMETS

../uc1/pairtree_root/b3/54/34/86/b34543486

b34543486.zip

b34543486.mets.xml

Example ids:

wu.89094366434mdp.39015037375253

uc2.ark:/1390/t26973133miua.aaj0523.1950.001

Page 54: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Architecture & Management

• Reference– Ability to locate objects definitively and reliably

over time among other objects (Task Force on Archiving of Digital Information, 1996)

– Identification of objects– Structure of the repository– Embedding of identifiers– Permanent URLs– Version dates

Page 55: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Architecture & Management

imagesSource METStext

HTMETS

../uc1/pairtree_root/b3/54/34/86/b34543486

b34543486.zip

b34543486.mets.xml

Page 56: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

What is METS?

• Metadata Encoding and Transmission Standard

• Administrative (including preservation), Technical, and Structural metadata

Page 57: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Why METS?

• Can serve as Archival Information Package and a Dissemination Information Package

• Designed to record the relationship between pieces of complex digital objects

• Can be created automatically as texts are loaded or reloaded

• Preservation actions (PREMIS)

Page 58: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Metadata Framework

• Details and specifications at repository level– Object specifications / Validation criteria– Page-tagging

• Variations at object level– Files missing– Non-valid files– Incorrect file checksums

http://www.hathitrust.org/digital_object_specifications

Page 59: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Architecture & Management

imagesSource METStext

HTMETS

../uc1/pairtree_root/b3/54/34/86/b34543486

b34543486.zip

b34543486.mets.xml

Page 60: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Object Entity<PREMIS:object xsi:type="PREMIS:representation”>

<PREMIS:objectIdentifier><PREMIS:objectIdentifierType>identifier</PREMIS:objectIdentifierType><PREMIS:objectIdentifierValue>dul1.ark:/13960/t13n2vj0t</PREMIS:objectIdentifierValue>

</PREMIS:objectIdentifier><PREMIS:significantProperties>

<PREMIS:significantPropertiesType>file count</PREMIS:significantPropertiesType> <PREMIS:significantPropertiesValue>960</PREMIS:significantPropertiesValue>

</PREMIS:significantProperties><PREMIS:significantProperties>

<PREMIS:significantPropertiesType>page count</PREMIS:significantPropertiesType> <PREMIS:significantPropertiesValue>320</PREMIS:significantPropertiesValue>

</PREMIS:significantProperties></PREMIS:object>

Event Entity<PREMIS:event>

<PREMIS:eventIdentifier><PREMIS:eventIdentifierType>UUID</PREMIS:eventIdentifierType><PREMIS:eventIdentifierValue>9af6a994-f6fe-3a61-ac0e-be793d347edb</PREMIS:eventIdentifierValue>

</PREMIS:eventIdentifier><PREMIS:eventType>package inspection</PREMIS:eventType> <PREMIS:eventDateTime>2011-10-25T20:37:51Z</PREMIS:eventDateTime><PREMIS:eventDetail>Inspection of download package for missing files</PREMIS:eventDetail> <PREMIS:eventOutcomeInformation>

<PREMIS:eventOutcome>warning</PREMIS:eventOutcome> <PREMIS:eventOutcomeDetail>

<PREMIS:eventOutcomeDetailNote>files missing</PREMIS:eventOutcomeDetailNote><PREMIS:eventOutcomeDetailExtension>

<HT:fileList status="missing"><HT:file>islandoradventur00whit_scanfactors.xml</HT:file> </HT:fileList>

</PREMIS:eventOutcomeDetailExtension> </PREMIS:eventOutcomeDetail>

</PREMIS:eventOutcomeInformation><PREMIS:linkingAgentIdentifier>

<PREMIS:linkingAgentIdentifierType>MARC21 Code</PREMIS:linkingAgentIdentifierType><PREMIS:linkingAgentIdentifierValue>MiU</PREMIS:linkingAgentIdentifierValue> <PREMIS:linkingAgentRole>Executor</PREMIS:linkingAgentRole>

</PREMIS:linkingAgentIdentifier><PREMIS:linkingAgentIdentifier>

<PREMIS:linkingAgentIdentifierType>tool</PREMIS:linkingAgentIdentifierType> <PREMIS:linkingAgentIdentifierValue>feedd.pl 0.9.17</PREMIS:linkingAgentIdentifierValue> <PREMIS:linkingAgentRole>software</PREMIS:linkingAgentRole>

</PREMIS:linkingAgentIdentifier></PREMIS:event>

PREMIS Metadata

Page 61: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

capture Initial capture (digitization) of item

file rename File renaming to HathiTrust conventions

image modification Replace boilerplate images with blank images

image compression Conversion of raw scans to compressed TIFF and JPEG2000

image header modification

Modification of image headers to meet HathiTrust conventions

ingestion Ingestion of object package into the repository

message digest calculation

Calculation of page-level MD5 checksums (refers to checksum calculations performed prior to content submission to HathiTrust when these checksums are available)

validation Validation of technical characteristics of image and OCR files

ocr split Detail is package type specific, e.g.: a) Extraction of plain-text OCR from ALTO XMLb) Split OCR into one plain text OCR file per pagec) Splitting of IA XML OCR into one plain text OCR file and one XML file (with coordinates) per page

package inspection Inspection of download package for missing files

page feature mapping Mapping of original page feature tags to HathiTrust tags

fixity check Validation of MD5 checksums of content files

zip archive creation Compression of content files and source METS into zip archive

zip file message digest calculation

Calculation of md5 checksum for zip archive

source mets creation Creation of source METS file

Page 62: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Provenance

• Strategies– Original source– Agent of digitization– Administrative metadata (provenance and

preservation)

Page 63: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Security

• Data Integrity– Checksum validation, digital object provenance

• Physical security– Biometric door systems, locked racks

• Network security– Firewalling, vulnerability scanning

• Application security– Developer best practices, input validation

• Access control

Page 64: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Authentication

• Shibboleth– Login with organization– Attributes released to Service Provider– Authorize access– http://www.hathitrust.org/shibboleth

Page 65: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Source

Bibliographic Data

Content Package

Bib Data

Data Management

Rights Data

Storage

Access

Ingest

Catalog

Full-text Search

PageTurner

APIs

Collections

Holdings Data

DatasetsMichigan

Indiana

Page 66: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

APIs

• Bibliographic API– Volume and rights information– MARC records– http://www.hathitrust.org/bib_api

• OAI– http://www.hathitrust.org/data

• “Hathifiles”– http://www.hathitrust.org/hathifiles

• Data API– Volume and rights information– Page images– OCR– http://www.hathitrust.org/data_api

Page 67: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Computational Access

• Distribution of datasets– http://www.hathitrust.org/datasets

• Non-Google-digitized Dataset (540,000+)– PD, PDUS, Open Access– Signed researcher statement

• Google-digitized (4.8 million+)– PD, PDUS, Open Access– Agreement between institution and Google– Brief proposal

• Characterize texts• Provide ids (custom sets possible)• Research, results, use of results

– Signed researcher statement

Page 68: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

HTRC

• http://www.hathitrust.org/htrc• HathiTrust Research Center

– Developed collaboratively by Indiana University and University of Illinois; launched July 2011

– Enables computational access to public domain and open access materials; working to support in-copyright materials as well

– Secure Environment – bring researchers to the data– Build services and tools that facilitate research by digital

humanities and informatics communities– Advanced Collaborative Support

• RFP: http://www.hathitrust.org/htrc/acs-rfp• Awards: http://www.hathitrust.org/htrc_acs_awards_spring2015

Page 69: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

How to find out more• About: http://www.hathitrust.org/about• Twitter: http://twitter.com/hathitrust• Facebook: http://www.facebook.com/hathitrust• Monthly newsletter:

– http:www.hathitrust.org/updates– RSS http://www.hathitrust.org/updates_rss

• Contact us: [email protected]• Blogs: http://www.hathitrust.org/blogs

– Large-scale Search– Perspectives from HathiTrust

• Resources– A Preservation Infrastructure Built to Last: Preservation, Community, and

HathiTrust• http://www.hathitrust.org/documents/york-MemoftheWorld-201209.pdf

– PREMIS 2.0 Implementation:• http://bit.ly/1O8Fokz

Page 70: HATHITRUST A Shared Digital Repository The HathiTrust Digital Repository: Under the hood SI 625 April 20, 2015 Jeremy York, Assistant Director, HathiTrust

Thank you!