HATHITRUST A Shared Digital Repository HathiTrust: Aspiring to Build the Universal Library UKSG Annual Conference March 26-28, 2012 Jeremy York, Project Librarian, HathiTrust
1. HATHITRUST A Shared Digital RepositoryHathiTrust: Aspiring
to Build the Universal Library UKSG Annual Conference March 26-28,
2012 Jeremy York, Project Librarian, HathiTrust
2. PartnershipArizona State University North Carolina State
University of ConnecticutBaylor University University University of
FloridaBoston College Northwestern University University of
IllinoisBoston University The Ohio State University University of
Illinois at ChicagoCalifornia Digital Library The Pennsylvania
State The University of IowaColumbia University University
Princeton University University of MarylandCornell University
Purdue University University of MiamiDartmouth CollegeDuke
University Stanford University University of MichiganEmory
University Texas A&M University University of MinnesotaFlorida
State University Universidad Complutense University of
MissouriGetty Research Institute de Madrid University of
Nebraska-LincolnHarvard University Library University of Arizona
The University of NorthIndiana University University of Calgary
Carolina at Chapel HillJohns Hopkins University University of
California University of Notre DameLafayette College Berkeley Davis
University of PennsylvaniaLibrary of Congress Irvine University of
PittsburghMassachusetts Institute of Technology Los Angeles
University of UtahMcGill University` Merced University of
VirginiaMichigan State University Riverside University of
WashingtonNew York Public Library San Diego University of
Wisconsin-New York University San Francisco MadisonNorth Carolina
Central Santa Barbara Utah State University University Santa Cruz
Washington University The University of Chicago Yale University
Library
3. Digital Repository Launched 2008 Initial focus on digitized
book and journal content 10,109,919 total volumes 5,372,755 book
titles 266,540 serial titles 2,802,347 public domain (~28%)
4. The Name The meaning behind the name Hathi (hah-tee)--Hindi
for elephant Big, strong Never forgets, wise Secure
Trustworthy
5. Mission To contribute to the common good by collecting,
organizing, preserving, communicating, and sharing the record of
human knowledge
6. HathiTrust Universal Library Common GoalSingle Entity, Many
Partners
7. Collections and Collaboration Comprehensive collection -
Preservationwith Access Shared strategies Copyright Collection
management, development Preservation Discovery / Use Bibliographic
Indeterminacy Efficient user services Public Good
8. Content Distribution U.S. Federal Government Documents
(worldwide) 4% Public Domain72% "Public Domain" Public Domain (US)
28% (worldwide) 10% 14% Open Access .1% Creative Commons .01%
9. Content Sources LC Minnesota 1% 1% Yale UNC-Chapel Hill
Harvard Madrid 0% Virginia 0% Utah State Indiana 1% 1% 0% 0%
Chicago 2% NCSU 0% Columbia NorthwesternDuke 1% 0% 0% Princeton 0%
Illinois Purdue Penn State 3% 0% NYPL 0% 0% Cornell 3%Wisconsin 4%
5% Michigan 45% California 33%
11. Language Distribution (1) The top 10 languages make up
Remaining ~86% of all content Languages Arabic Latin 14%Italian 2%
1% 3% Japanese 3% EnglishRussian 48% 4% Chinese German 4% 9%
Spanish 5% French 7%
12. Language Distribution (2) Bulgarian ArmenianAncient-Greek
Panjabi Catalan Malayalam 1% 1% 1% 1% 1% 1% Multiple The next 40
Sanskrit 1% 2% Ukrainian Serbian Marathi Malay Undetermined
languages make 1% 1%Romanian Telugu 1% 1% Finnish 7% up ~13% of
total Slovak Vietnamese Greek 1% 1% 1%1% Polish Hungarian 1% 7% 1%
2% Portuguese Norwegian Dutch 7% 2% 5% Music 2%Bengali Tamil 2%
Hebrew 2% 5% Persian Hindi 2% 5% Unknown Czech Indonesian 3% 3%
Thai Korean Turkish Urdu 4% Danish 3% Swedish 4%Croatian 3% 3% 3%
3% 2%
13. Preservation with Access Cost effective preservation and
access services Preservation TRAC-certified Robust infrastructure
Long-term commitments on digital content facilitate planning,
decision-making
14. Executive Committee Strategic Advisory BoardBudget/Finances
Decision-making Guidance on Policy, Planning Collective Work:
Working Groups and Committees Operational Operational Strategic
Communications Communications Collections User Support User Support
Discovery Interface User Experience User Experience Full-text
Search Distributed work Driven by needs of institutions Leverage
across the partnership Projects, Grant Work, Ingest Specifications,
PageTurner, Bibliographic Data Management HathiTrust
15. Bibliographic Enterprise Repository Repository Rights
Collection Governance Data Management Administration Administration
Management Development Management Communication Data management
Digital Budget, Finances Hardware Copyright Entity description and
Coordination (content Expansion beyond configuration and
determination (record-level) with partner storage, backup, in books
and journals institutions maintenance (born- Decision-making
tegrity digital, images and checks, deletion) Object maps, audio)
Project Copyright review identification Selection of Policy
management Web and (item-level) content (for non- application
server Google volume configuration and Hardware selection ingest
and pilots Copyright projects) maintenance and replacement
information Data availability Planning management Print (database)
Cloud Library (effect Security of digital on print) Content and
Metadata specifications Rightsholder permissions Permissions
Disaster Recovery Logging Processes for ensuring content integrity
Qualitye-Commerce Content Ingest Content Access User Services
Outreach Legal Assurance Transformation PageTurner Quality Review
Risk management Print on Demand Usability Project website (use of
materials) Validation Collection Builder Content User support
Partner Certification Monthly agreements (helpdesk) newsletter
Large-scale Search Advocacy Papers and Financial presentations
contributions Research Center HathiTrust Functional Communication
of partners Bibliographic Framework with potential partners Catalog
Surveys, general APIs inquiries Repository evaluation and audit
(e.g., DRAMBORA, TRAC)
16. Constitutional Convention October 2011 52 partners 3-year
review overseen by SAB Ballot Proposals Print monograph storage
Approval Process for development initiatives U.S. Government
Documents Fee-for-service content deposit Governance
17. Emerging Governance 12-member Board of Governors 3-member
Executive Committee Executive Director 6 seats to founding
institutions 2 California, 2 CIC (minus Indiana and Michigan) 1
Indiana, 1 Michigan Voting (March 1 March 15) Announcement of
Results March 30 Begin work April 16, 2012
18. Preservation with Access Cost effective preservation and
access services Preservation TRAC-certified Robust infrastructure
Long-term commitments on digital content facilitate planning,
decision-making
19. Preservation with Access (2) Discovery Bibliographic and
full-text search of all materials Extended discovery (ProQuest,
EBSCO, OCLC, Ex Libris) Mechanisms for local loading of
records
20. Preservation with Access (3) Access and Use Public domain
and open access works Full download of materials where possible*
Print on demand Collections and APIs Research Center* Lawful uses
of in-copyright works*
21. Lawful uses Access to users who have print disabilities
Section 108 uses of materials Access to orphan works
22. Terms of Access Available to students, faculty, staff of
partnering institutions On library premises or authenticated into
HathiTrust Partner libraries own a print copy One simultaneous user
per print copy owned Users must be on U.S. soil One page at a time
download
23. How do we facilitate uses? Fundamental issues of
Identification Description Rights
24. Approach Collective problems as collective Web of
relationships Rights Records Digital Volumes Libraries Print
Volumes
25. Bibliographic Data Normalization of bibliographic data
University of Michigan Efficiency California Digital Library
26. Copyright Bibliographic metadata Automatic and manual
rights determination
27. Automatic Rights Determination Conducted on all works at
time of ingest and when records are modified Public domain
worldwide US works published before 1923, US federal government
publications, non-US works published prior to 1872 Public domain in
the United States Non-US works published prior to 1923
28. Manual Rights Determination IMLS-funded CRMS project
US-published works 1923-1963 Conformance with formalities Expanding
to non-US works Double-blind review with expert review for
conflicts Staff at 4 HathiTrust partner institutions (15 will take
part in non-US) As of February 2012 ~190,000 reviewed, more than
100,000 opened Rights Holder Permissions
29. Breakdown of HathiTrust book corpus by publication
dateBibliographic Indeterminacy and the Scale of Problems and
Opportunities of "Rights" in Digital Collection Building
2/2011
30. Breakdown of HathiTrust book corpus by publication
date
31. Copyright status of books published pre-1923 and US
workspublished 1923-1963
32. Copyright status of books published pre-1923 and US
workspublished 1923-1963 Pre-1872 ~ 5%
33. Copyright status of books published pre-1923 and US
workspublished 1923-1963 Pre-1872 ~ 5% Public Domain in the US
34. Copyright status of books published pre-1923 and US works
published 1923-1963? Pre-1872 ~ 5% Public Domain in the US
35. Copyright status of books published pre-1923 and US
workspublished 1923-1963
36. Copyright status of books published pre-1923 and US works
published 1923-1963In Print ?
37. Collection Management, Development Overlap
38. A global change in the library environment 60% Academic
print book collection already substantially 50% duplicated in mass
digitized book corpus June 2010% of Titles in Local Collection 40%
Median duplication: 31% 30% 20% 10% June 2009 Median duplication:
19% 0% 0 20 40 60 80 100 120 Rank in 2008 ARL Investment Index
39. Digitized Books in Shared Repositories ~3.5M titles
3,500,000 ~75% of mass digitized corpus is backed up in one or more
shared print repositories 3,000,000 ~2.5M 2,500,000Unique Titles
2,000,000 1,500,000 1,000,000 500,000 0 Sep-09 Oct-09 Nov-09 Dec-09
Jan-10 Feb-10 Mar-10 Apr-10 May-10 Jun-10 Mass digitized books in
Hathi digital repository Mass digitized books in shared print
repositories
40. Collection Management, Development Overlap More than 50%
median overlap with ARL institutions; higher for small liberal arts
colleges Pricing model based on Print holdings Requires print
holdings database Also support expansion of legal uses, efforts in
de- duplication Facilitate individual and collaborative collection
development and management operations Print monographs
archiving
41. Collection Management, Development Discovery (OCLC)
Collections Committee
42. Comprehensive Picture Definitional Issues Identification,
Description, Rights Discovery and Use Finding Relating (APIs and
integration) Using (Reading, Computational activities) Collection
management, development Preservation infrastructure Digital and
Print Relationships
43. Work going forward Definitional elements Print archiving,
management Discovery and use Lawful uses Research Center Quality
Government documents Beyond books and journals Publishing
Transitioning to next phase of partnership
44. How to find out more Web site About section
http://www.hathitrust.org/about HathiTrust Research Center
http://www.hathitrust.org/htrc Twitter
http://twitter.com/hathitrust Monthly newsletter
http://www.hathitrust.org/updates RSS:
http://www.hathitrust.org/updates_rss Contact us:
[email protected] Blogs:
http://www.hathitrust.org/blogs Large-scale search Perspectives
from HathiTrust