the UPS protoproto project

the UPS protoproto project

herbert van de sompel, michael nelson, thomas krichel

UPS 1 Meeting Santa Fe - October 21th 1999

description project

the UPS protoproto demo

the data exchange framework dex

project why a protoproto?

•  UPS: enable cross-archive end-user services •  protoproto:

–  facilitate discussions –  identify issues involved in creating cross-archive services –  experiment with digital object concepts for archive

material –  does not claim to be a solution

•  protoproto is multi-disciplinary –  a special instance of cross-archive –  there is a market –  promotional value

project who?

•  coordination: herbert van de sompel, michael nelson, thomas krichel

•  involvement of: – Old Dominion U & NASA Langley – U of Surrey – U of Ghent – Los Alamos National Laboratory - Library – Russian Academy of Science - Siberian branch

project sponsors

•  Los Alamos National Laboratory - Research Library •  JISC eLib WoPEc project

project datasets – metadata only –  full text remains at archives –  static dumps obtained ca. July 99

the arXiv CogPrints NACA NCSTRL NDLTD RePEc

Total

objects 85,223

742 3,036 29,184 1,590 73,367

193,142

full-text 85,223

659 3,036 9,084 951

13,582

112,535

!organization 17,983

14 100 93 1

2,453

project metadata formats

the arXiv CogPrints NACA NCSTRL NDLTD RePEc

format internal internal Refer RFC1807 MARC ReDIF

•  Getting metadata out of archives –  not all archives support metadata extraction

•  some archives have undocumented metadata extraction procedures

–  not all archives support rich criteria for extraction

•  single dump concept only

•  Intellectual property and use rights not always clear

project metadata extraction

•  Metadata has problems with: –  record duplication –  crucial missing fields –  internal errors –  ambiguous references to people and places,

publications

project metadata quality

project metadata conversion

•  data enhancements: •  creation of unique identifier •  addition of raw subject-classification •  normalization of publication types

•  all datasets converted to ReDIF: •  essential to have a single fomat for the creation of services •  supply by archives in a single format was not realistic •  no downgrading of data

project re-creation of archives

•  creation of archives for ReDIF-ed metadata •  using intelligent digital objects : “buckets”

arXiv

RePEc

NCSTRL

•  Buckets were chosen to study the implications of using rich, intelligent objects in UPS

•  Buckets are: –  DL protocol / system independent –  self-contained and mobile –  handle their own display, enforcement of terms and

conditions, and dissemination of their contents –  designed for bundling multiple data representations and

data instance types •  The aggregative nature of buckets is well

suited for adding valued-added services at the object level

project buckets

project creation of end-user service

•  NCSTRL+ digital library service •  indexing buckets in archives by requesting their metadata •  enhanced user-interface •  NCSTRL+ search results point at buckets •  buckets auto-display •  buckets provide link to full-text in native archive

•  UPS contains 193K objects –  using buckets consumed inodes (~60 inodes per

bucket) •  filesystem reformatted with more generous amount

of inodes

– Solaris and Dienst conflict •  Dienst wants each object in an publishing authority

to be in a single directory •  Solaris has a hard limit of 32K objects in a directory •  resolution: use many (100+) authorities for UPS

project scaling problems

project addition of linking service

•  integrate the archives with the traditional communication mechanism •  context-sensitive linking to deliver extended services via SFX technology

project SFX linking service

metadata metadata evaluate metadata

extended services

system A system B

project SFX linking database

•  buckets for arXiv, NCSTRL and RePEc are SFX-aware

•  Cogprints, NACA, NDLTD not SFX-aware •  SLAC/SPIRES is SFX-aware •  linking services for preprint metadata + for published version

project addition of linking service

demo the UPS protoproto

http://ups.cs.odu.edu:8000/

•  will be available starting beginning of November •  UPS list will be notified •  disclaimer “not a production system”

http://ups.cs.odu.edu

dex some issues (I)

• data exchange framework • data provision vs. data implementation • central searching, distributed archives

•  need for a framework by which archives can describe themselves:

•  content •  terms and conditions •  protocols, criteria supported to extract (meta)data •  metadata scheme, subject classification scheme, material-type scheme, ...

•  need for an identifier scheme for archives and archive objects

• (cf. ISSN, ISBN, DOI) •  metadata quality obstructs the creation of services •  desirabile to extend metadata with citation information •  smart objects

•  archived objects that are active, not passsive

dex some issues (II)

•  Providing data: –  publishing into an archive –  providing methods for metadata “harvesting”

•  provide non-technical context for sharing information also

•  Implementing Data: –  harvest metadata from providers –  implement user interface to data

•  Even if provided by the same DL, these are distinct functions

dex providing vs. implementing data

Provider Input interface

Native end-user interface



Native harvesting interface

No machine based way to extract metadata…

Machine and user interfaces for extracting metadata….







Implementor Native end-user interface

Input and harvesting interfaces optional

Native end-user interface optional (e.g., RePEc)


•  Much of the learning about the constituent UPS archives occurred out of band…

•  Given an unknown archive, we should be able to algorithmically determine the archive’s metadata...



Native harvesting interface Where possible, the

harvesting interface should provide the same criteria as the end-user interface

dex self-describing archives

•  Recommended criteria for metadata extraction: –  subject classification –  accession date –  publication date

•  Criteria for archive description – metadata formats employed –  contact information for archive –  publication type scheme –  identifier scheme –  subject classification scheme

dex self-describing archives

•  Useful in: –  reference linking –  can be used in citations –  resolving duplications

•  UPS duplications were removed by hand

–  tracking publication lifecycle •  Need the ability for an object to have

multiple unique identifiers –  organization, discipline, etc.

dex identifiers

•  Premise: Objects are more important than the archives that hold them

•  SODA: Smart Objects, Dumb Archives

•  Objects should be the canonical authority for •  metadata •  contents •  use

•  Objects should be able to grow and change •  correct metadata •  add new formats •  add new services •  reflect the lifecycle of the object

dex smart objects

•  It would be beneficial if the archived objects could be heterogenous:

•  with their own “look-and-feel” •  unique functionality / services

–  e.g., the data archiving needs of an atmospheric scientist can be different than that of a computer scientist, engineer or medical researcher

•  yet maintained a standard API for: •  extracting metadata •  content retrieval •  resource discovery on the object •  terms and conditions

dex smart objects

•  A strong distinction between the provision of data, and the implementation of data –  also, a socio-legal context for sharing metadata

•  Open, “self-describing” archives •  A universal, unique identifier name space •  Archived objects with more intelligence and

flexibility

dex lessons learned

Technology

the UPS protoproto project