What do we want to identify? Ketil Albertsen, Paradigma project National Library of Norway

What do we want to identify?

Ketil Albertsen, Paradigma project

National Library of Norway

Existing ID schemes and FRBR

• Work:

DOI (?)

• Expression:

ISTC, ISWC, DOI (?)


• Manifestation:

ISBN, ISMN, ISRN, ISAN, ISRCISSN, SICIDOI (?)


• Item:Often: non-standardized holding IDs, shelf IDs etc.

Internet: URL (actually, URLs are location IDs, but are often used as object IDs).

Museum world, rare used books shops/collections etc. may have their own ID schemes.

Several Item level schemes are really location IDs or a mixup of location and object identification.


• Other entity groups

No widespread standards exists

Identifier issues

Decisions made in Paradigma for ID assignments to objects with no assigned ID, or no assigned ID is found appropriate/ satisfactory

Analysis may aid the understanding of properties and limitations of existing ID schemes

For each identified issue, the survey states “Note:” to explain the heading in more detail, where required, arguments in favor of the decision (pro), arguments against the decision (con), our confidence in our own decision on a scale from 1 to 10, additional remarks, where applicable.

ID value should carry no informaton about identified object (e.g. ID should be grey, opaque, unintelligent)

Pro:

• Object retains ID even if its attributes change (without becoming inconsistent).All views equal with respect to ID; different views won’t have conflicting wishes.No ambiguiety with respect to how to form ID value.No restrictions on any object attribute, e.g. regarding single-value or uniqueness.

Con:

• Readability is lower for ID values which appear meaningless to the user.Any form of ordering requires other attributes to be retrieved.Any lookup on object attributes requires use of and index mechanism.

Confidence: 10

Remarks:

• Values may have a non-opaque prefix, identifying the ID scheme or responsible authority. We accept this as long as as no object properties is implied by the prefix.

ID values should use a restricted character/symbol set

Pro:

• No problems caused by limited internationalization featuresNo transcription problems.May avoid ambiguities such as UPPER/lower case distinctions or numeric base.Easier detection of typing/reading errors.

Con:

• In cross-cultural contexts, users may have to work in unfamiliar symbol sets.Identifier length increases.

Confidence: 7

Remarks:

• Opaque (grey) IDs usually satisfy this requirement.Special attention should be given to separator characters for increasing readability – use of different separators in a given ID is usually undesirable.

Check digits are included in IDs to be typed manually or read from print

Pro:

• Errors are detected immediately, allowing retyping/rereading.

Con:

• Identifier length is increased.

Confidence: 9

Check digits are not stored internally as part of the ID

Pro:

• Software (except input routines) need not relate to IDs known to be invalid.

Con:

• There is no way to store/handle invalid IDs that cannot be retyped/reread.

Confidence: 7

IDs have fixed length

Pro:

• Elementary error detection may be done very early in the data entry process.Internal handling may be simplified.Implementers are forced to prepare software for entire range of ID values.

Con:

• Limits total size of ID space; one may run out of IDs.A large ID space gives longer IDs than “required” from a practical viewpoint.

Confidence: 9

Remarks:

• In numeric IDs, leading zeros should be avoided (for several technical reasons), i.e. serial numbers should start at value 100…0

An external, “readable” format is rigidly defined

Pro:

• Readability may be improved, e.g. by defined use of digit grouping characters.IDs can be directly compared for equality with no preprocessing/canonizing.

Con:

• Users may feel that rules are too rigid.

Confidence: 9

External format optionally identifies primary resolution service

Pro:

• Provides necessary info to realize clickable links.

Con:

• Bound to one specific identification method for resolution service.May be unsuitable for long-term archival (if service is identified by URL).Users generally won’t distinguish between object ID part and resolution ID part.

Confidence: 7

Binary IDs are displayed as decimal digits

Pro:

• Improved readability; users relate better to digits than to arbitrary character strings.May be entered from a purely numeric keyboard.

Con:

• Identifier length is higher than with arbitrary (or e.g. hexadecimal) display.

Confidence: 9

Remarks:

• Binary IDs are usually only/primarily intended for internal, technical use

One ID scheme covers all object classes / information types

Pro:

• Prepared to handle the proliferation of information types on the Internet.Simplifies internal workings of indexing mechanisms.May avoid ambiguities with respect to which ID scheme is referenced.

Con:

• Object/metadata must be inspected to determine semantics.Requires coordinated ID allocation.

Confidence:

• Paradigma: 9. In non-digital contexts: 3-6

An ID may be assigned to objects without digital value

representation

Note:

• Applies to e.g. physical objects and abstract concepts, and also to digital “objects” which cannot be or is not available as a storable object, e.g. a network or web site.

Pro:

• One ID scheme is used for all different purposes.An ID mechanism is provided for objects which might otherwise be “unidentifiable”.

Con:

• Retrieval/presentation functions must be prepared for value being unavailable

Confidence: Paradigma: 9. In non-digital contexts: 6

Remarks:

• In Paradigma, displayable “agent objects” are defined for un-digitizable objects.

One ID scheme handles static as well as dynamic resources:

incrementally issued, integrating and streaming

Pro:

• The only way that web documents may be handled properly by automatic mechanisms.

• Resources saved in ID management.

Con:

• The semantics of the ID is limited to what is common to both static and dynamic objects.

Confidence: 9

The contents of an object is either 100% specified, or the object is explicitly defined as an aggregate,

e.g. a dynamic document

Pro:

• Any ambiguity is deliberate and well known both to cataloguer and user.No need for qualified judgment regarding assignment of new IDs.Document contents never becomes inconsistent with ID – modified, specific contents should be assigned a new ID if needed.

Con:

• Insignificant revisions cannot be ignored from an identifier point of view

Confidence: 9

An ID may identify a rule to be interpreted to determine the object components

Pro:

• The only way that web pages with continuously varying contents may be identified

Con:

• The contents of an object identified by a rule cannot, by definition, be authenticated.

Confidence: In a web-based, dynamic document context: 8. In static contexts: 3-5

Remarks:

• Even though the component set may vary from one moment to the next, any evaluation of the rule must result in a finite set of components.

An ID identifies a unique object in a given interpretation

An HTML file as backup object is different from interpreting the HTML code as the top level of a composite web page; these should have different IDs. A periodical as a publication forum is distinct from the complete set of printed issues.

Pro:

• Ambiguities in interpretation of a given ID is avoided.

Con:

• Required size of IDs space increases.An object may have multiple IDs; ID is not unique for the object.

Confidence: In digital document contexts: 9, in other contexts 3-7

Remarks:

• Distinct IDs are essential when the set of components depends on the interpretation, or when the interpretations represent different abstraction levels.

An ID scheme should not assume that objects are atomic (non-composite)

Pro:

• New object classes may be identified without conflicting with ID scheme philosophy

Con:

• Always being prepared for composite objects makes software more complex.

Confidence: Context dependent. For referencing: 9, for storage: 1-3.

Remarks:

• In Paradigma, “physical” IDs assume that object is a static, atomic bit sequence. “Logical” IDs, provided to external users, assume that objects are composite.

Each distinct component of a composite must be identifiable

Pro:

• Allows references to a specific part of composite objects.Allows different views of a document, defining different extents.Allows a component shared among several composites to have one ID.

Con:

• Requires a larger ID space.

Confidence: 9 if components stored as independent files/records, otherwise: 7

Remarks:

• Components are not necessarily assigned an ID, but it must be possible to do so when the need for identifying the component arises.

An object ID is totally independent of the location of the object

Pro:

• Object can be moved around without loosing its identity.Allows multiple copies of the same object to have the same ID.

Con:

• Obtaining an object requires an explicit mapping from object ID to location.

Confidence: 10

A location ID is totally independent of the identity of the object stored

Pro:

• Object store may be reorganized without affecting object IDs.

Con:

• Location IDs cannot be saved for later re-retrieval if store is subject to reorganizing.

Confidence: 10

Making references to an object must not require contents interpretation

Pro:

• The referencer need not know format details of documents he wants to reference.The document format may be changed, and references to it remains valid.References can be made at a more abstract (format independent) level.

Con:

• References may have lower precision compared to format dependent references.Dereferencing may be more complex and resource consuming.

Confidence: 7

Remarks:

• Relevant primarily to digital/Internet documentsIn Paradigma, references are indirect, made through a format independent “reference object” containing one or more format specific direct references.

Assigning an ID from one scheme does not prohibit allocation of IDs from other schemes

Pro:

• In a given context, a single scheme may be employed for all objects, even those with existing IDs.

Con:

• There will not be a single, unique way to reference a given object.

Confidence: 9

Index mechanisms provide a fallback for arbitrary URI format ID schemes

Pro:

• A large number of current and future ID schemes are handled with a single mechanism.

Con:

• The general mechanism cannot handle e.g. allowed variations in syntax for the same ID.

Confidence: 9

Remarks:

• For pragmatic reasons (people use URLs as if they were URNs!), Paradigma decided to allow both URLs and URNs to be entered in the fallback index.For known schemes with known, allowed syntax variations, scheme dependent preprocessing (“canonizing”) will have to be done, e.g. to hide case differences.

Framework for ID schemes: The URI world

uri:

urn:

url:

http://www.nb.no/

gopher://gopher.uminn.edu/pub/sched

ftp://ftp.funet.fi/pub/at/8200.exe …:

isbn:0-596-00420-6

issn:1560-1560

doi:10.185/4dd2-00032

nbn:fi-34a3aea1707839494511fcdc14773f4

…:

One ID refers to several object aspects: value, metadata, converted versions…

Pro:

• The same ID is used to reference all information about an object.Information can be added to (or about) an object without requiring a new ID.

Con:

• A resolution request must identify the relevant aspect; the ID alone is not sufficient.Managing value and metadata independently increases complexity of resolution

Confidence: 5. Context dependent

An ID scheme may prohibit multiple IDs for a given object (in that scheme)

Pro:

• An unambiguous ID simplifies internal handling, especially with respect to storage.Object references can be directly compared for equality.ID value may be used to determine storage address.

Con:

• May prohibit “correct” ID, e.g. for component used in multiple contexts

Confidence: 5. Context dependent.

Remarks:

• Paradigma: “Physical” IDs, identifying document elements at lowest level, have a single unique ID, while “logical” IDs do not have this restriction

IDs are assigned one by one from a single central office

Pro:

• Procedures ensure that metadata is always available.The definition of the identified object is always known.No need to structure ID value, which may be opaque except for common prefix.

Con:

• Channel to assignment authority may become a bottleneck.(Final) ID assignment cannot be done until the object definition is available.

Confidence: In Paradigma: 9. Highly context dependent - in general contexts: 2-5

Remarks:

• In Paradigma, an ID can be reserved for a limited period of time prior to final assignment, to allow the ID to be inserted into the document text.

An automated ID assignment service is provided

Pro:

• Saves manual labor.IDs may be assigned 7/24.Rapid response to assignment requests.

Con:

• Service must be protected against malicious use.

Confidence: In Paradigma: 9. In general contexts: 5-7

ID assignment may be requested by any user with no particular authorization

Pro:

• Allows user to create reference to object where the publisher has provided no ID.

Con:

• An object may be assigned an arbitrary number of IDs.An automated assignment service must be protected against malicious use.

Confidence: For referencing purposes (point/fragment IDs): 9.

Remarks:

• The resolution service may treat IDs assigned by unauthorized users different from IDs assigned by recognized and authorized users such as publishers.

A generally available, online ID resolution service is provided

Pro:

• Satisfies user expectations for clickable links.

Con:

• A complex infrastructure may be required to provide a high quality service.

Confidence: 9

Remarks:

• The service may provide the object itself, metadata or other classes of information

A generally available authentication service is provided

.

Pro:

• Can be an essential aid in legal conflicts.Can be used to force storing of a snapshot of a dynamically changing resource.

Con:

• Must be provided by an authority recognized by all relevant parties.Implementation requires significant resources; all documents must be held by provider.Ignoring pure syntax differences of no semantic importance is very difficult.

Confidence: 6. For non-digital documents: Not applicable.

Remarks:

• For various reasons, object may be unavailable for retrieval, only for authentication.The service may provide information about degree of discrepancy.Paradigma: Generally available metadata allows a user to authenticate document locally (for strict equality only).

Resolution/authentication infrastructure is document format independent, and requires no modifications of object contents

Pro:

• All current and future document formats can be handled.Management functions need not know a large number of document syntaxes.

Con:

• The functions cannot be based on content attributes, but must be based on independently managed structures.

Confidence: 10

The object must be available to the assignment authority at assignment time

Pro:

• Users are given a higher level of service: All assigned IDs have a valid definition.

Con:

• Delegation of ID assignment have to be restricted to those satisfying requirements.A complex infrastructure is required to make available and maintain definitions.

Confidence: In Paradigma: 9, otherwise: 5

Remarks:

• In an archival context, such as Paradigma, objects never disappear, so all IDs will be valid “forever”. This is not necessarily the case in other contexts.

A minimum set of metadata must be specified for ID allocation

Pro:

• Guarantees that the resolution service can provide some information about object.

Con:

• Requires that ID allocation is managed by an actor enforcing this requirement.Supplied metadata may be misleading or without value.

Confidence: 9

Point/fragment references

• Offered to users who need to reference other documents

Realized as a stored “reference object”

Identity of a document + starting position and length information

May reference abstract document (expression) or aggregate – starting position and length may be specified for each instance / aggregate component.

ID like a document – Paradigma: Norwegian branch of urn:nbn: space

Resolution service interprets ID as retrieval + positioning, if possible

Documents

What do we want to identify? Ketil Albertsen, Paradigma project National Library of Norway