37
A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

Embed Size (px)

Citation preview

Page 1: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

A METS Application Profile for Historical Newspapers

Morgan Cundiff

Network Development and MARC Standards Office

Library of Congress

Page 2: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

• XML and Standards

• Definition of METS

• Definition of METS Profiles

• Use of MODS relatedItem element

• Draft METS Profile for Historical Newspapers

• Parting Thoughts

Outline

Page 3: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

“XML has become the de-facto standard for representing metadata descriptions of resources on the Internet.”

Jane HunterWorking towards MetaUtopia - A Survey of Current Metadata Research

XML

Page 4: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

The Importance of Standards

“In moving from dispersed digital collections to interoperable digital libraries, the most important activity we need to focus on is standards… most important is the wide variety of metadata standards [including] descriptive metadata… administrative metadata…, structural metadata, and terms and conditions metadata…”

Howard BesserThe Next Stage: Moving from Isolated Digital Collections to Interoperable Digital Libraries

Page 5: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

What is METS?

METS is an XML Schema designed for the purpose of creating XML document instances that express the hierarchical structure of digital library objects, the names and locations of the files that comprise those objects, and the associated metadata. METS can, therefore, be used as a tool for modeling real world objects, such as particular document types.

Page 6: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

What are the 7 Sections of a METS Document?

<mets> <metsHdr/> <dmdSec/> <amdSec/> <fileSec/> <structMap/> <structLink/> <behaviorSec/></mets>

Page 7: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

The Descriptive Metadata Section with mdWrap

<mets> <dmdSec> <mdWrap> <xmlData> <!-- insert data from different namespace here --> </xmlData> </mdWrap> </dmdSec> <fileSec></fileSec> <structMap></structMap></mets>

Page 8: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

The Descriptive Metadata Section with MODS as

extension schema

<mets:mets> <mets:dmdSec> <mets:mdWrap> <mets:xmlData> <mods:mods></mods:mods> </mets:xmlData> </mets:mdWrap> </mets:dmdSec> <mets:fileSec></mets:fileSec> <mets:structMap></mets:structMap></mets:mets>

Page 9: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

The Descriptive Metadata Section with MODS and relatedItem elements

<mets:mets> <mets:dmdSec> <mets:mdWrap> <mets:xmlData> <mods:mods> <mods:relatedItem type=“constituent”> <mods:relatedItem type=“constituent”></mods:relatedItem> </mods:relatedItem> </mods:mods> </mets:xmlData> </mets:mdWrap> </mets:dmdSec> <mets:fileSec></mets:fileSec> <mets:structMap></mets:structMap></mets:mets>

Page 10: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

MODS relatedItem element

1. Child element to MODS

2. relatedItem element has same content model as mods (titleInfo, name, subject, physicalDescription, note, etc)

3. The relatedItem element makes it possible to create very rich analytic descriptions for contained works within a MODS records

4. relatedItem element is repeatable and it can be nested recursively (thus making it possible to build a hierarchical tree structure)

5. relatedItem elements make it possible to associate descriptive data with any structural element.

Page 11: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

<mods:mods> <mods:titleInfo> <mods:title>Baltimore Sun</mods:title> </mods:titleInfo> <mods:relatedItem type="constituent"> <mods:titleInfo> <mods:title>Sports</mods:title> </mods:titleInfo> <mods:relatedItem type="constituent"> <mods:titleInfo> <mods:title>O’s Split Beantown Twi-niter</mods:title> </mods:titleInfo> </mods:relatedItem> <mods:relatedItem type="constituent"> <mods:titleInfo> <mods:title>Chisox Nip Tribe</mods:title> </mods:titleInfo> </mods:relatedItem> </mods:relatedItem></mods:mods>

Use of MODS relatedItem element to express logical structure

Page 12: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

METS document with two hierarchies (logical and physical)

<mets:mets> <mets:dmdSec> <mets:mdWrap> <mets:xmlData> <mods:mods> <mods:relatedItem> <mods:relatedItem></mods:relatedItem> </mods:relatedItem> </mods:mods> </mets:xmlData> </mets:mdWrap> </mets:dmdSec> <mets:fileSec></mets:fileSec> <mets:structMap> <mets:div> <mets:div></mets:div> </mets:div> </mets:structMap></mets:mets>

Page 13: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

Linking in METS Documents(XML ID/IDREF links)

DescMD

mods

relatedItem

relatedItemAdminMD

techMD

sourceMD

digiprovMD

rightsMD

fileGrp

file

file

StructMap

div

div

fptr

div

fptr

Page 14: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

DescMD

mods

relatedItem

relatedItemAdminMD

techMD

sourceMD

digiprovMD

rightsMD

fileGrp

file

file

StructMap

div

div

fptr

div

fptr

Linking in METS Documents(XML ID/IDREF links)

Page 15: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

DescMD

mods

relatedItem

relatedItemAdminMD

techMD (mix)

sourceMD

digiprovMD

rightsMD

fileGrp

file

file

StructMap

div

div

fptr

div

fptr

Linking in METS Documents(XML ID/IDREF links)

Page 16: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

DescMD

mods

relatedItem

relatedItemAdminMD

techMD (mix)

sourceMD

digiprovMD

rightsMD

fileGrp

file

file

StructMap

div

div

fptr

div

fptr

Linking in METS Documents(XML ID/IDREF links)

Page 17: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

DescMD

mods

relatedItem

relatedItemAdminMD

techMD (mix)

sourceMD

digiprovMD

rightsMD

fileGrp

file

file

StructMap

div

div

fptr

div

fptr

Linking in METS Documents(XML ID/IDREF links)

Page 18: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

DescMD

mods

relatedItem

relatedItemAdminMD

techMD (mix)

sourceMD

digiprovMD

rightsMD

fileGrp

file

file

StructMap

div

div

fptr

div

fptr

Linking in METS Documents(XML ID/IDREF links)

Page 19: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

What is a METS Application Profile?

“METS Profiles are intended to describe a class of METS documents in sufficient detail to provide both document authors and programmers the guidance they require to create and process METS documents conforming with a particular profile.”

A profile is expressed as an XML document. There is a schema for this purpose. The profile expresses the requirements that a METS document must satisfy.

A sufficiently explicit METS Profile may be considered a “data standard”.

Note: A METS Profile is a human-readable prose document and is not intended to be “machine actionable”.

Page 20: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

METS Profile for Historical Newspapers [draft]

The METS Profile for Historical Newspapers specifies how METS documents representing digitized historical newspapers should be encoded. Note that the profile is to be used to represent a single issue of a newspaper. The profile uses MODS to express the logical structure of a newspaper issue, and uses the METS structMap to express the physical structure of the newspaper issue. [draft abstract]

URL to find Profile and related documents:

http://www.loc.gov/standards/mets/test/ndnp/profile_notes.html

[email protected]

Page 21: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

METS Profile (features)

Represents one issue of a newspaper.

Page 22: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

METS Profile (features)

The Profile presumes the use of alto files (or some equivalent) where the zones on the corresponding digital image (expressed as coordinates) are correlated to the corresponding logical entity (e.g. article or paragraph) and also to the corresponding OCR text.

Page 23: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

METS Profile (features)

The Profile maintains a strict separation between logical entities and physical entities.

Page 24: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

METS Profile (features)

The primary logical entities are issue, issue section, article, article section, illustration, and advertisement.

The top-level MODS record describes the issue. The other primary logical entities (issue section, article, article section, illustration, and advertisement) are described in a heirarchy of MODS relatedItem elements.

Page 25: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

METS Profile (features)

Logical structure is represented using MODS in the METS dmdSec. It is necessary to use the latest version (version 3.2) of MODS.

Page 26: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

Hierarchy of Logical Entities• issue

• issue section

•article (or article-like entity)

• paragraph

• illustration (photograph, drawing, map, table)

• article section

• paragraph

• illustration

• illustration

• advertisements

• article

• paragraph

• illustration

• article section

• paragraph

• illustration

• illustration

• advertisements

Page 27: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

METS Profile (features)

The primary logical entities are expressed as values of the MODS genre element.

Page 28: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

<mods:mods> <mods:titleInfo> <mods:title>Baltimore Sun</mods:title>

<mods:genre>newspaper</genre> </mods:titleInfo> <mods:relatedItem type="constituent"> <mods:titleInfo> <mods:title>Sports</mods:title>

</mods:titleInfo> <mods:genre>section</genre>

<mods:relatedItem type="constituent"> <mods:titleInfo> <mods:title>O’s Split Beantown Twi-niter</mods:title> </mods:titleInfo>

<mods:genre>article</mods:genre> <mods:relatedItem type="constituent"> <mods:titleInfo>

<mods:title>Aparicio puts tag on Jensen to end 7th</mods:title> </mods:titleInfo> <mods:genre>photograph</genre> </mods:relatedItem>

</mods:relatedItem> </mods:relatedItem></mods:mods>

Use of MODS relatedItem element to express logical structure

Page 29: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

METS Profile (features)

The allowable genre values (for Profile compliance) are listed in Newspaper Genre Terms [draft].

Page 30: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

METS Profile (features)

It is also possible to tag subparts of the primary logical entities. The typical example of this is tagging the paragraph. This is accomplished using the MODS part element.

Page 31: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

METS Profile (features)

There are only three physical entities. They are: issue, page, and pageRegion.

Page 32: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

METS Profile (features)

The physical entities are represented in the structMap section of the METS document as div types (div type="news:page"). There is only one structMap.

Page 33: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

METS Profile (features)

Page regions are correlated to the corresponding logical entity by means of an IDREF link.

Note that one or more page regions may correspond to a single logical entity. This makes it possible to make the necessary associations when the logical entity is split into more than one physical entity, e.g. when a paragraph is continued on the next column or an article is continued on a different page.

Page 34: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

METS Profile (features)

Example document

http://memory.loc.gov/cocoon/diglib/loc.news.sr.1002/default.html

Page 35: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress

Parting Thoughts

Agreement on a data standard (such as a METS profile) will facilitate interoperability. Interoperability can be between any two agents (digital library applications, preservation repositories, search and retrieval systems, etc.)

Newspaper community has a “quality vs. quantity” dilemma. Large volume of material to be digitized necessitates automatic processing. Automatic processing produces dirty data and less satisfying results. High quality processing (requiring more human intervention) is more expensive but produces far better results and pays dividends far into the future (the data will be used over and over without additional cost).

Page 36: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress
Page 37: A METS Application Profile for Historical Newspapers Morgan Cundiff Network Development and MARC Standards Office Library of Congress