Upload
colleen-craig
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Cornell CS 502
Metadata for the WebIssues and Simple Answers
CS 502 – 20020221Carl Lagoze – Cornell University
Cornell CS 502
“Metadata is data about data”
Cornell CS 502
Some untested hypotheses
• Metadata is useful for…– People– Machines
• More metadata is better• (semi) automated digital libraries and simple
metadata
Cornell CS 502
Some known facts
• Number and variety of metadata vocabularies will continue to increase
• The Tower of Babel is a franchise– There is not one common view of reality
• “The one thing I know about metadata is that it is expensive”
Cornell CS 502
Are metadata and data distinguishable?
• Objectivity?• Intellectual property?• Structure?• Aboutness?
Cornell CS 502
The fiction of classification
…there is no classification of the universe that is not fictional and
conjectural.
Jorge Luis Borges
Cornell CS 502
Lenses and Views
• All classification does and should provide a biased lens or view of reality
• Each view emphasizes certain characteristics and hides others
GeospatialRights
Museum
Cornell CS 502
Reality is Complex
Created by:George Castaldo
Created on:1994
Created by:Leonardo da Vinci
Created on:1506
Relationship?
Cornell CS 502
Objects are Related
IFLA Entity Model
Cornell CS 502
Entities, Events, and Agents
Photographer
Camera type Software
Computer artist
Cornell CS 502
Haven’t we done metadata already?
Cornell CS 502
What’s wrong with this model?
• Expensive– Complex (even for its original goal?) – Professional intervention (assumes single community
of expertise)
• Monolithic– One size fits all approach– Reflects its centralized system origins
• Bias towards physical artifacts– Fixed resources– Incomplete handling of resource evolution and other
resource relationships
• Anglo-centric
Cornell CS 502
Web Challenge to Traditional Cataloging
• Scale
• Permanence
• Authenticity
• Organizational Context
• Custodial Control
• Variety
Cornell CS 502
Internet Commons includes Multiple Communities
ScientificData
HomePages Geo
InternetCommons
Library
Museums
Commerce
Whatever...
Cornell CS 502
Realities of Web search and discovery
• Search systems are motivated by advertising• Index coverage is unpredictable and limited• Too much recall, too little precision• Index spam abounds• Resources (and their names) are volatile
Cornell CS 502
Metadata: Part of a Solution
• Structured data about data– helps to impose order on chaos– enables automated discovery/manipulation
• Variety across various dimensions:– specialization– decentralization– democratization
Cornell CS 502
Web Metadata Issues:Description vs. Discovery
• Library cataloging motivated by describing resources
• Fuzzy search buckets– Separate books about Sigmund Freud versus books
by Sigmund Freud into different buckets– But, different types of data appropriate for different
buckets: URLs, date strings, word strings, names
• But general, fuzzy categories may not be sufficient for describing resources
Cornell CS 502
Web Metadata Models:Drill-Down Searching Paradigm
• Moving along a specificity spectrum• Inter-domain vs. intra-domain terms, models,
query mechanisms• One size doesn't fit all
– Cognitive models of searching and browsing
Cornell CS 502
Metadata Takes Many Forms
resourcediscovery
documentadministration
rightsmanagement
contentrating
security andauthentication
archivalstatus
products andservices
databaseschemas
process controlor description
Cornell CS 502
Metadata:Part of the problem
cost
functionality
AACR2/MARC
googleDublin Core
Cornell CS 502
Metadata Challenges
• Accommodate multiple varieties of metadata– community-specific functionality, creation,
administration, access
• Tensions– functionality and simplicity – extensibility and interoperability– human and machine creation and use
Cornell CS 502
Interoperability has many facets
• Semantics– Meaning/classification/ontology
• Models/Structure– Entities and relationships
• Syntax– grammars to convey semantics and structure
Cornell CS 502
Warwick Framework: Containing Chaos
• Conceptual Architecture for metadata from the Warwick Metadata Workshop (DC-2)
• Conceptual architecture to support the specification, collection, encoding, and exchange of modular metadata
• Provide context for metadata efforts (including Dublin Core)– avoids the “black-hole” of comprehensive element
sets– focuses interoperability issues at package level
Cornell CS 502
Metadata Container
Container
Package
Dublin Core
Package
MARC record
Package
Indirect Reference
Package
Terms and Conditions
URI
Cornell CS 502
Modularization Allows Distributed Management
• Communities of expertise (not software vendors) are responsible for:– Semantics– Registration– Administration– Access management– Authority of data– Sharing and Distribution
Cornell CS 502
Modularization Implementation Issues
• Data encoding • Semantic interaction of overlapping sets
– between semantically-related packages– between semantically distinct packages
• Type registry
Cornell CS 502
Dublin Core Metadata Initiative
• A simple set of properties to support resource discovery on the web (fuzzy search buckets)?
• A cross-domain switchboard for interoperable metadata?
• An extensible ontology for resource desciption?
http://dublincore.org
Cornell CS 502
The fifteen Dublin Core Elements
Creator Title Subject
Contributor Date Description
Publisher Type Format
Coverage Rights Relation
Source Language I dentifi er
http://www.dublincore.org/documents/1999/07/02/dces/
Cornell CS 502
A Pidgin for Digital Tourists
• Metadata is language• Dublin Core is a small and simple language -- a
pidgin -- for finding resources across domains.• Speakers of different languages naturally
"pidginize" to communicate– E.g., tourists using simple phrases to order beer
("zwei Bier bitte" "dva pivo" "biru o san bai"...)
• We are all "tourists" on the global Internet.
Cornell CS 502
A Grammar of Dublin Core
• http://www.dlib.org/dlib/october00/baker/10baker.html
• By design not as subtle as mother tongues, but easy to learn and extremely useful in practice
• Pidgins: small vocabularies (Dublin Core: fifteen special nouns and lots of optional adjectives)
• Simple grammars: sentences (statements) follow a simple fixed pattern...
Cornell CS 502
Example Dublin Core statements
• Resource has Title 'Grammar of Dublin Core'.• Resource has Creator 'Tom Baker'.• Resource has Subject 'Metadata'.• Resource has Relation http://foo.org/file.htm.
Cornell CS 502
Resource has property
DC:CreatorDC:TitleDC:SubjectDC:Date...
X
implied subject
impliedverb
one of 15properties
property value(an appropriateliteral)