View
225
Download
1
Embed Size (px)
Citation preview
Mixed content, mixed metadata:
Information discovery in the NSDL
- 2 -
Experience from American Memory and NSDL
Caroline R. Arms and William Y. Arms
Mixed content, mixed metadata: information discovery in a messy world
In Metadata in Practice, Editors: Diane Hillmann and Elaine Westbrooks, ALA Editions (forthcoming)
- 3 -
The Integration Task is to provide a coherent set of collections and services across great diversity (all digital collections relevant to science education).
The National Science Digital Library
http://nsdl.org/
- 4 -
Mixed Content
Examples: NSDL-funded collections at Cornell
Atlas. Data sets of earthquakes, volcanoes, etc.
Reuleaux. Digitized kinematics models from the nineteenth century
Laboratory of Ornithology. Sound recording, images, videos of birds and other animals.
Nuprl. Logic-based tools to support programming and to implement formal computational mathematics.
- 5 -
Effective Information Discovery Before Digital Information
Searching
(a) Resources separated into categories of related materials. Each category organized, indexed and searched separately.
(b) Catalogs and indexes built on tightly controlled metadata standards, e.g., MARC, MeSH headings, etc.
(c) Search engines used Boolean operators and fielding searching.
(d) Query languages and search interfaces assumed a trained user.
(e) Resources were physical items.
- 6 -
Effective Information Discovery With Homogeneous Digital Information
Comprehensive metadata with Boolean retrieval Can be excellent for well-understood categories of material, but requires standardized metadata and relatively homogeneous content (e.g., MARC catalog).
Full text indexing with ranked retrievalCan be excellent, but methods developed and validated for relatively homogeneous textual material (e.g., TREC ad hoc track).
- 7 -
Mixed Metadata: the Chimera of Standardization
Technical reasons
(a) Characteristics of formats and genres
(b) Differing user needs
Social and cultural reasons
(a) Economic factors
(b) Installed base
- 8 -
Cross-Domain Metadata
Dublin Core
"... indexes [such as Lycos] are most useful in small collections within a given domain. As the scope of their coverage expands, indexes succumb to problems of large retrieval sets and problems of cross-disciplinary semantic drift. Richer records, created by content experts, are necessary to improve search and retrieval."
[Weibel 1995]
- 9 -
Information Discovery in a Messy World
Web search engines have adapted to a very large scale. Other techniques, such as cross-domain metadata and federated searching have failed to scale up.
• What new concepts and techniques have enabled this adaptation?
• What can we learn that is applicable to other information discovery tasks?
• How is NSDL making use of this understanding?
- 10 -
Information Discovery in a Messy World
Building blocks
Brute force computation
The expertise of users -- human in the loop
Methods
(a) Better understanding of how and why users seek for information
(b) Relationships and context information
(c) Multi-modal information discovery
(d) User interfaces for exploring information
- 11 -
Understanding How and Why Users Seek for InformationHomogeneous content
All documents are assumed equal
Criterion is relevance (binary measure)
Goal is to find all relevant documents (high recall)
Hits ranked in order of similarity to query
Mixed content
Some documents are more important than other
Goal is to find most useful documents on a topic and then browse
Hits ranked in order that combines importance and similarity to query
- 12 -
Relationship and Contextual Information
Methods for capturing context
Analysis of citations and links (e.g., PageRank)
Mining usage logs (e.g., customers who buy the same product)
Reviews (e.g., reputation management)
Structural relationships (e.g., domain names)
- 13 -
Multi-Modal Information Discovery
With mixed content and mixed metadata, the amount of information about the various resources varies greatly
but clues from many difference sources can be combined.
"The fundamental premise of the research was that the integration of these technologies, all of which are imperfect and incomplete, would overcome the limitations of each, and improve the overall performance in the information retrieval task."
[Wactlar, 2000]
- 14 -
User Interfaces for Exploring Information
Search index
Return hits
Browse content
Return objects
- 15 -
NSDL: The Spectrum of Interoperability
Level Agreements Example
Federation Strict use of standards AACR, MARC(syntax, semantic, Z 39.50and business)
Harvesting Digital libraries expose Open Archivesmetadata; simple metadata harvesting
protocol and registry
Gathering Digital libraries do not Web crawlerscooperate; services must and search enginesseek out information
- 16 -
Users
Collections
NSDL Repository
The NSDL Repository
ServicesThe repository is a resource for service providers.
It holds information about every collection and item known to the NSDL, including contextual information.
- 17 -
NSDL Search Service: First Phase
Portal
Portal
Portal
Search andDiscovery
Service
Collections
SDLIP harvest
crawl
NSDL Repository
Inquery -> Lucene
- 18 -
NSDL Search Service: First Phase
Approach
(a) Collections map metadata to Dublin Core, provide via Open Archives protocol.
(b) Search service augments Dublin Core metadata with indexing of full-text where available.
(c) User interface returns snippets derived from the metadata, links to full content and to metadata.
- 19 -
NSDL Search Service: First Phase
Weaknesses
(a) Ranking by similarity to query not sufficient.
(b) Snippets do not indicate why item was returned (e.g., terms in full text but not in metadata).
(c) Dublin Core records provide limited information.
(d) Browsing environment limited.
(e) Most users begin their search with a Web search engine (e.g., Google)
- 20 -
NSDL Search Service: Second Phase Developments
Metadata
(a) Accept any metadata that is available in a range of formats
(b) System for reviews and annotations, with reputation management
Search system
(a) Multimodal retrieval and ranking
(b) Dynamic generation of snippets by search engine
- 21 -
NSDL Search Service: Second Phase Developments (cont.)
Usability and human factors
(a) Wider range of browsing tools (e.g., collection visualization)
(b) Filters by education level and education quality, where known
Web compatibility
(a) Expose records for Web crawlers to index
(b) Browser bookmarklet to add NSDL information to Web pages
Mixed content, mixed metadata:
Information discovery in the NSDL