Upload
amitoz-rathore
View
217
Download
0
Embed Size (px)
Citation preview
7/28/2019 i r Lecture 150513
1/57
12/11/98 SIMS Affiliates Meetin
Organizing Information:
Metadata and Controlled
VocabulariesRay R. Larson
University of California, BerkeleySchool of Information Management and
Systems
7/28/2019 i r Lecture 150513
2/57
12/11/98 SIMS Affiliates Meetin
Overview: Metadata and
Controlled Vocabularies Definitions
Origins and Uses of Controlled
Vocabularies for Information Retrieval
Metadata
Types of Indexing Languages, Thesauri and
Classification Systems
Process of Design and Development of
Thesauri
7/28/2019 i r Lecture 150513
3/57
12/11/98 SIMS Affiliates Meetin
Information Organization and
Retrieval To organize is to (1) furnish with organs, make organic, make into
living tissue, become organic; (2) form into an organic whole; give
orderly structure to; frame and put into working order; make
arrangements for. Knowledge is knowing, familiarity gained by experience;persons
range of information; a theoretical or practical understanding of; the
sum of what is known.
To retrieve is to (1) recover by investigation or effort of memory,
restore to knowledge or recall to mind; regain possession of; (2) rescuefrom a bad state, revive, repair, set right.
Informationis (1) informing, telling; thing told, knowledge, items of
knowledge, news.
The Oxford English Dictionary, cf. Rowley
7/28/2019 i r Lecture 150513
4/57
12/11/98 SIMS Affiliates Meetin
Information Properties
Information can be communicated
electronically
BroadcastingNetworking
Information can be easily duplicated and
sharedProblems of Ownership
Problems of Control
Adapted from Silicon Dreams by Robert W. Lucky
7/28/2019 i r Lecture 150513
5/57
12/11/98 SIMS Affiliates Meetin
Information Hierarchy Data
The raw material of information
InformationData organized and presented by someone
Knowledge
Information read, heard or seen and understood Wisdom
Distilled and integrated knowledge and
understanding
7/28/2019 i r Lecture 150513
6/57
12/11/98 SIMS Affiliates Meetin
Information Hierarchy
Wisdom
Knowledge
Information
Data
7/28/2019 i r Lecture 150513
7/5712/11/98 SIMS Affiliates Meetin
Information Life CycleCreation
Utilization Searching
Active
Inactive
Semi-Active
Retention/
Mining
Disposition
Discard
Using
Creating
Authoring
Modifying
Organizing
Indexing
Storing
Retrieval
Distribution
Networking
Accessing
Filtering
7/28/2019 i r Lecture 150513
8/5712/11/98 SIMS Affiliates Meetin
Information Life Cycle
Authoring/Modifying
Organizing/Indexing
Storing/Retrieving
Distribution/Networking
Accessing/Filtering Using/Creating
7/28/2019 i r Lecture 150513
9/5712/11/98 SIMS Affiliates Meetin
Origins
Very early history of content representation
Sumerian tokens and envelopes
Alexandria - pinakes
Indices
7/28/2019 i r Lecture 150513
10/5712/11/98 SIMS Affiliates Meetin
Origins
Biblical Indexes and Concordances (Hugo
de St. Caro & 500 monks, 1247 -- KWIC)
Journal Indexes
Information Explosion following WWII
Cranfield Studies of indexing languages and
information retrieval
Development of bibliographic databases
Index Medicus -- production and Medlars searching
7/28/2019 i r Lecture 150513
11/5712/11/98 SIMS Affiliates Meetin
Origins Communication theory revisited
Problems with transmission of meaning
Noise
Source DecodingEncoding Destination
Message Message
Channel
StorageSourceDecoding
(Retrieval/Reading)
Encoding
(writing/indexing)Destination
Message Message
7/28/2019 i r Lecture 150513
12/5712/11/98 SIMS Affiliates Meetin
Structure of an IR SystemSearch
Line
Interest profiles
& Queries
Documents
& data
Rules of the game =
Rules for subject indexing +
Thesaurus (which consists of
Lead-In
Vocabularyand
Indexing
Language
StorageLine
Potentially
Relevant
Documents
Comparison/
Matching
Store1: Profiles/
Search requests
Store2: Document
representations
Indexing
(Descriptive and
Subject)
Formulating query in
terms of
descriptors
Storage of
profilesStorage of
Documents
Information Storage and Retrieval System
Adapted from Soergel, p. 19
7/28/2019 i r Lecture 150513
13/5712/11/98 SIMS Affiliates Meetin
Metadata
Data about data
Information about Information
Description of information structure and
contents for individual information items, or
entire collections of information
7/28/2019 i r Lecture 150513
14/5712/11/98 SIMS Affiliates Meetin
Types of Metadata
Element names.
Element description.
Element representation.
Element coding.
Element semantics. Element classification.
7/28/2019 i r Lecture 150513
15/5712/11/98 SIMS Affiliates Meetin
Metadata Systems
AACRII/MARC
Dublin Core
RDF (Resource Description Framework)
SGML/XML
DBMS Metadata Controlled vocabularies
7/28/2019 i r Lecture 150513
16/5712/11/98 SIMS Affiliates Meetin
Goals of Descriptive Cataloging
(AACRII/MARC) 1. To enable a person to find a document of which the author, or
the title, or
the subject is known
2. To show what a library has
by a given author
on a given subject (and related subjects)
in a given kind (or form) of literature.
3. To assist in the choice of a document as to its edition (bibliographically)
as to its character (literary or topical)
Charles A. Cutter, 1876
7/28/2019 i r Lecture 150513
17/5712/11/98 SIMS Affiliates Meetin
Dublin Core Elements
Title
Creator
Subject Description
Publisher
Other Contributors Date
Resource Type
Format
Resource Identifier
Source Language
Relation
Coverage Rights Management
7/28/2019 i r Lecture 150513
18/5712/11/98 SIMS Affiliates Meetin
RDF (W3C)
A model for representing named properties
and property values
Resources (the things described)
Properties (aspects, attributes, characteristics of
resources)
Statements (Resource+Property+Value ofProperty for the Resource)
Expressed in XML
7/28/2019 i r Lecture 150513
19/5712/11/98 SIMS Affiliates Meetin
SGML & XML
What is SGML/XML?
Document Type Definitions
Document Markup
Sources and Resources
7/28/2019 i r Lecture 150513
20/5712/11/98 SIMS Affiliates Meetin
Databases & Metadata
Particularly in the Relational Model
metadata is part of the Database, providing
information about the structure and contentsof the database
What Relations (tables) in the the DB
Relation(table) attributes (domains)Attribute representation and storage
Other information (indexes, etc)
7/28/2019 i r Lecture 150513
21/5712/11/98 SIMS Affiliates Meetin
Controlled Vocabularies
Vocabulary control is the attempt to provide
astandardizedand consistentset of terms
(such as subject headings, names,classifications, etc.) with the intent of aiding
the searcher in finding information.
7/28/2019 i r Lecture 150513
22/5712/11/98 SIMS Affiliates Meetin
Controlled Vocabularies
Names and name authorities
Design of controlled vocabularies for
subject access -- Thesaurus design
7/28/2019 i r Lecture 150513
23/5712/11/98 SIMS Affiliates Meetin
Names
Cutters (1876) objectives of bibliographic
description:
To enable a person to find a document of whichthe author is known.
To show what the library has by a given author.
First serves access. Second serves collocation.
7/28/2019 i r Lecture 150513
24/5712/11/98 SIMS Affiliates Meetin
Problems with Names
How many names should be associated with
a document?
Which of these should be the main entry?
What form should each of the names take?
What references should be made from other
possible forms of names that havent been
used?
7/28/2019 i r Lecture 150513
25/5712/11/98 SIMS Affiliates Meetin
The problem
Proliferation of the forms of names
Different names for the same person
Different people with the same names
Examples
from Books in Print (semi-controlled but not
consistent)ERIC author index (not controlled)
7/28/2019 i r Lecture 150513
26/5712/11/98 SIMS Affiliates Meetin
Rules for description
AACR II and other sets of descriptive
cataloging rules provide guidelines for:
Determining the number of name entries
Choosing a main entry
Deciding on the form of name to be used
Deciding when to make references
7/28/2019 i r Lecture 150513
27/5712/11/98 SIMS Affiliates Meetin
Authority control
Authority control is concerned with creation
and maintenance of a set of terms that have
been chosen as the standard representatives(also know as established) based on some
set of rules.
If you have rules, why do you need to keeptrack of all of the headings?
7/28/2019 i r Lecture 150513
28/57
12/11/98 SIMS Affiliates Meetin
Conditions of Authorship?
Single person or single corporate entity
Unknown or anonymous authors
Shared responsibility
Collections or editorially assembled works
Works of mixed responsibility (e.g.translations)
Related Works
7/28/2019 i r Lecture 150513
29/57
12/11/98 SIMS Affiliates Meetin
Added Entries Personal names
Collaborators
Editors, compilers, writers
Translators (in some cases) Illustrators (in some cases)
Other persons associated with the work (such as the
honoree in a Festschrift).
Corporate Names Any prominently named corporate body that has
involvement in the work beyond publication,
distribution, etc.
7/28/2019 i r Lecture 150513
30/57
12/11/98 SIMS Affiliates Meetin
Choice of Name
AACR II says that the predominant form of
the name used in a particular authors
writings should be chosen as the form ofname.
References should be made from the other
forms of the name.
7/28/2019 i r Lecture 150513
31/57
12/11/98 SIMS Affiliates Meetin
Form of the Name When names appear in multiple forms, one
form needs to be chosen. Criteria for choice
are
Fullness (e.g. Full names vs. initials only)Language of the name.
Spelling (choose predominant form)
Entry element:John Smith or Smith, John?
Mao Zedong or Zedong, Mao? (Mao Tse Tung?)
7/28/2019 i r Lecture 150513
32/57
12/11/98 SIMS Affiliates Meetin
Name Authority FilesID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242
KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80
RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:
VST:d 08-21-91 Other Versions: earlier
040 DLC$cDLC$dDLC$dOCoLC
053 PR6005.R517
100 10 Creasey, John
400 10 Cooke, M. E.400 10 Cooke, Margaret,$d1908-1973
400 10 Cooper, Henry St. John,$d1908-1973
400 00 Credo,$d1908-1973
400 10 Fecamps, Elise
400 10 Gill, Patrick,$d1908-1973
400 10 Hope, Brian,$d1908-1973
400 10 Hughes, Colin,$d1908-1973
400 10 Marsden, James
400 10 Matheson, Rodney
400 10 Ranger, Ken
400 20 St. John, Henry,$d1908-1973
400 10 Wilde, Jimmy
500 10 $wnnnc$aAshe, Gordon,$d1908-1973
Different names for thesame person
7/28/2019 i r Lecture 150513
33/57
12/11/98 SIMS Affiliates Meetin
Name Authority FilesID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048
KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91
RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:
VST:d 08-19-91
040 OCoLC$cOCoLC
100 10 Marric, J. J.,$d1908-1973500 10 $wnnnc$aCreasey, John
663 Works by this author are entered under the name used in the item. For
a listing of other names used by this author, search also under$bCrease
y, John
670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J.J. Marric)
670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric)
670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis
h author; pseud.: Marric, J. J.)
7/28/2019 i r Lecture 150513
34/57
12/11/98 SIMS Affiliates Meetin
Name authority files
ID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124
KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81
RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD:
VST:d 06-06-91 Other Versions: earlier
040 DLC$cDLC$dDLC$dOCoLC100 10 Butler, William Vivian,$d1927-
400 10 Butler, W. V.$q(William Vivian),$d1927-
400 10 Marric, J. J.,$d1927-
670 His The durable desperadoes, 1973.
670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler)670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J
.J. Marric)
Different people writing with the same name
7/28/2019 i r Lecture 150513
35/57
12/11/98 SIMS Affiliates Meetin
Controlled Vocabularies for
Information Access The greatest problem of today is how to teach people to
ignore the irrelevant, how to refuse to know things, before
they are suffocated. For too many facts are as bad as none
at all. (W.H. Auden) Similarly, there are too many ways ofexpressing or
explainingthe topic of a document.
Controlled vocabularies are sets ofRules for topic
identification and indexing, and a THESAURUS, whichconsists oflead-in vocabulary and an limited and
selective Indexing Language sometimes with special
coding or structures.
7/28/2019 i r Lecture 150513
36/57
12/11/98 SIMS Affiliates Meetin
Structure of an IR SystemSearch
Line
Interest profiles& Queries
Documents& data
Rules of the game =
Rules for subject indexing +
Thesaurus (which consists of
Lead-In
Vocabularyand
Indexing
Language
Storage
Line
Potentially
Relevant
Documents
Comparison/
Matching
Store1: Profiles/
Search requests
Store2: Document
representations
Indexing
(Descriptive and
Subject)
Formulating query in
terms of
descriptors
Storage of
profilesStorage of
Documents
Information Storage and Retrieval System
Adapted from Soergel, p. 19
7/28/2019 i r Lecture 150513
37/57
12/11/98 SIMS Affiliates Meetin
Uses of Controlled Vocabularies Library Subject Headings, Classification
and Authority Files.
Commercial Journal Indexing Services anddatabases Yahoo, and other Web classification
schemes
Online and Manual Systems withinorganizationsSunSolveMacArthur
7/28/2019 i r Lecture 150513
38/57
12/11/98 SIMS Affiliates Meetin
Types of Indexing Languages Uncontrolled Keyword Indexing
Indexing Languages
Controlled, but not structured
Thesauri
Controlled and Structured
Classification Systems
Controlled, Structured, and Coded
Faceted Classification Systems
7/28/2019 i r Lecture 150513
39/57
12/11/98 SIMS Affiliates Meetin
Indexing Languages An index is a systematic guide designed to
indicate topics orfeatures of documents in
order to facilitate retrieval of documents orparts of documents.
An Indexing languageis the set ofterms
used in an index to represent topics orfeatures of documents, and the rules for
combining or using those terms.
7/28/2019 i r Lecture 150513
40/57
12/11/98 SIMS Affiliates Meetin
Indexing Languages Library of Congress Subject Headings
Yellow Pages Topics
Wilson Indexes (Readers Guide)
7/28/2019 i r Lecture 150513
41/57
12/11/98 SIMS Affiliates Meetin
Thesauri A Thesaurus is a collection of selected
vocabulary (preferred terms or descriptors)
with links among Synonymous, Equivalent,Broader, Narrowerand otherRelated
Terms
7/28/2019 i r Lecture 150513
42/57
12/11/98 SIMS Affiliates Meetin
Thesauri (cont.) National and International Standards for
Thesauri
ANSI/NISO z39.19--1994 -- American National StandardGuidelines for the Construction, Format and Management of
Monolingual Thesauri
ANSI/NISO Draft Standard Z39.4-199x -- American National
Standard Guidelines for Indexes in Information Retrieval
ISO 2788 -- Documentation -- Guidelines for the establishmentand development of monolingual thesauri
ISO 5964-- Documentation -- Guidelines for the establishment and
development of multilingual thesauri
7/28/2019 i r Lecture 150513
43/57
12/11/98 SIMS Affiliates Meetin
Thesauri (cont.) Examples:
The ERIC Thesaurus of Descriptors
The Art and Architecture Thesaurus
The Medical Subject Headings (MESH) of the
National Library of Medicine
7/28/2019 i r Lecture 150513
44/57
12/11/98 SIMS Affiliates Meetin
Why develop a thesaurus?
To provide a conceptual structure or
space for a body of information
To make it possible to adequately describe thetopical contents of informational objects at an
appropriate level of generality or specificity
To provide enhanced search capabilities and to
improve the effectiveness of searching (I.e., to
retrieve most of the relevant material without
too much irrelevant material).
7/28/2019 i r Lecture 150513
45/57
12/11/98 SIMS Affiliates Meetin
Why develop a thesaurus?
To provide vocabulary (or terminological)
control.
When there are several possible termsdesignating a single concept, the thesaurus
should lead the indexer or searcher to the
appropriate concept, regardless of the terms
they start with.
7/28/2019 i r Lecture 150513
46/57
12/11/98 SIMS Affiliates Meetin
Preliminary considerations What is used now?
Continue using an existing thesaurus?Ad hoc modification of existing thesaurus?
Develop a new well-structured thesaurus? What is the scope and complexity of the
subject field? What kind of retrieval objects or data will
be dealt with? How exhaustive and specific is the desired
description of objects?
7/28/2019 i r Lecture 150513
47/57
12/11/98 SIMS Affiliates Meetin
Preliminary Considerations
The scope and complexity of the field willprovide some indication of the scope andcomplexity of the thesaurus.It is better to plan for a larger and more
comprehensive system than a smaller systemthat rapidly will become inadequate as thedatabase grows.
Development of a good thesaurus requires amajor intellectual effort as well as clericaloperations like data entry and production ofsorted lists.
7/28/2019 i r Lecture 150513
48/57
12/11/98 SIMS Affiliates Meetin
Development of a Thesaurus Term Selection.
Merging and Development of Concept
Classes. Definition of Broad Subject Fields and
Subfields.
Development of Classificatory structure Review, Testing, Application, Revision.
Fl f W k i Th
7/28/2019 i r Lecture 150513
49/57
12/11/98 SIMS Affiliates Meetin
Flow of Work in Thesaurus
ConstructionSelect Sources
Assign codes
Select Terms
Record Selected Terms
Sort Terms
Merge identical Terms
Define Broad SubjectFields
Merge Terms in Same
Concept class
Sort Terms into Broad
Subject Fields
Define Subfields within
one Subject Field
Work out detailed structure
of the Subject Field
Select Preferred Terms
All Subfields of BroadSubject finished?
All Broad
Subjects finished?
Improve Class Structure
Yes
Yes
No
No
Print Classified Index
and review
Discuss with Experts and
Users
Select descriptors and
checklist items
Produce Full Thesaurus
and Check references
Assign Notation
Review and Test
Many
Modifications?
Based on Soergel, pp 327-333
Yes
No
Revise as
needed
7/28/2019 i r Lecture 150513
50/57
12/11/98 SIMS Affiliates Meetin
The Indexing Process
Concept identification
term selection (via thesaurus)
term assignment
7/28/2019 i r Lecture 150513
51/57
12/11/98 SIMS Affiliates Meetin
Application: The Indexing
Process (Manual)Is
Term
suitable
NOSelect Alternative
term to represent
Concept
Would
Concept be
better represented
by one of
these
terms
Is
There
Another
Concept
Consider
Preferred
Term
Select
PreferredTerm
Establish Term
Denoting
Concept
Examine Document
and Identify
Significant
Concepts
Consider
First
Concept
Preferred
Term?
StartNO
NO
NO
NO
NO
YES YES YES
YES
YES
YES
Does
Thesaurus
contain term
for
Concept
Consider any
associated terms in
Thesaurus (NT,BT)
Admit New Term
Into Thesaurus
Can Concept
be expressed
combining
terms?
Consider Each of
These TermsAssign Terms
to
Document
Prefer
Alternative
Term(s)
End
Adapted from ISO 5963, p.5
7/28/2019 i r Lecture 150513
52/57
12/11/98 SIMS Affiliates Meetin
Classification Systems A classification system is an indexing
language often based on a broad ordering of
topical areas. Thesauri and classificationsystems both use this broad ordering and
maintain a structure of broader, narrower,
and related topics. Classification schemescommonly use a coded notation for
representing a topic and its place in
relation to other terms.
7/28/2019 i r Lecture 150513
53/57
12/11/98 SIMS Affiliates Meetin
Classification Systems (cont.) Examples:
The Library of Congress Classification System
The Dewey Decimal Classification SystemThe ACM Computing Reviews Categories
The American Mathematical Society
Classification System
7/28/2019 i r Lecture 150513
54/57
12/11/98 SIMS Affiliates Meetin
Automatic Indexing and
Classification Automatic indexing is typically the simple deriving ofkeywords from a document and providing access to all of
those words.
More complex Automatic Indexing Systems attempt to
select controlled vocabulary terms based on terms in the
document.
Automatic classification attempts to automatically group
similar documents using either:
A fully automatic clustering method.
An established classification scheme and set of documents already
indexed by that scheme.
7/28/2019 i r Lecture 150513
55/57
7/28/2019 i r Lecture 150513
56/57
12/11/98 SIMS Affiliates Meetin
Automatic Class Assignment
Doc
DocDoc
Doc
Doc
Doc
Doc
Search
Engine
1. Create pseudo-documents representingintellectually derived classes.
2. Search using document contents
3. Obtain ranked list
4. Assign document to Ncategories
ranked over threshold. OR assign
to top-ranked category
Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered
clusters are order-independent, usually based on an intellectually derived scheme
7/28/2019 i r Lecture 150513
57/57
References Soegel, D.Indexing Languages and Thesauri:Construction and Maintenance. Los Angeles : Melville
Publishing Co., 1974
Foskett, A.C. The Subject Approach to Information.London: Clive Bingley, 1982.
Standards: ISO 2788 -- Documentation -- Guidelines for the establishment and development of
monolingual thesauri
ISO 5964-- Documentation -- Guidelines for the establishment and development ofmultilingual thesauri
ANSI/NISO z39.19--1994 -- American National Standard Guidelines for the
Construction, Format and Management of Monolingual Thesauri