46
10/21/98 Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information Management and Systems SIMS 245: Organization of Information In Collections

10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

Embed Size (px)

Citation preview

Page 1: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Subject Access to Collections: Introduction

University of California, Berkeley

School of Information Management and Systems

SIMS 245: Organization of Information In Collections

Page 2: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Review

• Review of Description• Goal of IR is to retrieve all and only the

“relevant” documents in a collection for a particular user with a particular need for information

Page 3: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Indexing Languages and Thesauri

• Origins and Uses of Controlled Vocabularies for Information Retrieval

• Types of Indexing Languages, Thesauri and Classification Systems

Page 4: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Controlled Vocabularies

• Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information.

Page 5: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

What is a “Controlled Vocabulary”

• “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.” (W.H. Auden)

• Similarly, there are too many ways of expressing or explaining the topic of a document.

• Controlled vocabularies are sets of Rules for topic identification and indexing, and a THESAURUS, which consists of “lead-in vocabulary” and an limited and selective “Indexing Language” sometimes with special coding or structures.

Page 6: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Uses of Controlled Vocabularies• Library Subject Headings, Classification and

Name Authority Files.• Commercial Journal Indexing Services and

databases• Yahoo, and other Web classification schemes• Online and Manual Systems within

organizations– SunSolve– MacArthur

Page 7: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Name Authority Files ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973

Different names for thesame person

Page 8: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Name Authority FilesID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91 RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-19-91 040 OCoLC$cOCoLC 100 10 Marric, J. J.,$d1908-1973 500 10 $wnnnc$aCreasey, John 663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$bCrease y, John 670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J .J. Marric) 670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric) 670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis h author; pseud.: Marric, J. J.)

Page 9: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Name authority filesID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 06-06-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 100 10 Butler, William Vivian,$d1927- 400 10 Butler, W. V.$q(William Vivian),$d1927- 400 10 Marric, J. J.,$d1927- 670 His The durable desperadoes, 1973. 670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler) 670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J .J. Marric)

Different people writing with the same name

Page 10: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Indexing Languages

• An index is a systematic guide designed to indicate topics or features of documents in order to facilitate retrieval of documents or parts of documents.

• An Indexing language is the set of terms used in an index to represent topics or features of documents, and the rules for combining or using those terms.

Page 11: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Types of Indexing Languages

• Uncontrolled Keyword Indexing• Indexing Languages

– Controlled, but not structured

• Thesauri– Controlled and Structured

• Classification Systems– Controlled, Structured, and Coded

• Faceted Classification Systems

Page 12: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Indexing Languages

• Library of Congress Subject Headings

• Yellow Pages Topics

• Wilson Indexes (“Reader’s Guide”)

Page 13: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Controlled Vocabulary

• Start with the text of the document

• Attempt to “control” or regularize: – The concepts expressed within

• mutually exclusive• exhaustive

– The language used to express those concepts• limit the normal linguistic variations• regulate word order and structure of phrases• reduce the number of synonyms or near-synonyms

• Also, provide cross-references between concepts and their expression.

See Bates, 1988

Page 14: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Subject Headings vs. Descriptors

• Describe the contents of an entire document

• Designed to be looked up in an alphabetical index– Look up document

under its heading

• Few (1-5) headings per document

• Describe one concept within a document

• Designed to be used in Boolean searching– Combine to describe

the desired document

• Many (5-25) descriptors per document

Page 15: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Subject Heading vs. Descriptor Example

• WILSONLINE– Athletes– Athletes--Heath&Hygiene– Athletes--Nutrition– Athletes--Physical Exams– …– Athletics– Athletics -- Administration– Athletics -- Equipment -- Catalogs– …– Sports -- Accidents and injuries– Sports -- Accidents and injuries --

prevention

• ERIC– Athletes

– Athletic Coaches

– Athletic Equipment

– Athletic Fields

– Athletics

– …

– Sports psychology

– Sportsmanship

Page 16: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Assigning Headings vs. Descriptors

• Subject headings -- assign one (or a few) complex heading(s) to the document

• Descriptors -- mix and match– How would we describe recipes using each

technique?

Page 17: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Thesauri

• A Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among Synonymous, Equivalent, Broader, Narrower and other Related Terms

Page 18: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Thesauri (cont.)

• National and International Standards for Thesauri– ANSI/NISO z39.19--1994 -- American National Standard

Guidelines for the Construction, Format and Management of Monolingual Thesauri

– ANSI/NISO Draft Standard Z39.4-199x -- American National Standard Guidelines for Indexes in Information Retrieval

– ISO 2788 -- Documentation -- Guidelines for the establishment and development of monolingual thesauri

– ISO 5964-- Documentation -- Guidelines for the establishment and development of multilingual thesauri

Page 19: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Thesauri (cont.)

• Examples:– The ERIC Thesaurus of Descriptors– The Art and Architecture Thesaurus– The Medical Subject Headings (MESH) of the

National Library of Medicine

Page 20: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Development of a Thesaurus

• Term Selection.

• Merging and Development of Concept Classes.

• Definition of Broad Subject Fields and Subfields.

• Development of Classificatory structure

• Review, Testing, Application, Revision.

Page 21: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Categorization Summary

• Processes of categorization underlie many of the issues having to do with information organization

• Categorization is messier than our computer systems would like

• Human categories have graded membership, consisting of family resemblances.

• Family resemblance is expressed in part by which subset of features are shared

• It is also determined by underlying understandings of the world that do not get represented in most systems

Page 22: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Classification Systems

• A classification system is an indexing language often based on a broad ordering of topical areas. Thesauri and classification systems both use this broad ordering and maintain a structure of broader, narrower, and related topics. Classification schemes commonly use a coded notation for representing a topic and it’s place in relation to other terms.

Page 23: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Classification Systems (cont.)

• Examples:– The Library of Congress Classification System– The Dewey Decimal Classification System– The ACM Computing Reviews Categories– The American Mathematical Society

Classification System

Page 24: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Classification Schemes

• Classify possible concepts.

• Goals:– Completely distinct conceptual categories

(mutually exclusive)– Complete coverage of conceptual categories

(exhaustive)

Page 25: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Hierarchical Classification

• Traditional “family-tree” – Each category is successively broken down into

smaller and smaller subdivisions– Each level divided out by a “character of

division”. Also known as a feature.• Example: distinguish Literature based on:

– Language

– Genre

– Time Period

Page 26: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Hierarchical ClassificationLiterature

SpanishFrenchEnglish

DramaPoetryProse

18th17th16th

DramaPoetryProse

19th 18th17th16th 19th

...

... ... ...

...

Page 27: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Labeled Categories for Hierarchical Classification

• LITERATURE– 100 English Literature

• 110 English Prose– English Prose 16th Century– English Prose 17th Century– English Prose 18th Century– ...

• 111 English Poetry– 121 English Poetry 16th Century– 122 English Poetry 17th Century– ...

• 112 English Drama– 130 English Drama 16th Century– …

– 200 French Literature

Page 28: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Faceted Classification

• Create a separate, free-standing list for each characteristic of division (feature).

• Combine features to create a classification.

Page 29: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Faceted Classification and Labeled Catgories

• A Language– a English– b French– c Spanish

• B Genre– a Prose– b Poetry– c Drama

• C Period– a 16th Century– b 17th Century– c 18th Century– d 19th Century

• Aa English Literature• AaBa English Prose• AaBaCa English Prose

16th Century• AbBbCd French

Poetry 19th Century• BbCd Drama 19th

Century

Page 30: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

How to use such classification structures?

• How to look through them?

• How to use them in search?

Page 31: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Automatic Indexing and Classification

• Automatic indexing is typically the simple deriving of keywords from a document and providing access to all of those words.

• More complex Automatic Indexing Systems attempt to select controlled vocabulary terms based on terms in the document.

• Automatic classification attempts to automatically group similar documents using either:– A fully automatic clustering method.

– An established classification scheme and set of documents already indexed by that scheme.

Page 32: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Agglomerative Clustering

A B C D E F G HI

Page 33: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Agglomerative Clustering

A B C D E F G HI

Page 34: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

AgglomerativeClustering

A B C D E F G HI

Page 35: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Hierarchical Methods

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

Single Link Dissimilarity Matrix

Hierarchical methods: Polythetic, Usually Exclusive, OrderedClusters are order-independent

||||

||1

BA

BAitydissimilar

Page 36: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Threshold = .1

Single Link Dissimilarity Matrix

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

2 03 0 04 0 0 05 1 0 0 1 1 2 3 4

2

1

35

4

Page 37: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Threshold = .2

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

2 03 0 14 0 0 05 1 0 0 1 1 2 3 4

2

1

35

4

Page 38: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Threshold = .3

2 .43 .4 .24 .3 .3 .35 .1 .4 .4 .1 1 2 3 4

2 03 0 14 1 1 15 1 0 0 1 1 2 3 4

2

1

35

4

Page 39: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

ClusteringAgglomerative methods: Polythetic, Exclusive or Overlapping, Unorderedclusters are order-dependent.

DocDoc

DocDoc

DocDoc

DocDoc

1. Select initial centers (I.e. seed the space)2. Assign docs to highest matching centers and compute centroids3. Reassign all documents to centroid(s)

Rocchio’s method

Page 40: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Automatic Class Assignment

DocDoc

DocDoc

DocDoc

Doc

SearchEngine

1. Create pseudo-documents representing intellectually derived classes.2. Search using document contents3. Obtain ranked list4. Assign document to N categories ranked over threshold. OR assign to top-ranked category

Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually orderedclusters are order-independent, usually based on an intellectually derived scheme

Page 41: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

K-Means Clustering

• 1 Create a pair-wise similarity measure• 2 Find K centers using agglomerative clustering

– take a small sample

– group bottom up until K groups found

• 3 Assign each document to nearest center, forming new clusters

• 4 Repeat 3 as necessary

Page 42: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

Scatter/Gather

Cutting, Pedersen, Tukey & Karger 92, 93

Hearst & Pedersen 95

• Cluster sets of documents into general “themes”, like a table of contents

• Display the contents of the clusters by showing topical terms and typical titles

• User chooses subsets of the clusters and re-clusters the documents within

• Resulting new groups have different “themes”

Page 43: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information

10/21/98 Organization of Information in Collections

S/G Example: query on “star”Encyclopedia text

14 sports

8 symbols 47 film, tv

68 film, tv (p) 7 music

97 astrophysics

67 astronomy(p) 12 steller phenomena

10 flora/fauna 49 galaxies, stars

29 constellations

7 miscelleneous

Clustering and re-clustering is entirely automated

Page 44: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information
Page 45: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information
Page 46: 10/21/98Organization of Information in Collections Subject Access to Collections: Introduction University of California, Berkeley School of Information