Transcript

www.fit.qut.edu.au

Queensland University of Technology

Faculty of Information Technology

Michael Middleton

1 CRICOS No. 00213J

Controlled vocabularies:Thesauri and information retrieval

Michael Middleton

QUT School of Information Systems, Brisbane, [email protected]

forSTIMULATE 5

Vrije Universiteit Brussel

Brussels, Belgium July, 2005

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 2 CRICOS No. 00213J

Introduction

• Context ….. History

• Vocabulary principles

• Thesaurus software

• Thesaurus building …. application

• Thesaurus evaluation

• The future

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 3 CRICOS No. 00213J

Organise to maintain

Context: Information life cyclecreate

distribute

use

maintain

recall

reuse

store

dispose

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 4 CRICOS No. 00213J

Context: Information management

Domains

• Operational

• Analytical

• Strategic

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 5 CRICOS No. 00213J

Context: indexing

• Producing representations of records or documents that constitute a finding aid to the records in a database or to part of a document

– Assigned indexing

– Derived indexing

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 6 CRICOS No. 00213J

Indexer qualities

• The ‘Art’ of assigned indexing:– Empathy– Meticulousness– Consistency– General knowledge– Patience

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 7 CRICOS No. 00213J

Indexing guidelines

• Conceptual analysis and assigning• Aboutness• Elements of the document to consider• Exhaustivity• Specificity• Index what is in the item• Co-ordination

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 8 CRICOS No. 00213J

Assigned index representations

• Alphabetical Subject

• Classified– Alphabetical– Notation

• Chain

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 9 CRICOS No. 00213J

Indexing exercise

How consistent is database indexing?

Example: the same paper in multiple databases:

Middleton, M Skills expectations of library graduates http://eprints.qut.edu.au/archive/00000094/

1. Index it yourself2. Compare your indexing with others3. Compare the indexing in ERIC and INSPEC

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 10

CRICOS No. 00213J

Context: metadata

• Agent

– Document description– Responsibility– Administrative– Provenance– Connections – Conditions of use

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 11

CRICOS No. 00213J

Context: metadata

• Content

– Topic (application of vocabulary control)– Coverage– Role

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 12

CRICOS No. 00213J

Controlled vocabulary

• Thesaurus– A controlled vocabulary of terms in natural language that are

designed for post-coordination

• Classification scheme– A scheme for organisation by categories in a systematic manner;

this may involve grouping by subject, function or other criteria, or determining document naming conventions

– Often involves notation

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 13

CRICOS No. 00213J

Purpose

• Indexing by translating diverse natural language to consistent terminology

• Establishing relationships among terms

• Information retrieval improving precision and recall

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 14

CRICOS No. 00213J

History

• Bibliographic databases– Many applications, list of online associated

thesauri and classification schemes at http://sky.fit.qut.edu.au/~middletm/cont_voc.html

• Standards

– ISO2788; ISO 5964– ANSI Z39.19

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 15

CRICOS No. 00213J

Thesaurus principles

• Term relationships

• Continuing evolution

• Internally consistent hierarchies to support database searching

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 16

CRICOS No. 00213J

• The vocabulary of a controlled indexing language formally organised so that the a priori relationships between concepts are made explicit.

• A thesaurus is an example of metadata

The Thesaurus

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 17

CRICOS No. 00213J

Thesaurus extract (ISO sample)

35 mm CAMERAS

BT MINIATURE CAMERAS

CAMERAS

BT OPTICAL EQUIPMENT

NT MOVING PICTURE CAMERAS

STEREO CAMERAS

STILL CAMERAS

UNDERWATER CAMERAS

RT PHOTOGRAPHY

CINE CAMERAS

BT MOVING PICTURE CAMERAS

NT UNDERWATER CINE CAMERAS

RT CINEMA

CINEMA

RT CINE CAMERAS

DIVING

RT UNDERWATER CAMERAS

INSTANT PICTURE CAMERAS

SN Cameras which produce a finished

print directly

BT STILL CAMERAS

Land cameras USE VIEW CAMERAS

MICROSCOPES

BT OPTICAL EQUIPMENT

MINIATURE CAMERAS

BT STILL CAMERAS

NT 35 mm CAMERAS

MOVING PICTURE CAMERAS

BT CAMERAS

NT CINE CAMERAS

TELEVISION CAMERAS

OPTICAL EQUIPMENT

NT CAMERAS

MICROSCOPES

PHOTOGRAPHY

RT CAMERAS

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 18

CRICOS No. 00213J

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 19

CRICOS No. 00213J

Standardising the Vocabulary

• Types of entities & forms of terms

• Singular vs plural

• Homonyms

• Choice of terms

• Scope notes and history notes

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 20

CRICOS No. 00213J

Compound terms

• Terms should be factored into simpler elements to improve user’s understanding.

• Semantic factoring

• Syntactic factoring

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 21

CRICOS No. 00213J

Semantic Relationships• Equivalence

– Establishing relationships between preferred (postable) and non-preferred (non-postable) terms

• Hierarchical– Establishing relationships between subordinate and

superordinate terms. These may be distinguished as:• Generic• Whole-part• Instance

• Associative– Establishing relationships between terms that are mentally

associated, but not equivalent or hierarchical

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 22

CRICOS No. 00213J

… but, the Functions thesaurus

Whereas• agenda papers might have

– broader term documents

In a functions thesaurus

• agenda papers might have– broader term meetings

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 23

CRICOS No. 00213J

Applying a functional thesaurusTop Term • PERSONNELScope Notes The function of managing all employees ……

Related Terms • COMPENSATION• ESTABLISHMENT• INDUSTRIAL RELATIONS etc, etc

Narrower Terms • ALLOWANCES• APPEALS (Decisions)• APPOINTMENT• ARRANGEMENTS• AUTHORISATION• COMMITTEES• COMPLIANCE etc, etc

Use For Terms• Employees• Public Servants• Staff

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 24

CRICOS No. 00213J

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 25

CRICOS No. 00213J

Thesaurus Display

• Alphabetical hierarchies– One level above and below entry term

– Complete hierarchy for each term or separate TT display

• Permuted term lists

• Combination with classification notation

• Graphic Displays

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 26

CRICOS No. 00213J

Applying a thesaurus

Download Term Tree from http://www.termtree.com.au

Free trial download from

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 27

CRICOS No. 00213J

Thesaurus software

• Assigned

• Integrated database

• Deriving terminology

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 28

CRICOS No. 00213J

Thesaurus software - assigned

Terms are assigned by vocabulary specialists in independent database

• a.k.a.™ – Synercon Management Consulting

• MultiTes• OpenCyc• SuperTHES

– from THESmain/THESshow for mono-/multilingual thesauri • Term Tree 2000 • WebChoir • Wordmap

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 29

CRICOS No. 00213J

Thesaurus software – integrated database

Terms are assigned by specialists, thesaurus works like active data dictionary to control database

• BASIS

• InMagic Bibliotech PRO

• BRS/Search

• STAR

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 30

CRICOS No. 00213J

Thesaurus software for deriving terminologyTerms are created automatically from text

• Entrieva – SemioTagger™, SemioMap™ and SemioSkyline™ for viewing

• Intology – taxonomy builder

• Verity – Thematic Mapping

• Autonomy – taxonomy generation & categorization

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 31

CRICOS No. 00213J

Thesaurus Building - 1• Users

– Define

– Identify needs

– Define Thesaurus range & depth

• Raw vocabulary building– Identify sources

– Collect and record terms

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 32

CRICOS No. 00213J

Thesaurus Building -2

• Vocabulary organisation– Cluster terms

– Establish relationships using symbols

• Maintenance

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 33

CRICOS No. 00213J

Business application

• Not long term collaborative efforts of classification specialists

– Instead, adapt to business changes

• Not just descriptions of present business processes

– Instead, reflect strategic planning, competitors

• Not necessarily a single taxonomy

– Instead, multiple overlapping taxonomies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 34

CRICOS No. 00213J

Content management

• Describe content as it’s being created rather than classify after creation

• User-needs orientation

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 35

CRICOS No. 00213J

Integrating taxonomies

• Accurate reporting

• Exchange of data

• Assist resource discovery– Information retrieval

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 36

CRICOS No. 00213J

Thesaurus evaluation

• Qualities

• Information retrieval evaluation

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 37

CRICOS No. 00213J

Thesaurus Qualities

• Scope and features description• Display forms• Correctness of hierarchies• Use of scope, history and qualification• Adherence to standards• Syndetic measures

– Connectedness

– Accessibility

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 38

CRICOS No. 00213J

Thesauri & Retrieval evaluation

• Cranfield experiments & since• Recall and precision• Influence on indexing

– Conceptual analysis

– Translation failure

– Omissions

– Exhaustivity/Specificity

– Syntax and ‘false drops’

• Maintenance costs

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 39

CRICOS No. 00213J

Post-controlled vocabularies

• Use of a ‘Hedge’ of terms to represent a broad concept, eg:

– ‘psychological aspects of..........’– ‘........in Australia’– ‘....review items on.....’

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 40

CRICOS No. 00213J

Still to come ……

Research areas

• Metathesauri– Super – interlinked

vocabularies (e.g. NLM)

• Semantic Web– Enhancing word association

with usage statistics like links (e.g. THESUS)

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 41

CRICOS No. 00213J

Review

• Controlled vocabulary types

• Software support

• Business processes

• Website – http://sky.fit.qut.edu.au/~middletm/cont_voc.html

– (about to move to database driven site – redirection will be applied)

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 42

CRICOS No. 00213J

Questions?


Recommended