of 42 /42
www.fit.qut.edu.a u Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies: Thesauri and information retrieval Michael Middleton QUT School of Information Systems, Brisbane, Australia [email protected] for STIMULATE 5 Vrije Universiteit Brussel Brussels, Belgium July, 2005

Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

Embed Size (px)

Text of Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael...

Page 1: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology

Faculty of Information Technology

Michael Middleton

1 CRICOS No. 00213J

Controlled vocabularies:Thesauri and information retrieval

Michael Middleton

QUT School of Information Systems, Brisbane, [email protected]

forSTIMULATE 5

Vrije Universiteit Brussel

Brussels, Belgium July, 2005

Page 2: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 2 CRICOS No. 00213J

Introduction

• Context ….. History

• Vocabulary principles

• Thesaurus software

• Thesaurus building …. application

• Thesaurus evaluation

• The future

Page 3: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 3 CRICOS No. 00213J

Organise to maintain

Context: Information life cyclecreate

distribute

use

maintain

recall

reuse

store

dispose

Page 4: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 4 CRICOS No. 00213J

Context: Information management

Domains

• Operational

• Analytical

• Strategic

Page 5: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 5 CRICOS No. 00213J

Context: indexing

• Producing representations of records or documents that constitute a finding aid to the records in a database or to part of a document

– Assigned indexing

– Derived indexing

Page 6: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 6 CRICOS No. 00213J

Indexer qualities

• The ‘Art’ of assigned indexing:– Empathy– Meticulousness– Consistency– General knowledge– Patience

Page 7: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 7 CRICOS No. 00213J

Indexing guidelines

• Conceptual analysis and assigning• Aboutness• Elements of the document to consider• Exhaustivity• Specificity• Index what is in the item• Co-ordination

Page 8: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 8 CRICOS No. 00213J

Assigned index representations

• Alphabetical Subject

• Classified– Alphabetical– Notation

• Chain

Page 9: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 9 CRICOS No. 00213J

Indexing exercise

How consistent is database indexing?

Example: the same paper in multiple databases:

Middleton, M Skills expectations of library graduates http://eprints.qut.edu.au/archive/00000094/

1. Index it yourself2. Compare your indexing with others3. Compare the indexing in ERIC and INSPEC

Page 10: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 10

CRICOS No. 00213J

Context: metadata

• Agent

– Document description– Responsibility– Administrative– Provenance– Connections – Conditions of use

Page 11: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 11

CRICOS No. 00213J

Context: metadata

• Content

– Topic (application of vocabulary control)– Coverage– Role

Page 12: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 12

CRICOS No. 00213J

Controlled vocabulary

• Thesaurus– A controlled vocabulary of terms in natural language that are

designed for post-coordination

• Classification scheme– A scheme for organisation by categories in a systematic manner;

this may involve grouping by subject, function or other criteria, or determining document naming conventions

– Often involves notation

Page 13: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 13

CRICOS No. 00213J

Purpose

• Indexing by translating diverse natural language to consistent terminology

• Establishing relationships among terms

• Information retrieval improving precision and recall

Page 14: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 14

CRICOS No. 00213J

History

• Bibliographic databases– Many applications, list of online associated

thesauri and classification schemes at http://sky.fit.qut.edu.au/~middletm/cont_voc.html

• Standards

– ISO2788; ISO 5964– ANSI Z39.19

Page 15: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 15

CRICOS No. 00213J

Thesaurus principles

• Term relationships

• Continuing evolution

• Internally consistent hierarchies to support database searching

Page 16: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 16

CRICOS No. 00213J

• The vocabulary of a controlled indexing language formally organised so that the a priori relationships between concepts are made explicit.

• A thesaurus is an example of metadata

The Thesaurus

Page 17: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 17

CRICOS No. 00213J

Thesaurus extract (ISO sample)

35 mm CAMERAS

BT MINIATURE CAMERAS

CAMERAS

BT OPTICAL EQUIPMENT

NT MOVING PICTURE CAMERAS

STEREO CAMERAS

STILL CAMERAS

UNDERWATER CAMERAS

RT PHOTOGRAPHY

CINE CAMERAS

BT MOVING PICTURE CAMERAS

NT UNDERWATER CINE CAMERAS

RT CINEMA

CINEMA

RT CINE CAMERAS

DIVING

RT UNDERWATER CAMERAS

INSTANT PICTURE CAMERAS

SN Cameras which produce a finished

print directly

BT STILL CAMERAS

Land cameras USE VIEW CAMERAS

MICROSCOPES

BT OPTICAL EQUIPMENT

MINIATURE CAMERAS

BT STILL CAMERAS

NT 35 mm CAMERAS

MOVING PICTURE CAMERAS

BT CAMERAS

NT CINE CAMERAS

TELEVISION CAMERAS

OPTICAL EQUIPMENT

NT CAMERAS

MICROSCOPES

PHOTOGRAPHY

RT CAMERAS

Page 18: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 18

CRICOS No. 00213J

Page 19: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 19

CRICOS No. 00213J

Standardising the Vocabulary

• Types of entities & forms of terms

• Singular vs plural

• Homonyms

• Choice of terms

• Scope notes and history notes

Page 20: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 20

CRICOS No. 00213J

Compound terms

• Terms should be factored into simpler elements to improve user’s understanding.

• Semantic factoring

• Syntactic factoring

Page 21: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 21

CRICOS No. 00213J

Semantic Relationships• Equivalence

– Establishing relationships between preferred (postable) and non-preferred (non-postable) terms

• Hierarchical– Establishing relationships between subordinate and

superordinate terms. These may be distinguished as:• Generic• Whole-part• Instance

• Associative– Establishing relationships between terms that are mentally

associated, but not equivalent or hierarchical

Page 22: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 22

CRICOS No. 00213J

… but, the Functions thesaurus

Whereas• agenda papers might have

– broader term documents

In a functions thesaurus

• agenda papers might have– broader term meetings

Page 23: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 23

CRICOS No. 00213J

Applying a functional thesaurusTop Term • PERSONNELScope Notes The function of managing all employees ……

Related Terms • COMPENSATION• ESTABLISHMENT• INDUSTRIAL RELATIONS etc, etc

Narrower Terms • ALLOWANCES• APPEALS (Decisions)• APPOINTMENT• ARRANGEMENTS• AUTHORISATION• COMMITTEES• COMPLIANCE etc, etc

Use For Terms• Employees• Public Servants• Staff

Page 24: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 24

CRICOS No. 00213J

Page 25: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 25

CRICOS No. 00213J

Thesaurus Display

• Alphabetical hierarchies– One level above and below entry term

– Complete hierarchy for each term or separate TT display

• Permuted term lists

• Combination with classification notation

• Graphic Displays

Page 26: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 26

CRICOS No. 00213J

Applying a thesaurus

Download Term Tree from http://www.termtree.com.au

Free trial download from

Page 27: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 27

CRICOS No. 00213J

Thesaurus software

• Assigned

• Integrated database

• Deriving terminology

Page 28: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 28

CRICOS No. 00213J

Thesaurus software - assigned

Terms are assigned by vocabulary specialists in independent database

• a.k.a.™ – Synercon Management Consulting

• MultiTes• OpenCyc• SuperTHES

– from THESmain/THESshow for mono-/multilingual thesauri • Term Tree 2000 • WebChoir • Wordmap

Page 29: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 29

CRICOS No. 00213J

Thesaurus software – integrated database

Terms are assigned by specialists, thesaurus works like active data dictionary to control database

• BASIS

• InMagic Bibliotech PRO

• BRS/Search

• STAR

Page 30: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 30

CRICOS No. 00213J

Thesaurus software for deriving terminologyTerms are created automatically from text

• Entrieva – SemioTagger™, SemioMap™ and SemioSkyline™ for viewing

• Intology – taxonomy builder

• Verity – Thematic Mapping

• Autonomy – taxonomy generation & categorization

Page 31: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 31

CRICOS No. 00213J

Thesaurus Building - 1• Users

– Define

– Identify needs

– Define Thesaurus range & depth

• Raw vocabulary building– Identify sources

– Collect and record terms

Page 32: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 32

CRICOS No. 00213J

Thesaurus Building -2

• Vocabulary organisation– Cluster terms

– Establish relationships using symbols

• Maintenance

Page 33: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 33

CRICOS No. 00213J

Business application

• Not long term collaborative efforts of classification specialists

– Instead, adapt to business changes

• Not just descriptions of present business processes

– Instead, reflect strategic planning, competitors

• Not necessarily a single taxonomy

– Instead, multiple overlapping taxonomies

Page 34: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 34

CRICOS No. 00213J

Content management

• Describe content as it’s being created rather than classify after creation

• User-needs orientation

Page 35: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 35

CRICOS No. 00213J

Integrating taxonomies

• Accurate reporting

• Exchange of data

• Assist resource discovery– Information retrieval

Page 36: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 36

CRICOS No. 00213J

Thesaurus evaluation

• Qualities

• Information retrieval evaluation

Page 37: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 37

CRICOS No. 00213J

Thesaurus Qualities

• Scope and features description• Display forms• Correctness of hierarchies• Use of scope, history and qualification• Adherence to standards• Syndetic measures

– Connectedness

– Accessibility

Page 38: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 38

CRICOS No. 00213J

Thesauri & Retrieval evaluation

• Cranfield experiments & since• Recall and precision• Influence on indexing

– Conceptual analysis

– Translation failure

– Omissions

– Exhaustivity/Specificity

– Syntax and ‘false drops’

• Maintenance costs

Page 39: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 39

CRICOS No. 00213J

Post-controlled vocabularies

• Use of a ‘Hedge’ of terms to represent a broad concept, eg:

– ‘psychological aspects of..........’– ‘........in Australia’– ‘....review items on.....’

Page 40: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 40

CRICOS No. 00213J

Still to come ……

Research areas

• Metathesauri– Super – interlinked

vocabularies (e.g. NLM)

• Semantic Web– Enhancing word association

with usage statistics like links (e.g. THESUS)

Page 41: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 41

CRICOS No. 00213J

Review

• Controlled vocabulary types

• Software support

• Business processes

• Website – http://sky.fit.qut.edu.au/~middletm/cont_voc.html

– (about to move to database driven site – redirection will be applied)

Page 42: Www.fit.qut.edu.au Queensland University of Technology Faculty of Information Technology Michael Middleton 1 CRICOS No. 00213J Controlled vocabularies

www.fit.qut.edu.au

Queensland University of Technology FIT School of Information Systems MM 42

CRICOS No. 00213J

Questions?