Author
ian-weller
View
214
Download
1
Tags:
Embed Size (px)
www.fit.qut.edu.au
Queensland University of Technology
Faculty of Information Technology
Michael Middleton
1 CRICOS No. 00213J
Controlled vocabularies:Thesauri and information retrieval
Michael Middleton
QUT School of Information Systems, Brisbane, [email protected]
forSTIMULATE 5
Vrije Universiteit Brussel
Brussels, Belgium July, 2005
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 2 CRICOS No. 00213J
Introduction
• Context ….. History
• Vocabulary principles
• Thesaurus software
• Thesaurus building …. application
• Thesaurus evaluation
• The future
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 3 CRICOS No. 00213J
Organise to maintain
Context: Information life cyclecreate
distribute
use
maintain
recall
reuse
store
dispose
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 4 CRICOS No. 00213J
Context: Information management
Domains
• Operational
• Analytical
• Strategic
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 5 CRICOS No. 00213J
Context: indexing
• Producing representations of records or documents that constitute a finding aid to the records in a database or to part of a document
– Assigned indexing
– Derived indexing
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 6 CRICOS No. 00213J
Indexer qualities
• The ‘Art’ of assigned indexing:– Empathy– Meticulousness– Consistency– General knowledge– Patience
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 7 CRICOS No. 00213J
Indexing guidelines
• Conceptual analysis and assigning• Aboutness• Elements of the document to consider• Exhaustivity• Specificity• Index what is in the item• Co-ordination
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 8 CRICOS No. 00213J
Assigned index representations
• Alphabetical Subject
• Classified– Alphabetical– Notation
• Chain
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 9 CRICOS No. 00213J
Indexing exercise
How consistent is database indexing?
Example: the same paper in multiple databases:
Middleton, M Skills expectations of library graduates http://eprints.qut.edu.au/archive/00000094/
1. Index it yourself2. Compare your indexing with others3. Compare the indexing in ERIC and INSPEC
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 10
CRICOS No. 00213J
Context: metadata
• Agent
– Document description– Responsibility– Administrative– Provenance– Connections – Conditions of use
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 11
CRICOS No. 00213J
Context: metadata
• Content
– Topic (application of vocabulary control)– Coverage– Role
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 12
CRICOS No. 00213J
Controlled vocabulary
• Thesaurus– A controlled vocabulary of terms in natural language that are
designed for post-coordination
• Classification scheme– A scheme for organisation by categories in a systematic manner;
this may involve grouping by subject, function or other criteria, or determining document naming conventions
– Often involves notation
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 13
CRICOS No. 00213J
Purpose
• Indexing by translating diverse natural language to consistent terminology
• Establishing relationships among terms
• Information retrieval improving precision and recall
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 14
CRICOS No. 00213J
History
• Bibliographic databases– Many applications, list of online associated
thesauri and classification schemes at http://sky.fit.qut.edu.au/~middletm/cont_voc.html
• Standards
– ISO2788; ISO 5964– ANSI Z39.19
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 15
CRICOS No. 00213J
Thesaurus principles
• Term relationships
• Continuing evolution
• Internally consistent hierarchies to support database searching
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 16
CRICOS No. 00213J
• The vocabulary of a controlled indexing language formally organised so that the a priori relationships between concepts are made explicit.
• A thesaurus is an example of metadata
The Thesaurus
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 17
CRICOS No. 00213J
Thesaurus extract (ISO sample)
35 mm CAMERAS
BT MINIATURE CAMERAS
CAMERAS
BT OPTICAL EQUIPMENT
NT MOVING PICTURE CAMERAS
STEREO CAMERAS
STILL CAMERAS
UNDERWATER CAMERAS
RT PHOTOGRAPHY
CINE CAMERAS
BT MOVING PICTURE CAMERAS
NT UNDERWATER CINE CAMERAS
RT CINEMA
CINEMA
RT CINE CAMERAS
DIVING
RT UNDERWATER CAMERAS
INSTANT PICTURE CAMERAS
SN Cameras which produce a finished
print directly
BT STILL CAMERAS
Land cameras USE VIEW CAMERAS
MICROSCOPES
BT OPTICAL EQUIPMENT
MINIATURE CAMERAS
BT STILL CAMERAS
NT 35 mm CAMERAS
MOVING PICTURE CAMERAS
BT CAMERAS
NT CINE CAMERAS
TELEVISION CAMERAS
OPTICAL EQUIPMENT
NT CAMERAS
MICROSCOPES
PHOTOGRAPHY
RT CAMERAS
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 18
CRICOS No. 00213J
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 19
CRICOS No. 00213J
Standardising the Vocabulary
• Types of entities & forms of terms
• Singular vs plural
• Homonyms
• Choice of terms
• Scope notes and history notes
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 20
CRICOS No. 00213J
Compound terms
• Terms should be factored into simpler elements to improve user’s understanding.
• Semantic factoring
• Syntactic factoring
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 21
CRICOS No. 00213J
Semantic Relationships• Equivalence
– Establishing relationships between preferred (postable) and non-preferred (non-postable) terms
• Hierarchical– Establishing relationships between subordinate and
superordinate terms. These may be distinguished as:• Generic• Whole-part• Instance
• Associative– Establishing relationships between terms that are mentally
associated, but not equivalent or hierarchical
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 22
CRICOS No. 00213J
… but, the Functions thesaurus
Whereas• agenda papers might have
– broader term documents
In a functions thesaurus
• agenda papers might have– broader term meetings
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 23
CRICOS No. 00213J
Applying a functional thesaurusTop Term • PERSONNELScope Notes The function of managing all employees ……
Related Terms • COMPENSATION• ESTABLISHMENT• INDUSTRIAL RELATIONS etc, etc
Narrower Terms • ALLOWANCES• APPEALS (Decisions)• APPOINTMENT• ARRANGEMENTS• AUTHORISATION• COMMITTEES• COMPLIANCE etc, etc
Use For Terms• Employees• Public Servants• Staff
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 24
CRICOS No. 00213J
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 25
CRICOS No. 00213J
Thesaurus Display
• Alphabetical hierarchies– One level above and below entry term
– Complete hierarchy for each term or separate TT display
• Permuted term lists
• Combination with classification notation
• Graphic Displays
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 26
CRICOS No. 00213J
Applying a thesaurus
Download Term Tree from http://www.termtree.com.au
Free trial download from
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 27
CRICOS No. 00213J
Thesaurus software
• Assigned
• Integrated database
• Deriving terminology
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 28
CRICOS No. 00213J
Thesaurus software - assigned
Terms are assigned by vocabulary specialists in independent database
• a.k.a.™ – Synercon Management Consulting
• MultiTes• OpenCyc• SuperTHES
– from THESmain/THESshow for mono-/multilingual thesauri • Term Tree 2000 • WebChoir • Wordmap
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 29
CRICOS No. 00213J
Thesaurus software – integrated database
Terms are assigned by specialists, thesaurus works like active data dictionary to control database
• BASIS
• InMagic Bibliotech PRO
• BRS/Search
• STAR
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 30
CRICOS No. 00213J
Thesaurus software for deriving terminologyTerms are created automatically from text
• Entrieva – SemioTagger™, SemioMap™ and SemioSkyline™ for viewing
• Intology – taxonomy builder
• Verity – Thematic Mapping
• Autonomy – taxonomy generation & categorization
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 31
CRICOS No. 00213J
Thesaurus Building - 1• Users
– Define
– Identify needs
– Define Thesaurus range & depth
• Raw vocabulary building– Identify sources
– Collect and record terms
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 32
CRICOS No. 00213J
Thesaurus Building -2
• Vocabulary organisation– Cluster terms
– Establish relationships using symbols
• Maintenance
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 33
CRICOS No. 00213J
Business application
• Not long term collaborative efforts of classification specialists
– Instead, adapt to business changes
• Not just descriptions of present business processes
– Instead, reflect strategic planning, competitors
• Not necessarily a single taxonomy
– Instead, multiple overlapping taxonomies
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 34
CRICOS No. 00213J
Content management
• Describe content as it’s being created rather than classify after creation
• User-needs orientation
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 35
CRICOS No. 00213J
Integrating taxonomies
• Accurate reporting
• Exchange of data
• Assist resource discovery– Information retrieval
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 36
CRICOS No. 00213J
Thesaurus evaluation
• Qualities
• Information retrieval evaluation
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 37
CRICOS No. 00213J
Thesaurus Qualities
• Scope and features description• Display forms• Correctness of hierarchies• Use of scope, history and qualification• Adherence to standards• Syndetic measures
– Connectedness
– Accessibility
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 38
CRICOS No. 00213J
Thesauri & Retrieval evaluation
• Cranfield experiments & since• Recall and precision• Influence on indexing
– Conceptual analysis
– Translation failure
– Omissions
– Exhaustivity/Specificity
– Syntax and ‘false drops’
• Maintenance costs
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 39
CRICOS No. 00213J
Post-controlled vocabularies
• Use of a ‘Hedge’ of terms to represent a broad concept, eg:
– ‘psychological aspects of..........’– ‘........in Australia’– ‘....review items on.....’
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 40
CRICOS No. 00213J
Still to come ……
Research areas
• Metathesauri– Super – interlinked
vocabularies (e.g. NLM)
• Semantic Web– Enhancing word association
with usage statistics like links (e.g. THESUS)
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 41
CRICOS No. 00213J
Review
• Controlled vocabulary types
• Software support
• Business processes
• Website – http://sky.fit.qut.edu.au/~middletm/cont_voc.html
– (about to move to database driven site – redirection will be applied)
www.fit.qut.edu.au
Queensland University of Technology FIT School of Information Systems MM 42
CRICOS No. 00213J
Questions?