3 Thesaurus
Prof. Dr. Knut Hinkelmann 2Information Retrieval and Knowledge Organisation - 3 Thesaurus
Dealing with word meanings in informationretrieval
Problem: The same meaning can be expressed usingdifferent terms
synonymshomonymsrelated terms
How can it be achieved that for the same meaning theidentical terms are used in the index and the query?
Prof. Dr. Knut Hinkelmann 3Information Retrieval and Knowledge Organisation - 3 Thesaurus
Thesaurus
A thesaurus is a sorted composition of terms and their descriptors thatcan be used for indexing, storing and retrieval of information in a fieldof documentation.
A thesaurus containstermsrelationships between terms
Prof. Dr. Knut Hinkelmann 4Information Retrieval and Knowledge Organisation - 3 Thesaurus
Thesaurus - Definition
Ein Thesaurus [...] ist eine geordnete Zusammenstellung von Begriffen und ihren (vorwiegend natürlichsprachigen) Bezeichnungen, die in einem Dokumentationsgebiet zum Indexieren, Speichern und Wiederauffinden dient
Er ist durch folgende Merkmale gekennzeichnet:Begriffe und Bezeichnungen werden eindeutig aufeinander bezogen (terminologische Kontrolle) indem
Synonyme möglichst vollständig erfasst werdenHomonyme und Polyseme besonders gekennzeichnet werden,für jeden Begriff eine Bezeichung (Vorzugsbenennung, Begriffsnummer oder Notation) festgelegt wird, die den Begriff eindeutig vertritt,
Beziehungen zwischen Begriffen (repräsentiert durch ihre Bezeichnungen) werden dargestellt.
Quelle: DIN 1463 – Erstellung und Weiterentwicklung von Thesauri
Prof. Dr. Knut Hinkelmann 5Information Retrieval and Knowledge Organisation - 3 Thesaurus
Types of Thesauri
Two kinds of thesauri can be distinguished
Thesauri with preferred termsFrom the terms with the same or nearly the samemeaning only one is allowed for indexing. Preferredterms are also called descriptors.
Thesauri without preferred termsTerms with similar meaning are collected in equivalence classes (sometimes called synonym setsor synsets). All terms can be used for indexing.
preferred term = Vorzugsbezeichnung
Prof. Dr. Knut Hinkelmann 6Information Retrieval and Knowledge Organisation - 3 Thesaurus
Thesauri in the WebWeb Thesaurus Compendium:
http://www.ipsi.fraunhofer.de/~lutes/thesoecd.html
Examples:
Thesauri with preferred termsUNESCO Thesaurus
http://www.ulcc.ac.uk/unesco/
Standard Thesaurus Wirtschafthttp://www.gbi.de/thesaurus/
Thesauri without preferred termsWordnet (A lexical datebase for the English language)
http://wordnet.princeton.edu/
Open Thesaurushttp://www.openthesaurus.de
Prof. Dr. Knut Hinkelmann 7Information Retrieval and Knowledge Organisation - 3 Thesaurus
3.1 Thesaurus with preferred terms
Terms are represented as descriptors and non-descriptors
DescriptorA descriptor, also called preferred term, is the term to be used to represent a concept when indexing documents and formulating queriesA descriptor contains relationships to other descriptors/terms
Non-descriptorA non-descriptor, also called forbiddenterm, is a term designating a conceptvery close to that represented by a descriptor. It contains a reference to thecorresponding descriptor as the onlyrelationshipdescriptor
non-descriptor
Example: Unesco Thesaurus
Prof. Dr. Knut Hinkelmann 8Information Retrieval and Knowledge Organisation - 3 Thesaurus
Relationships between termsDescriptors contain relationships to other descriptors
Hierarchical relationships, which link terms to other terms expressing more general and more specific concepts - i.e. broader terms (BT) and narrower terms (NT). Associative relationships, which link terms to similar terms (related terms) where the relationship between the terms is non-hierarchical. Related terms are indicated by the prefix RT.Equivalence relationships, which link "non-preferred" terms to synonyms or quasi-synonyms which act as "preferred" terms. Non-preferred terms are indicated by the prefix UF.
A descriptor can contain additional informationExplanations of the intended use of the descriptorGroup (Microthesaurus) the descriptor belongs toLingustic equivalence, which designates the same concept in different languages for multilingual thesauri
Prof. Dr. Knut Hinkelmann 9Information Retrieval and Knowledge Organisation - 3 Thesaurus
Relations: German and English
Abbr.
Hierarchy RelationsTT Top Term (allgemeinster Begriff) TT Top termOB übergeordneter Begriff (Oberbegriff) BT Broader termUB untergeordneter Begriff (Unterbegriff) NT Narrower term
Hierarchy Relations distinguishing between Abstradtion and AggregationOA Oberbegriff Abstraktionsrelation BTG Broader term genericUA Unterbegriff Abstraktionsrelation NTG Narrower term genericSP Verbandbegriff BTP Broader term partitiveTP Teilbegriff NTP Narrower term partitive
Equivalence Relations and AssociationsBS Benutztes Synonym oder Quasi-Synonym USE UseBF Benutzt für Synonym oder Quasi-Synonym UF Used forVB verwandter Begriff RT Related termBK Benutzte Kombination von Einfachdeskriptoren USE UseKB Benutzt in Kombination von Einfachdeskriptoren UFC Used for combination
German English
Denomination Abbr. Denomination
Quelle: DIN 1463 – Erstellung und Weiterentwicklung von Thesauri
Prof. Dr. Knut Hinkelmann 10Information Retrieval and Knowledge Organisation - 3 Thesaurus
Equivalence Relation - Synonyms
Semantic Equivalence is a relation between terms with (nearly) thesame meaning. It is expressed by two symbols:
USE – is used in non-descriptors and related to the correspondingdescriptor
ExampleCarsUSE Motor vehicles
UF (= Used For) is used in descriptors and refers to synonymous non-descriptors
ExampleMotor vehiclesUF Cars
Prof. Dr. Knut Hinkelmann 11Information Retrieval and Knowledge Organisation - 3 Thesaurus
Descriptors and Non-DescriptorsDescriptors
may have zero, one or more non-descriptors corresponding to ithave relations to other descriptors
Non-descriptormust refer to one descriptor only (relation USE)do not have any other relation
Example from the UNESCO thesaurus:
Motor vehiclesMT 6.60 Equipment and facilitiesUF AutomobilesUF CarsUF TrucksBT VehiclesRT Road EngineeringRT Road Transport
Motor vehiclesMT 6.60 Equipment and facilitiesUF AutomobilesUF CarsUF TrucksBT VehiclesRT Road EngineeringRT Road Transport
AutomobilesUSE Motor vehicles
CarsUSE Motor vehicles
Trucks USE Motor vehicles
AutomobilesUSE Motor vehicles
CarsUSE Motor vehicles
Trucks USE Motor vehicles
Descriptor: Non-Descriptors:
Prof. Dr. Knut Hinkelmann 12Information Retrieval and Knowledge Organisation - 3 Thesaurus
HierarchyIn general, a hierarchy is represented by two relations
BT (= Broader Term) relates a descriptor to a more generic descriptorExample:
BanksBT Finanical institutions
BT2 Finance
NT (= Narrower Term) relates a descriptor to a more specific descriptor
Example:
Financial institutionsNT BanksBT Finance
In the UNESCO thesaurus, a digit to the right of the symbols BT or NT indicates the number of hierachical levels separating the descriptors
FinanceFinance
FinancialinstitutionsFinancial
institutions
BanksBanks
Prof. Dr. Knut Hinkelmann 13Information Retrieval and Knowledge Organisation - 3 Thesaurus
Specific HierachiesThere are thesauri that distinguish between different types of hiearchies
specific vs. generic terms: The narrower term is more specific than thebroader term
Example:
Vehicles Motor vehiclesNTG Motor vehicles BTG VehiclesNTG Bicycles
BicyclesBTG vehicles
partitive relation: the narrower terms is part of the broader termExample:
Motor Vehicles EnginesNTP Engines BTP Motor Vehicles
VehiclesVehicles
BicyclesBicycles MotorvehiclesMotor
vehicles
MotorvehiclesMotor
vehicles
EnginesEngines
Prof. Dr. Knut Hinkelmann 14Information Retrieval and Knowledge Organisation - 3 Thesaurus
Association RT
RT (= Related Term) is a relation between two descriptors that isneither hiearchical nor an equivalence relation.
There are different kinds of relations that can be expressed as association relation, e.g.
Descriptors that are at the same level in a hierarchyDiesel engine RT Otto engineApple RT Pear
Descriptors that are part of a common thingSolothurn RT Aargau
Antonym (opposite)Heat RT Cold
Successor relationFather RT Son
functional or causal relationBook RT Reading
Prof. Dr. Knut Hinkelmann 15Information Retrieval and Knowledge Organisation - 3 Thesaurus
Structure of the Thesaurus
The UNESCO thesaurus is organised intosubject fields and microthesauri
Field namesA field is a grouping of microthesausA field name is preceded by a one-digit serial number
Microthesaurus namesA microthesaurus is a grouping of descriptors and non-descriptorsA microthesaurus name is precededby a three-digit serial number, thefirst digit is the number of the subjectfield to which the microthesaurusbelongs
Example: subject field and microthesauri
Prof. Dr. Knut Hinkelmann 16Information Retrieval and Knowledge Organisation - 3 Thesaurus
Other Descriptor InformationDescriptors in the UNESCO thesaurus also contain:
ExplanationExplains the use for which a descriptoris intendedexplanations in the UNESCO thesaurus are called Scope Notes SN
InclusionReference between a descriptor and the microthesaurus to which it belongsshown by the symbol MT
Linguistig equivalenceRelation between descriptorsdesignatingt he same concept in different languagesShown by the symbol of the languageindicators
Prof. Dr. Knut Hinkelmann 17Information Retrieval and Knowledge Organisation - 3 Thesaurus
Standard Thesaurus Wirtschaft
SubthesauriStructure of theSubthesaurus "Betriebswirt-schaft"
Searching for "Geldinstitut" findsthe descriptor term "Bank"
Prof. Dr. Knut Hinkelmann 18Information Retrieval and Knowledge Organisation - 3 Thesaurus
3.2 Thesauri without preferred terms
Terms with similar meaning arerepresented as equivalenceclasses.
Example: WordNetIn WordNet, nouns, verbs, adjectives and adverbs aregrouped into sets of cognitivesynonyms (synsets), each synset expresses a distinct conceptSynsets are interlinked bymeans of conceptual-semantic and lexical relations.
http://wordnet.princeton.edu/
Prof. Dr. Knut Hinkelmann 19Information Retrieval and Knowledge Organisation - 3 Thesaurus
Relations in WordNet
For each synset there are a number of relations to other synsets, e.g.
hyponym: more specific concepts(corresponds to narrower termNT)hypernym: more generalconcepts (opposite of hyponym; corresponds to broader term BT)part meronym: consituent parts of the concept (corresponds to narrower term partitive NTP)holonym: opposite of meronymdomain category: classes theconcept belongs to
Prof. Dr. Knut Hinkelmann 20Information Retrieval and Knowledge Organisation - 3 Thesaurus
WordNet: Displaying the value of relations
Prof. Dr. Knut Hinkelmann 21Information Retrieval and Knowledge Organisation - 3 Thesaurus
WordNet: Displaying the value of relations
Prof. Dr. Knut Hinkelmann 22Information Retrieval and Knowledge Organisation - 3 Thesaurus
Example: OpenThesaurusOpenThesaurus is an open source thesaurus for the German language
Wortgruppe mit synonymen Wörtern zu „Auto“
Prof. Dr. Knut Hinkelmann 23Information Retrieval and Knowledge Organisation - 3 Thesaurus
3.3 Possible uses of a Thesaurus
Index with Controlled Vocabulary
Use thesaurus for indexingProviding a controlled vocabularyfor manual indexingstoring only preferred terms(descriptors) in the index, e.g. in attribute „keyword“
Use Thesaurus for retrievalUser can use thesaurus to formulate a query:
find preferred termsfind broader or narrower termsif query is not successful
Fulltext search
Use thesaurus for indexingautomatically store all synonymsas index termsThesaurus may still be helpful at retrieval e.g. to find broaderterms, narrower terms, relatedterms
Use theaurus for retrievalIndex contains only term occuringin the documentsUser can use thesaurus to refinea query: find synonyms, broaderterms, narrower termsor relatedterms if query is not successful
Prof. Dr. Knut Hinkelmann 24Information Retrieval and Knowledge Organisation - 3 Thesaurus
Use of a Thesaurus (2)
The Thesaurus can be used by humans or automaticallyHuman
use thesaurus as a reference bookelectronically or conventionally (book)
Retrieval systemThe system can suggest synonyms, broader terms or narrowerterms automatically
Indexing systemautomatically find synonyms and preferred terms
Prof. Dr. Knut Hinkelmann 25Information Retrieval and Knowledge Organisation - 3 Thesaurus
Example fromthe UNESCO Thesaurus
The figures shows a descriptor in the UNESCO thesaurusTo the term „Motor vehicles“ there varioussynonyms, broader terms and related terms
Use for indexing:The index must not contain the nonb-preferredterms „Automobiles“, „Cars“, „Trucks“ but only„Motor vehicles“
Use for keyword search:Searching for „Cars" does not provide a result.Looking up the thesaurus, we find that „Motor vehicles“ is the corresponding descriptor termwhich is used as index term.
Use for fulltext search:If searching for „Motor vehicles" provides too manyresults, we can use the thesaurus to find alternative search terms.
Prof. Dr. Knut Hinkelmann 26Information Retrieval and Knowledge Organisation - 3 Thesaurus
Example from the Standardthesaurus Wirtschaft
The figures shows a descriptor in the thesaurus „Wirtschaft“To the term „Bank“ there various synonyms, narrower termsand related terms
Use for indexing:The index only contains the descriptor term „Bank“, but not thecorresponding non-descriptors
Use for keyword search:Searching for "Kreditinstitut" does not provide a result.Looking up the thesaurus, we find that „Bank“ is thecorresponding descriptor term which is used as keyword.
Use for fulltext search:If searching for "Bank" provide too many results, we can use thethesaurus to find alternative search terms.
Prof. Dr. Knut Hinkelmann 27Information Retrieval and Knowledge Organisation - 3 Thesaurus
Maintenance of a Thesaurus
Building and maintaining a thesaurus is requires expertise and is time-consuming
What terms are descriptors?Are all synonyms included?What is the correct relation between terms?Avoiding inconsistencies
Thesauri often are constructed and maintained by trustworthyorganisations
Many thesauri cover a specific field of interest contain general termsbut no enterprise-specific terms (product names, projects etc.) Addingthem requires effort for maintenance.