14
3 Thesaurus Prof. Dr. Knut Hinkelmann 2 Information Retrieval and Knowledge Organisation - 3 Thesaurus Dealing with word meanings in information retrieval Problem: The same meaning can be expressed using different terms synonyms homonyms related terms How can it be achieved that for the same meaning the identical terms are used in the index and the query?

3 Thesaurus - · PDF fileProf. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 3 Thesaurus 3 Thesaurus A thesaurus is a sorted composition of terms and their

  • Upload
    hadung

  • View
    250

  • Download
    10

Embed Size (px)

Citation preview

3 Thesaurus

Prof. Dr. Knut Hinkelmann 2Information Retrieval and Knowledge Organisation - 3 Thesaurus

Dealing with word meanings in informationretrieval

Problem: The same meaning can be expressed usingdifferent terms

synonymshomonymsrelated terms

How can it be achieved that for the same meaning theidentical terms are used in the index and the query?

Prof. Dr. Knut Hinkelmann 3Information Retrieval and Knowledge Organisation - 3 Thesaurus

Thesaurus

A thesaurus is a sorted composition of terms and their descriptors thatcan be used for indexing, storing and retrieval of information in a fieldof documentation.

A thesaurus containstermsrelationships between terms

Prof. Dr. Knut Hinkelmann 4Information Retrieval and Knowledge Organisation - 3 Thesaurus

Thesaurus - Definition

Ein Thesaurus [...] ist eine geordnete Zusammenstellung von Begriffen und ihren (vorwiegend natürlichsprachigen) Bezeichnungen, die in einem Dokumentationsgebiet zum Indexieren, Speichern und Wiederauffinden dient

Er ist durch folgende Merkmale gekennzeichnet:Begriffe und Bezeichnungen werden eindeutig aufeinander bezogen (terminologische Kontrolle) indem

Synonyme möglichst vollständig erfasst werdenHomonyme und Polyseme besonders gekennzeichnet werden,für jeden Begriff eine Bezeichung (Vorzugsbenennung, Begriffsnummer oder Notation) festgelegt wird, die den Begriff eindeutig vertritt,

Beziehungen zwischen Begriffen (repräsentiert durch ihre Bezeichnungen) werden dargestellt.

Quelle: DIN 1463 – Erstellung und Weiterentwicklung von Thesauri

Prof. Dr. Knut Hinkelmann 5Information Retrieval and Knowledge Organisation - 3 Thesaurus

Types of Thesauri

Two kinds of thesauri can be distinguished

Thesauri with preferred termsFrom the terms with the same or nearly the samemeaning only one is allowed for indexing. Preferredterms are also called descriptors.

Thesauri without preferred termsTerms with similar meaning are collected in equivalence classes (sometimes called synonym setsor synsets). All terms can be used for indexing.

preferred term = Vorzugsbezeichnung

Prof. Dr. Knut Hinkelmann 6Information Retrieval and Knowledge Organisation - 3 Thesaurus

Thesauri in the WebWeb Thesaurus Compendium:

http://www.ipsi.fraunhofer.de/~lutes/thesoecd.html

Examples:

Thesauri with preferred termsUNESCO Thesaurus

http://www.ulcc.ac.uk/unesco/

Standard Thesaurus Wirtschafthttp://www.gbi.de/thesaurus/

Thesauri without preferred termsWordnet (A lexical datebase for the English language)

http://wordnet.princeton.edu/

Open Thesaurushttp://www.openthesaurus.de

Prof. Dr. Knut Hinkelmann 7Information Retrieval and Knowledge Organisation - 3 Thesaurus

3.1 Thesaurus with preferred terms

Terms are represented as descriptors and non-descriptors

DescriptorA descriptor, also called preferred term, is the term to be used to represent a concept when indexing documents and formulating queriesA descriptor contains relationships to other descriptors/terms

Non-descriptorA non-descriptor, also called forbiddenterm, is a term designating a conceptvery close to that represented by a descriptor. It contains a reference to thecorresponding descriptor as the onlyrelationshipdescriptor

non-descriptor

Example: Unesco Thesaurus

Prof. Dr. Knut Hinkelmann 8Information Retrieval and Knowledge Organisation - 3 Thesaurus

Relationships between termsDescriptors contain relationships to other descriptors

Hierarchical relationships, which link terms to other terms expressing more general and more specific concepts - i.e. broader terms (BT) and narrower terms (NT). Associative relationships, which link terms to similar terms (related terms) where the relationship between the terms is non-hierarchical. Related terms are indicated by the prefix RT.Equivalence relationships, which link "non-preferred" terms to synonyms or quasi-synonyms which act as "preferred" terms. Non-preferred terms are indicated by the prefix UF.

A descriptor can contain additional informationExplanations of the intended use of the descriptorGroup (Microthesaurus) the descriptor belongs toLingustic equivalence, which designates the same concept in different languages for multilingual thesauri

Prof. Dr. Knut Hinkelmann 9Information Retrieval and Knowledge Organisation - 3 Thesaurus

Relations: German and English

Abbr.

Hierarchy RelationsTT Top Term (allgemeinster Begriff) TT Top termOB übergeordneter Begriff (Oberbegriff) BT Broader termUB untergeordneter Begriff (Unterbegriff) NT Narrower term

Hierarchy Relations distinguishing between Abstradtion and AggregationOA Oberbegriff Abstraktionsrelation BTG Broader term genericUA Unterbegriff Abstraktionsrelation NTG Narrower term genericSP Verbandbegriff BTP Broader term partitiveTP Teilbegriff NTP Narrower term partitive

Equivalence Relations and AssociationsBS Benutztes Synonym oder Quasi-Synonym USE UseBF Benutzt für Synonym oder Quasi-Synonym UF Used forVB verwandter Begriff RT Related termBK Benutzte Kombination von Einfachdeskriptoren USE UseKB Benutzt in Kombination von Einfachdeskriptoren UFC Used for combination

German English

Denomination Abbr. Denomination

Quelle: DIN 1463 – Erstellung und Weiterentwicklung von Thesauri

Prof. Dr. Knut Hinkelmann 10Information Retrieval and Knowledge Organisation - 3 Thesaurus

Equivalence Relation - Synonyms

Semantic Equivalence is a relation between terms with (nearly) thesame meaning. It is expressed by two symbols:

USE – is used in non-descriptors and related to the correspondingdescriptor

ExampleCarsUSE Motor vehicles

UF (= Used For) is used in descriptors and refers to synonymous non-descriptors

ExampleMotor vehiclesUF Cars

Prof. Dr. Knut Hinkelmann 11Information Retrieval and Knowledge Organisation - 3 Thesaurus

Descriptors and Non-DescriptorsDescriptors

may have zero, one or more non-descriptors corresponding to ithave relations to other descriptors

Non-descriptormust refer to one descriptor only (relation USE)do not have any other relation

Example from the UNESCO thesaurus:

Motor vehiclesMT 6.60 Equipment and facilitiesUF AutomobilesUF CarsUF TrucksBT VehiclesRT Road EngineeringRT Road Transport

Motor vehiclesMT 6.60 Equipment and facilitiesUF AutomobilesUF CarsUF TrucksBT VehiclesRT Road EngineeringRT Road Transport

AutomobilesUSE Motor vehicles

CarsUSE Motor vehicles

Trucks USE Motor vehicles

AutomobilesUSE Motor vehicles

CarsUSE Motor vehicles

Trucks USE Motor vehicles

Descriptor: Non-Descriptors:

Prof. Dr. Knut Hinkelmann 12Information Retrieval and Knowledge Organisation - 3 Thesaurus

HierarchyIn general, a hierarchy is represented by two relations

BT (= Broader Term) relates a descriptor to a more generic descriptorExample:

BanksBT Finanical institutions

BT2 Finance

NT (= Narrower Term) relates a descriptor to a more specific descriptor

Example:

Financial institutionsNT BanksBT Finance

In the UNESCO thesaurus, a digit to the right of the symbols BT or NT indicates the number of hierachical levels separating the descriptors

FinanceFinance

FinancialinstitutionsFinancial

institutions

BanksBanks

Prof. Dr. Knut Hinkelmann 13Information Retrieval and Knowledge Organisation - 3 Thesaurus

Specific HierachiesThere are thesauri that distinguish between different types of hiearchies

specific vs. generic terms: The narrower term is more specific than thebroader term

Example:

Vehicles Motor vehiclesNTG Motor vehicles BTG VehiclesNTG Bicycles

BicyclesBTG vehicles

partitive relation: the narrower terms is part of the broader termExample:

Motor Vehicles EnginesNTP Engines BTP Motor Vehicles

VehiclesVehicles

BicyclesBicycles MotorvehiclesMotor

vehicles

MotorvehiclesMotor

vehicles

EnginesEngines

Prof. Dr. Knut Hinkelmann 14Information Retrieval and Knowledge Organisation - 3 Thesaurus

Association RT

RT (= Related Term) is a relation between two descriptors that isneither hiearchical nor an equivalence relation.

There are different kinds of relations that can be expressed as association relation, e.g.

Descriptors that are at the same level in a hierarchyDiesel engine RT Otto engineApple RT Pear

Descriptors that are part of a common thingSolothurn RT Aargau

Antonym (opposite)Heat RT Cold

Successor relationFather RT Son

functional or causal relationBook RT Reading

Prof. Dr. Knut Hinkelmann 15Information Retrieval and Knowledge Organisation - 3 Thesaurus

Structure of the Thesaurus

The UNESCO thesaurus is organised intosubject fields and microthesauri

Field namesA field is a grouping of microthesausA field name is preceded by a one-digit serial number

Microthesaurus namesA microthesaurus is a grouping of descriptors and non-descriptorsA microthesaurus name is precededby a three-digit serial number, thefirst digit is the number of the subjectfield to which the microthesaurusbelongs

Example: subject field and microthesauri

Prof. Dr. Knut Hinkelmann 16Information Retrieval and Knowledge Organisation - 3 Thesaurus

Other Descriptor InformationDescriptors in the UNESCO thesaurus also contain:

ExplanationExplains the use for which a descriptoris intendedexplanations in the UNESCO thesaurus are called Scope Notes SN

InclusionReference between a descriptor and the microthesaurus to which it belongsshown by the symbol MT

Linguistig equivalenceRelation between descriptorsdesignatingt he same concept in different languagesShown by the symbol of the languageindicators

Prof. Dr. Knut Hinkelmann 17Information Retrieval and Knowledge Organisation - 3 Thesaurus

Standard Thesaurus Wirtschaft

SubthesauriStructure of theSubthesaurus "Betriebswirt-schaft"

Searching for "Geldinstitut" findsthe descriptor term "Bank"

Prof. Dr. Knut Hinkelmann 18Information Retrieval and Knowledge Organisation - 3 Thesaurus

3.2 Thesauri without preferred terms

Terms with similar meaning arerepresented as equivalenceclasses.

Example: WordNetIn WordNet, nouns, verbs, adjectives and adverbs aregrouped into sets of cognitivesynonyms (synsets), each synset expresses a distinct conceptSynsets are interlinked bymeans of conceptual-semantic and lexical relations.

http://wordnet.princeton.edu/

Prof. Dr. Knut Hinkelmann 19Information Retrieval and Knowledge Organisation - 3 Thesaurus

Relations in WordNet

For each synset there are a number of relations to other synsets, e.g.

hyponym: more specific concepts(corresponds to narrower termNT)hypernym: more generalconcepts (opposite of hyponym; corresponds to broader term BT)part meronym: consituent parts of the concept (corresponds to narrower term partitive NTP)holonym: opposite of meronymdomain category: classes theconcept belongs to

Prof. Dr. Knut Hinkelmann 20Information Retrieval and Knowledge Organisation - 3 Thesaurus

WordNet: Displaying the value of relations

Prof. Dr. Knut Hinkelmann 21Information Retrieval and Knowledge Organisation - 3 Thesaurus

WordNet: Displaying the value of relations

Prof. Dr. Knut Hinkelmann 22Information Retrieval and Knowledge Organisation - 3 Thesaurus

Example: OpenThesaurusOpenThesaurus is an open source thesaurus for the German language

Wortgruppe mit synonymen Wörtern zu „Auto“

Prof. Dr. Knut Hinkelmann 23Information Retrieval and Knowledge Organisation - 3 Thesaurus

3.3 Possible uses of a Thesaurus

Index with Controlled Vocabulary

Use thesaurus for indexingProviding a controlled vocabularyfor manual indexingstoring only preferred terms(descriptors) in the index, e.g. in attribute „keyword“

Use Thesaurus for retrievalUser can use thesaurus to formulate a query:

find preferred termsfind broader or narrower termsif query is not successful

Fulltext search

Use thesaurus for indexingautomatically store all synonymsas index termsThesaurus may still be helpful at retrieval e.g. to find broaderterms, narrower terms, relatedterms

Use theaurus for retrievalIndex contains only term occuringin the documentsUser can use thesaurus to refinea query: find synonyms, broaderterms, narrower termsor relatedterms if query is not successful

Prof. Dr. Knut Hinkelmann 24Information Retrieval and Knowledge Organisation - 3 Thesaurus

Use of a Thesaurus (2)

The Thesaurus can be used by humans or automaticallyHuman

use thesaurus as a reference bookelectronically or conventionally (book)

Retrieval systemThe system can suggest synonyms, broader terms or narrowerterms automatically

Indexing systemautomatically find synonyms and preferred terms

Prof. Dr. Knut Hinkelmann 25Information Retrieval and Knowledge Organisation - 3 Thesaurus

Example fromthe UNESCO Thesaurus

The figures shows a descriptor in the UNESCO thesaurusTo the term „Motor vehicles“ there varioussynonyms, broader terms and related terms

Use for indexing:The index must not contain the nonb-preferredterms „Automobiles“, „Cars“, „Trucks“ but only„Motor vehicles“

Use for keyword search:Searching for „Cars" does not provide a result.Looking up the thesaurus, we find that „Motor vehicles“ is the corresponding descriptor termwhich is used as index term.

Use for fulltext search:If searching for „Motor vehicles" provides too manyresults, we can use the thesaurus to find alternative search terms.

Prof. Dr. Knut Hinkelmann 26Information Retrieval and Knowledge Organisation - 3 Thesaurus

Example from the Standardthesaurus Wirtschaft

The figures shows a descriptor in the thesaurus „Wirtschaft“To the term „Bank“ there various synonyms, narrower termsand related terms

Use for indexing:The index only contains the descriptor term „Bank“, but not thecorresponding non-descriptors

Use for keyword search:Searching for "Kreditinstitut" does not provide a result.Looking up the thesaurus, we find that „Bank“ is thecorresponding descriptor term which is used as keyword.

Use for fulltext search:If searching for "Bank" provide too many results, we can use thethesaurus to find alternative search terms.

Prof. Dr. Knut Hinkelmann 27Information Retrieval and Knowledge Organisation - 3 Thesaurus

Maintenance of a Thesaurus

Building and maintaining a thesaurus is requires expertise and is time-consuming

What terms are descriptors?Are all synonyms included?What is the correct relation between terms?Avoiding inconsistencies

Thesauri often are constructed and maintained by trustworthyorganisations

Many thesauri cover a specific field of interest contain general termsbut no enterprise-specific terms (product names, projects etc.) Addingthem requires effort for maintenance.