37
Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

Module 7b: Extracting/Controlling Terms and Semantic Relationships

IMT530: Organization of Information Resources

Winter 2007

Michael Crandall

Page 2: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 2

Steps in Constructing CVs

• Define your domain• Gather concepts

– From user interviews, search logs, content analysis, preexisting vocabularies

• Select your approach• Extract terminology• Control your terms• Organize your terms• Maintain, maintain, maintain

Page 3: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 3

Elements of Building CVs• Select your approach

– Pre- or post-coordinated (sixteenth century lute music or sixteenth century and lutes and music)

– Open or closed (indexers can add terms or not)– Enumeration vs. synthesis (facets)

• Extract terms– Warrant (from users or domain or both)

• Control terms– Specificity (cats or Siamese cats?)– Control of homographs (qualifications)– Term consistency and word form (plurals, etc.)– Multiword/phrase sequence and form (inverted, normal form?)– Term definitions (scope notes)– Syntax (citation order)– Semantic factoring

• Organize terms– Semantic relationships

Page 4: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

Extracting Terminology

Page 5: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 5

Sources and Origins of Terminology

• Where do you get terms for a controlled vocabulary?

• Sources and origins of terminology may come from explicit statements of warrant

• Making a conscious decision about warrant demonstrates that as a CV designer you are aware of the different possibilities and have made considered choices

Page 6: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 6

Warrant

• Warrant is “the authority that is used to justify decisions about what is included in a system,” (Clare Beghtol)

• Types of warrant:– Literary warrant – User warrant – Scholarly warrant – Cultural warrant (Beghtol, 2002)

Page 7: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 7

Literary & User warrant

• Literary Warrant– terms or organization reflect or are taken

directly from resources themselves; this includes dictionaries, encyclopedias, etc. on a topic

• User (aka Use, Enquiry) Warrant– terms or organization reflect use; user

terminology may (or may not) be taken directly from logs of system use or from personal interactions with users

Page 8: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 8

Scholarly & Cultural Warrant

• Scholarly Warrant– terms or organization reflect the opinions of a

panel of human experts

• Cultural Warrant– terms or organization derived from cultural practice

or understanding; for example, Dewey and LCSH reflect American/Western cultural bias; Colon Classification reflects Indian/Eastern cultural bias (this also can be partly a function of literary warrant…)

Page 9: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

Term Control

Page 10: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 10

Term control

– Specificity (cats or Siamese cats?)– Control of homographs (qualifications)– Term consistency and word form (plurals,

etc.)– Multiword/phrase sequence and form

(inverted, normal form?)– Term definitions (scope notes)– Syntax (citation order)– Semantic factoring

Page 11: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 11

Specificity

• Depends on user needs and time available• Should be consistent throughout CV to avoid

user confusion• May be influenced by choice of approach

– If faceted some facets may be more specific than others

– If hierarchical you should be consistent throughout

Page 12: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 12

Homographs

• Sometimes a single word or phrase has multiple meanings: e.g., “power”, “drum”, “Java”, “Jupiter”

• Controlled vocabularies “disambiguate” these terms to make each term have a single meaning– In thesauri & subject heading lists, parenthetical

qualifiers are added, e.g. these LCSH terms “Power (Mechanics)”; “Power (Christian theology)”; “Power (Social Sciences)”; Power (Philosophy)”

– In taxonomies and classifications, the meaning of homographs is contextualized by placement in a particular hierarchy (following the example above, Power will appear in the Philosophy, Christianity, Social Sciences, and Mechanics hierarchies and the terms themselves, by virtue of their location (thus, different notation), will be disambiguated)

Page 13: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 13

Word Form

• Single word form should be consistent– Choose verbs or nouns– Singular or plural– Standard form

• Phrases should be standard form– Either direct (Constitutional government)– Or inverted (government, constitutional)

• Allows closer grouping of like terms in alphabetic display- not used much anymore

Page 14: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 14

Scope Notes

• Scope notes are term definitions in a thesaurus or controlled vocabulary

• Scope notes are useful for indexers to let them know what the precise meaning of the term is; and for users to help them know if they are searching on the correct term

Page 15: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 15

Syntax

• Syntax describes how terms are built (especially, how multiple concepts may be combined), and citation order (order of facets)– Syntax is an issue when concepts are pre-

coordinated in an indexing term (whether the syntax is consistent or not)

– Syntax is an issue for CVs that use synthesis with facets in that rules for synthesis (also called citation order in classification schemes) determine term syntax

Page 16: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 16

Semantic Factoring

• “The process of analyzing some or all of the categories of an ontology into a collection of primitives” Sowa, J. F. (2003). Ontology. Glossary. http://www.jfsowa.com/ontology/gloss.htm

• Essentially, you are trying to decompose terms into their elemental concepts, to minimize duplication and maximize reuse– For example: ship = vehicle+water transport – Not always possible, especially with non-concrete concepts

• “Creating a thesaurus without doing semantic factoring is like trying to put together furniture from Ikea without following the instructions. You will get interesting configurations, but you will not save time.” Ezzo, J. (2005) Bella and Yakov and Tillie's Panties: What I Learned in “Construction and Maintenance of Indexing Languages and Thesauri” Bulletin of the American Society for Information Science and Technology 31(4) April/May 2005. http://www.asis.org/Bulletin/Apr-05/ezzo.html

Page 17: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

Relationships in CVs

Page 18: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 18

Relationships in Controlled Vocabularies

• There are three major types of relationships between subject concepts

– Equivalence Relationships – Hierarchical Relationships – Associative Relationships

Page 19: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 19

Equivalence Relationships

• In natural language one word or phrase can refer to one or more concepts; and multiple terms can refer to a single concept

• In other words, there is no one-to-one correspondence between words/phrases and concepts

Page 20: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 20

Preferred Terms and Cross references (Synonyms)

• Controlled vocabularies create one-to-one relationships between synonyms – multiple words or phrases that share similar meaning

• To do this we:– Select Preferred term (descriptor, subject

heading)– Create cross references from non-preferred

terms (entry vocabulary, lead-in terms)

Page 21: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 21

Example Equivalence Display• Sample display for descriptor (preferred term) “Creativity” from

the ERIC Thesaurus:

Creativity UF Creative ability

Originality

• If you searched on “Originality” or “Creative ability” in the ERIC database, you would see these references:– “Creative ability” see “Creativity” OR– “Originality” use “Creativity”

• In other words, you would be led from the unused (lead-in) terms to the used (preferred) term.

Page 22: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 22

Equivalence Relationships - Summary

• Exist between words or phrases that share the same (or similar) meaning

• Equivalent terms are considered synonymous (whether they actually are or are not)

• When controlling vocabulary, one equivalent term is selected as a preferred term (e.g., descriptor); the other equivalent terms are treated as “lead in” terms or cross references

• References used in the CV to show equivalence relationships include: “UF” (use for); and “Use” “See”; and “Search under”

Page 23: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 23

Hierarchical Relationships

• Hierarchical Relationships:– May be strictly defined as:

• Genus-species (also called class inclusion or “is-a”) relationships

• Whole-part relationships (sometimes these are treated as associative relationships)

Page 24: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 24

Hierarchical Relationships

• Hierarchical Relationships:– May be illustrated by set notation: Set G (green) is

a subset of Set B (blue)– All Gs are also Bs (in other words, a G is a B)– Using a real-world analogy, if Gs are gorillas, and

Bs are animals, all gorillas are animals

Page 25: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 25

Ideal CV Hierarchical Relationships

• Ideally, all hierarchical relationships indicated in a controlled vocabulary are also controlled and defined as genus-species (and sometimes also whole-part) relationships

• ALL other relationships between terms are associative relationships

• In real life CVs, this is not always the case!

Page 26: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 26

References for Hierarchical Relationships

• Hierarchically related terms are shown by the BT (broader term), NT (narrower term), and sometimes See also/Search also references.

• Examples of two entries in the ERIC thesaurus:Creativity

BT Psychological characteristics

Psychological characteristics NT Creativity

Intelligence Cognitive style

Page 27: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 27

BTs & NTs

• In the previous slide, both Creativity and Psychological characteristics are preferred terms

• Each has its own display; the Creativity display (Creativity as a preferred term display) shows the reference to the broader, preferred term “Psychological characteristics”

Page 28: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 28

Testing for Hierarchical Relationships

• To test for a hierarchical relationship between terms, use the ‘is-a’ test.

• The relationship between “robin” and “bird”? (A robin is a (type of) bird, so the relationship is hierarchical; Bird is the broader term, Robin is the narrower)

• The relationship between Water and Hydronomy? (Water is not a hydronomy or a type of hydronomy; Hydronomy is not a water or a type of water; so the relationship here is an associative relationship)

Page 29: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 29

Examples of Hierarchical Relationships

• What is the relationship between these sets of terms?– books and library materials– water and floods– buildings and chimneys– painting and acrylic paints– water and groundwater

Page 30: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 30

Answers

• Books and Library materials (hierarchical)• Water and floods (associative because a flood

is not the same type of thing as water--one way you can tell is that one is a count noun, and the other is not--but maybe hierarchical is ok depending on context)

• Buildings and chimneys (hierarchical if you include whole-part relationships; associative if you don’t)

• Painting and acrylic paints (associative)• Water and ground water (hierarchical)

Page 31: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 31

More on Hierarchical Relationships

• A characteristic of the hierarchical relationship between terms that are strictly hierarchically related (genus-species only, not whole part) is Hierarchical Force

• When a narrower term is hierarchically related to a broader term, the narrower terms (NT) inherits all of the characteristics of the terms above it in a hierarchy

Page 32: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 32

Associative Relationships

• Include all relationships not encompassed by equivalence and hierarchical relationships

• In Controlled Vocabularies, these relationships are shown by the following references:– Related Term (RT), see also (SA)

• Examples of types of associative relationships (there are many of these!): – Thing and property (rubber, elasticity)– Complementary activities (teaching, learning)– Agent and activity (artist, painting)

Page 33: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 33

Associative Relationships

• Many of these are semantic relationships• Some of these are syntactic relationships too:

– Children see related term Games

• Problems – when to stop? How close in meaning or syntactic relation do two terms have to be to show them in a CV?

• Note: associative relationships are rarely shown in classifications & taxonomies

Page 34: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 34

Example Associative Relationship Display

• From the ERIC thesaurus:Comprehension RT Concept formation

Misconceptions Scientific literacy Thinking skills

• Again, remember that both Comprehension and all of the RTs are preferred terms; however, this is the display for the preferred term Comprehension

Page 35: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 35

Some Guidelines

• Does the taxonomy cover the domain appropriately?

• Is it within scope?• Do draft definitions for concepts express them

clearly?• Are duplicate concepts removed?• Are basic-level concepts represented?• Does extracted terminology express them?• Is the structure useful and sensible?

Page 36: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 36

Questions?

• If not, take a break!!!

Page 37: Module 7b: Extracting/Controlling Terms and Semantic Relationships IMT530: Organization of Information Resources Winter 2007 Michael Crandall

IMT530- Organization of Information Resources 37

Exercise 7b

• Take your term lists from last week, and use those in Exercise 7b to begin building a controlled vocabulary

• Turn in your initial controlled vocabularies before Tuesday via email