1 Synonyms & Taxonomies Synonyms & Taxonomies Thesaurus Design for Information Architects an...

Preview:

Citation preview

1

Synonyms & TaxonomiesSynonyms & TaxonomiesThesaurus Design for Information

Architects

an ACIA Seminar

by Peter Morville & Samantha Bailey

2

Introductions

Peter Morville (morville@argus-inc.com)

• CEO, Argus Associates• Co-author, Information Architecture

for the World Wide Web• Director, ACIA• LIS background• Fortune 500 consulting

3

Introductions

Samantha Bailey (bailey@argus-inc.com)

• VP of Operations, Argus Associates• LIS background• Fortune 500 consulting• VC experience

4

Seminar OutlineI. Thesauri in ContextII. Value of ThesauriIII. MethodologyIV. MetadataV. Vocabulary ControlVI. Structure & RelationshipsVII. Thesaurus ManagementVIII. Case Study IX. Related Topics

Instructional MethodsExercises, Quizzes, Discussions, Breaks

5

Our Approach

Assumptions• Understanding of IA Basics• Interest in Thesauri and the Web

Philosophy• Reality is Important• Technology has Limitations• Success takes Time• Tension can be Healthy

6

Thesauri in Context

What is IA?

The art and science of structuring and organizing information systems to help people achieve their goals.

7

Thesauri in Context

An Ecological Approach

Books: Information Ecologies by Bonnie Nardi and

Information Ecology by Thomas Davenport

Content

BusinessContext

Users

8

Thesauri in Context

IA From Top to BottomTop-Down Bottom-Up

portal sub-site

strategy objects

hierarchy metadata

primary path multiple pathsportal

local subsites(HR, Engineering, R&D…)

Object XName:Product Category:Topic:Stale Date:Author:Security:

9

Thesauri in Context

Where Does IA Fit?http://www.jjg.net/ia/elements.pdf

The Elements of

User Experience

Jesse James Garrett

10

Thesauri in Context

What is Vocabulary Control?

Controlled Vocabulary

A list of preferred and variant terms.

A subset of natural language.

Preferred Variants Authority

AZ Ariz, Arizona, 85XXX US Postal Service

IBM Intl Bus Machines, Big Blue

NY Stock Exchange

Nyctalopia Night blindness

Moon blindness

National Library of Medicine

11

Thesauri in Context

Why Control Vocabulary?Language is Ambiguous

• Synonyms, homonyms, antonyms, contronyms, etc.

In the Oxford English Dictionary:• “Round” takes 7 ½ pages or 15,000

words to define.• “Set” has 58 uses as a noun, 126 as a

verb, 10 as an adjective.

The Mother Tongue: English & How It Got That Way

by Bill Bryson

12

Thesauri in Context

Why Control Vocabulary?

So Your Users Don’t Have To!

Users

Documents and Applications

Communication Chasm

ExamplePersonal Digital Assistant

SynonymsHandheld Computer

"Alternate" SpellingsPersenal Digitel Asistent

Abbreviations / AcronymsPDA

Broader TermsWireless, Computers

Narrower TermsPalmPilot, PocketPC

Related TermsWindowsCE, Cell Phones

13

Thesauri in Context

Semantic RelationshipsTypes1. Equivalence

2. Hierarchical

3. Associative

(Preferred)Vermont

(Related)Skiing

(Narrower)Burlington

(Broader)United States

(Variant)Green

Mountain State

(Related)Maple Syrup

(Variant)Vt

1

3

2

14

Thesauri in Context

Levels of Control

Simple Complex

SynonymRings

AuthorityFiles

ThesauriClassificationSchemes

Equivalence Hierarchical Associative

(Vocabularies)

(Relationships)

15

Thesauri in Context

What is a Thesaurus?

Traditional Use• Dictionary of synonyms (Roget’s)• From one word to many words

Information Retrieval Context• A controlled vocabulary in which

equivalence, hierarchical, and associative relationships are identified for purposes of improved retrieval

• Many words to one concept

16

Thesauri in Context

TerminologyPreferred Terms (UF subject headings,

descriptors)

SN Scope Notes

UF Used For

BT Broader Term

NT Narrower Term

RT Related Terms (“See Also”)

Variant Terms (UF non-preferred, entry terms)

USE (“See”)

17

Thesauri in Context

Types of ThesauriUsed in Indexing

No Yes

No

Yes

Used inSearching

NaturalLanguage

IndexingThesaurus

ClassicThesaurus

SearchingThesaurus

18

Thesauri in Context

VisibilityClassic Use

• Both indexers and searchers explicitly map natural language terms onto controlled vocabularies

Web Environment• Able to choose level of visibility

(implicit use, thesaural browsers)

• Opportunity to educate users (terminology, associative learning)

19

Thesauri in Context

Niche Applications (hypothetical example)

Product Catalog: multipleviews enabled by thesaurus

Technical Support Database:entry vocabulary mapsproblems to solutions

Searching Thesaurus:implicit term explosionmanages synonyms

20

Thesauri in Context

Thesaurus StandardsMono-Lingual Thesauri

• ISO 2788 (1974, 1985, 1986, International)• BS 5723 (1987, British)• AFNOR NFZ 47-100 (1981, French)• DIN 1463 (1987-1993, German)• ANSI/NISO Z39.19 (1994, United States)

Multi-Lingual Thesauri• ISO 5964 (1985, International)

21

Thesauri in Context

ANSI/NISO StandardZ39.19-1993

Guidelines for the Construction, Format, and

Management of Monolingual Thesauri.

84 pp. ISBN: 1-880124-04-1 Price: $49.00

http://www.niso.org/stantech.html

Reasons to Follow Standard• Significant thinking behind guidelines• Technology integration• Cross-database compatibility

22

Thesauri in Context

Oracle’s Perspective“The phrase…thesaurus standard is somewhat misleading. The computing industry considers a ‘standard’ to be a specification of behavior or interface. These standards do not specify anything. If you are looking for a thesaurus function interface, or a standard thesaurus file format, you won't find it here. Instead, these are guidelines for thesaurus compilers -- compiler being an actual human, not a program.

What Oracle has done is taken the ideas in these guidelines and in ANSI Z39.19…and used them as the basis for a specification of our own creation…So, Oracle supports ISO-2788 relationships or ISO-2788 compliant thesauri.”

23

Thesauri in Context

A World in Transition

“The majority of basic problems of thesaurus construction had already been solved by 1967.” (Krooks and Lancaster, 1993)

Traditional Thesauri Web Thesauri

Print Online

Academic / Library Business

Expert / Repeat Users Novice / Infrequent Users

Visible Invisible

Accepted Value Unknown Value

24

Section Break

I. Thesauri in Context II. Value of ThesauriIII. MethodologyIV. MetadataV. Vocabulary ControlVI. Structure & RelationshipsVII. Thesaurus ManagementVIII. Case Study IX. Related Topics

25

Value of Thesauri

IA Metrics

• Cost of finding (time, clicks, frustration, precision).

• Cost of not finding (success, recall, frustration, alternatives).

• Cost of development (time, budget, staff, frustration).

• Value of learning (related products, services, projects, people).

26

Value of Thesauri

KM Metrics

• Revenue Generation (% revenues spent on KM, new revenue generation)

• Opportunity Cost (staff time, customers lost)

• Knowledge Efficiency (faster product development, # mistakes made twice)

• Data Quality (% knowledge on intranet, % email with attachments)

• Intranet Usage (# hits, # contributions)

• Individual Behavior (# citations)

• Technical Performance (uptime, search response time)

Working Council for Chief Information OfficersBasic Principles of Information Architecture

(http://www.cio.executiveboard.com)

27

Value of Thesauri

Web Site Statistics

Wasted expense: most sites will waste between $1.5M and $2.1M on redesigns next year.

Forfeited revenue: poorly architected retailing sites are underselling by as much as 50%.

Lost customers: the sites we tested are driving away up to 40% of repeat traffic.

Eroded brand: people who have a bad experience, typically tell 10 others.

Forrester Research Why Most Web Sites Fail (Sept 98)

28

Value of Thesauri

Intranet Statistics

Employees spend 35% of productive time searching for information online.

Working Council for Chief Information Officers

Basic Principles of Information Architecture

(http://www.cio.executiveboard.com)

Managers spend 17% of their time (6 weeks a year) searching for information.

Information Ecology

Thomas Davenport and Lawrence Prusak

(http://argus-acia.com/content/review001.html)

29

Value of Thesauri

Intranet Statistics

Sun Microsystems’ usability experts calculated that 21,000 employees were wasting an average of six minutes per day due to inconsistent intranet navigation structures. When lost time was multiplied by staff salaries, the estimated productivity loss exceeded $10 million per year.

Jakob Nielsen

Web Design and Development

September 1997

30

Value of Thesauri

Intranet Statistics

After spending two years and $3 million on development and usability testing, Bay Networks expects to see $10 million in productivity gains and a 10 percent cycle-time reduction for new product development as a result of its new information architecture.

Working Council for Chief Information Officers

Basic Principles of Information Architecture

(http://www.cio.executiveboard.com)

31

Value of Thesauri

Intranet Statistics

40% of corporate users can’t find the information they need on their intranet.

Prior to intranet reengineering in 1997, Ford conducted a survey of its 100,000+ user base. Employees stated they could only find 15% of the information they needed to do their jobs.

Under-investment in (unstructured) information. 80% spending on 20% (structured) data.

Working Council for Chief Information OfficersBasic Principles of Information Architecture

(http://www.cio.executiveboard.com)

32

Value of Thesauri

Searching Problems“Most of the complaints we get are due to the way users search – they use the wrong keywords.”

- a manufacturing company

“We have problems with the way customers enter queries. Capitalizations and misspellings give us headaches.”

- a software company

Forrester Research

Must Search Stink? (June 2000)

33

Value of Thesauri

Searching Statistics

“Search will become the center piece of navigation.”

90% of firms rate search as very or extremely important.

52% don’t measure search effectiveness.

Forrester Research

Must Search Stink? (June 2000)

34

Value of Thesauri

CV StatisticsResearchers at Bell Labs found the probability that two people would choose the same word to describe an object to be less than 20%.

Furnas, Landauer, et. al., Bell Labs (1987)

30% of corporations systematically utilize metadata to classify information, while only one to three percent of companies populate those metadata tags using controlled vocabularies.

71% don’t account for misspellings or synonyms.

Forrester ResearchBuilding an Intranet Portal (Jan 1999)

35

Value of Thesauri

CV Statistics

Principle of unlimited aliasing: by leveraging synonyms, recall went from 20% to 80% (in a small collection).

The Trouble with ComputersResearch study at Bellcore (Furnas et al. 1987)

“The findings indicate that a hypertext index with multiple access points for each concept…led to greater effectiveness and efficiency of retrieval on almost all measures.”

A Usability Assessment of Online Indexing Structures By Carol A. Hert, Elin K. Jacob, and Patrick Dawson

Journal of the American Society for Information Science (September 2000)

36

Value of Thesauri

Complementary ApproachesBasic• Navigation Design (Browsing)• Full Text Indexing (Searching)

Advanced• Collaborative Filtering• Lexical Databases • Automated Hierarchy-Generation

37

Value of Thesauri

Navigation DesignRelationships• Global & Local (hierarchical)• Contextual (associative)

Where am I?

Wha

t's n

earb

y?

What's related towhat's here?

Global Navigation

Loca

l Nav

igat

ion

Content is here,with contextual

navigationembedded or

separate.

38

Value of Thesauri

Full Text IndexingStrengths• Enables high precision (exact phrase)• Enables high recall (word occurrence)

Weaknesses• Often results in low precision (“aboutness”)• Often results in low recall (synonyms)

Complementary Use• Provide users with option (search CV, full text)• Intelligent next step (no hits on CV > full text)• Full text search within CV search zones

39

Value of Thesauri

Collaborative Filtering

SN. Approaches that leverage knowledge about preferences or behaviors of people or organizations to facilitate information retrieval.

Popularity / Importance • Direct Hit (analysis of searcher behavior)• Amazon (cross-title purchasing habits)• Google (citation indexing)

Considerations• Favors established materials• Lacks benefits of vocabulary control• User-centric (ignores content, context)

40

Value of Thesauri

Lexical DatabasesScope Notes• Broad term banks or semantic networks

that specify lexical variants and term relationships.

• General-interest, off-the-shelf thesauri.

Examples• Roget’s Thesaurus• WordNet• Plumb Design Visual Thesaurus

41

Value of Thesauri

Lexical DatabasesNumber of Terms (General, Niche)

Importance of Context (Bug in Software, Espionage)

# of Terms

# of Meanings

Notes

WordNet 50,000 70,000

Oxford English Dictionary

615,000 2.4M > 20,000 New Terms Per Year

Named Insect Species

1.4M Drosophila UF Fruit Fly

Square D

Products

300,000 Electrical Distribution

42

Value of Thesauri

Hierarchy-Generation SoftwareAn Intimidating Vocabulary• Multivariate regression models, probabilistic

Bayesian models, neural networks, symbolic rule learning, computational semiotics, and support vector machines

General Techniques• Clustering (similarity, word co-occurrence)• Vector Space (extract “meaning” from terms,

teach by example)

43

Value of Thesauri

Hierarchy-Generation SoftwareExamples

• Autonomy (http://www.autonomy.com/)• Semio (http://www.semio.com/)• Cartia (http://www.cartia.com/)

Hyperbole

Autonomy claims their software eliminates "the need for any manual labor in the process."

44

Value of Thesauri

Hierarchy-Generation SoftwareConsiderations • No business context• No consideration of users• No planning for future• Mixed category schemes• Hidden costs

integration rule design training

Trends• Niche use (e.g., news, web search results)• Integration with manual classification schemes

Content

BusinessContext

Users

X

X

45

Section Break

I. Thesauri in Context II. Value of Thesauri III. MethodologyIV. MetadataV. Vocabulary ControlVI. Structure & RelationshipsVII. Thesaurus ManagementVIII. Case Study IX. Related Topics

46

Methodology

Overview

indicates special emphasis during this phase

Strategy Design Build

Process

Deliverables

Consulting

47

Methodology

Strategy x Process

* select right mix for project; this is a partial list of tools

Information Architect’s Toolbox *

Business Context

strategy meetings

opinion leader interviews

technology assessment

Content & Applications

content inventory

content analysis

metadata

evaluation

Users log analysis observation / usability testing

interviews / affinity

modeling

Existing IA heuristic evaluation

classification scheme analysis

benchmarking

48

Methodology

Design x Deliverables

* select right mix for project; this is a partial list of tools

Information Architect’s Toolbox *

Organization & Labeling

metadata specifications

controlled vocabularies

thesaurus

Navigation (Embedded)

primary taxonomy

classification schemes

blueprints and wireframes

Navigation (Supplemental)

search system sitemap /

indexes

personalization / customization

Synthesis design / authoring guidelines

content management

policies

functional specifications

49

Methodology

Consulting x Build

* select right mix for project; this is a partial list of tools

Information Architect’s Toolbox *

Metadata Application

object-level indexing guides

support indexers

support thesaurus managers

Point of Production

support designers / developers

usability testing input / analysis

fix problems

Post - Launch

metrics evaluation improvement

50

Methodology

Thesaurus Construction

Strategy1. Define Thesaurus Strategy2. Develop Project PlanDesign3. Gather Candidate Terms / Variants4. Select Preferred Terms5. Develop Facet Hierarchies6. Identify ‘See Also’ Links7. Write Design / Functional Specifications8. Build / Buy Software ApplicationsBuild9. Launch Indexing Operation10. Refine Controlled Vocabularies

51

Methodology

Strategy Questions

• Does vocabulary control make sense?• Where and for what purposes?• How will it align with business goals?• How will it support users’ goals?• How will it impact content management?• Will we buy, borrow, or build?

52

Section Break

I. Thesauri in Context II. Value of Thesauri III. Methodology IV. MetadataV. Vocabulary ControlVI. Structure & RelationshipsVII. Thesaurus ManagementVIII. Case Study IX. Related Topics

53

Metadata

Definition

Information about information

Purposes

1. Document surrogate (abstract)

2. Provides context (date, publisher)

3. Facilitates retrieval (subject)

54

Metadata

Ways to LeverageUser Interface

• Generate browsable indexes (site-wide, sub-site, specialized authority files)

• Enable field-specific searching (filters, zones, sorting)

• Support personalization (map profile to tags)

Behind the Scenes• Enable efficient content management • Support decentralized tagging

55

Metadata

Types of IndexingManual Automated

Full Text x complete text minus stop words

Keyword

(Natural Language)

humans assign “relevant” words and phrases

software assigns “relevant” words and phrases

Controlled Vocabulary

humans map variants to preferred terms

software maps variants to preferred terms

56

Metadata

Full Text Indexing

57

Metadata

Keyword Indexing<HTML><HEAD><TITLE>STARTREK.COM:The Official Star Trek Web

Site!</TITLE><META NAME='description'

CONTENT='STARTREK.COM:The Official Star Trek Web Site! The starting point for all Star Trek information on the web.'>

<META NAME='keywords' CONTENT='star trek, enterprise, james kirk, mister spock, seven of nine, doctor mccoy, captain sulu, borg, klingon, romulan, ferengi, human, starfleet command, delta quadrant, alpha quadrant, gamma quadrant, excelsior, paramount, voyager, deep space nine, captain sisko, jean luc picard, kathryn janeway, starfleet academy, united federation of planets'>

<META NAME='author' CONTENT='Paramount Digital Entertainment'>

58

Metadata

CV IndexingPartners/Competitors

UI ACCEPTED TERM

LRID Variant Terms

PC0004 Bell Atlantic

  BellAtlantic; Bell Atlantic / North; NYNEX; Nynex

PC0091 NLG   National Leisure Group

PC0076 VH1   Video Hits 1; VH-1

59

Metadata

Indexing GuidelinesConsiderations• Specificity: rule of specific entry • Exhaustivity: number of terms per document• Aboutness: strive for consistent interpretation• Consistency: can be more important than quality• Quality: balance against speed and consistency

60

Metadata

Comparative AnalysisFull Text (extraction)• High specificity enables precision (sometimes)

• Exhaustivity allows for high recall (sometimes)

Keyword (assignment or extraction)• Relatively low level of investment• Selection of more relevant words / phrases may

increase recall and precision (sometimes)

Controlled Vocabulary (assignment)• Synonym management increases recall• Disambiguation increases precision

(value increases with size, Medline > 6M documents)

• Enables hierarchical and “see also” browsing

61

Metadata

Cost Analysis

Searching Costs# users, usage volume,

user value, success value,size, complexity

ThesaurusCosts

complexity,vocabulary

stability,technology

Indexing Costscontent volume, #

fields, time per field,rate of growth /

churn

62

Metadata

Automated Indexing

Primary Benefit• Save money (cost of manually classifying 1 journal

article = $1.70)

Approaches• Term Extraction: extraction of “important” words

and phrases (proximity, stemming)• Latent Semantic Indexing: vector space approach

(extracts meaning, training required)

Desired Features• Assign terms from controlled vocabularies• Integrate with thesauri, database tools, etc.• Handle multi-lingual collections

63

Metadata

Automated IndexingSoftware Categories & Labels

Search Engines, Data Mining, Text Extraction, Knowledge Management, Automatic Classification, Meta-Tagging

Leading Products

Metacode’s Metatagger (http://www.metacode.com/)

Mohomine (http://www.mohomine.com/)

Oingo (http://www.oingo.com/)

InXight Categorizer (http://www.inxight.com/)

Semio Taxonomy (http://www.semio.com/)

Inktomi / Ultraseek CCE (http://www.inktomi.com/)

64

Metadata

Selecting a StrategyFactors to Consider Manual Automated

Cost (per document) High Low

Speed Slow Fast

Consistency Variable High

Quality Variable Variable

Multimedia-Capable Yes No

Intelligent(understand text and guidelines)

Yes No

65

Section Break

I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary ControlVI. Structure & RelationshipsVII. Thesaurus ManagementVIII. Case Study IX. Related Topics

66

Vocabulary Control

Getting StartedTypes1. Equivalence

2. Hierarchical

3. Associative

(Preferred)Vermont

(Related)Skiing

(Narrower)Burlington

(Broader)United States

(Variant)Green

Mountain State

(Related)Maple Syrup

(Variant)Vt

1

3

2

67

Vocabulary Control

Identify TermsPublished Reference Materials

Thesauri, classification schemes, encyclopedias, dictionaries, glossaries, indexes

ContentRepresentative sample of web site / intranet

UsersSearch log analysis, surveys, interviews

ExpertsAuthors, subject experts

68

Vocabulary Control

Organize Terms

1. Define preferred terms

2. Link synonyms and variants

3. Group preferred terms by subject

4. Identify broader and narrower terms

5. Identify related terms

Note: steps 3-5 are tentative designations and part of iterative process.

69

Vocabulary Control

Form of Preferred TermsGrammatical Form (noun, adjective, verb)

Spelling (defined authority, house style)

Singular & Plural Form (count nouns)

Abbreviations & Acronyms (popular use)

Considerations• Stemming helps (but not for mouse/mice) • Global guidelines / term-specific decisions• Rules simplify decision-making • Consistency enhances usability

70

Vocabulary Control

Selection of Preferred Terms

ANSI/NISO Z39.19-1993

3.0 “Literary warrant (occurrence of terms in documents) is the guiding principle for selection of the preferred (term).”

5.2.2 “Preferred terms should be selected to serve the needs of the majority of users.”

71

Vocabulary Control

Definition of TermsThe meaning of the term must be deliberately restricted.

Qualifiers (manage homographs)

Cells (biology) / Cells (electric)

Scope Notes (restrict meaning)

Hamburger. SN: includes burgers made with beef. Otherwise use “Turkey Burger” or “Veggie Burger”

Definition (clarify and educate)

Trend towards integration of glossaries

72

Vocabulary Control

Variant TermsVariant terms provide the users with entry points into the vocabulary.

Synonyms (same meaning)

cats USE felines, helicopters USE whirlybirds

Lexical Variants (different word forms)

paediatrics USE pediatrics, BK USE Burger King

Quasi-Synonyms (treated as equivalent)

generic posting: beagle USE dog

antonyms/continuum: wetness USE dryness

73

Vocabulary Control

Recall and Precision

CostsTime to Find

Failure to FindDevelopment

PrecisionDevices

SpecificityCoordination (AND)Compound Terms

Term DefinitionProximity

RecallDevices

Word StemmingVariants (OR)

Generic PostingRelationships

74

Vocabulary Control

Term SpecificityAssuming a good entry vocabulary, increased term specificity allows for improved precision without hurting recall (but costs grow fast).

Vocabulary A Vocabulary B

United States United States

California

San Diego

75

Vocabulary Control

Compound Terms

ANSI/NISO Z39.19.

“Each descriptor…should represent a single concept.”

ISO 2788.

“It is a general rule that…compound terms should be factored (split) into simple elements.”

76

Vocabulary Control

Compound TermsArticle: “Software for Information Architecture”

Hig

h P

rec

isio

n

Hig

h R

ec

all

One Term Information Architecture Software

Two Terms Information Architecture Software

Three Terms Architecture Information Software

77

Section Break

I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & RelationshipsVII. Thesaurus ManagementVIII. Case Study IX. Related Topics

78

Structure & RelationshipsTypes

• Bottom-up (semantic, term to term)

• Top-down (shape, classification)

Semantic Relationships (reciprocity)

• Equivalence • Hierarchical• Associative

79

Structure & Relationships

Semantic Relationships

(Preferred)Settlements

(Related)Housing

(Narrower)Ghost Towns

(Broader)Cultural

Landscapes

(Variant)Human

Settlements

(Related)Dwellings

(Synonym)Inhabited

Places

80

Structure & Relationships

Semantic RelationshipsEquivalence

• Use/Used For (USE/UF)• Leads from variants to preferred

e.g., prams: USE baby carriages

A = B

81

Structure & Relationships

Semantic RelationshipsHierarchical

• Broader Term/Narrower Term (BT/NT)

Types• Generic (class/species, inheritance)

Vertebrata NT Amphibia

• Whole-Part (associative unless exclusive)

Ear NT Vestibular Apparatus

• Instance (proper name)

Seas NT Mediterranean SeaA

B

82

Structure & Relationships

Semantic RelationshipsAssociative

• Related Term (RT, See Also)

• Non-hierarchical and non-equivalent• Relation should be “strongly implied”

e.g., hammers RT nails

A B

83

Structure & Relationships

Associative RelationshipsExamples

Field of Study and Object of Study• Forestry RT Forests

Process and its Agent• Temperature Control RT ThermostatConcepts and their Properties

• Poisons RT ToxicityAction and Product of Action

• Weaving RT ClothConcepts Linked by Causal Dependence

• Bereavement RT Death

84

Structure & Relationships

Classification Schemes

SN Hierarchical arrangement of terms. In navigation context, use Hierarchy.

UF Categorization

Taxonomy

Ontology

RT Hierarchy

85

Structure & Relationships

Pre- & Post-CoordinationEnumerative Classification Schemes

• Pre-coordinate (more compound terms)• All terms are enumerated (listed) in their

entirety in the scheme.

Library of Congress Classification Scheme

Synthetic Classification Schemes• Post-coordinate (more uni-terms)• New terms can be created by combining

terms during a search (AND).

Art & Architecture Thesaurus

86

Structure & Relationships

Pre- & Post-Coordination

• In the highly enumerative LC Classification, “Groundwater - - Pollution” and “Soil pollution” are dispersed at indexing (high precision, low recall).

• Keyword searching improves recall, hurts precision (a synthetic band-aid, potential false drop on “soil purification standards”).

87

Structure & Relationships

PolyhierarchyStrict Hierarchies• Each term appears in only

one place in the hierarchy.• Essential for placement

of physical objects.

Polyhierarchies• Terms cross-listed

in multiple categories.• Accepts complex

nature of reality.

88

Structure & Relationships

PolyhierarchyMedical Subject Headings (MeSH)• Compound terms needed

to manage 6 million documents in Medline.

• High level of pre-coordination forces polyhierarchy.

• Terms may have more than one BT.

ViralPneumonia

Diseases

VirusDiseases

RespiratoryTract

Diseases

89

Structure & Relationships

Faceted ClassificationOverview• Invented by S.R. Ranganathan (1930s)• Handle complex subjects (reality)• One principle of division at a time• Multiple “pure” taxonomies• UF analytico-synthetic scheme, fielded database

Facets• Fundamental facets: personality, matter, energy,

space, time• Common facets: subject (about), geography (in),

author (by whom)

Art & Architecture Thesaurus, ASIS Thesaurus

90

Structure & Relationships

Facets, Coordination, Specificity

Drying of ApplesDrying of PearsDrying of PeachesCanned ApplesCanned PearsCanned PeachesFrozen ApplesFrozen PearsFrozen PeachesFresh ApplesFresh PearsFresh PeachesFreezing of Canned ApplesCanning of Dried PearsDrying of Fresh Peaches

EntitiesApplesPearsPeaches

ProcessesCanningFreezingDrying

FormsCannedFrozenFresh

ApplesPearsPeachesCanningFreezingDryingCannedFrozenFreshCanning of ApplesCanning of PearsCanning of PeachesFreezing of ApplesFreezing of PearsFreezing of Peaches

Partial List of Potential Combinations

91

Structure & Relationships

YahooCharacteristics• Single Facet (a topical hierarchy)• Fairly Enumerative (search on “Boston” finds

45 categories including: Boston Celtics, Boston Tea Party, Anonymous Account of the Boston Massacre)

• Polyhierarchical (Computer Science@ listed under Computers & Internet and Science)

Observations• Huge number of categories and levels (unwieldy)• Fits user expectations (where do I find this?)

92

Structure & Relationships

ASIS ThesaurusCharacteristics• Faceted (16 facets including document types,

fields and disciplines, organizations, qualities)• Fairly Synthetic (large percentage of one or two

word single-concept descriptors)• Polyhierarchical (machine aided indexing

BT computer applications, BT indexing)

Observations• Faceted approach allows small number of terms

to be combined in large number of unexpected ways (e.g., ambiguity and informatics)

• Presentation is not accessible to typical user

93

Structure & Relationships

A Unification Theory

Hypothesis: This hybrid information architecture will become a common model for web sites and intranets over the next several years.

Taxonomy

single facet, enumerative

Thesaurus

faceted, synthetic

fits user expectations (where did they put this?)

fits content complexity

(how can I describe this?)

use for top few levels

(familiar gateway to site)

populate the hierarchy

(combinations, see also)

early user tests

(best primary hierarchy)

ongoing user tests

(leverage power, flexibility)

application of human expertise

human-software hybrid

(facet-specific solutions)

94

Section Break

I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus ManagementVIII. Case Study IX. Related Topics

95

Thesaurus ManagementWhat’s Involved?• Software, workflow, quality control• Vocabularies evolve over time• Impacts authors, indexers, users

Vocabulary Maintenance Tasks• Add, delete, enhance, normalize terms• Overall evaluation

96

Thesaurus Management

Software: What to Look For• Traditional database functionality• Compliant with standards (ANSI, ISO)

• Relationship control (reciprocity, validation, orphan identification)

• Term status (proposed, provisional, accepted)

• Flexible output (alphabetical, hierarchical)

• Integration with related tools and tasks (indexing, searching, browsing)

Willpower’s List of Thesaurus Software

http://www.willpower.demon.co.uk/thessoft.htm

97

Thesaurus Management

Software: What You’ll FindThesaurus Management Software• Standards-compliant, sophisticated, • Poor integration (library-centric)• Examples: Lexico, MultiTes

Database Management Software• Strong integration• Less thesaurus-specific functionality• Examples: Oracle (interMedia),

Sybase (English Wizard)

98

Thesaurus Management Software

What You’ll FindSearch Engines• Watch for casual use of “thesaurus”• Look for integration with browsing.

UltraseekThesaurus Expansion for Queries: Administrators may put sets of synonyms in the thesaurus.txt file…When a query matches one of the terms in that file, the synonyms will automatically appear, so the user has the option to add it to the query.

VerityVerity's core search products include the following advanced knowledge retrieval capabilities: advanced query expansion and disambiguation tools, including linguistic stemming and thesaurus expansion.

99

Section Break

I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII. Case Study IX. Related Topics

100

Case Study

Call Center IntranetIntroduction

• KM application• 6,000 users (customer care associates)

• 8,000 documents (hierarchy, search)

• 6 month project (10/97 to 4/98)

• $500K of $10M redesign

Goals• Reduce training time / time to find• Increase use / customer satisfaction

101

Case Study: Call Center Intranet

Process OverviewStrategy • Background, vocabulary, meetings, observation • 4 weeks x 2.5 PM + 1 IA

Design • Bottom-up focus (doc types, fields, templates)• 4 weeks x 2 PM + 2 IA • 4 weeks x 1 IA (during implementation)

Implementation • Indexing / develop controlled vocabularies• Specifications (authors, indexers, developers)• 16 weeks x 4 indexers + 1 IA + 2 PM + 1

subject expert

102

Case Study: Call Center Intranet

Controlled VocabulariesPrimary Vocabularies

• Partners/Competitors (122)

• Plans/Promotions (173)

• Products/Services (151 / 184 variants)

• Geographic Codes (51)

Secondary Vocabularies• Adjustment Codes (36)

• Corporate Terminology (70)

• Time Codes (12)

103

Case Study: Call Center Intranet

Primary VocabulariesPartners/Competitors

UI ACCEPTED TERM

LRID Variant Terms

PC0004 Bell Atlantic

  BellAtlantic; Bell Atlantic / North; NYNEX; Nynex

PC0091 NLG   National Leisure Group

PC0076 VH1   Video Hits 1; VH-1

104

Case Study: Call Center Intranet

Primary VocabulariesProducts/Services

UI Accepted Term

LRID   Variant Terms

PS0135 Access Dialing

    10-288; 10-322; dial around

PS0006 Air Miles

    AirMiles

PS0151 XYZ Direct

    USADirect; XYZ USA Direct; XYZDirect card

105

Case Study: Call Center Intranet

Primary VocabulariesGeographic Codes

CT Connecticut

DE Delaware

DC District of Columbia; Dist. of Columbia; Dist. Columbia

  Note:Continental U.S. is equivalent to the lower 48 states.

106

Case Study: Call Center Intranet

Secondary VocabulariesAdjustment Codes 

DAK Denies All Knowledge

-

MOS Monthly Service Charge

Mnthly. Service Charge; Mnthly. Svc. Charge; Monthly Svc. Charge

WNO Wrong Number -

WTN Working Telephone Number

Working Tele. Number

107

Case Study: Call Center Intranet

Secondary VocabulariesCorporate Terminology 

Billed Telephone Number (BTN)

Billed Tele. Number

Cross Boundary Account

Foreign Account

Fraud -

Multi Level Marketing

Multi-Level Marketing; MultiLevel Marketing; MLM

World Wide Web WWW; WorldWideWeb

108

Case Study: Call Center Intranet

Blueprints

CustomerCare

Browse byPlans &

Promotions

Browse byProducts &Services

Browse byTopics

Browse byPartners &

Competitors

Browse byGeography

Browse byWhat's New

AdvancedSearch

SearchInterface

ExpressLinks

(Top 10)

ExpressLinks

109

Case Study: Call Center Intranet

Wireframes: Content

110

Case Study: Call Center Intranet

Wireframes: Browsable Index

Provides ability to view all documents tagged with same preferred term. Ability to combine fields for powerful search/browse.

111

Case Study: Call Center Intranet Deliverables Overview• Blueprints and Wireframes• Controlled Vocabularies• Authoring & Indexing Guidelines• Indexed Documents (4,000)

• Functional Specifications• Documentation & Training

112

Section Break

I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII. Case Study IX. Related Topics

113

Related Topics

Multi-Lingual ThesauriConcepts• Source / Target Language• Degrees of Equivalence• Localization, not Globalization

Facts (from The Mother Tongue by Bill Bryson)

• There are now more students of English in China than there are people in the United States

• The French can’t distinguish house and home• Finnish has 15 case forms (noun variants)• The Eskimos have 50 words for types of snow

but no word that just means snow• A blizzard in England is a flurry in Nebraska

114

Related Topics

The List Goes On…Thesauri AND

• Business Strategy• Content Management• Markup Languages• Notation• XML

115

Seminar Review

I. Thesauri in Context II. Value of Thesauri III. Methodology IV. Metadata V. Vocabulary Control VI. Structure & Relationships VII. Thesaurus Management VIII.Case Study IX. Related Topics

116

How To Learn MoreArgus Center for Information Architecture

Web Site

http://argus-acia.com

Email Newsletter

Strange Connections, Events, Interviews

Thesaurus Resources & Examples

http://argus-acia.com/seminars/

user name and password both = “lajolla”

117

Contact UsArgus Associates, Inc.912 North Main StreetAnn Arbor, Michigan 48104(734) 913-0010

Sales sales@argus-inc.com

Employment http://argus-inc.com/recruiting/

Web Sites http://argus-inc.com/

http://argus-acia.com/

Recommended