Upload
synaptica-llc
View
106
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Presentation given by Dave Clarke, CEO, Synaptica, LLC and Paula McCoy, ProQuest, on Machine vs. Human Indexing at Taxonomy Boot Camp in San Jose, 2009.
Citation preview
Taxonomies: Tools or People?
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © SynapAca, LLC, 2009 www.synapAcasoIware.com
11/25/09 Slide 1
When would one favor human indexing over machine indexing? An example of the human indexing effort is presented along with tools that can help with the process. An example of autocategorizaAon is illustrated with a discussion of the reciprocal flow of informaAon between the taxonomy management tool and the autocategorizaAon tool. Speakers then discuss how structured vocabularies help refine categorizers and how feedback from the categorizer tool to the human editorial team contributes to the conAnual improvement of the vocabularies.
by Dave Clarke & Paula McCoy
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © SynapAca, LLC, 2009 www.synapAcasoIware.com
11/25/09 Slide 2
HUMAN VS. MACHINE &
THE HUMAN OPTION
Dave Clarke
CEO SynapAca, LLC
Humans will invent almost anything to save Ime
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © SynapAca, LLC, 2009 www.synapAcasoIware.com
11/25/09 Slide 3
Human or machine indexing – depends on the data and the user
subtle & abstract concepts
non-‐textual, e.g. images, sounds
highly structured
very high volume
homogeneous topics
mission-‐criIcal precision & recall
noisy or incomplete results tolerable
very quick turnaround
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © SynapAca, LLC, 2009 www.synapAcasoIware.com
11/25/09 Slide 4
Human indexing – the process
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © SynapAca, LLC, 2009 www.synapAcasoIware.com
11/25/09 Slide 5
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © SynapAca, LLC, 2009 www.synapAcasoIware.com
11/25/09 Slide 6
Human indexing – a wish list of Ime-‐saving tools
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © SynapAca, LLC, 2009 www.synapAcasoIware.com
11/25/09 Slide 7
Human indexing – a wish list of Ime-‐saving tools
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © SynapAca, LLC, 2009 www.synapAcasoIware.com
11/25/09 Slide 8
Human indexing – SynapIca’s “IMS” Toolbox
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © SynapAca, LLC, 2009 www.synapAcasoIware.com
11/25/09 Slide 9
Human indexing – IMS Workflow Detail
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © SynapAca, LLC, 2009 www.synapAcasoIware.com
11/25/09 Slide 10
Human indexing – profile set up screen shot
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © SynapAca, LLC, 2009 www.synapAcasoIware.com
11/25/09 Slide 11
Human indexing – examples
1. A national library could use IMS to human index digital images and multimedia assets against a set of authority files.
2. A professional services corporation could use IMS to human index mission-critical legal documents against a taxonomy of compliance terminology.
3. A multinational electronics company could use IMS to human index product data according to product lines and families, hardware assets and other product based keyword groups.
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © SynapAca, LLC, 2009 www.synapAcasoIware.com
11/25/09 Slide 12
Human indexing – conclusions
1. Like everything else in life, if we can possibly pass the task on to machines, we’d like to
2. There are some situaAons where machines are the only soluAon and there are others where human indexing is required (non-‐machine-‐readable data sets, subtle/abstract concepts, mission-‐criAcal precision-‐recall requirements, etc.)
3. If human indexing is required there are tools that can help speed up the process and help adain indexing consistency
4. The SynapAca “wish list” represents those Ame-‐saving tools requested by our user base over the past ten years
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009 www.proquest.com
11/25/09 Slide 13
AUTOCATEGORIZATION A CASE STUDY USING SYNAPTICA
Paula McCoy
Manager, Taxonomy Development ProQuest
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009 www.proquest.com
11/25/09 Slide 14
• InformaAon aggregator & database producer, with content ranging from newspapers to academic/scholarly publicaAons, in topics spanning business and management, STM (scienAfic, technical, medical), humaniAes, social science, general reference
• Abstracts/indexes more than 6,000 periodicals and newspapers
• Daily ingest of more than 60,000 new newspaper and newswire arAcles
• Customer base: Public and academic libraries
• End users: Academic and student researchers
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009 www.proquest.com
11/25/09 Slide 15
The Mandate: To promote discovery of all content relevant to the user’s search query
The SoluAon: Index and abstract as much content as possible in order to maximize the
number of “entry points” to an arAcle. – Indexing provided for different parts of an arAcle:
• SUBJECTS • COMPANIES
• PEOPLE • LOCATIONS
– Abstracts provided for all arAcles of minimum length
ProQuest Search Interface
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009 www.proquest.com
11/25/09 Slide 16
A Growing Challenge:
How to A&I hundred of thousands of new arAcles every day?
The Only Answer: AutocategorizaAon, or auto-‐indexing: Machine-‐based applica/on of index terms to a document or other object
ProQuest Search Interface
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009 www.proquest.com
11/25/09 Slide 17
The AutocategorizaAon SoluAon
Basic Tenets of AutocategorizaAon: 1. Must have a controlled vocabulary in place
2. Must have other controlled lists if you want to index companies, people, locaAons, etc.
3. Must have a way to manage your vocabularies
4. Must have a way to manage the results of the autocat—no automated indexing method is perfect
Autocat success rests upon the existence of a strong controlled vocabulary with a history of usage from which the automaAon soIware can learn.
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009 www.proquest.com
11/25/09 Slide 18
The ProQuest Approach
1) Implement SynapAca thesaurus management soluAon to manage 11,300+-‐term subject thesaurus and authority files for companies, people, and locaAons
2) Purchase Nstein Technologies’ Text Mining Engine soluAon to automate abstracAng and indexing of subject and other terms
3) Train the TME to understand the usage of ProQuest thesaurus terms (3-‐month collaboraAve process)
4) Implement Nstein’s Knowledge Base Manager (TME Manager) to manage subject terms rules base
SynapIca Taxonomy Manager Nstein
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009 www.proquest.com
11/25/09 Slide 19
Thesaurus and Autocat Management
SynapAca Thesaurus Management:
• New terms added, hierarchies revised, Scope Notes added/revised • Use For (non-‐preferred) terms added frequently to reflect variant usages in the
indexed literature and provide addiAonal cross-‐references
Nstein Autocat Management: • Nstein TME Manager tool used to manage indexing rules base for all thesaurus
terms
• Autocat rules supplement and complement the underlying concept training
• Autocat rules can be added, deleted, revised • Autocat rules enable autocat indexing to keep up with changes in term usages
so that new variants can be added and rules created based on current topics in the literature or in the news
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009 www.proquest.com
11/25/09 Slide 20
SynapAca-‐TME InteracAon Thesaurus management informs 2 levels of indexing: manual and
automated The thesaurus as represented in SynapAca must display all cross-‐
references (mainly Use refs) required by manual indexers
The thesaurus as represented in Nstein must contain rules reflecAng those Use references
Term updates made in SynapAca are duplicated in Nstein via indexing rules
Use references in SynapAca point human indexers to the right term Use references in Nstein rules base point the automated indexer to the
right term
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009 www.proquest.com
11/25/09 Slide 21
SynapAca & Autocat: Benefits
• A semanAc-‐based autocat soluAon indexes as well as it’s been trained that training is most successful if based on years of manual indexing using a controlled subject vocabulary combined with a rules base, autocat can produce intelligent and informed indexing
• Reviewing the results of good autocat leads to comparison with ongoing manual indexing quesAons about term usages rise to the surface human indexing can improve by becoming more flexible and adaptable to changes in terminology revised term usages are reflected in SynapAca
• Human indexers raise issues of new term variants and need for new terms SynapAca is updated the rules base is updated to allow autocat to capture terms beder
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © Proquest, Inc., 2009 www.proquest.com
11/25/09 Slide 22
Benefits for SynapAca Thesaurus Control • Day-‐to-‐day review of automated indexing highlights correct and incorrect
term usages, leading to greater discipline in SynapAca thesaurus management to ensure human indexers remain aware of terms and their proper usage.
• The need for precision in subject terms means terms must be exact and descripAve—automated indexing will not work with vague, ambiguous terms or one-‐word terms with mulAple meanings, like “Apologies,” “Affect,” “ArAculaAon.” The result is a more robust and controlled subject vocabulary.
• Automated indexing will use terms in the thesaurus that human indexers may have forgoden about—leading again to revised hierarchies in SynapAca, new Scope Notes, and instant feedback to indexers.
TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy
Copyright © SynapAca, LLC, 2009 www.synapAcasoIware.com
11/25/09 Slide 23