23
Taxonomies: Tools or People? TBC; Taxonomies: Tools or People? By Dave Clarke & Paula McCoy Copyright © SynapAca, LLC, 2009 www.synapAcasoIware.com 11/25/09 Slide 1 When would one favor human indexing over machine indexing? An example of the human indexing effort is presented along with tools that can help with the process. An example of autocategorizaAon is illustrated with a discussion of the reciprocal flow of informaAon between the taxonomy management tool and the autocategorizaAon tool. Speakers then discuss how structured vocabularies help refine categorizers and how feedback from the categorizer tool to the human editorial team contributes to the conAnual improvement of the vocabularies. by Dave Clarke & Paula McCoy

Synaptica Proquest Talk Taxonomy Boot Camp 2009

Embed Size (px)

DESCRIPTION

Presentation given by Dave Clarke, CEO, Synaptica, LLC and Paula McCoy, ProQuest, on Machine vs. Human Indexing at Taxonomy Boot Camp in San Jose, 2009.

Citation preview

Page 1: Synaptica Proquest Talk Taxonomy Boot Camp 2009

Taxonomies:  Tools  or  People?  

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  SynapAca,  LLC,  2009  www.synapAcasoIware.com  

11/25/09  Slide  1  

When  would  one  favor  human  indexing  over  machine  indexing?  An  example  of  the  human  indexing  effort  is  presented  along  with  tools  that  can  help  with  the  process.  An  example  of  autocategorizaAon  is  illustrated  with  a  discussion  of  the  reciprocal  flow  of  informaAon  between  the  taxonomy  management  tool  and  the  autocategorizaAon  tool.  Speakers  then  discuss  how  structured  vocabularies  help  refine  categorizers  and  how  feedback  from  the  categorizer  tool  to  the  human  editorial  team  contributes  to  the  conAnual  improvement  of  the  vocabularies.  

by  Dave  Clarke  &  Paula  McCoy  

Page 2: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  SynapAca,  LLC,  2009  www.synapAcasoIware.com  

11/25/09  Slide  2  

HUMAN  VS.  MACHINE  &  

THE  HUMAN  OPTION  

Dave  Clarke  

CEO  SynapAca,  LLC  

[email protected]  

Page 3: Synaptica Proquest Talk Taxonomy Boot Camp 2009

Humans  will  invent  almost  anything  to  save  Ime  

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  SynapAca,  LLC,  2009  www.synapAcasoIware.com  

11/25/09  Slide  3  

Page 4: Synaptica Proquest Talk Taxonomy Boot Camp 2009

Human  or  machine  indexing  –  depends  on  the  data  and  the  user  

subtle  &  abstract  concepts  

non-­‐textual,  e.g.  images,  sounds  

highly  structured  

very  high  volume  

homogeneous  topics  

mission-­‐criIcal  precision  &  recall  

noisy  or  incomplete  results  tolerable  

very  quick  turnaround  

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  SynapAca,  LLC,  2009  www.synapAcasoIware.com      

11/25/09  Slide  4  

Page 5: Synaptica Proquest Talk Taxonomy Boot Camp 2009

Human  indexing  –  the  process  

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  SynapAca,  LLC,  2009  www.synapAcasoIware.com  

11/25/09  Slide  5  

Page 6: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  SynapAca,  LLC,  2009  www.synapAcasoIware.com  

11/25/09  Slide  6  

Human  indexing  –  a  wish  list    of  Ime-­‐saving  tools  

Page 7: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  SynapAca,  LLC,  2009  www.synapAcasoIware.com  

11/25/09  Slide  7  

Human  indexing  –  a  wish  list    of  Ime-­‐saving  tools  

Page 8: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  SynapAca,  LLC,  2009  www.synapAcasoIware.com  

11/25/09  Slide  8  

Human  indexing  –  SynapIca’s  “IMS”  Toolbox  

Page 9: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  SynapAca,  LLC,  2009  www.synapAcasoIware.com  

11/25/09  Slide  9  

Human  indexing  –  IMS  Workflow  Detail  

Page 10: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  SynapAca,  LLC,  2009  www.synapAcasoIware.com  

11/25/09  Slide  10  

Human  indexing  –  profile  set  up  screen  shot  

Page 11: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  SynapAca,  LLC,  2009  www.synapAcasoIware.com  

11/25/09  Slide  11  

Human  indexing  –  examples  

1. A national library could use IMS to human index digital images and multimedia assets against a set of authority files.

2. A professional services corporation could use IMS to human index mission-critical legal documents against a taxonomy of compliance terminology.

3. A multinational electronics company could use IMS to human index product data according to product lines and families, hardware assets and other product based keyword groups.

Page 12: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  SynapAca,  LLC,  2009  www.synapAcasoIware.com  

11/25/09  Slide  12  

Human  indexing  –  conclusions  

1.  Like  everything  else  in  life,  if  we  can  possibly  pass  the  task  on  to  machines,  we’d  like  to  

2.  There  are  some  situaAons  where  machines  are  the  only  soluAon  and  there  are  others  where  human  indexing  is  required  (non-­‐machine-­‐readable  data  sets,  subtle/abstract  concepts,  mission-­‐criAcal  precision-­‐recall  requirements,  etc.)  

3.  If  human  indexing  is  required  there  are  tools  that  can  help  speed  up  the  process  and  help  adain  indexing  consistency  

4.  The  SynapAca  “wish  list”  represents  those  Ame-­‐saving  tools  requested  by  our  user  base  over  the  past  ten  years    

Page 13: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  Proquest,  Inc.,  2009  www.proquest.com  

11/25/09  Slide  13  

AUTOCATEGORIZATION  A  CASE  STUDY  USING  SYNAPTICA  

Paula  McCoy  

Manager,  Taxonomy  Development  ProQuest  

[email protected]  

Page 14: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  Proquest,  Inc.,  2009  www.proquest.com  

11/25/09  Slide  14  

• InformaAon  aggregator  &  database  producer,  with  content  ranging  from  newspapers  to  academic/scholarly  publicaAons,  in  topics  spanning  business  and  management,  STM  (scienAfic,  technical,  medical),  humaniAes,  social  science,  general  reference  

• Abstracts/indexes  more  than  6,000  periodicals  and  newspapers  

• Daily  ingest  of  more  than  60,000  new  newspaper  and  newswire  arAcles  

• Customer  base:  Public  and  academic  libraries  

• End  users:  Academic  and  student  researchers  

Page 15: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  Proquest,  Inc.,  2009  www.proquest.com  

11/25/09  Slide  15  

The  Mandate:  To  promote  discovery  of  all  content  relevant  to  the  user’s  search  query  

The  SoluAon:    Index  and  abstract  as  much  content  as  possible  in  order  to  maximize  the  

number  of  “entry  points”  to  an  arAcle.  –  Indexing  provided  for  different  parts  of  an  arAcle:  

•  SUBJECTS  •  COMPANIES  

•  PEOPLE  •  LOCATIONS  

–  Abstracts  provided  for  all  arAcles  of  minimum  length  

ProQuest  Search  Interface  

Page 16: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  Proquest,  Inc.,  2009  www.proquest.com  

11/25/09  Slide  16  

A  Growing  Challenge:  

How  to  A&I  hundred  of  thousands  of  new  arAcles  every  day?  

The  Only  Answer:  AutocategorizaAon,  or  auto-­‐indexing:    Machine-­‐based  applica/on  of  index  terms  to  a  document  or  other  object  

ProQuest  Search  Interface  

Page 17: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  Proquest,  Inc.,  2009  www.proquest.com  

11/25/09  Slide  17  

The  AutocategorizaAon  SoluAon  

Basic  Tenets  of  AutocategorizaAon:  1.  Must  have  a  controlled  vocabulary  in  place  

2.  Must  have  other  controlled  lists  if  you  want  to  index  companies,  people,  locaAons,  etc.  

3.  Must  have  a  way  to  manage  your  vocabularies  

4.  Must  have  a  way  to  manage  the  results  of  the  autocat—no  automated  indexing  method  is  perfect  

  Autocat  success  rests  upon  the  existence  of  a  strong  controlled  vocabulary  with  a  history  of  usage  from  which  the  automaAon  soIware  can  learn.  

Page 18: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  Proquest,  Inc.,  2009  www.proquest.com  

11/25/09  Slide  18  

The  ProQuest  Approach  

1)  Implement  SynapAca  thesaurus  management  soluAon  to  manage  11,300+-­‐term  subject  thesaurus  and  authority  files  for  companies,  people,  and  locaAons    

2)  Purchase  Nstein  Technologies’  Text  Mining  Engine  soluAon  to  automate  abstracAng  and  indexing  of  subject  and  other  terms  

3)  Train  the  TME  to  understand  the  usage  of  ProQuest  thesaurus  terms  (3-­‐month  collaboraAve  process)  

4)  Implement  Nstein’s  Knowledge  Base  Manager  (TME  Manager)  to  manage  subject  terms  rules  base  

       SynapIca                        Taxonomy  Manager                        Nstein  

Page 19: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  Proquest,  Inc.,  2009  www.proquest.com  

11/25/09  Slide  19  

Thesaurus  and  Autocat  Management  

SynapAca  Thesaurus  Management:  

•  New  terms  added,  hierarchies  revised,  Scope  Notes  added/revised    •  Use  For  (non-­‐preferred)  terms  added  frequently  to  reflect  variant  usages  in  the  

indexed  literature  and  provide  addiAonal  cross-­‐references    

Nstein  Autocat  Management:  •  Nstein  TME  Manager  tool  used  to  manage  indexing  rules  base  for  all  thesaurus  

terms  

•  Autocat  rules  supplement  and  complement  the  underlying  concept  training    

•  Autocat  rules  can  be  added,  deleted,  revised    •  Autocat  rules  enable  autocat  indexing  to  keep  up  with  changes  in  term  usages  

so  that  new  variants  can  be  added  and  rules  created  based  on  current  topics  in  the  literature  or  in  the  news  

Page 20: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  Proquest,  Inc.,  2009  www.proquest.com  

11/25/09  Slide  20  

SynapAca-­‐TME  InteracAon  Thesaurus  management  informs  2  levels  of  indexing:  manual  and  

automated    The  thesaurus  as  represented  in  SynapAca  must  display  all  cross-­‐

references  (mainly  Use  refs)  required  by  manual  indexers  

  The  thesaurus  as  represented  in  Nstein  must  contain  rules  reflecAng  those  Use  references  

  Term  updates  made  in  SynapAca  are  duplicated  in  Nstein  via  indexing  rules  

  Use  references  in  SynapAca  point  human  indexers  to  the  right  term    Use  references  in  Nstein  rules  base  point  the  automated  indexer  to  the  

right  term  

Page 21: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  Proquest,  Inc.,  2009  www.proquest.com  

11/25/09  Slide  21  

SynapAca  &  Autocat:  Benefits  

•  A  semanAc-­‐based  autocat  soluAon  indexes  as  well  as  it’s  been  trained    that  training  is  most  successful  if  based  on  years  of  manual  indexing  using  a  controlled  subject  vocabulary    combined  with  a  rules  base,  autocat  can  produce  intelligent  and  informed  indexing  

•  Reviewing  the  results  of  good  autocat  leads  to  comparison  with  ongoing  manual  indexing    quesAons  about  term  usages  rise  to  the  surface    human  indexing  can  improve  by  becoming  more  flexible  and  adaptable  to  changes  in  terminology    revised  term  usages  are  reflected  in  SynapAca  

•  Human  indexers  raise  issues  of  new  term  variants  and  need  for  new  terms    SynapAca  is  updated    the  rules  base  is  updated  to  allow  autocat  to  capture  terms  beder  

Page 22: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  Proquest,  Inc.,  2009  www.proquest.com  

11/25/09  Slide  22  

Benefits  for  SynapAca  Thesaurus  Control    •  Day-­‐to-­‐day  review  of  automated  indexing  highlights  correct  and  incorrect  

term  usages,  leading  to  greater  discipline  in  SynapAca  thesaurus  management  to  ensure  human  indexers  remain  aware  of  terms  and  their  proper  usage.  

•  The  need  for  precision  in  subject  terms  means  terms  must  be  exact  and  descripAve—automated  indexing  will  not  work  with  vague,  ambiguous  terms  or  one-­‐word  terms  with  mulAple  meanings,  like  “Apologies,”  “Affect,”  “ArAculaAon.”  The  result  is  a  more  robust  and  controlled  subject  vocabulary.  

•  Automated  indexing  will  use  terms  in  the  thesaurus  that  human  indexers  may  have  forgoden  about—leading  again  to  revised  hierarchies  in  SynapAca,  new  Scope  Notes,  and  instant  feedback  to  indexers.  

Page 23: Synaptica Proquest Talk Taxonomy Boot Camp 2009

TBC;  Taxonomies:  Tools  or  People?  By  Dave  Clarke  &  Paula  McCoy  

Copyright  ©  SynapAca,  LLC,  2009  www.synapAcasoIware.com  

11/25/09  Slide  23  

[email protected]     [email protected]