Taxonomies for Text Analytics and Auto-indexing

  • Published on
    20-Jan-2015

  • View
    1.952

  • Download
    1

Embed Size (px)

DESCRIPTION

Presentation given at Text Analytics World Conference, Boston, MA, October 2012

Transcript

<ul><li> 1. Taxonomies for Text Analyticsand Auto-IndexingHeather HeddenHedden Information ManagementText Analytics World, Boston, MAOctober 4, 2012</li></ul> <p> 2. Introduction: Text Analytics and TaxonomiesText analytics can be used to index content without theuse of taxonomies/controlled vocabularies.Text analytics can be used to index content withtaxonomies/controlled vocabularies for better results.Text analytics can generate terms from text to be used:1. As a source to manually build taxonomies2. To auto-categorize/classify content against existing taxonomies2 2012 Hedden Information Management 3. OutlineTaxonomy Introduction: Definitions &amp; TypesTaxonomy Introduction: Purposes &amp; BenefitsSynonyms for TermsAuto-Indexing and Auto-CategorizationTaxonomies for Auto-CategorizationTaxonomy Resources3 2012 Hedden Information Management 4. OutlineTaxonomy Introduction: Definitions &amp; TypesTaxonomy Introduction: Purposes &amp; BenefitsSynonyms for TermsAuto-Indexing and Auto-CategorizationTaxonomies for Auto-CategorizationTaxonomy Resources4 2012 Hedden Information Management 5. Definitions &amp; TypesBroad designationsSpecific types(essentially the same meaning (different meanings): used interchangeably):Controlled Vocabularies (CV) Term Lists/Pick listsKnowledge Organization Systems Synonym RingsTaxonomies Authority Files TaxonomiesHierarchicalFaceted Thesauri Ontologies5 2012 Hedden Information Management 6. Definitions &amp; TypesBroad Designations:Controlled vocabulary, knowledge organization system, taxonomyAn authoritative, restricted list of terms (words or phrases)Each term for a single unambiguous concept(synonyms/nonpreferred terms, as cross-references, may beincluded)Policies (control) for who, when, and how new terms can be addedMay or may not have structured relationships between termsTo support indexing/tagging/metadata management of content tofacilitate content management and retrieval6 2012 Hedden Information Management 7. Definitions &amp; TypesSpecific typesTerm Lists/Pick listsSynonym RingsAuthority FilesTaxonomiesThesauriOntologies7 2012 Hedden Information Management 8. Definitions &amp; Types: Specific TypesTerm ListA simple list of termsLacking synonyms, usually shortenough for browsingOften displayed in drop-down scrollboxes8 2012 Hedden Information Management 9. Definitions &amp; Types: Specific TypesSynonym RingA controlled vocabulary withsynonyms or near-synonyms foreach conceptNo designated preferred term:All terms are equal and point toeach other, as in a ring.Table for terms does not displayto the user9 2012 Hedden Information Management 10. Definitions &amp; Types: Specific TypesTaxonomyA controlled vocabulary with internal structure.Terms are grouped or have hierarchical relationships.Emphasizes categories and classification for end-userdisplay.May or may not have synonyms. Hierarchical all terms have broader/narrower relationships to each other to form one big hierarchy Faceted terms are grouped by attribute/aspect and are used in combination for indexing and search10 2012 Hedden Information Management 11. Definitions &amp; Types: Specific Types Leisure and culture. Arts and entertainment venues. . Museums and galleriesHierarchicalTaxonomy. Childrens activities. Culture and creativityexample(UKsIPSV): . . Architecture. . Crafts Top Level Headings . . Heritage. . Literature Business and industry. . Music Economics and finance. . Performing arts Education and skills . . Visual arts Employment, jobs and careers . Entertainment and events Environment. Gambling and lotteries Government, politics and public. Hobbies and interests administration . Parks and gardens Health, well-being and care. Sports and recreation Housing. . Team sports Information and communication. . . Cricket International affairs and defence. . . Football Leisure and culture. . . Rugby Life in the community. . Water sports People and organisations . . Winter sports Public order, justice and rights . Sports and recreation facilities Science, technology and innovation . Tourism Transport and infrastructure . . Passports and visas. Young peoples activities11 2012 Hedden Information Management 12. Definitions &amp; Types: Specific TypesFacetedTaxonomyexamples 2012 Hedden Information Management 13. Outline Taxonomy Introduction: Definitions &amp; Types Taxonomy Introduction: Purposes &amp; Benefits Synonyms for Terms Auto-Indexing and Auto-Categorization Taxonomies for Auto-Categorization Taxonomy Resources13 2012 Hedden Information Management 14. Purposes &amp; Benefits1. Controlled vocabulary aspect: Brings together different wordings (synonyms) for the same concept and disambiguates termsHelps people search for information by differentnamesHelps people retrieve matching concepts, not justwords2. Taxonomy or thesaurus structure aspect: Organizes information into a logical structureHelps people browse or navigate for information 2012 Hedden Information Management 15. Purposes &amp; BenefitsHelps people search for information by different namesThere are multiple ways to describe the same thing.A controlled vocabulary gathers synonyms, acronyms,variant spellings, etc.Without a controlled vocabulary keyword searches wouldmiss some relevant documents, due to: Use of different words (e.g. Attorneys, instead of Lawyers) Use of different phrases (e.g. Deceptive acts or practices instead of Unfair practices) User does not knowing the spelling of unusual names (e.g. Condoleezza Rice) 2012 Hedden Information Management 16. Purposes &amp; Benefits 2012 Hedden Information Management 17. Purposes &amp; BenefitsHelps people retrieve matching concepts, not just wordsA single term may have multiple meanings.Controlled vocabulary terms can be clarified/ disambiguated.Without a controlled vocabulary, too many irrelevantdocuments would be retrieved.A search restricted on the controlled vocabulary retrievesconcepts not just words. Excludes document with mere text-string matches (e.g. monitors for computers, not the verb observes) 2012 Hedden Information Management 18. Outline Taxonomy Introduction: Definitions &amp; Types Taxonomy Introduction: Purposes &amp; Benefits Synonyms for Terms Auto-Indexing and Auto-Categorization Taxonomies for Auto-Categorization Taxonomy Resources18 2012 Hedden Information Management 19. Synonyms for TermsSupports search in most controlled vocabulary types:synonym rings, authority files, thesauri, (some taxonomies)Anticipating both: varied user search string entries varied forms in the text for the same contentFor both manual and automated indexingA concept may have any number of synonyms, but a synonym can point to only one preferred termVaried synonym sources: Search analytics records Interviews and use cases Legacy print indexes Obvious patterns (acronyms, phrase inversions, etc.) 2012 Hedden Information Management 20. Synonyms for TermsNot all are synonyms.Types include:synonyms: Cars USE Automobilesnear-synonyms: Junior high USE Middle schoolvariant spellings: Defence USE Defenselexical variants: Hair loss USE Baldnessforeign language proper nouns: Luftwaffe USE German Air Forceacronyms/spelled out forms: UN USE United Nationsscientific/technical names: Neoplasms USE Cancerphrase variations (in print): Buses, school USE School busesantonyms: Misbehavior USE Behaviornarrower terms: Alcoholism USE Substance abuseAlso called variant terms, equivalence terms, non-preferred 2012 Hedden Information Management terms 21. Outline Taxonomy Introduction: Definitions &amp; Types Taxonomy Introduction: Purposes &amp; Benefits Synonyms for Terms Auto-Indexing and Auto-Categorization Taxonomies for Auto-Categorization Taxonomy Resources21 2012 Hedden Information Management 22. Auto-Indexing and Auto-CategorizationChoosing human vs. automated indexing:Human indexing Automated indexing Manageable number of docs Very large number of docs Higher accuracy in indexing Greater speed in indexing May include non-text files Text files only Invest in people Invest in technology Low-tech: can build your own High-tech: must purchaseindexing UIauto-indexing software Internal control Software vendor relationship22 2012 Hedden Information Management 23. Auto-Indexing and Auto-CategorizationAutomated Indexing Technologies Entity extraction Text analytics and text mining, based on NLP Auto-categorization23 2012 Hedden Information Management 24. Auto-Indexing and Auto-CategorizationChoosing auto-indexing methods:Information extraction/text analytics Auto-categorization For varied and undifferentiated For consistent doc types/formatsdocument types For structured or pre-tagged For unstructured contentcontent For varied subject areas For limited/focused subject Terms may or may not be displayed Displays categories to user Not necessarily with taxonomy Leverages a taxonomyCombine both text analytics and auto-categorization:1. Text analytics to extract concepts from unstructured varied content2. Auto-categorization to apply benefits of a taxonomy/controlled vocabulary24 2012 Hedden Information Management 25. Auto-Indexing and Auto-Categorizationauto-categorization = auto-classification = automated subject indexingAuto-categorization makes use of the controlled vocabulary matched with extracted terms.Primary auto-categorization technologies: 1. Machine-learning and training documents 2. Rules-based categorization 2012 Hedden Information Management 26. Auto-Indexing and Auto-CategorizationMachine-learning based auto-categorization:Automatically indexes based on previous examplesComplex mathematical algorithms are createdTaxonomist must then provide multiple representative sampledocuments for each CV term to train the system.Best if pre-indexed records exist (i.e. converting from human toautomated indexing), then hundreds of varied documents can beused for each term. 26 2012 Hedden Information Management 27. Auto-Indexing and Auto-CategorizationRules-based auto-categorization Taxonomist must write rules for each CV term Like advanced Boolean searching or regular expressionsExample:BushIF (INITIAL CAPS AND (MENTIONS "president*" OR WITH administration*" ORAROUND "white house" OR NEAR "george"))USEU.S. PresidentELSEUSE ShrubsENDIF Data Harmony27 2012 Hedden Information Management 28. Outline Taxonomy Introduction: Definitions &amp; Types Taxonomy Introduction: Purposes &amp; Benefits Synonyms for Terms Auto-Indexing and Auto-Categorization Taxonomies for Auto-Categorization Taxonomy Resources28 2012 Hedden Information Management 29. Taxonomies for Auto-CategorizationNo matter which method of auto-indexing, auto-indexing impactscontrolled vocabulary creation: Continual update work is needed (new training documents or new rules) for each new term created. Feeding training documents is easier for non-information professionals, than is writing rules29 2012 Hedden Information Management 30. Taxonomies for Auto-CategorizationTaxonomies designed for auto-categorization:Need more, varied synonym/variant termsNeed variant terms of different parts of speechCannot have subtle differences between preferred termsAvoid creating many action-termsTaxonomy needs to be more content-tailored, content-based30 2012 Hedden Information Management 31. Taxonomies for Auto-CategorizationSynonym/variant term differences:For human-indexingFor auto-categorizationPresidential candidates Presidential candidateCandidates, presidentialPresidential candidacyCandidate for presidentCandidacy for presidentPresidential hopefulRunning for presidentCampaigning for presidentPresidential nominee31 2012 Hedden Information Management 32. Outline Taxonomy Introduction: Definitions &amp; Types Taxonomy Introduction: Purposes &amp; Benefits Synonyms for Terms Auto-Indexing and Auto-Categorization Taxonomies for Auto-Categorization Taxonomy Resources32 2012 Hedden Information Management 33. Taxonomy Resources ANSI/NISO Z39.19 (2005) Guidelines for Construction, Format, and Management of Monolingual Controlled Vocabularies. Bethesda, MD: NISO Press. www.niso.org Hedden, Heather. (2010) The Accidental Taxonomist. Medford, NJ: Information Today Inc. www.accidental-taxonomist.com American Society for Indexing: Taxonomies and Controlled Vocabularies Special Interest Group www.taxonomies-sig.org Special Libraries Association (SLA): Taxonomy Division http://wiki.sla.org/display/SLATAX Taxonomy Community of Practice discussion group http://finance.groups.yahoo.com/group/TaxoCoP "Taxonomies and Controlled Vocabularies Simmons College Graduate School of Library and Information Science Continuing Education Program, 5 weeks. $250. November 2012, January 2013. http://alanis.simmons.edu/ceweb/byinstructor.php#93333 2012 Hedden Information Management 34. Questions/ContactHeather HeddenHedden Information ManagementCarlisle, MAheather@hedden.net978-467-5195www.hedden-information.comwww.linkedin.com/in/heddentwitter.com/hheddenaccidental-taxonomist.blogspot.com34 2012 Hedden Information Management </p>