Upload
david-oliver
View
227
Download
0
Embed Size (px)
Citation preview
Text Analytics in Action:Using Text Analytics as a Toolset
TBC 4:15 p.m. - 5:00 p.m.
Marjorie HlavaSemantic enrichment / Semantic
Fingerprinting
Abstract
• Big data inferences are increasingly used to mine huge heaps of data.
• The applications are endless. • However, those inferences do not work well when many
lines go to a single bubble. The lines and relationships must be drawn between concepts, not simply between words.
• Using the text analytics is a powerful tool, but it is a means to an end, not the end itself.
• The important work is in the interpretation of the data. • This session outlines a highly accurate and efficient approach
and provides a case study of the application.
Outline of the talk
• Using text analytics in term extraction– 3 examples– Pattern recognition– String tagging– Taxonomy control
• Achieving Synonymy• Now what do I do with it?
Term clouds
• Good place to start• Show concept landscape• Basis =
– Levenshtein distances– N-grams
• Redundant concepts, separately shown• No disambiguation• Not direct XML tagging
Sample article
Normal text extraction
Near conceptual synonyms
Nonsensical suggestions
Small Taxonomy
Near synonym, conceptual duplicate
Refined presentation
Dependent concepts
Ontological dependencies
Achieving Synonymy
• Find like concepts• Merge the terms• Choose a preferred form• Build term record
– Hierarchy– Equivalence– Associative
Overview, Upload 7K documents, search for text string, add a tag, “Columbia”
“Colombian” – no stemming
Same document – different terms
Colombiana – record overlap
“FARC” – No Synonymy
“People’s Armed Forces of Colombia”, i.e., FARC, lacks synonymy, some doc overlap
Tag suite, no hierarchy, no equivalence, no combining
tags for synonymy
Disambiguation
Bridge Structure
Bridge Dentistry
Bridge Game
Bridge Concept
Now what do I do with it?
• Tag documents– Consistently– Even depth of treatment– Full breadth of conceptual area
• Insert concepts in full text or as linked data• Implement in search• Use for internal statistics and analysis• Track industry trends• Create semantic fingerprints
The AIP Thesaurus
Hierarchy TermRecord
The AIP Thesaurus: Rulebase
This article is about (among other things)degenerate stars.
The text string “degenerate stars” occurs zerotimes in the text of the article.
But since the rulebase is tuned to understandthat when certain other words appear nearthe text “star”or “stars” it was correctly indexed.
The AIP Thesaurus: Rulebase
If the word “star” or “stars” appears inthe same sentence as “degenerate” or“compact” MAI applies the term “Degenerate stars” instead ofjust using “Stars”
The AIP Thesaurus: Applications
Listing of the AIP Thesaurus terms in JATS. Includes the term, keyword-ID, weight, code.
Inline tagged terms (denoted by the highlighting). The keyword ID (kwd1.4) corresponds with the name in the previous screenshot.
HTML Header
Copyright © 2013 Access Innovations, Inc.
7. Content Recommender
More Articles on the same topic
Selected Article Search “thin film sputtering”
Grants available
Upcoming conferences on this topic
Authors working in this space
Taxonomy Driven Search Presentation
Navigate the full taxonomy “tree”
BROWSE
Auto-completion using the taxonomy
Guide the user
Copyright © 2005 - Access Innovations, Inc.
Taxonomyview
ThesaurusTerm Record
view
Suggested taxonomy descriptors
34
Visualization Strategies
MatrixVisualization
Software
Pattern AnalysisDomain Associations
Pattern AnalysisGap Analyses
Summary
• Taxonomy tool box• Text extraction / mining for terms• Gather synonyms• Disambiguate terms• Look for gaps and over coverage• Map all conceptual groupings
– Hierarchical, Associative, Equivalence• Apply to content• Leverage knowledge of the collection
Thank you
Marjorie M.K. Hlava, PresidentAccess Innovations
The Semantic Enrichment CompanySMART CONTENT
About Access InnovationsAccess Innovations are experts in content creation, enrichment, and conversion services. We provide services to semantically enrich and tag raw text into highly structured data. We deliver clean, well-formed, metadata-enriched content so our clients can reuse, repurpose, store, and find their knowledge assets. We go beyond the standards to build taxonomies and other data control structures as a solid foundation for your information. Our services and software allow organizations to use and present their information to both internal and external constituents by leveraging search, presentation, and e-commerce. We change search to found!
Quick Facts• Founded in 1978• Headquartered in Albuquerque, NM• Privately held• Delivered more than 2000 engagements