39
Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Embed Size (px)

Citation preview

Page 1: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Text Analytics in Action:Using Text Analytics as a Toolset

TBC 4:15 p.m. - 5:00 p.m.

Marjorie HlavaSemantic enrichment / Semantic

Fingerprinting

Page 2: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Abstract

• Big data inferences are increasingly used to mine huge heaps of data.

• The applications are endless. • However, those inferences do not work well when many

lines go to a single bubble. The lines and relationships must be drawn between concepts, not simply between words.

• Using the text analytics is a powerful tool, but it is a means to an end, not the end itself.

• The important work is in the interpretation of the data. • This session outlines a highly accurate and efficient approach

and provides a case study of the application.

Page 3: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Outline of the talk

• Using text analytics in term extraction– 3 examples– Pattern recognition– String tagging– Taxonomy control

• Achieving Synonymy• Now what do I do with it?

Page 4: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Term clouds

• Good place to start• Show concept landscape• Basis =

– Levenshtein distances– N-grams

• Redundant concepts, separately shown• No disambiguation• Not direct XML tagging

Page 5: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Sample article

Page 6: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Normal text extraction

Page 7: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Near conceptual synonyms

Page 8: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Nonsensical suggestions

Page 9: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Small Taxonomy

Near synonym, conceptual duplicate

Page 10: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Refined presentation

Page 11: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Dependent concepts

Page 12: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Ontological dependencies

Page 13: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Achieving Synonymy

• Find like concepts• Merge the terms• Choose a preferred form• Build term record

– Hierarchy– Equivalence– Associative

Page 14: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Overview, Upload 7K documents, search for text string, add a tag, “Columbia”

Page 15: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

“Colombian” – no stemming

Same document – different terms

Page 16: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Colombiana – record overlap

Page 17: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

“FARC” – No Synonymy

Page 18: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

“People’s Armed Forces of Colombia”, i.e., FARC, lacks synonymy, some doc overlap

Page 19: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Tag suite, no hierarchy, no equivalence, no combining

tags for synonymy

Page 20: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Disambiguation

Bridge Structure

Bridge Dentistry

Bridge Game

Bridge Concept

Page 21: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Now what do I do with it?

• Tag documents– Consistently– Even depth of treatment– Full breadth of conceptual area

• Insert concepts in full text or as linked data• Implement in search• Use for internal statistics and analysis• Track industry trends• Create semantic fingerprints

Page 22: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting
Page 23: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

The AIP Thesaurus

Hierarchy TermRecord

Page 24: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

The AIP Thesaurus: Rulebase

This article is about (among other things)degenerate stars.

The text string “degenerate stars” occurs zerotimes in the text of the article.

But since the rulebase is tuned to understandthat when certain other words appear nearthe text “star”or “stars” it was correctly indexed.

Page 25: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

The AIP Thesaurus: Rulebase

If the word “star” or “stars” appears inthe same sentence as “degenerate” or“compact” MAI applies the term “Degenerate stars” instead ofjust using “Stars”

Page 26: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

The AIP Thesaurus: Applications

Page 27: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Listing of the AIP Thesaurus terms in JATS. Includes the term, keyword-ID, weight, code.

Page 28: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Inline tagged terms (denoted by the highlighting). The keyword ID (kwd1.4) corresponds with the name in the previous screenshot.

Page 29: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

HTML Header

Copyright © 2013 Access Innovations, Inc.

Page 30: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

7. Content Recommender

More Articles on the same topic

Selected Article Search “thin film sputtering”

Grants available

Upcoming conferences on this topic

Authors working in this space

Page 31: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Taxonomy Driven Search Presentation

Navigate the full taxonomy “tree”

BROWSE

Auto-completion using the taxonomy

Guide the user

Page 32: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Copyright © 2005 - Access Innovations, Inc.

Taxonomyview

ThesaurusTerm Record

view

Page 33: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Suggested taxonomy descriptors

Page 34: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

34

Visualization Strategies

MatrixVisualization

Software

Page 35: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Pattern AnalysisDomain Associations

Page 36: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Pattern AnalysisGap Analyses

Page 37: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Summary

• Taxonomy tool box• Text extraction / mining for terms• Gather synonyms• Disambiguate terms• Look for gaps and over coverage• Map all conceptual groupings

– Hierarchical, Associative, Equivalence• Apply to content• Leverage knowledge of the collection

Page 38: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

Thank you

Marjorie M.K. Hlava, PresidentAccess Innovations

[email protected]

The Semantic Enrichment CompanySMART CONTENT

Page 39: Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting

About Access InnovationsAccess Innovations are experts in content creation, enrichment, and conversion services. We provide services to semantically enrich and tag raw text into highly structured data. We deliver clean, well-formed, metadata-enriched content so our clients can reuse, repurpose, store, and find their knowledge assets. We go beyond the standards to build taxonomies and other data control structures as a solid foundation for your information. Our services and software allow organizations to use and present their information to both internal and external constituents by leveraging search, presentation, and e-commerce. We change search to found!

Quick Facts• Founded in 1978• Headquartered in Albuquerque, NM• Privately held• Delivered more than 2000 engagements