Enabling Exploration Through Text Analytics

Enabling Exploration through Text Analytics

Daniel TunkelangChief Scientist, Endeca

overview

information seeking toolsneed to support exploration

text analytics can help

you can do this here and now

real-world information seeking examples

• looking for health information

• looking for work-related information

remindersearch and text analyticsare a means, not an end

example 1: looking for health information

six months into my wife’s pregnancy, wediscovered that she had gestational diabetes

how to learn more?

google: the default option for most

in government we trust: fda.gov

maybe the private sector knows best: webmd

powered by

success – and a sticky site

powered by

example 2: looking for work-related information

need to ramp up summerinterns on text mining

how to find a good book?

let’s try google again

google: the gateway to wikipedia?

the library of congress (loc.gov)

triangle research libraries: next-gen catalog

powered by

faceted search enables query refinement

powered by

take-away #1

exploratory search support:a must-have for many information needs

text analytics

• categorization• named entity detection• term extraction• sentiment analysis

vague term, lots of see-alsostext mining

information extractioncontent enrichment

newssift: text analytics enabling exploration

powered by

categorization

named entity detection

term extraction

sentiment analysis

exploring the news about facebook

powered by

facebook: the good

powered by

Social Utility

Iphone Application

facebook: the bad

powered by

Criminal BehaviorLitigation AndSettlement

take-away #2

text analytics enableexploratory search

text analytics is here and now

lots of off-the-shelf options

and more!

caveats

• rule-based techniques are domain-specific

• statistical techniques rely on trained models

• plan for errors, inconsistency

• document vs. corpus analysis

Person Location Organization

ABDUL-KARIM KHALAF (1) ALTOONA, PA (1) ABC News Inc. (1)

ABDULRAHMAN ABDULLAH (1) Afghanistan (7) Air Force (1)

AL GORE (1) Africa (5) Amazon.com Inc. (1)

ALEX TREBEK (1) Akihabara (1) American Airlines Inc. (1)

ALI HASSAN AL (1) Alaska (3) Apple (1)

AMANDA MARCOTTE (1) Allegheny (1) Arctic National Wildlife Refuge (1)

AMY WINEHOUSE (1) Americas (17) Arianna Huffington (1)

ANDERS ERICSSON (1) Appalachia (1) Australian Liberal Party (1)

ANDREW LLOYD WEBBER (1) Argentina (1) Bad News Bears (1)

ANTHONY MWANGI (1) Arizona (11) Bear Stearns (2)

ANTONIN SCALIA (1) Arkansas (7) Big Apple Companies (1)

ARYE BARAK (1) Arlington, Va. (2) BioDiversity Research Institute (1)

Aaron Sorkin (1) Arrest (1) Bloomberg LP (3)

Abbie Hoffman (1) Asia (1) Bob Dole (1)

Abe Lincoln (1) Atlanta (2) Bocuse d’Or World Cuisine Contest (1)

Abe Weiss (1) Austin (1) Boston Globe (1)

Abraham Lincoln (1) Austin, Texas (1) Boston Tea Party (1)

Adlai Stephenson (1) Australia (1) Budweiser (1)

problems with entity extraction

• moderate precision, but low recall• not just noisy, but inconsistent• corpus analysis can help!

Arrest (1)

Asia (1)

ALTOONA, PA (1)

Abe Lincoln (1)

Bob Dole (1)

Boston Tea Party (1)Abraham Lincoln (1)

look for ways to cheat!

recall

precision

division of labor

people supply vocabulary

machine annotates documents

http://www.precolumbianwomen.com/images/inca-labor.10.gif

example: ACM digital library

• opportunity– repository of (sometimes) author-tagged documents– high-precision tags: very few false positives

• challenge– poor reuse of vocabulary: most tags unique– low-recall tags: 90% false negatives

as is, tags were not useful for exploration

solution

• bootstrap on author-supplied tags

• prune 600K+ tags to 10K by– imposing frequency threshold– normalizing by case and singular/plural– eliminating infrequent subphrases

• mine documents using resulting vocabulary

• manually validate most frequently assigned tags

example: a search for boeing

powered by

it’s a HITS!

if you prefer sports to computer science

• no author-supplied tags

• use search logs instead

• supplement with authority files– team names– player names

• mine documents using resulting vocabulary

roger clemens, then and now

powered by

pivoting to a different view

powered by

take-away #3

this is not vapor ware;text analytics to enable exploration

is available here and now

looking forward

• better tags are the beginning, not the end

• improve with manual and automatic processing

• give users control over precision / recall trade-off

• help users and content creators help you

in closing

exploratory search = must-have, not nice-to-have

text analytics are a key enabler

the technology is real, here, and now

thank you…and come to SIGIR!

communication 1.0email: dt@endeca.com

communication 2.0blog: http://thenoisychannel.com

twitter: http://twitter.com/dtunkelang

SIGIR: July 19-23 in Boston Industry Track on July 22nd!

Enabling Exploration Through Text Analytics

Technology

Search, Exploration and Analytics of Evolving Data

Enabling a new banking experience with predictive analytics

Key Technology trends are enabling the analytics

Enabling Real-time Analytics on IBM z Systems

Enabling the High-Level Synthesis of Data Analytics

Enabling Real-Time Operational Analytics

Water: A Critical Material Enabling Space Exploration · 2014-09-02 · Enabling Space Exploration Karen D. Pickering NASA Johnson Space Center Houston, ... Where is NASA going?

The Shuttle Mission: Enabling Science and Exploration Life

Paper SAS1957-2015 Meter Data Analytics Enabling ...support.sas.com/resources/papers/proceedings15/SAS1957-2015.pdf · 1 Paper SAS1957-2015 Meter Data Analytics—Enabling Actionable

Business Analytics Enabling Digital Mines...Business Analytics Enabling Digital Mines: Discover how Peabody Energy uses the SAP Business Objects tool suite for real-time mining analytics

Explorable Visual Analytics (EVA) Interactive Exploration ...lehd.ces.census.gov/doc/workshop/2015/Presentations/2015-06-EVA-LED.pdfExplorable Visual Analytics (EVA) Interactive Exploration

Geo-enabling the Enterprise - Esri Canada · Geo-enabling the Enterprise Akakpo Agbago & Patrick Brennan Esri Inc. Location Analytics . Why Location Analytics? ... Acme Insurance

Technology Enabling the Exploration of Mars

Enabling Real-Time Analytics for IoT

Grappa: enabling next-generation analytics tools via latency …sampa.cs.washington.edu/papers/grappa-osdi14-poster.pdf · 2016-09-29 · Grappa: enabling next-generation analytics

ACO = HIE + Analytics: Enabling Population Health Management

Enabling optimal exploration and production outcomes for ... · and gas companies. Combining powerful industry software and analytics with proven, innovative hardware, these solutions

Advanced Distribution Analytics Services Enabling High PV … · 2019-12-26 · Advanced Distribution Analytics Services Enabling High PV Penetration California Solar Initiative RD&D

Enabling Creativity: Software that encourages creation and exploration

OHB Planetary Exploration Enabling Technologies...OHB Planetary Exploration Enabling Technologies Planetary Exploration Horizon 2061 Synthesis Workshop Critical point in space exploration: