Text Analytics on 2 Million Documents: A Case Study

Text Analytics on 2 Million Documents: A Case Study

Plus, An Introduction into Keyword Extraction

Alyona Medelyan

Text Analytics World, Boston, October 3-4, 2012

“Because he could” by D. Morris, E. McGann

“Still stripping after 25 years” by E. Burns

What are these books about?

“Glut” by A. Wright

Only metadata will tell…

What this talk will cover:

• Who am I & my relation to the topic• What types of keyword extraction are out there• How does keyword extraction work• How accurate can keywords be• How to analyze 2 million documents efficiently

nzdl.org/kea/ maui-indexer.googlecode.com

MauiMulti-purpose automatic topic indexing

My Background

SemEval-2

2005-2009 PhD Thesis on keyword extraction“Human-competitive automatic topic indexing”

2010 co-organized keyword extraction competitionSemEval-2, Track 5 “Automatic keyphrase extraction from scientific articles”

2010-2012 leading the R&D of Pingar’s text analytics APIPingar API features: keyword & named entities extraction, summarization etc.

@zelandiyamedelyan.com

http://www.nzdl.org/kea/

http://maui-indexer.googlecode.com/

http://www.medelyan.com/

Document Easy to extract: Title, file type & location, creation & modification date, authors, publisher

Difficult to extract: Keywords & keyphrases,people & companies mentioned, suppliers & addresses mentioned

Metadata

Text Analytics

Findability is ensured with the help of metadata

What can text analytics determine from text?

text text text

text text text text text text text text text text text texttext text text

sentiment

keywords tags

genre

categoriestaxonomy terms

entities

namespatterns

biochemicalentities… text text text

text text text text text text text text text text text texttext text text

focu

s of

this

pre

sent

ation

Types of keyword extraction (or topic indexing)

• Subject headings in libraries• general with Library of Congress Subject Headings• domain-specific in PubMed with MeSH

• Keyphrases in academic publications

• Tags in folksonomies• by authors on Technorati• by users on Del.icio.us

controlled indexing

free indexing

keywords tags

categoriestaxonomy

terms

Free indexing Controlled indexingE.g. keywords, tags E.g. LCSH, ACM, MeSH

InconsistentNo controlNo semanticsAd hoc

RestrictedCentrally controlledInflexibleNot always available

Document KeywordsCandidates

How keyword extraction works

NEJM usually has the highest impact factor of the journals of clinical medicine.

highest, highest impact, highest impact factor

1. Extract phrases using the sliding window approach

ignore stopwords

NEJM

Alternative approach:a) Assign part-of-speech tagsb) Extract valid noun phrases (NPs)

impact, impact factor…

Document KeywordsCandidates


NEJM usually has the highest impact factor of the journals of clinical medicine.

NEJMhighesthighest impact factorimpactimpact factorjournalsjournals of clinicalclinicalclinical medicinemedicine

nejmhighhigh impact factorimpactimpact factorjournaljournal of clinicclinicclinic medicmedic

New England J of Med---Impact FactorJournal-ClinicMedicineMedicine

2. Normalize phrases (case folding, stemming etc.)

Free indexingControlled indexing

Document KeywordsCandidates Properties


1. Frequency: number of occurrences (incl. synonyms)

2. Position: beginning/end of a document, title, headers

3. Phrase length: longer means more specific

4. Similarity: semantic relatedness to other candidates

5. Corpus statistics: how prominent in this particular text

6. Popularity: how often people select this candidate

7. Part of speech pattern: some patterns are more common

…

Document KeywordsCandidates Properties


Scoring

Heuristics

A formula that combines most powerful features

• requires accurate crafting• performs equaly well or less well across various domains

Supervised machine learning

Train a model from manually indexed documents

• requires training data• performs really well on docs that are similar to training data, but poorly on dissimilar ones

How accurate is keyword extraction?

• It’s subjective…• But: the higher the indexing consistency is,

the better the search effectiveness (findability)

A – set of keyphrases 1B – set of keyphrases 2C – set of keyphrases in common

A

B

CConsistencyRolling = 2C / (A + B)

ConsistencyHopper = C / (A + B – C)

Alyona Medelyan

Rolling and Hopper are related by R/(2-R)

overweight

nutritionpolicies

foodconsumption

pricepolicies

feedinghabits

nutritionalrequirements

foodintake

diet

fiscalpolicies prices

taxes

developing countries

overeating humannutrition

nutritionsurveillance

nutritionalphysiology

nutritionstatus

foodpolicies

foods

nutrientexcesses

dietaryguidelines

mealpatterns

nutritionprograms

publichealth

diseasecontrol

nutritionaldisorders

globalization

weightreduction

urbanization

developedcountries

price formation

energyvalue

directtaxation

regulations

Agrovoc terms:

Professional indexers’ keywords*

* 6 professional FAO indexers assigned terms from the Agrovoc thesaurus to the same document, entitled “The global obesity problem”

Agrovoc terms:Agrovoc relation:

overweight

overeating

feedinghabits


humannutrition

nutritionpolicies



nutritionstatus

foodpolicies

foods

foodintake

foodconsumption

dietaryguidelines

diet

nutrientexcesses

mealpatterns

nutritionprograms

publichealth

diseasecontrol


pricepoliciesfiscal

policies prices

taxes


globalization

weightreduction

urbanization

developedcountries

price formation

energyvalue

directtaxation

regulationsIndexer 1:

Indexer 2:

Comparison of 2 indexers

Comparison of 6 indexers & Kea

overweight

overeating

feedinghabits


humannutrition

nutritionpolicies



nutritionstatus

foodpolicies

foods

foodintake

foodconsumption

diet

nutrientexcesses

dietaryguidelines

mealpatterns

nutritionprograms

publichealth

diseasecontrol


pricepoliciesfiscal

policies prices


globalization

weightreduction

urbanization

developedcountries

price formation

energyvalue

directtaxation

regulations

1 2 3 4 5 6

Agrovoc terms:Agrovoc relation:

Indexers:

taxes

body weight

saturated fatprice fixing

policies

controlled prices

Kea Algorithm:

* 15 teams of 2 students each assigned keywords to the same document, entitled “A safe, efficient regression test selection technique”

Comparison of CS students* & Maui

Human vs. algorithm consistency

Method Min Avg Max

Students 21 31 37

Maui 24 32 36

Method Min Avg Max

Professionals 26 39 47

KEA 24 32 38

6 Professional indexers vs. Kea on 30 agricultural documents & Agrovoc thesaurus

15 teams of 2 CS students vs. Maui on 20 CS documents & Wikipedia vocabulary

With other taggers With Maui

330 taggers & 180 docs 19 24

35 taggers & 140 docs 38 35

CiteULike taggers vs. Maui (each tagger had ≥ 2 co-taggers) & free indexing

Text Analytics on 2 Million Documents: A Case Study

+

Collaboration with Gene Golovchinskyfxpal.com/?p=gene

http://www.fxpal.com/?p=gene

http://www.fxpal.com/?p=gene

The dataset

CiteSeer1.7 Million scientific

publications110 GB

slideshare.net/raffikrikorian/twitter-by-the-numbersen.wikipedia.org/wiki/Wikipedia:Size_comparisons

Twitter490 Million tweets per

week84 GB

Wikipedia3.6 Million articles13 GB

Britannica0.65 Million articles0.3 GBICWSM 2011

2.1 TB (compressed!)News, blogs, forums, etc.

http://www.slideshare.net/raffikrikorian/twitter-by-the-numbers

http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons

The task goal1. Extract all phrases that appear in search results

2. Weigh and suggest the best phrases for query refinement

Gene’s collaborative search system Querium

Step 1: Get time estimatesA. Take a subset, e.g. 100 documentsB. Run on various machines / settingsC. Extrapolate to the entire dataset, e.g. 1.7M docs

Our example:

• Standard laptop 4 Core, 8GB RAM: 30 days• Similar Rackspace VM: 46 days• Threading reduces time: 24 days

Step 2: Look into your dataUnderstand the nature of your data:

look at samples, compute statistcs. Speed up by removing anomalies & targetting the text analytics.

Our example:

30% docs exceed 50KB (some ≈600KB)

Most important phrase appear in title, abstract, introduction and conclusions.

Only process top 30% and last 20%This reduces the time by 57%!

Validate: Can we crop our documents?

Top N keywords in original doc

How many were found in the cropped doc

10 91%

50 80%

100 75%

All 64%

ontologyknowledge baseknowledge representationSemantic WebWordNetknowledge engineeringpredicate logicartificial intelligencesemantic networksnatural languagefirst-order logicontology engineeringlexiconconceptual graphshigher-order logicnatural language processingdesign rationaleblock diagram

ontologyknowledge baseknowledge engineeringknowledge representationWordNetpredicate logicartificial intelligenceontology engineeringsemantic networksSemantic Webfirst-order logicblock diagramdynamic systemshigher-order logicconceptual graphsmodeling & simulationuniverse of discoursebond graphlexicon

…original document ...cropped documentTop 20 keywords from*…

* Toward principles for the design of ontologies used for knowledge sharing

T. R. Gruber (1993)

Step 3: Go cloud

Don’t be afraid to bring out the big guns

• Large Elastic Compute instance1000 docs x 4 threads = 30 min

• High-CPU Extra Large (8 virtual cores)1000 docs x 24 threads = 6 min

Also: increase the number of machines

• 4 machines = 4 times faster, i.e. 50 instead of 200 hours (or 1 weekend!)

* Taking into account 8h per working day** Assuming 250 working days per year (no holidays, no sickdays)

Min per doc

Min Hours Days* Years**

1 1.700.000 28.333 3.542 14

2 3.400.000 56.666 7.083 28

3 5.100.000 85.000 10.625 42

How long would a human need to extract keywords from 1.7M docs?

http://www.flickr.com/photos/mararie/2663711551/




CiteSeer1.7 Million scientific

publications110 GB

Can be done in a weekend

Don’t do it manually!

1. Get time estimates2. Look into your data3. Go cloud

Candidates Properties Scoring KeywordsDocument

To estimate quality, take a sample and computeinter-indexer consistency between several people

Keyword extraction : medelyan.com/files/phd2009.pdfCiteSeer study: pingar.com/technical-blog/Pingar API: apidemo.pingar.com

http://www.medelyan.com/files/phd2009.pdf

http://pingar.com/technical-blog/

http://apidemo.pingar.com/