27
Text Analytics on 2 Million Documents: A Case Study Plus, An Introduction into Keyword Extraction Alyona Medelyan Text Analytics World, Boston, October 3-4, 2012

Text Analytics on 2 Million Documents: A Case Study

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Text Analytics on 2 Million Documents: A Case Study

Text Analytics on 2 Million Documents: A Case Study

Plus, An Introduction into Keyword Extraction

Alyona Medelyan

Text Analytics World, Boston, October 3-4, 2012

Page 2: Text Analytics on 2 Million Documents: A Case Study

“Because he could” by D. Morris, E. McGann

“Still stripping after 25 years” by E. Burns

What are these books about?

“Glut” by A. Wright

Only metadata will tell…

Page 3: Text Analytics on 2 Million Documents: A Case Study

What this talk will cover:

• Who am I & my relation to the topic• What types of keyword extraction are out there• How does keyword extraction work• How accurate can keywords be• How to analyze 2 million documents efficiently

Page 4: Text Analytics on 2 Million Documents: A Case Study

nzdl.org/kea/ maui-indexer.googlecode.com

MauiMulti-purpose automatic topic indexing

My Background

SemEval-2

2005-2009 PhD Thesis on keyword extraction“Human-competitive automatic topic indexing”

2010 co-organized keyword extraction competitionSemEval-2, Track 5 “Automatic keyphrase extraction from scientific articles”

2010-2012 leading the R&D of Pingar’s text analytics APIPingar API features: keyword & named entities extraction, summarization etc.

@zelandiyamedelyan.com

Page 5: Text Analytics on 2 Million Documents: A Case Study

Document Easy to extract: Title, file type & location, creation & modification date, authors, publisher

Difficult to extract: Keywords & keyphrases,people & companies mentioned, suppliers & addresses mentioned

Metadata

Text Analytics

Findability is ensured with the help of metadata

Page 6: Text Analytics on 2 Million Documents: A Case Study

What can text analytics determine from text?

text text text

text text text text text text text text text text text texttext text text

sentiment

keywords tags

genre

categoriestaxonomy terms

entities

namespatterns

biochemicalentities… text text text

text text text text text text text text text text text texttext text text

focu

s of

this

pre

sent

ation

Page 7: Text Analytics on 2 Million Documents: A Case Study

Types of keyword extraction (or topic indexing)

• Subject headings in libraries• general with Library of Congress Subject Headings• domain-specific in PubMed with MeSH

• Keyphrases in academic publications

• Tags in folksonomies• by authors on Technorati• by users on Del.icio.us

controlled indexing

free indexing

keywords tags

categoriestaxonomy

terms

Page 8: Text Analytics on 2 Million Documents: A Case Study

Free indexing Controlled indexingE.g. keywords, tags E.g. LCSH, ACM, MeSH

InconsistentNo controlNo semanticsAd hoc

RestrictedCentrally controlledInflexibleNot always available

Page 9: Text Analytics on 2 Million Documents: A Case Study

Document KeywordsCandidates

How keyword extraction works

NEJM usually has the highest impact factor of the journals of clinical medicine.

highest, highest impact, highest impact factor

1. Extract phrases using the sliding window approach

ignore stopwords

NEJM

Alternative approach:a) Assign part-of-speech tagsb) Extract valid noun phrases (NPs)

impact, impact factor…

Page 10: Text Analytics on 2 Million Documents: A Case Study

Document KeywordsCandidates

How keyword extraction works

NEJM usually has the highest impact factor of the journals of clinical medicine.

NEJMhighesthighest impact factorimpactimpact factorjournalsjournals of clinicalclinicalclinical medicinemedicine

nejmhighhigh impact factorimpactimpact factorjournaljournal of clinicclinicclinic medicmedic

New England J of Med---Impact FactorJournal-ClinicMedicineMedicine

2. Normalize phrases (case folding, stemming etc.)

Free indexingControlled indexing

Page 11: Text Analytics on 2 Million Documents: A Case Study

Document KeywordsCandidates Properties

How keyword extraction works

1. Frequency: number of occurrences (incl. synonyms)

2. Position: beginning/end of a document, title, headers

3. Phrase length: longer means more specific

4. Similarity: semantic relatedness to other candidates

5. Corpus statistics: how prominent in this particular text

6. Popularity: how often people select this candidate

7. Part of speech pattern: some patterns are more common

Page 12: Text Analytics on 2 Million Documents: A Case Study

Document KeywordsCandidates Properties

How keyword extraction works

Scoring

Heuristics

A formula that combines most powerful features

• requires accurate crafting• performs equaly well or less well across various domains

Supervised machine learning

Train a model from manually indexed documents

• requires training data• performs really well on docs that are similar to training data, but poorly on dissimilar ones

Page 13: Text Analytics on 2 Million Documents: A Case Study

How accurate is keyword extraction?

• It’s subjective…• But: the higher the indexing consistency is,

the better the search effectiveness (findability)

A – set of keyphrases 1B – set of keyphrases 2C – set of keyphrases in common

A

B

CConsistencyRolling = 2C / (A + B)

ConsistencyHopper = C / (A + B – C)

Alyona Medelyan
Rolling and Hopper are related by R/(2-R)
Page 14: Text Analytics on 2 Million Documents: A Case Study

overweight

nutritionpolicies

foodconsumption

pricepolicies

feedinghabits

nutritionalrequirements

foodintake

diet

fiscalpolicies prices

taxes

developing countries

overeating humannutrition

nutritionsurveillance

nutritionalphysiology

nutritionstatus

foodpolicies

foods

nutrientexcesses

dietaryguidelines

mealpatterns

nutritionprograms

publichealth

diseasecontrol

nutritionaldisorders

globalization

weightreduction

urbanization

developedcountries

price formation

energyvalue

directtaxation

regulations

Agrovoc terms:

Professional indexers’ keywords*

* 6 professional FAO indexers assigned terms from the Agrovoc thesaurus to the same document, entitled “The global obesity problem”

Page 15: Text Analytics on 2 Million Documents: A Case Study

Agrovoc terms:Agrovoc relation:

overweight

overeating

feedinghabits

nutritionalrequirements

humannutrition

nutritionpolicies

nutritionsurveillance

nutritionalphysiology

nutritionstatus

foodpolicies

foods

foodintake

foodconsumption

dietaryguidelines

diet

nutrientexcesses

mealpatterns

nutritionprograms

publichealth

diseasecontrol

nutritionaldisorders

pricepoliciesfiscal

policies prices

taxes

developing countries

globalization

weightreduction

urbanization

developedcountries

price formation

energyvalue

directtaxation

regulationsIndexer 1:

Indexer 2:

Comparison of 2 indexers

Page 16: Text Analytics on 2 Million Documents: A Case Study

Comparison of 6 indexers & Kea

overweight

overeating

feedinghabits

nutritionalrequirements

humannutrition

nutritionpolicies

nutritionsurveillance

nutritionalphysiology

nutritionstatus

foodpolicies

foods

foodintake

foodconsumption

diet

nutrientexcesses

dietaryguidelines

mealpatterns

nutritionprograms

publichealth

diseasecontrol

nutritionaldisorders

pricepoliciesfiscal

policies prices

developing countries

globalization

weightreduction

urbanization

developedcountries

price formation

energyvalue

directtaxation

regulations

1 2 3 4 5 6

Agrovoc terms:Agrovoc relation:

Indexers:

taxes

body weight

saturated fatprice fixing

policies

controlled prices

Kea Algorithm:

Page 17: Text Analytics on 2 Million Documents: A Case Study

* 15 teams of 2 students each assigned keywords to the same document, entitled “A safe, efficient regression test selection technique”

Comparison of CS students* & Maui

Page 18: Text Analytics on 2 Million Documents: A Case Study

Human vs. algorithm consistency

Method Min Avg Max

Students 21 31 37

Maui 24 32 36

Method Min Avg Max

Professionals 26 39 47

KEA 24 32 38

6 Professional indexers vs. Kea on 30 agricultural documents & Agrovoc thesaurus

15 teams of 2 CS students vs. Maui on 20 CS documents & Wikipedia vocabulary

With other taggers With Maui

330 taggers & 180 docs 19 24

35 taggers & 140 docs 38 35

CiteULike taggers vs. Maui (each tagger had ≥ 2 co-taggers) & free indexing

Page 19: Text Analytics on 2 Million Documents: A Case Study

Text Analytics on 2 Million Documents: A Case Study

+

Collaboration with Gene Golovchinskyfxpal.com/?p=gene

Page 20: Text Analytics on 2 Million Documents: A Case Study

The dataset

CiteSeer1.7 Million scientific

publications110 GB

slideshare.net/raffikrikorian/twitter-by-the-numbersen.wikipedia.org/wiki/Wikipedia:Size_comparisons

Twitter490 Million tweets per

week84 GB

Wikipedia3.6 Million articles13 GB

Britannica0.65 Million articles0.3 GBICWSM 2011

2.1 TB (compressed!)News, blogs, forums, etc.

Page 21: Text Analytics on 2 Million Documents: A Case Study

The task goal1. Extract all phrases that appear in search results

2. Weigh and suggest the best phrases for query refinement

Gene’s collaborative search system Querium

Page 22: Text Analytics on 2 Million Documents: A Case Study

Step 1: Get time estimatesA. Take a subset, e.g. 100 documentsB. Run on various machines / settingsC. Extrapolate to the entire dataset, e.g. 1.7M docs

Our example:

• Standard laptop 4 Core, 8GB RAM: 30 days• Similar Rackspace VM: 46 days• Threading reduces time: 24 days

Page 23: Text Analytics on 2 Million Documents: A Case Study

Step 2: Look into your dataUnderstand the nature of your data:

look at samples, compute statistcs. Speed up by removing anomalies & targetting the text analytics.

Our example:

30% docs exceed 50KB (some ≈600KB)

Most important phrase appear in title, abstract, introduction and conclusions.

Only process top 30% and last 20%This reduces the time by 57%!

Page 24: Text Analytics on 2 Million Documents: A Case Study

Validate: Can we crop our documents?

Top N keywords in original doc

How many were found in the cropped doc

10 91%

50 80%

100 75%

All 64%

ontologyknowledge baseknowledge representationSemantic WebWordNetknowledge engineeringpredicate logicartificial intelligencesemantic networksnatural languagefirst-order logicontology engineeringlexiconconceptual graphshigher-order logicnatural language processingdesign rationaleblock diagram

ontologyknowledge baseknowledge engineeringknowledge representationWordNetpredicate logicartificial intelligenceontology engineeringsemantic networksSemantic Webfirst-order logicblock diagramdynamic systemshigher-order logicconceptual graphsmodeling & simulationuniverse of discoursebond graphlexicon

…original document ...cropped documentTop 20 keywords from*…

* Toward principles for the design of ontologies used for knowledge sharing

T. R. Gruber (1993)

Page 25: Text Analytics on 2 Million Documents: A Case Study

Step 3: Go cloud

Don’t be afraid to bring out the big guns

• Large Elastic Compute instance1000 docs x 4 threads = 30 min

• High-CPU Extra Large (8 virtual cores)1000 docs x 24 threads = 6 min

Also: increase the number of machines

• 4 machines = 4 times faster, i.e. 50 instead of 200 hours (or 1 weekend!)

Page 26: Text Analytics on 2 Million Documents: A Case Study

* Taking into account 8h per working day** Assuming 250 working days per year (no holidays, no sickdays)

Min per doc

Min Hours Days* Years**

1 1.700.000 28.333 3.542 14

2 3.400.000 56.666 7.083 28

3 5.100.000 85.000 10.625 42

How long would a human need to extract keywords from 1.7M docs?

http://www.flickr.com/photos/mararie/2663711551/

Page 27: Text Analytics on 2 Million Documents: A Case Study

CiteSeer1.7 Million scientific

publications110 GB

Can be done in a weekend

Don’t do it manually!

1. Get time estimates2. Look into your data3. Go cloud

Candidates Properties Scoring KeywordsDocument

To estimate quality, take a sample and computeinter-indexer consistency between several people

Keyword extraction : medelyan.com/files/phd2009.pdfCiteSeer study: pingar.com/technical-blog/Pingar API: apidemo.pingar.com