35
Automatic Taxonomy Generation for a News Group Anna Divoli Pingar Research @annadivoli San Francisco Apr 2013 A Case Study

Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

  • View
    607

  • Download
    1

Embed Size (px)

DESCRIPTION

Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study As presented in Text Analytics World in San Francisco (Apr 2013)

Citation preview

Page 1: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Automatic Taxonomy Generation for a News Group

Anna DivoliPingar Research

@annadivoli

San Francisco Apr 2013

A Case Study

Page 2: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Why Automatic Generation? DynamicFastCheapConsistentRDF / Flexible…

Why from a Document Collection?Focused/specificOptimal for those documents…

Why?

Page 3: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

The Team

The Process

News Group Case Study

Evaluation

Other Use Cases

Summary

Talk Overview

San Francisco Apr 2013

Page 4: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Taxonomy Generation Research Team

Olena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna Lan Huang and Ian Witten

Constructing a Focused Taxonomy from a Document CollectionTo appear in Proceedings of the Extended Semantic Web Conference 2013,

ESWC, Montpellier, France

Page 5: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

?

How Taxonomy Generation Works

Page 6: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Input: Documentsstored somewhere

Analysis: Using variety of tools*and datasets, extract concepts,entities, relations

Grouping & Output: A taxonomy is createdthat groups resulting taxonomy terms hierarchically

Custom Taxonom

y

Taxonomy Generation Overview

Page 7: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Taxonomy Generation - Detailed

Page 8: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Document Database

Solr

Concepts & Relations Database

Sesame

1. Import & convert to text

2. Extract concepts

3. Annotate with Linked Data

4. Disambiguateclashing concepts

5. Consolidate taxonomy

InputDocs

Preferred top-level terms

FocusedSKOS

Taxonomy

Taxonomy Generation in 5 Steps!

Page 9: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

InputDocuments Document

Database1. Convert to text

Current input:• Directory path read

recursively

Other possible inputs:• Docs in a database or a

DMS• Emails +attachments

(Exchange)• Website URL• RSS feed

External tool to convert different file formats to text

Database to storedocument content

Step 1. Document input & conversion

Page 10: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Documents

DatabaseConcepts Database

2. Extract concepts

http://localhost/solr/select?q=path:mycollection\\document456.txt

Pingar API:Taxonomy Terms: Climate and Weather Leaders AgreementsPeople: Yvo de Boer Maite Nkoana-MashabaneOrganizations: Associated Press South African Council of ChurchesLocations: South Africa

Wikify:Wikipedia Terms: South Africa Yvo de Boer U.N. Climate agreements Associated Press

Specific terminology: green policies; climate diplomacy

Step 2. Extracting concepts

Page 11: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Annotations Database

3. Annotate with Linked Data

mycollection/document456.txt

Pingar API:People: Yvo de Boer Maite Nkoana-MashabaneOrganizations: Associated Press South African Council of ChurchesLocations: South Africa

Later this additional infowill help create

e-Discovery & semantic searchsolutions

Concepts Database

Step 3. Annotation with meaning

Page 12: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Final Concepts Database

4. Disambiguate clashing concepts

wikipedia.org/wiki/Ocean

wikipedia.org/wiki/Apple_Corps freebase.com/view/en/apple_inc

www.fao.org/aos/agrovoc#c_4607

Over the past three years,  Apple has acquired three mapping companies

For millions of years, the  oceans have been filled with sounds from natural sources.

Two concepts were extracted,that are dissimilarDiscard the incorrect one

Two concepts were extracted,that are similarAccept both correct

Agrovoc term:Marine areas

Concepts Database

Step 4. Discarding irrelevant meanings

Page 13: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

5a. Add relationsConcepts & Relations Database

felines tiger bird

horse family

zebra donkey pigeonhorselizard

Category:Carnivorous animals Category:Animals

animals Building the taxonomybottom up

Broader: Sqamata/Reptiles/Tetrapods/Vertebrates/Chordates/Animals

FocusedSKOS

Taxonomy

Step 5a. Group taxonomy

Page 14: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Films and film making Film stars Mila Kunis Daniel Radcliffe Sally Hawkins Julianna Margulies

Association football clubs Former Football League clubs Manchester United F.C. Manchester United F.C. Manchester City F.C.

Finance Economics and finance Personal finance Commercial finance Tax

Capital gains tax Tax Capital gains tax

5b. Prune relationsConcepts & Relations Database

FocusedSKOS

Taxonomy

Step 5b. Consolidating taxonomy

Page 15: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Analysis: Using variety of tools*and datasets, extract concepts, entities, relations

Custom Taxonom

y

Taxonomy Generation Process

Input: Documentsstored somewhere

Output: A taxonomy is createdthat groups resulting taxonomy terms hierarhically

* Pingar API for People, Organization, Locations & Taxonomy Terms from related taxonomies; Wikification for related Wikipedia articles and category relations; Linked Data analysis for creating links to Freebase & DBpedia

File-shareSharePointExchangeEtc

Page 16: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

?

How Does It Look Like?

Page 17: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Fairfax NZThis taxonomy was created from 2000 news articles by Fairfax New Zealand around Christmas 2011.

Taxonomy StatisticsConcept Count: 10158Edges Count: 12668Intermediate Count: 1383Leaves Count: 8748Labels Count: 11545

Nesting Counts0: 27, 1: 6102, 2: 2903, 3: 28914: 2057, 5: 1202, 6: 745, 7: 3548: 179, 9: 41, 10: 10

Average Depth: 2.65

Case Study: A News Group

Page 18: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Case Study: A News Group

Page 19: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Case Study: A News Group

Page 20: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Case Study: A News Group

Page 21: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Case Study: A News Group

Page 22: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Case Study: A News Group

Page 23: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Case Study: A News Group

Page 24: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Labels & Relations

Page 25: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Case Study: A News Group

Page 26: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Case Study: A News Group

Fairfax - 4 Days from Sep 2001 Excerpt of the taxonomy generated from:Fairfax articles taken from - Sep 9th & 10th (1242 articles) and - Sep 13th & 14th (1667 articles) NZT! Colors of terms:- proposed to group other terms- found in both document collections- in 9-10 Sep 2001 docs- in 13-14 Sep 2001 docs- search match

Taxonomy Statistics: Concept Count: 12699Edges Count: 13755Intermediate Count: 709Leaves Count: 11985Labels Count: 12741

Page 27: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Case Study: A News Group

proposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 docs

……………………………………………………………….

……………………………………………………………….

Page 28: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Case Study: A News Group

proposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 docs

Page 29: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

FairFax NZ - 4 Days from Sep 2001 

Excerpt of the taxonomy generated from:Fairfax articles taken from - Sep 9th & 10th (1242 articles) and - Sep 13th & 14th (1667 articles) NZT! 

Colors of terms:- proposed to group other terms- found in both document collections- in 9-10 Sep 2001 docs- in 13-14 Sep 2001 docs- search match

Taxonomy Statistics: Concept Count: 12699Edges Count: 13755Intermediate Count: 709Leaves Count: 11985Labels Count: 12741Average Depth: 1.85( 0: 5 - 1: 4082 - 2: 8980 - 3: 7554: 333 - 5: 132 - 6: 31 - 7: 6 - 8: 1 )

Including NZPSV Taxonomy Statistics: Concept Count: 13970Edges Count: 15020Intermediate Count: 1277Leaves Count: 12677Labels Count: 15407Average Depth: 3(0: 16 - 1: 10153 - 2: 1888 - 3: 14004: 1203 - 5: 1053 - 6: 756 - 7: 4278: 252 - 9: 267 - 10: 341 - 11: 31512: 330 - 13: 149 - 14: 134 - 15: 8716: 10 )

Case Study: A News Group

Page 30: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

September 2001

 

Christmas 2011

Case Study: A News Group

proposed to group other termsin both document collectionsin 9-10 Sep 2001 docsin 13-14 Sep 2001 docs

Page 31: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Evaluation

Sources of error in concept identificationType Number Errors RatePeople 1145 37 3.2%Organizations 496 51 10.3%Locations 988 114 11.5%Wikipedia named entities 832 71 8.5%Wikipedia other entities 99 16 16.4%Taxonomy 868 229 26.4%DBPedia 868 81 8.1%Freebase 135 12 8.9%Overall 3447 393 11.4%

Recall: 75% (comparing with manually generated taxonomy for the same domain) Precision:89% for concepts 90% for relations (15 human judges based evaluation)

Page 32: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Other Use Cases

How to refine search by metadata?What’s in these files / emails?

What to include into our corporate taxonomy?

How to find all docs on a given topic?

Content Audit

Information Architecture

Better search with facets

Better browsing

Page 33: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection

Other Use Cases: Discovery

Page 34: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

Summary

Entity Extraction

Linked Data

Disambiguation

Consolidation

News Group Case Study

Other Use Cases

Page 35: Anna Divoli (Pingar Research): Automatic Taxonomy Generation for a News Group - A Case Study

More?

bit.ly/f-step

pingar.com @PingarHQ

[email protected]

@annadivoli

Focused SKOS Taxonomy Extraction Process (F-STEP) wiki