35
Extracting and Mapping SharePoint Content to Create a Custom Taxonomy Anna Divoli Pingar Research @annadivoli

Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

  • View
    400

  • Download
    1

Embed Size (px)

DESCRIPTION

Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy Pingar presentation at ShareFEST in Philadelphia (Apr 2013).

Citation preview

Page 1: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Extracting and Mapping SharePoint Content to Create

a Custom Taxonomy

Anna DivoliPingar Research

@annadivoli

Page 2: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Why?

Why Automatic Generation? DynamicFastCheapConsistentRDF / Flexible…

Why from a Document Collection?Focused/specificOptimal for those documents…

Why Taxonomies? Organize knowledgeDomain representationEnable automatic tasks…

Why in SharePoint?All you need is there!Can be used straight away!

Page 3: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Talk OverviewThe Team

The Process

Evaluation

Use Cases – Withdrawn drug– Cancer treatments– Re-purposed drug

Summary

Page 4: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Taxonomy Generation Research Team

Olena Medelyan, Steve Manion, Jeen Broekstra, Anna Divoli, Anna Lan Huang and Ian WittenConstructing a Focused Taxonomy from a Document Collection

ESWC 2013, Montpellier, France

Page 5: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Taxonomy Generation Process

Input: Documentsstored somewhere

Analysis: Using variety of tools*and datasets, extract concepts,entities, relations

Grouping & Output: A taxonomy is createdthat groups resulting taxonomy terms hierarchically

Custom Taxonom

y

Page 6: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

How Taxonomy Generation works

Page 7: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Document Database

Solr

Concepts & Relations Database

Sesame

1. Import & convert to text

2. Extract concepts

3. Annotate with Linked Data

4. Disambiguateclashing concepts

5. Consolidate taxonomy

InputDocs

Preferred top-level terms

In 5 Steps!

FocusedSKOS

Taxonomy

Page 8: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Step 1. Document input & conversion

InputDocuments Document

Database1. Convert to text

Current input:• Directory path read

recursively

Other possible inputs:• Docs in a database or a

DMS• Emails +attachments

(Exchange)• Website URL• RSS feed

External tool to convert different file formats to text

Database to storedocument content

Page 9: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Step 2. Extracting concepts

Documents

Database

Concepts Database

2. Extract concepts

http://localhost/solr/select?q=path:mycollection\\document456.txt

Pingar API:Taxonomy Terms: Climate and Weather Leaders AgreementsPeople: Yvo de Boer Maite Nkoana-MashabaneOrganizations: Associated Press South African Council of ChurchesLocations: South Africa

Wikify:Wikipedia Terms: South Africa Yvo de Boer U.N. Climate agreements Associated Press

Specific terminology: green policies; climate diplomacy

Page 10: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Step 3. Annotation with meaning

Annotations Database

3. Annotate with Linked Data

mycollection/document456.txt

Pingar API:People: Yvo de Boer Maite Nkoana-MashabaneOrganizations: Associated Press South African Council of ChurchesLocations: South Africa

Later this additional infowill help create

e-Discovery & semantic searchsolutions

Concepts Database

Page 11: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Step 4. Discarding irrelevant meanings

Final Concepts Database

4. Disambiguate clashing concepts

wikipedia.org/wiki/Ocean

wikipedia.org/wiki/Apple_Corps freebase.com/view/en/apple_inc

www.fao.org/aos/agrovoc#c_4607

Over the past three years,  Apple has acquired three mapping companies

For millions of years, the  oceans have been filled with sounds from natural sources.

Two concepts were extracted,that are dissimilarDiscard the incorrect one

Two concepts were extracted,that are similarAccept both correct

Agrovoc term:Marine areas

Concepts Database

Page 12: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Step 5. Group taxonomy (a)

5a. Add relationsConcepts & Relations Database

felines tiger bird

horse family

zebra donkey pigeonhorselizard

Category:Carnivorous animals Category:Animals

animals Building the taxonomybottom up

Broader: Sqamata/Reptiles/Tetrapods/Vertebrates/Chordates/Animals

FocusedSKOS

Taxonomy

Page 13: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Step 5. Consolidating taxonomy (b)

Films and film making Film stars Mila Kunis Daniel Radcliffe Sally Hawkins Julianna Margulies

Association football clubs Former Football League clubs Manchester United F.C. Manchester United F.C. Manchester City F.C.

Finance Economics and finance Personal finance Commercial finance Tax

Capital gains tax Tax Capital gains tax

5b. Prune relationsConcepts & Relations Database

FocusedSKOS

Taxonomy

Page 14: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Evaluation

Recall: 75% (comparing with manually generated taxonomy for the same domain) Precision:89% for concepts 90% for relations (15 human judges based evaluation)

Page 15: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

SharePoint Taxonomy Generation Process

Analysis: Using variety of tools*and datasets, extract concepts,entities, relations

Custom Taxonom

y

Page 16: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Triazolam[A benzodiazepine drug used for short-term treatment of acute insomnia. Withdrawn in 1991 in the UK because of risk of psychiatric adverse drug reactions. It continues to be available in the U.S.] Excerpt of the taxonomy generated from:- 131 PubMed abstracts of clinical trials on triazolam before1991- 180 PubMed abstracts of clinical trials on triazolam since1991 Colors of terms:- proposed to group other terms- found in both document collections- in before withdrawal docs- in since withdrawal docs

Taxonomy Statistics

Concept Count: 305Edges Count: 437Intermediate Count: 97Leaves Count: 183Labels Count: 353

Nesting Counts

0: 251: 512: 1243: 1604: 1765: 1536: 547: 4Average Depth: 3.6

Page 17: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs

Page 18: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs

Page 19: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

proposed to group other termsin both document collectionsin before withdrawal docsin since withdrawal docs

Page 20: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Cancer Treatments

Excerpt of the taxonomy generated from:- 200 PubMed abstracts on breast cancer treatments - 149 (all) PubMed abstracts on lung cancer treatments- 47 (all) PubMed abstracts on gastric cancer treatments Colors of terms:- proposed to group other terms- found in two or more document collections- in the breast treatment docs- in the stomach treatment docs- in the lung treatment docs

Taxonomy Statistics

Concept Count: 308Edges Count: 387Intermediate Count: 90Leaves Count: 195Labels Count: 371

Nesting Counts

0: 231: 522: 993: 1384: 1375: 1596: 607: 368: 6Average Depth: 3.88

Page 21: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy
Page 22: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs

Page 23: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs

Page 24: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs

Page 25: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs

Page 26: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

proposed to group other termsin two or more document collectionsin the breast treatment docsin the stomach treatment docsin the lung treatment docs

Page 27: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

TamoxifenTamoxifen is drug commonly used to treat breast cancer but with a subsequent indication for treating bipolar disorder.

 Excerpt of the taxonomy generated from:- papers discussing tamoxifen and bipolar disorder: 8 PubMed abstracts AND 2 PDFs of full papers (17641532, 18316672)- papers discussing tamoxifen and breast cancer: 50 PubMed abstracts of AND 2 PDFs of full papers (21635709, 12618491)- papers discussing tamoxifen but no mention of either breast cancer nor bipolar disorder: 50 PubMed abstracts of AND 2 PDFs of full papers (16275887, 19458291)

 Colors of terms:- proposed to group other concepts- in two or more document collections- in the bipolar document collection- in the breast cancer document collection- in the neither cancer or bipolar document collection

Taxonomy Statistics

Concept Count: 587Edges Count: 751Intermediate Count: 188Leaves Count: 365Labels Count: 718

Nesting Counts

0: 341: 732: 1333: 2844: 2255: 1576: 897: 308: 2Average Depth: 3.66

Page 28: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection

Page 29: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection

Page 30: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection

Page 31: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection

Page 32: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection

Page 33: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

proposed to group other conceptsin two or more document collectionsin the bipolar document collectionin the breast cancer document collectionin the neither cancer or bipolar doc. collection

Page 34: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

Summary

Entity Extraction

Linked Data

Disambiguation

Consolidation

Case Studies

Page 35: Anna Divoli (Pingar Research): Extracting and Mapping SharePoint Content to Create a Custom Taxonomy

More?

bit.ly/f-step

pingar.com

@PingarHQ

[email protected]

@annadivoli

Focused SKOS Taxonomy Extraction Process (F-STEP) wiki