23
Using Wikipedia as a reference for extracting semantic information from a text Andrea Prato & Marco Ronchetti Università di Trento, Italy

Using Wikipedia as a reference for extracting semantic information

  • Upload
    ronchet

  • View
    755

  • Download
    2

Embed Size (px)

DESCRIPTION

In this paper we present an algorithm that, using Wikipedia as a reference, extracts semantic information from an arbitrary text. Our algorithm refines a procedure proposed by others, which mines all the text contained in the whole Wikipedia. Our refinement, based on a clustering approach, exploits the semantic information contained in certain types of Wikipedia hyperlinks, and also introduces an analysis based on multi-words. Our algorithm outperforms current methods in that the output contains many less false positives. We were also able to understand which (structural) part of the texts provides most of the semantic information extracted by the algorithm.

Citation preview

Page 1: Using Wikipedia as a reference for extracting semantic information

Using Wikipedia as a referencefor extracting semanticinformation from a text

Andrea Prato&

Marco RonchettiUniversità di Trento, Italy

Page 2: Using Wikipedia as a reference for extracting semantic information

Explicit Semantic Analysis

GabrilovichMarkovich2007

Page 3: Using Wikipedia as a reference for extracting semantic information

Throw away:

StopwordsFragment pages (<100 words)Suffixes (stemming)

Page 4: Using Wikipedia as a reference for extracting semantic information

A sample (ESA)

The development of T-cell leukaemiafollowing the otherwise successfultreatment of three patients with X-linkedsevere combined immune deficiency (X-SCID) in gene-therapy trials usinghaematopoietic stem cells has led to a re-evaluation of this approach. Using amouse model for gene therapy of X-SCID, we find that the correctivetherapeutic gene IL2RG itself can act asa contributor to the genesis of T-celllymphomas, with one-third of animalsbeing affected. Gene-therapy trials for X-SCID, which have been based on theassumption that IL2RG is minimallyoncogenic, may therefore pose some riskto patients.

- Leukemia- Severe combinedimmunodeficiency- Cancer-Non-Hodgkin lymphoma- AIDS-ICD-10 Chapter II:Neoplasms;-Chapter III: Diseases of theblood and blood-formingorgans, and certaindisorders involving theimmune mechanism- Bone marrow transplant- Immunosuppressive drug- Acute lymphoblasticleukemia- Multiple sclerosis.

Page 5: Using Wikipedia as a reference for extracting semantic information

A sample (ESA)

Being so tightly packed, Venice doesn'tmake an ideal place to come to practiseyour favourite sport, although you'll get adecent workout just walking around andup and down bridges! If you've got anyenergy left for some extra exercise, try aspot of swimming (although pools arerare) or even a jog. Venice is a bit of adesert for swimmers. You can go in offthe Lido (if you're game) or at one ofVenice's two public swimming pools(handily, they close in summer).

Lonely Planet Tourist Guide

1-Glossary_of_cue_sports_terms2-Swimming,3-Ian_Thorpe.4-NCAA_football_bowl_games,2005-06,5-Swimming_machine,6-American_football_strategy,7-Contract_bridge_glossary,8-Olympic_Games,9-Pingu_episodes_series_6,10-Venice.…15 - Corruption_in_Ghana…27 - Legislative_system_of_thePeopleʼs_Republic_of_China.

Page 6: Using Wikipedia as a reference for extracting semantic information

Clustering

Page 7: Using Wikipedia as a reference for extracting semantic information

Wikipedia is hyperlinked

Page 8: Using Wikipedia as a reference for extracting semantic information

Swimming is clustered with Olympic Games

Page 9: Using Wikipedia as a reference for extracting semantic information

A sample (ESA)

Being so tightly packed, Venice doesn'tmake an ideal place to come to practiseyour favourite sport, although you'll get adecent workout just walking around andup and down bridges! If you've got anyenergy left for some extra exercise, try aspot of swimming (although pools arerare) or even a jog. Venice is a bit of adesert for swimmers. You can go in offthe Lido (if you're game) or at one ofVenice's two public swimming pools(handily, they close in summer).

Lonely Planet Tourist Guide

1-Glossary_of_cue_sports_terms2-Swimming,3-Ian_Thorpe.4-NCAA_football_bowl_games,2005-06,5-Swimming_machine,6-American_football_strategy,7-Contract_bridge_glossary,8-Olympic_Games,9-Pingu_episodes_series_6,10-Venice.…15 - Corruption_in_Ghana…27 - Legislative_system_of_thePeopleʼs_Republic_of_China.

Page 10: Using Wikipedia as a reference for extracting semantic information

Throw away:

Large aggregators Category links Numbers Pages with more than (N=100) links

Page 11: Using Wikipedia as a reference for extracting semantic information

After clustering:

only 3 clusters with cardinality larger than 1. The first cluster, with cardinality 21, was

automatically named Swimming. The second and the third both have cardinality

equal to 2, and they are named Training andVenice-bucentaur.

Page 12: Using Wikipedia as a reference for extracting semantic information

Validation: Turing test

Classification

Text Classification

Classification

Which one is machine -generated?

Page 13: Using Wikipedia as a reference for extracting semantic information

Outcome 20 texts of lengthranging between 60and 200 words. Textswere collected fromvarious sources likenewspaper articles,text books, randomweb pages, MSNEncarta.

Page 14: Using Wikipedia as a reference for extracting semantic information

Further improvements

Page 15: Using Wikipedia as a reference for extracting semantic information

Using only nouns

Using a POS Tagger to identify syntacticroles in document to be classified

Keep only names (throw away the rest)

No degradation in the results!

Page 16: Using Wikipedia as a reference for extracting semantic information

Define Multiwords

Lexical multiword identification approach: The following generative pattern is considered

((Adj∣Noun) + ∣((Adj∣Noun) ∗ (Noun - Prep)?)(Adj∣Noun)∗)Noun

+: One or more *: Zero or more ?: Zero or one ∣: Or

Validation: A candidate multiword is valid if thereis a Wikipedia entry related to it.

Page 17: Using Wikipedia as a reference for extracting semantic information

Text with multiwords:

Keep all nounsKeep all adjectives that are part of a

multiword

Page 18: Using Wikipedia as a reference for extracting semantic information

Evaluation (human inspection ofresults)100 samples (50 technical, 50 generic)Multiword improved significanty 7 (5 technical)It improved marginally 13It worsened marginally 6

Overall improvement: 10/% on technical text

Page 19: Using Wikipedia as a reference for extracting semantic information

Work in progress

Page 20: Using Wikipedia as a reference for extracting semantic information

Concept-mediated mappingamong documentsHow similar are two docs?

Doc 1 Doc 3

Concept 1

Concept 2 Concept 2

Concept 3 Concept 3

Concept 4

Jaccard Index

Page 21: Using Wikipedia as a reference for extracting semantic information

Syllabi comparison

Page 22: Using Wikipedia as a reference for extracting semantic information

Interlinks

Page 23: Using Wikipedia as a reference for extracting semantic information

Mapping documents in differentlanguagesDeploying Wikipedia Interlinks

Doc 1 Doc 3

Concept 1

Concept 2 Concept 2

Concept 3 Concept 3

Concept 4

Jaccard Index

INTERLINKS