Using Wikipedia as a reference for extracting semantic information

Preview:

DESCRIPTION

In this paper we present an algorithm that, using Wikipedia as a reference, extracts semantic information from an arbitrary text. Our algorithm refines a procedure proposed by others, which mines all the text contained in the whole Wikipedia. Our refinement, based on a clustering approach, exploits the semantic information contained in certain types of Wikipedia hyperlinks, and also introduces an analysis based on multi-words. Our algorithm outperforms current methods in that the output contains many less false positives. We were also able to understand which (structural) part of the texts provides most of the semantic information extracted by the algorithm.

Citation preview

Using Wikipedia as a referencefor extracting semanticinformation from a text

Andrea Prato&

Marco RonchettiUniversità di Trento, Italy

Explicit Semantic Analysis

GabrilovichMarkovich2007

Throw away:

StopwordsFragment pages (<100 words)Suffixes (stemming)

A sample (ESA)

The development of T-cell leukaemiafollowing the otherwise successfultreatment of three patients with X-linkedsevere combined immune deficiency (X-SCID) in gene-therapy trials usinghaematopoietic stem cells has led to a re-evaluation of this approach. Using amouse model for gene therapy of X-SCID, we find that the correctivetherapeutic gene IL2RG itself can act asa contributor to the genesis of T-celllymphomas, with one-third of animalsbeing affected. Gene-therapy trials for X-SCID, which have been based on theassumption that IL2RG is minimallyoncogenic, may therefore pose some riskto patients.

- Leukemia- Severe combinedimmunodeficiency- Cancer-Non-Hodgkin lymphoma- AIDS-ICD-10 Chapter II:Neoplasms;-Chapter III: Diseases of theblood and blood-formingorgans, and certaindisorders involving theimmune mechanism- Bone marrow transplant- Immunosuppressive drug- Acute lymphoblasticleukemia- Multiple sclerosis.

A sample (ESA)

Being so tightly packed, Venice doesn'tmake an ideal place to come to practiseyour favourite sport, although you'll get adecent workout just walking around andup and down bridges! If you've got anyenergy left for some extra exercise, try aspot of swimming (although pools arerare) or even a jog. Venice is a bit of adesert for swimmers. You can go in offthe Lido (if you're game) or at one ofVenice's two public swimming pools(handily, they close in summer).

Lonely Planet Tourist Guide

1-Glossary_of_cue_sports_terms2-Swimming,3-Ian_Thorpe.4-NCAA_football_bowl_games,2005-06,5-Swimming_machine,6-American_football_strategy,7-Contract_bridge_glossary,8-Olympic_Games,9-Pingu_episodes_series_6,10-Venice.…15 - Corruption_in_Ghana…27 - Legislative_system_of_thePeopleʼs_Republic_of_China.

Clustering

Wikipedia is hyperlinked

Swimming is clustered with Olympic Games

A sample (ESA)

Being so tightly packed, Venice doesn'tmake an ideal place to come to practiseyour favourite sport, although you'll get adecent workout just walking around andup and down bridges! If you've got anyenergy left for some extra exercise, try aspot of swimming (although pools arerare) or even a jog. Venice is a bit of adesert for swimmers. You can go in offthe Lido (if you're game) or at one ofVenice's two public swimming pools(handily, they close in summer).

Lonely Planet Tourist Guide

1-Glossary_of_cue_sports_terms2-Swimming,3-Ian_Thorpe.4-NCAA_football_bowl_games,2005-06,5-Swimming_machine,6-American_football_strategy,7-Contract_bridge_glossary,8-Olympic_Games,9-Pingu_episodes_series_6,10-Venice.…15 - Corruption_in_Ghana…27 - Legislative_system_of_thePeopleʼs_Republic_of_China.

Throw away:

Large aggregators Category links Numbers Pages with more than (N=100) links

After clustering:

only 3 clusters with cardinality larger than 1. The first cluster, with cardinality 21, was

automatically named Swimming. The second and the third both have cardinality

equal to 2, and they are named Training andVenice-bucentaur.

Validation: Turing test

Classification

Text Classification

Classification

Which one is machine -generated?

Outcome 20 texts of lengthranging between 60and 200 words. Textswere collected fromvarious sources likenewspaper articles,text books, randomweb pages, MSNEncarta.

Further improvements

Using only nouns

Using a POS Tagger to identify syntacticroles in document to be classified

Keep only names (throw away the rest)

No degradation in the results!

Define Multiwords

Lexical multiword identification approach: The following generative pattern is considered

((Adj∣Noun) + ∣((Adj∣Noun) ∗ (Noun - Prep)?)(Adj∣Noun)∗)Noun

+: One or more *: Zero or more ?: Zero or one ∣: Or

Validation: A candidate multiword is valid if thereis a Wikipedia entry related to it.

Text with multiwords:

Keep all nounsKeep all adjectives that are part of a

multiword

Evaluation (human inspection ofresults)100 samples (50 technical, 50 generic)Multiword improved significanty 7 (5 technical)It improved marginally 13It worsened marginally 6

Overall improvement: 10/% on technical text

Work in progress

Concept-mediated mappingamong documentsHow similar are two docs?

Doc 1 Doc 3

Concept 1

Concept 2 Concept 2

Concept 3 Concept 3

Concept 4

Jaccard Index

Syllabi comparison

Interlinks

Mapping documents in differentlanguagesDeploying Wikipedia Interlinks

Doc 1 Doc 3

Concept 1

Concept 2 Concept 2

Concept 3 Concept 3

Concept 4

Jaccard Index

INTERLINKS

Recommended