Using Wikipedia as a reference for extracting semantic information

Using Wikipedia as a referencefor extracting semanticinformation from a text

Andrea Prato&

Marco RonchettiUniversità di Trento, Italy

Explicit Semantic Analysis

GabrilovichMarkovich2007

Throw away:

StopwordsFragment pages (<100 words)Suffixes (stemming)

A sample (ESA)

The development of T-cell leukaemiafollowing the otherwise successfultreatment of three patients with X-linkedsevere combined immune deficiency (X-SCID) in gene-therapy trials usinghaematopoietic stem cells has led to a re-evaluation of this approach. Using amouse model for gene therapy of X-SCID, we find that the correctivetherapeutic gene IL2RG itself can act asa contributor to the genesis of T-celllymphomas, with one-third of animalsbeing affected. Gene-therapy trials for X-SCID, which have been based on theassumption that IL2RG is minimallyoncogenic, may therefore pose some riskto patients.

- Leukemia- Severe combinedimmunodeficiency- Cancer-Non-Hodgkin lymphoma- AIDS-ICD-10 Chapter II:Neoplasms;-Chapter III: Diseases of theblood and blood-formingorgans, and certaindisorders involving theimmune mechanism- Bone marrow transplant- Immunosuppressive drug- Acute lymphoblasticleukemia- Multiple sclerosis.

A sample (ESA)

Being so tightly packed, Venice doesn'tmake an ideal place to come to practiseyour favourite sport, although you'll get adecent workout just walking around andup and down bridges! If you've got anyenergy left for some extra exercise, try aspot of swimming (although pools arerare) or even a jog. Venice is a bit of adesert for swimmers. You can go in offthe Lido (if you're game) or at one ofVenice's two public swimming pools(handily, they close in summer).

Lonely Planet Tourist Guide

1-Glossary_of_cue_sports_terms2-Swimming,3-Ian_Thorpe.4-NCAA_football_bowl_games,2005-06,5-Swimming_machine,6-American_football_strategy,7-Contract_bridge_glossary,8-Olympic_Games,9-Pingu_episodes_series_6,10-Venice.…15 - Corruption_in_Ghana…27 - Legislative_system_of_thePeopleʼs_Republic_of_China.

Clustering

Wikipedia is hyperlinked

Swimming is clustered with Olympic Games

A sample (ESA)

Being so tightly packed, Venice doesn'tmake an ideal place to come to practiseyour favourite sport, although you'll get adecent workout just walking around andup and down bridges! If you've got anyenergy left for some extra exercise, try aspot of swimming (although pools arerare) or even a jog. Venice is a bit of adesert for swimmers. You can go in offthe Lido (if you're game) or at one ofVenice's two public swimming pools(handily, they close in summer).

Lonely Planet Tourist Guide

1-Glossary_of_cue_sports_terms2-Swimming,3-Ian_Thorpe.4-NCAA_football_bowl_games,2005-06,5-Swimming_machine,6-American_football_strategy,7-Contract_bridge_glossary,8-Olympic_Games,9-Pingu_episodes_series_6,10-Venice.…15 - Corruption_in_Ghana…27 - Legislative_system_of_thePeopleʼs_Republic_of_China.

Throw away:

Large aggregators Category links Numbers Pages with more than (N=100) links

After clustering:

only 3 clusters with cardinality larger than 1. The first cluster, with cardinality 21, was

automatically named Swimming. The second and the third both have cardinality

equal to 2, and they are named Training andVenice-bucentaur.

Validation: Turing test

Classification

Text Classification

Classification

Which one is machine -generated?

Outcome 20 texts of lengthranging between 60and 200 words. Textswere collected fromvarious sources likenewspaper articles,text books, randomweb pages, MSNEncarta.

Further improvements

Using only nouns

Using a POS Tagger to identify syntacticroles in document to be classified

Keep only names (throw away the rest)

No degradation in the results!

Define Multiwords

Lexical multiword identification approach: The following generative pattern is considered

((Adj∣Noun) + ∣((Adj∣Noun) ∗ (Noun － Prep)?)(Adj∣Noun)∗)Noun

+: One or more *: Zero or more ?: Zero or one ∣: Or

Validation: A candidate multiword is valid if thereis a Wikipedia entry related to it.

Text with multiwords:

Keep all nounsKeep all adjectives that are part of a

multiword

Evaluation (human inspection ofresults)100 samples (50 technical, 50 generic)Multiword improved significanty 7 (5 technical)It improved marginally 13It worsened marginally 6

Overall improvement: 10/% on technical text

Work in progress

Concept-mediated mappingamong documentsHow similar are two docs?

Doc 1 Doc 3

Concept 1

Concept 2 Concept 2

Concept 3 Concept 3

Concept 4

Jaccard Index

Syllabi comparison

Interlinks

Mapping documents in differentlanguagesDeploying Wikipedia Interlinks

Doc 1 Doc 3

Concept 1

Concept 2 Concept 2

Concept 3 Concept 3

Concept 4

Jaccard Index

INTERLINKS

Using Wikipedia as a reference for extracting semantic information

Technology

Extracting N-ary Facts from Wikipedia Table Clusters

Extracting Semantic Networks from Text Via Relational ...alchemy.cs.washington.edu/papers/kok08/kok08.pdf · 4 Extracting Semantic Networks from Text via Relational Clustering MLN,

From Glossaries to Ontologies: Extracting Semantic Structure from ...navigli/pubs/Navigli_Velardi_IOS_2008.pdf · From Glossaries to Ontologies: Extracting Semantic Structure from

Extracting Semantic Networks From Text Via Relational Clustering

A semantic approach for question classification using WordNet and Wikipedia

A tool for extracting and indexing spatio-temporal ... · A tool for extracting and indexing spatio-temporal information from biographical articles in Wikipedia Emily G. Morton-Owens

Feature-based approaches to semantic similarity assessment of concepts using Wikipedia

Extracting semantic role information from unstructured texts

Extracting Semantic Predication from Medline Citations for Pharmacogenomics

Extracting Semantic Representations from Word Co ...jxb/PUBS/BRM.pdf · Extracting Semantic Representations from Word Co-occurrence Statistics: A Computational Study ... This group

Extracting Semantic Relationships between Wikipedia Articles › ~jkalita › papers › 2009 › StoutenburgS... · progress toward extracting semantic information from Wikipedia,

Computing semantic relatedness using Wikipedia features

Extracting Semantic Networks from Text Via Relational ...homes.cs.washington.edu/~pedrod/papers/ecml08.pdf · 4 Extracting Semantic Networks from Text via Relational Clustering MLN,

Semantic Representation of Provenance in Wikipedia

Creating a Semantic Graph from Wikipedia

EXTRACTING LINKED HYPERNYMS FROM FREE TEXT OF … · 2019-12-07 · extracting linked hypernyms from free text of wikipedia articles combining machine learning with lexico-syntactic

01/06/15Sergey Chernov 1 Extracting Semantic Relationships between Wikipedia Categories By Sergey Chernov, Tereza Iofciu, Wolfgang Nejdl, Xuan Zhou, Michal

Extracting Semantic Information from on-line Art Music ...compmusic Background • Extracting semantic information from online forums -> only in text mining. • Structured data (Yang

Wikipedia-based Semantic Interpretation for Natural Language

Extracting Topic Trends and Connections- Semantic Analysis and …snap.stanford.edu/class/cs224w-2012/projects/cs224w-004... · 2020-01-09 · Extracting Topic Trends and Connections: