57
T. Flati, D. Vannella, T. Pasini, R. Navigli 2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project ERC Starting Grant MultiJEDI No. 259234

2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project

Embed Size (px)

DESCRIPTION

2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project. T. Flati, D. Vannella, T. Pasini, R. Navigli. ERC Starting Grant MultiJEDI No. 259234. The Wikipedia structure. Article pages ~4M. Category pages ~ 700K. Two noisy graphs with no explicit hypernym relation. - PowerPoint PPT Presentation

Citation preview

Page 1: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

T. Flati, D. Vannella, T. Pasini, R. Navigli

2 Is Bigger (and Better) Than 1: the Wikipedia Bitaxonomy Project

ERC Starting GrantMultiJEDI No. 259234

Page 2: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

The Wikipedia structure

Article pages~4M

Category pages~ 700K

Two noisy graphs with no explicit hypernym relation.

Page 3: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

The Wikipedia structure: an examplePages Categories

Mickey Mouse

Funny AnimalSuperman

Cartoon

Donald Duck

Disney comics characters

Disney comicsDisney character

Fictional characters by

medium

Comics by genre

Fictional characters

The Walt Disney Company

Page 4: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Our goal

To automatically create a Wikipedia Bitaxonomy for Wikipedia pages and categories in a

simultaneous fashion.

pages categories

Page 5: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Our goal

To automatically create a Wikipedia Bitaxonomy for Wikipedia pages and categories in a

simultaneous fashion.

The page and category level are mutually beneficial for inducing a wide-coverage and fine-grained integrated taxonomy

KEY IDEA

Page 6: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Key idea Pages Categories

Disney comics characters

Disney comicsDisney character

The Walt Disney Company

Fictional characters by

medium

Comics by genre

Fictional characters

Mickey Mouse

Funny AnimalSuperman

Cartoon

Donald Duckis a

is a

is a

is a

is a

is a

is ais a is a

Page 7: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

A 3-phase method

pages categories

Starting from two noisy graphs

Page 8: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

A 3-phase method1. Build the page taxonomy

pages

Page 9: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

A 3-phase method1. Build the page taxonomy2. Bitaxonomy Algorithm

pages categories

Page 10: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

A 3-phase method

pages categories

1. Build the page taxonomy2. Bitaxonomy Algorithm

Page 11: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

pages

1. Build the page taxonomy

A 3-phase method

+50%categories

categories

3. Refine the category taxonomy2. Bitaxonomy Algorithm

Page 12: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Contributions

1. Self-contained approach

2. Page taxonomy and category taxonomy built simultaneously

3. State-of-the-art results when compared to all other available taxonomies

Page 13: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

The WiBi Page taxonomy1

Page 14: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Assumptions

• The first sentence of a page is a good definition (also called

gloss)

Page 15: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

The WiBi Page taxonomy

1. [Syntactic step]Extract the hypernym lemma from a page definition using a syntactic parser;

2. [Semantic step]Apply a set of linking heuristics to disambiguate the extracted lemma.

Scrooge McDuck is a character […]

Syntactic step

Hypernym lemma: character

A

Semantic step

Scrooge McDuck is a character[…]nn nsubj

cop

Page 16: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

The semantic step

5 cascadinglinking heuristics

Ambiguoushypernym(‘player’)

Linking heuristic

Target page(Cristiano Ronaldo)

Disambiguatedhypernym

(Football player)

1. Crowdsourced2. Category3. Multiword4. Monosemous5. Distributional

Page 17: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

1. Crowdsourced heuristic

Mickey Mouse is a funny animal cartoon character and the official mascot ofThe Walt Disney Company.

Use the links from the crowd!

Page 18: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Given a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses.

Characters in Disney package films

Disney comics charactersAmbiguous

hypernym: Character

Donald Duck Pluto

Hook

Mickey Mouse

José Carioca

2. Category heuristic

Goofy

Page 19: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

2. Category heuristicGiven a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses.

Donald Duck Pluto

Hook

Mickey Mouse

José Carioca

Goofy

Goofy is a funny animal cartoon character […]

José Carioca  is a Disney cartoon character […]

Captain James Hook  is a fictional character […]

Mickey Mouse is a funny animal cartoon character […]

Pluto, also called Pluto the Pup, is a cartoon character […]

Mickey Mouse is a funny animal cartoon character […]

Characters in Disney package films

Disney comics charactersAmbiguous

hypernym: Character

Page 20: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

2. Category heuristicGiven a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses.

Donald Duck

Goofy is a funny animal cartoon character […]

José Carioca  is a Disney cartoon character […]

Captain James Hook  is a fictional character […]

Mickey Mouse is a funny animal cartoon character […]

Pluto, also called Pluto the Pup, is a cartoon character […]

Mickey Mouse is a funny animal cartoon character […]

Character (arts) 5, Funny animal 1

Character (arts) 3, Funny animal 1, Cartoon 1

Character(arts) 8, Funny animal 2, Cartoon 1Ambiguous hypernym: Character

Characters in Disney package films

Disney comics characters

Page 21: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Character(arts) 8, Funny animal 2, Cartoon 1

2. Category heuristicGiven a page and its ambiguous hypernym, exploit its categories to build a distribution of the hypernym’s senses.

Donald Duck

Character(arts)Ambiguous hypernym: Character

Page 22: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Page taxonomy linking heuristics

Category(1.603M)

Multiword(65K) Monosemous

(161K)

Distributional(561K)

Crowdsourced(1.338M)

1

2

34

5

Page 23: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Page taxonomy evaluation

Page 24: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

The story so far

1

Noisy page graph Page taxonomy

Page 25: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

2The Bitaxonomyalgorithm

Page 26: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

The Bitaxonomy algorithm

The information available in the two taxonomies is mutually beneficial;● At each step exploit one taxonomy to update

the other and vice versa;● Repeat until convergence.

Page 27: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

pages categories

Real MadridF.C.

Football team Football teams

Football clubsin Madrid

is a

Atlético Madrid

The Bitaxonomy algorithm

Football clubs

Starting from the page taxonomy

Page 28: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Real MadridF.C.

Football team Football teams

Football clubsin Madrid

is a

is a

The Bitaxonomy algorithm

Football clubs

Exploit the cross links to infer hypernym relations in the category taxonomy

Atlético Madrid

pages categories

Page 29: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Real MadridF.C.

Football team Football teams

Football clubsin Madrid

is a

is a

is a

The Bitaxonomy algorithm

Football clubs

Take advantage of cross links to infer back is-a relations in the page taxonomy

Atlético Madrid

pages categories

Page 30: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Real MadridF.C.

Football team Football teams

Football clubsin Madrid

is a

is a

is a

The Bitaxonomy algorithm

Football clubs

is a

Use the relations found in previous step to infer new hypernym edges

Atlético Madrid

pages categories

Page 31: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Atlético MadridReal Madrid

F.C.

Football team Football teams

Football clubsin Madrid

is a

is a

is a

The Bitaxonomy algorithm

Football clubs

is a

Mutual enrichment of both taxonomies until convergence

pages categories

Page 32: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Page taxonomy evaluation (cont’d)Sensible 3% increment in terms of recall and coverage,with unvaried precision

Page 33: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Category taxonomy evaluation

Page 34: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

The story so far

2

Page 35: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

3The WiBi category taxonomy refinement

Page 36: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Comics characters by protagonist

Comics characters

Garfield characters

Category taxonomy refinement

Some categories are affected by some structural problems.

pages categories

No pagesassociated!

Page 37: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Category taxonomy refinement● 3 refinement procedures to obtain broader

coverage for categorieso Single super categoryo Sub-categorieso Super-categories

Page 38: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Single super category

This category has only 1 outgoing edge

Comics characters by protagonist

Comics characters

Garfield characters

Animated television characters by series

Animated characters

Fictional characters by medium

Animation

So we promote its only super category to hypernym

Page 39: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Sub-categories

Comics characters by company

Disney comics

Comics by companyComics characters

DC Comicscharacters

Marvel Comicscharacters

Comics titlesby company

Focus on subcategories which have already been covered!

Page 40: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Sub-categories

Comics characters by company

Disney comics

Comics by companyComics characters

DC Comicscharacters

Comics titlesby company

Marvel Comicscharacters

Focus on subcategories which have already been covered!

Only 1 path ending in u

2 pathsending in v

Page 41: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Category taxonomy evaluation: coverage

+50%categoriescovered!

1SUP SUB SUPER

Page 42: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Category taxonomy evaluation: P & R

Iterations1SUP SUB SUPER

+35%recall

86%

Page 43: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Experimental setup

● We created 2 datasets:o 1000 randomly sampled pages;o 1000 randomly sampled categories.

● Each item was annotated with the most suitable generalization (lemma+page or category).

Page 44: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Competitors

WikiNet

MENTA

WikiTaxonomy

pages categories

Page 45: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Measures

● We calculated typical measures to assess the quality of all the possible taxonomies;o Precisiono Recallo Coverageo Specificityo Granularity

Page 46: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Page taxonomy comparison

Page 47: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Page taxonomy comparison

Page 48: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Category taxonomy comparison

Page 49: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Category taxonomy comparison

Page 50: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Category taxonomy comparison

Specificitymeasure

Page 51: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Measuring specificityA system is more specific than another when the hypernym(s) provided by the former are more specific/informative than the latter.

System 1

“Singer”System 2

“Swing singer”

“Frank Sinatra is a”

<less specific than

Page 52: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Page taxonomy specificityRatio of the times in which WiBi provided

a more specificanswer than the other system

Page 53: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Page taxonomy specificityRatio of the times in which WiBi

provided a less specific answer than the other system

Page 54: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Category taxonomy specificity

Page 55: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Measuring granularity

pages categories

Page 56: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Conclusions● Unified, 3-phase approach

to the construction of a bitaxonomyfor the English Wikipedia;

● Self-contained, no additionalresources or supervision required;

● Nearly full coverage of Wikipedia pages and categories;● State-of-the-art performance both on pages and categories.

wibitaxonomy.org

Page 57: 2 Is Bigger (and  Better) Than  1:  the Wikipedia Bitaxonomy  Project

Tiziano Flati, Daniele Vannella, Tommaso Pasini, Roberto Navigli

Linguistic Computing Laboratorylcl.uniroma1.it