35
Wei Shen , Jianyong Wang , Ping Luo , Min Wang Tsinghua University, Beijing, China HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao Zhou July 17, 2012 10/25/22 1

Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Embed Size (px)

Citation preview

Page 1: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Wei Shen†, Jianyong Wang†, Ping Luo‡, Min Wang‡

†Tsinghua University, Beijing, China‡HP Labs China, Beijing, China

WWW 2012

Presented by Tom Chao ZhouJuly 17, 2012

04/18/23 1

Page 2: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion

2/3404/18/23

Page 3: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion

3/3404/18/23

Page 4: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

MotivationMany large scale knowledge bases have emerged

DBpedia, YAGO, Freebase, and etc.

4/3404/18/23

www.freebase.com

Page 5: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

MotivationMany large scale knowledge bases have emerged

DBpedia, YAGO, Freebase, and etc. As world evolves

New facts come into existence Digitally expressed on the Web

Maintaining and growing the existing knowledge basesIntegrating the extracted facts with knowledge

baseChallenge

Name variations “National Basketball Association” “NBA” “New York City” “Big Apple”Entity ambiguity

“Michael Jordan” … …

NBA player

Berkeley professor

5/3404/18/23

Page 6: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion

6/3404/18/23

Page 7: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Problem DefinitionEntity linking task

Input: A textual named entity mention m, already

recognized in the unstructured textOutput:

The corresponding real world entity e in the knowledge base

If the matching entity e for entity mention m does not exist in the knowledge base, we should return NIL for m

7/3404/18/23

Page 8: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Entity linking task

Source: From Information to Knowledge:Harvesting Entities and Relationships from Web Sources. PODS’10.

German Chancellor Angela Merkel and her husband Joachim Sauer went to Ulm, Germany.

NIL

Figure 1: An example of YAGO

8/3404/18/23

Page 9: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion

9/3404/18/23

Page 10: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Previous MethodsEssential step of entity linking

Define a similarity measure between the text around the entity mention and the document associated with the entity

Bag of words modelRepresent the context as a term vectorMeasure the co-occurrence statistics of termsCannot capture the semantic knowledge

Example:Text: Michael Jordan wins NBA champion.

The bag of words model cannot work well!

10/3404/18/23

Page 11: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion

11/3404/18/23

Page 12: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

LINDEN FrameworkCandidate Entity Generation

For each named entity mention m Retrieve the set of candidate entities Em

Named Entity DisambiguationFor each candidate entity e∈Em

Define a scoring measureGive a rank to Em

Unlinkable Mention PredictionFor each etop which has the highest score in Em

Validate whether the entity etop is the target entity for mention m

12/3404/18/23

Page 13: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Candidate Entity GenerationIntuitively, the candidates in Em should have

the name of the surface form of m.We build a dictionary that contains vast amount

of information about the surface forms of entitiesLike name variations, abbreviations, confusable

names, spelling variations, nicknames, etc.Leverage the four structures of Wikipedia

Entity pages Redirect pages Disambiguation pages Hyperlinks in Wikipedia articles

13/3404/18/23

Page 14: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Candidate Entity Generation (Cont’)

For each mention mSearch it in the field of surface formsIf a hit is found, we add all target entities of

that surface form m to the set of candidate entities Em

Table 1: An example of the dictionary

14/3404/18/23

Page 15: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Named Entity DisambiguationGoal:

Give a rank to candidate entities according to their scores

Define four featuresFeature 1: Link probability

Based on the count information in the dictionarySemantic network based features

Feature 2: Semantic associativity Based on the Wikipedia hyperlink structure

Feature 3: Semantic similarity Derived from the taxonomy of YAGO

Feature 4: Global coherence Global document-level topical coherence among

entities 15/3404/18/23

Page 16: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Link Probability

Feature 1: link probability LP(e|m) for candidate entity e

where countm(e) is the number of links which point to entity e and have the surface form m

Table 1: An example of the dictionary

0.81

0.05

LP

16/3404/18/23

Page 17: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Semantic Network Construction Recognize all the Wikipedia concepts Γd in the document d

The open source toolkit Wikipedia-Miner1

Example: The Chicago Bulls’ player Michael Jordan won his first NBA championship in

1991. Set of entity mentions: {Michael Jordan, NBA} Candidate entities:

Michael Jordan {Michael J. Jordan, Michael I. Jordan} NBA {National Basketball Association, Nepal Basketball Association}

Γd : {NBA All-Star Game, David Joel Stern, Charlotte Bobcats, Chicago Bulls} Hyperlink structure of Wikipedia articles Taxonomy of concepts in YAGO

1http://wikipedia-miner.sourceforge.net/index.htm

Figure 2: An example of the constructed semantic network

17/3404/18/23

Page 18: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Semantic AssociativityFeature 2: semantic associativity SA(e) for

each candidate entity e

Figure 2: An example of the constructed semantic network

18/3404/18/23

Page 19: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Semantic Associativity (Cont’)Given two Wikipedia concepts e1 and e2

Wikipedia Link-based Measure (WLM) [1]Semantic associativity between them

where E1 and E2 are the sets of Wikipedia concepts that hyperlink to e1 and e2 respectively, and W is the set of all concepts in Wikipedia

[1] D. Milne and I. H. Witten. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proceedings of WIKIAI, 2008.

19/3404/18/23

Page 20: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Semantic SimilarityFeature 3: semantic similarity SS(e) for each

candidate entity e

where Θk is the set of k context concepts in Γd which have the highest semantic similarity with entity e

Figure 2: An example of the constructed semantic network

k=2

20/3404/18/23

Page 21: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Semantic Similarity (Cont’)Given two Wikipedia concepts e1 and e2

Assume the sets of their super classes are Φe1 and Φe2 For each class C1 in the set Φe1

Assign a target class ε(C1) in another set Φe2 as

Where sim(C1, C2) is the semantic similarity between two classes C1 and C2

To compute sim(C1, C2) Adopt the information-theoretic approach introduced in

[2]

Where C0 is the lowest common ancestor node for class nodes C1 and C2 in the hierarchy, P(C) is the probability that a randomly selected object belongs to the subtree with the root of C in the taxonomy.

[2] D. Lin. An information-theoretic definition of similarity. In Proceedings of ICML, pages 296–304, 1998. 21/3404/18/23

Page 22: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Semantic Similarity (Cont’)Calculate the semantic similarity from one

set of classes Φe1 to another set of classes Φe2

Define the semantic similarity between Wikipedia concepts e1 and e2

22/3404/18/23

Page 23: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Global CoherenceFeature 4: global coherence GC(e) for each

candidate entity eMeasured as the average semantic associativity of

candidate entity e to the mapping entities of the other mentions

where em’ is the mapping entity of mention m’Substitute the most likely assigned entity for the

mapping entity in Formula 9

The most likely assigned entity e’m’ for mention m is defined as the candidate entity which has the maximum link probability in Em

23/3404/18/23

Page 24: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Global Coherence (Cont’)

Figure 2: An example of the constructed semantic network

24/3404/18/23

Page 25: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Candidates RankingTo generate a feature vector Fm(e) for each e ∈ Em

To calculate Scorem(e) for each candidate e

where is the weight vector which gives different weights for each feature element in Fm(e)

Rank the candidates and pick the top candidate as the predicted mapping entity for mention m

To learn , we use a max-margin technique based on the training data setAssume Scorem(e∗) is larger than any other Scorem(e)

with a margin

We minimize over ξm ≥ 0 and the objective25/3404/18/23

Page 26: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Unlinkable Mention PredictionPredict mention m as an unlinkable mention

If the size of Em generated in the Candidate Entities Generation module is equal to zero

If Scorem(etop) is smaller than the learned threshold τ

26/3404/18/23

Page 27: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion

27/3404/18/23

Page 28: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Experiment SetupData sets

CZ data set: newswire data used by Cucerzan [3]

TAC-KBP2009 data set: used in the track of Knowledge Base Population (KBP) at the Text Analysis Conference (TAC) 2009

Parameters learning:10-fold cross validation

[3] S. Cucerzan. Large-Scale Named Entity Disambiguation Based on Wikipedia Data. In Proceedings of EMNLP-CoNLL, pages 708–716, 2007.

28/3404/18/23

Page 29: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Results over the CZ data set

29/3404/18/23

Page 30: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Results over the CZ data set

30/3404/18/23

Page 31: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Results on the TAC-KBP2009 data set

31/3404/18/23

Page 32: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Results on the TAC-KBP2009 data set

32/3404/18/23

Page 33: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

OutlineMotivationProblem DefinitionPrevious MethodsLINDEN FrameworkExperimentsConclusion

33/3404/18/23

Page 34: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

ConclusionLINDEN

A novel framework to link named entities in text with YAGO

Leveraging the rich semantic knowledge derived from the Wikipedia and the taxonomy of YAGO

Significantly outperforms the state-of-the-art methods in terms of accuracy

34/3404/18/23

Page 35: Wei Shen †, Jianyong Wang †, Ping Luo ‡, Min Wang ‡ † Tsinghua University, Beijing, China ‡ HP Labs China, Beijing, China WWW 2012 Presented by Tom Chao

Thanks!Q&A

35/3404/18/23