Learning to Link with Wikipedia

Learning to Link with Wikipedia

by Milne and Witten

exploits Wikipedia - largest known knowledge baseenrich unstructured text with links to Wikipedia articlesImplications -• provide structured knowledge• Tasks like indexing, clustering, retrieval can use this

technique instead of bag of words

Wikipedia

• Largest most visited encyclopaedia• densely structured• millions of articles with hundreds of millions of links• serendipitous encounters • "small world" - any article is just 4.5 links away from any

other How can this be extended to ALL DOCUMENTS?

Enter Wikification

Related work

Wikify - Mihalcea and Csomai • Detection - based on link probabilities

• Disambiguation - extracts features from the phrase and its

surrounding words. Compare them against training samples from Wikipedia.

• Topic detection

Machine learning approach to disambiguation

• Uses the links in Wikipedia for training• 500 training articles, 50K links, 1.8M instances• Commonness (Prior probability)• Relatedness (Link based measure)

o Use unambiguous links as context to disambiguate ambiguous terms.

o sense article that has most in common with all the context articles.

o R(a,b) = log(max(|A|,|B|)) - log(|A,B|) / log(|W|) - log(min(|A|,|B|))

o R(candidate) is weighted avg of its relatedness to each context article.

Weighing the context terms

1. All context terms are not equally useful - Use link probability to weigh the context terms

2. Many context terms are outliers and do not relate to the central theme of the document - Use relatedness measure to compute the average semantic relatedness of a term to all other context terms.

These two are averaged to provide a weight for each context term. This is then used in computing R(candidate)

Combining the features

Two features - commonness of each sense and its relatedness to surrounding context.How good is the context?Third feature - context quality - sum of weights assigned to each context.These three features are used to train a classifier that can distinguish valid senses from invalid ones.

Configuring the classifier

• Probability of senses follows power law - unlikely senses can be safely ignored by choosing a threshold parameter

• Improves performance, precision and decreases recall• 2% is used as probability threshold• C4.5 algorithm was chosen after trying several classifiers

Learning to detect links

Wikify uses link probabilities with no consideration to context - error prone.Gather all n-grams and retain those whose probability exceeds a low threshold (defined later)Remaining phrases are disambiguatedThe automatically identified Wikipedia articles provide training instances for a classifier

Features used by the links classifier

Link Probability - each training instance involves several candidate link locations giving multiple link probabilities. These are combined into average and maximum (more indicative of links)Relatedness - topics which relate to the central thread of the document are more likely to be linkedDisambiguation confidence - average and maximum disambiguation probabilityGenerality - link to specific topics is useful rather than general onesLocation and spread - Frequency, first occurrence, last occurrence

Training the links classifier

• same 500 articles used to train disambiguation classifier are used for training

• Threshold to discard nonsense phrases and stop words is set to 6.5% (min link probability)

• Disambiguation classifier performs poorly in this case as it was trained on links but was used on raw text here - resolved by modifying the trainer to account for these other unambiguous terms.

• Several classifiers were evaluated and bagged C4.5 was chosen

Evaluation

• trained in 37 mins and tested in 8 mins

Technology

Learning to Link with Wikipedia