Upload
ashish-kulkarni
View
773
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Learning to Link with Wikipedia
by Milne and Witten
exploits Wikipedia - largest known knowledge baseenrich unstructured text with links to Wikipedia articlesImplications -• provide structured knowledge• Tasks like indexing, clustering, retrieval can use this
technique instead of bag of words
Wikipedia
• Largest most visited encyclopaedia• densely structured• millions of articles with hundreds of millions of links• serendipitous encounters • "small world" - any article is just 4.5 links away from any
other How can this be extended to ALL DOCUMENTS?
Enter Wikification
Related work
Wikify - Mihalcea and Csomai • Detection - based on link probabilities
• Disambiguation - extracts features from the phrase and its
surrounding words. Compare them against training samples from Wikipedia.
• Topic detection
Machine learning approach to disambiguation
• Uses the links in Wikipedia for training• 500 training articles, 50K links, 1.8M instances• Commonness (Prior probability)• Relatedness (Link based measure)
o Use unambiguous links as context to disambiguate ambiguous terms.
o sense article that has most in common with all the context articles.
o R(a,b) = log(max(|A|,|B|)) - log(|A,B|) / log(|W|) - log(min(|A|,|B|))
o R(candidate) is weighted avg of its relatedness to each context article.
Weighing the context terms
1. All context terms are not equally useful - Use link probability to weigh the context terms
2. Many context terms are outliers and do not relate to the central theme of the document - Use relatedness measure to compute the average semantic relatedness of a term to all other context terms.
These two are averaged to provide a weight for each context term. This is then used in computing R(candidate)
Combining the features
Two features - commonness of each sense and its relatedness to surrounding context.How good is the context?Third feature - context quality - sum of weights assigned to each context.These three features are used to train a classifier that can distinguish valid senses from invalid ones.
Configuring the classifier
• Probability of senses follows power law - unlikely senses can be safely ignored by choosing a threshold parameter
• Improves performance, precision and decreases recall• 2% is used as probability threshold• C4.5 algorithm was chosen after trying several classifiers
Learning to detect links
Wikify uses link probabilities with no consideration to context - error prone.Gather all n-grams and retain those whose probability exceeds a low threshold (defined later)Remaining phrases are disambiguatedThe automatically identified Wikipedia articles provide training instances for a classifier
Features used by the links classifier
Link Probability - each training instance involves several candidate link locations giving multiple link probabilities. These are combined into average and maximum (more indicative of links)Relatedness - topics which relate to the central thread of the document are more likely to be linkedDisambiguation confidence - average and maximum disambiguation probabilityGenerality - link to specific topics is useful rather than general onesLocation and spread - Frequency, first occurrence, last occurrence
Training the links classifier
• same 500 articles used to train disambiguation classifier are used for training
• Threshold to discard nonsense phrases and stop words is set to 6.5% (min link probability)
• Disambiguation classifier performs poorly in this case as it was trained on links but was used on raw text here - resolved by modifying the trainer to account for these other unambiguous terms.
• Several classifiers were evaluated and bagged C4.5 was chosen
Evaluation
• trained in 37 mins and tested in 8 mins