19
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina Santamaria Julio Gonzalo Javier Artiles nlp.uned.es UNED,c/Juan del Rosal, 16, 28040 Madrid, Spain [email protected] [email protected] [email protected] ACL 2010

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Embed Size (px)

Citation preview

Page 1: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results

Celina Santamaria Julio Gonzalo Javier Artilesnlp.uned.es

UNED,c/Juan del Rosal, 16, 28040 Madrid, [email protected] [email protected] [email protected]

ACL 2010

Page 2: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Introduction

• Word sense Disambiguation(WSD)– Promoting diversity in the search result– Present the results as a set of clusters– Complement search results with search

suggestions• Two lexical resource– Wikipedia– Wordnet 3.0

Page 3: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Introduction

Page 4: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Introduction

• Problem– Coverage– Estimate search results diversity using our senses– Sense frequencies– Classification

Page 5: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Test Set

• It are susceptible to form a one-word query• Denote one or more named entities• 40 nouns– 15 nouns from the Senseval-3 lexical sample

dataset– 25 nouns which satisfy two conditions

1. Ambiguous2. They are all names for music bands in one of their

senses

Page 6: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Test Set

• Average of 22 senses per noun in Wikipedia• Average of 4.5 senses per noun in Wordnet• Wikipedia has an larger coverage• Retrieve 150 documents for each

noun(Google)• Annotate each document in each of the

dictionaries

Page 7: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Coverage of Web Search Results

• If we focus on the top ten results, in the band subset Wikipedia covers 68% of the top ten documents

• In the top ten results that are not covered by Wikipedia– a majority of the missing senses consists of names of companies(45%) and products or

services(26%)– the other frequent type (12%) of non annotated document is disambiguation pages

Page 8: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Coverage of Web Search Results

• Wikipedia seems to extend the coverage of Wordnet rather than providing complementary sense information

• If we want to extend the coverage of Wikipedia, the best strategy seems to be to consider lists of companies, products and services

Page 9: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Diversity in Google Search Results

• Use Wikipedia senses to test how well search results respect diversity in terms of this subset of senses

• 63% of the pages in search results belong to the most frequent sense of the query word

• Diversity may not play a major role in the current Google ranking algorithm

Page 10: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Sense Frequency Estimators for Wikipedia

• Frequency information is crucial in a lexicon• But Wikipedia don’t provide the relative

importance of senses for a given word• Attempt to use two estimators of expected

sense distribution– Incoming links for the sense page– The number of visits for the sense page(May, June

and July 2009 http://stats.grok.se/)

Page 11: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Association of Wikipedia Senses to Web Pages

• Test whether the information can be used to classify search results accurately

• No consider approaches that involve a manual training data

• A web page p and the set of senses w1,…wn listed in Wikipedia

• Approach1. Vector Space Model(VSM)2. Word Sense Disambiguation(WSD) System3. Random4. Assign the most frequent sense to all documents

Page 12: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

VSM

• Represent page in a vector space model(tf*idf weights)

• VSM : compute idf in the collection of retrieval documents

• VSM-GT : use the statistics provided by the google Terabyte collection

• VSM-mix : combine statistics from the collection and from the Google Terabyre Collection

• VSM-GT+freq

Page 13: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

WSD system

• Extract learning examples from the Wikipedia automatically

• Disambiguate all occurrences of word w in the page p

• TiMBL-core : use only the examples found in the Wikipedia page

• TiMBL-inlinks : use the examples found in Wikipedia pages pointing to the page

• TiMBL-all : use both sources of examples

• TiMBL-core+freq

Page 14: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Classification Results

• VSM is a simpler and more efficient approach• May indicate that using frequency estimations is only

helpful up to certain precision ceiling

Page 15: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Precision/Coverage Trade-off

• All systems assign a sense for every document in the test collection

• It is possible to enhance search results diversity without annotating every document

• Set threshold[0.00-0.90]

Page 16: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Using Classification to Promote Diversity

• Use our best classifier(VSM-GT+freq)• Make a list of the top-ten documents– Maximize the number of senses– Maximize the similarity scores of the documents to their

assigned senses• Algorithm

1. Fill each position in the rank with the highest similarity sense which are not yet represented in the rank

2. Once all senses are represented, we start choosing a second representative for each sense

Page 17: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Using Classification to Promote Diversity

• Other approaches– Clustering(centroids)– Clustering(top ranked)– Random– Upper bound

Page 18: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Using Classification to Promote Diversity

• coverage=the number of senses in the top ten result / the number of senses in all search results• Using wikipedia to enhance diversity seems to work much better than clustering• Note, Our evaluation has a bias towards using Wikipedia, because only Wikipedia senses are considered to

estimate diversity

Page 19: Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Conclusion

• Wikipedia has a much better coverage• The distribution of senses can be esitmated• Improve search results diversity for one word

queries with simple and efficient algorithm• Our results do not imply that the Wikipedia

modified rank is better than the original Google rank