Stopwords in Search

Preview:

Citation preview

face of the stopwordsThe February 2014 MonthlyTomek Sobczak

what are stop words?

what are stop words?

what are stop words?

common wisdom

• they are everywhare and bloat index

• remove them to increase performance (smaller index and query) and relevance of search results

common wisdom

• they are everywhare and bloat index

• remove them to increase performance (smaller index and query) and relevance of search results

• … but sometimes stop words add little semantic to a sentence

• … and sometimes you need them - To be or not to be

common wisdom

• they are everywhare and bloat index

• remove them to increase performance (smaller index and query) and relevance of search results

• … but sometimes stop words add little semantic to a sentence

• … and sometimes you need them - To be or not to be

• having the best of both worlds? multiple mappings of data: one with stop words removed and one with stop words

common wisdom

• they are everywhare and bloat index

• remove them to increase performance (smaller index and query) and relevance of search results

• … but sometimes stop words add little semantic to a sentence

• … and sometimes you need them - To be or not to be

• having the best of both worlds? multiple mappings of data: one with stop words removed and one with stop words doubled data by indexing in two different ways!

• Common Terms Query analyzes query, identifies whichwords are “important” based on document frequencies for each term

• Common Terms Query leverage the power of stop wordremoval (faster searches) without eliminating them (theycan contribute to score sometimes)

• Common Terms Query adapts to your domain, wordswith high frequency will automatically be recognized as stop words

restoring stop words

possibility of improving

• searches comprised only of stopwords (improved recall)• to be or not to be• The Who

• searches for short searches including stopwords (improved precison)• pearl vs. the pearl• the one• a zukofsky (author Zukofsky, title "a")

• distinguish "in" from "and” in some cases• archaeology in literature != archaeology and literature

restoring stop words

possibility of improving

• searches comprised only of stopwords (improved recall)• to be or not to be• The Who

• searches for short searches including stopwords (improved precison)• pearl vs. the pearl• the one• a zukofsky (author Zukofsky, title "a")

• distinguish "in" from "and” in some cases• archaeology in literature != archaeology and literature

possibility of degrading

• long queries (over 6 terms) with a lot of stopwords have reduced precision• Lectures on the Calculus of Variations and Optimal Control Theory• BUT: the words occurring as a phrase float to the top• AND: you can modify minimum match (mm) param

restoring stop words

how to decide?

• take a look at your business knowledge domain

• count percent of searches with stop words

• count terms in user queries

Thank you!

Recommended