4
Algorithms for Search Engine Boolean retrieval: In this technique terms are connected by logical operators logical AND, logical OR, logical NOT. Documents treated as collection of words. When we perform a query like Brutus Caesar but not Calpurnia then the action performed on the query is Brutus ^ Caesar ^ (! Calpurnia). In this they assume fixed collection of documents and the total time required to retrieve the results from the query is θ = {sum of lengths of all the documents). For this they developed a term-document matrix, rows consists of all the terms which are stored in alphabetical order and columns all the documents if there is a match of the term in that document then that point will be[term,document]= 1 else it is zero but this algorithm is very sparse and consumes most of the space. Inverted Index: Better way to store the matrix, construct this system before it is ready to accept the query. Initially set up the dictionary, set of all the distinct words in the document collection which are arranged in the increasing order and store the frequency of that term in document collection and develop a linked list of all document IDs [posting lists] in the increasing order that are connected to each other for that word in the document collection and store the dictionary in main memory for quick access to find the term in the document collection. To develop index in the given document collection sort each term as (term, docid) and then remove all the duplicate terms in the same document and then setup the index to find the posting lists and store that in increasing order . To sort the (term, docid) pairs it will take θ(NlgN) and removing the duplicate terms θ(N) and to setup the posting lists is θ (N) so total time required is θ(NlgN) Algorithms that deal with phrase queries Biword Index: This algorithm helps us to answer two word phrase query and can also use for the large queries. When the user runs the query for example “Stanford university palo alto” this algorithm splits the above sentence into “Stanford university” ^ “university palo” ^ “palo alto” and searches these in the documents and retrieves all the documents that matches these words but the problem in this algorithm is it has more false positive document, returns as a result of query but not really match for our query. In order to reduce the false positive they neglect the stop words “renegotiation of the constitution” in this query they will neglect the stop word “of the” and perform the query to get the documents. Phrase index algorithm

Algorithms for Search Engine

Embed Size (px)

Citation preview

Page 1: Algorithms for Search Engine

Algorithms for Search EngineBoolean retrieval: In this technique terms are connected by logical operators logical AND, logical OR, logical NOT. Documents treated as collection of words. When we perform a query like Brutus Caesar but not Calpurnia then the action performed on the query is Brutus ^ Caesar ^ (! Calpurnia). In this they assume fixed collection of documents and the total time required to retrieve the results from the query is θ = {sum of lengths of all the documents). For this they developed a term-document matrix, rows consists of all the terms which are stored in alphabetical order and columns all the documents if there is a match of the term in that document then that point will be[term,document]= 1 else it is zero but this algorithm is very sparse and consumes most of the space.

Inverted Index: Better way to store the matrix, construct this system before it is ready to accept the query. Initially set up the dictionary, set of all the distinct words in the document collection which are arranged in the increasing order and store the frequency of that term in document collection and develop a linked list of all document IDs [posting lists] in the increasing order that are connected to each other for that word in the document collection and store the dictionary in main memory for quick access to find the term in the document collection. To develop index in the given document collection sort each term as (term, docid) and then remove all the duplicate terms in the same document and then setup the index to find the posting lists and store that in increasing order . To sort the (term, docid) pairs it will take θ(NlgN) and removing the duplicate terms θ(N) and to setup the posting lists is θ (N) so total time required is θ(NlgN)

Algorithms that deal with phrase queriesBiword Index: This algorithm helps us to answer two word phrase query and can also use for the large queries. When the user runs the query for example “Stanford university palo alto” this algorithm splits the above sentence into “Stanford university” ^ “university palo” ^ “palo alto” and searches these in the documents and retrieves all the documents that matches these words but the problem in this algorithm is it has more false positive document, returns as a result of query but not really match for our query. In order to reduce the false positive they neglect the stop words “renegotiation of the constitution” in this query they will neglect the stop word “of the” and perform the query to get the documents. Phrase index algorithm extends the biword index algorithm this uses queries more than two words and stores the large queries which are very rare. But consumes large amount of dictionary space.

Positional Index: Best way of answering the query is with this index algorithm. In this technique all the individual terms are stored in the dictionary. Each individual term and the frequency of that term and their document posting lists. When we perform a query each term in the query is checked in the dictionary and their matching positions if all the positions are matching with the terms are in the query. Query will return all the documents that matches with these terms. This can be used to answer phrase query of any length but dictionary is very large and it computationally more expensive. Running time of this algorithm θ (T) where T is the sum of sizes of all the documents. With the help of the compressed positional index they are able to decrease by 1/3rd of the document collection.

Proximity queries are the queries where all the matching words are searched within set of k words on either side. This can be answered by positional index efficiently but it is inefficient to use biword index or phrase queries. Most of the search engine uses the combination of positional index and biword index

Page 2: Algorithms for Search Engine

Algorithms on how the dictionary is stored and retrieved Hash Tables: In this technique, in the universe of keys or terms they hash the value for each term and will added that term in the dictionary and then they will setup the posting lists for each term in the hash value by this they will reduce the memory consumption so that they can easily store in the main memory but the problem they face is that there will be possibility of collision of terms which have same hash value. In this case when there is a hash collision they will add the same terms to the same hash value and setup a different posting lists for each term in increasing order. When we perform the query initially all the terms are hashed and the hashed value will be compared in the hash table and the corresponding posting lists will be retrieved and displayed that as results to the user. Has table size is θ(|S|) where S is the set of terms stored in the hash table. Reorganizing the table is expensive. Queried words may be very close but hash values may be far from one another. Total running time required is θ(|S|)

Binary search tree: In this algorithm will have a parent node and two children and it is sub divided further at the end of the leaf node consists of all the terms which are connected to each other in increasing order and for each leaf it consists of corresponding posting lists of documents. When the query is performed depending on the words occurred it travels in that direction to find the corresponding term and their posting lists and retrieves all the documents corresponding to the query. Running time for this algorithm is O(lg n). But as the number of terms increases tree becomes unbalanced it is very difficult to balance the tree. On an average it has to go through 20 nodes to find the term and its posting lists.

B- Trees: In this technique all the nodes of the tree are automatically rebalanced as the tree grows. It consists of keys and pointers keys which holds the value and the pointers which points to the lower node towards the terms. At the root we have ≥ 2 pointers and internal nodes have (n+1/2) pointers and leaf had n/2 points to the posting lists right most pointer of each leaf points to the next leaf and all the terms are arranged from left to right in increasing order. Running time of the b-tree O(lg n). Number of node levels are very less.

Permuterm Index: This algorithm is the best technique when the user performs the wild query. When the user is not sure of the spelling of the word. When the user types the word “how” initially it is assigned with how$ and then it is converted to ow$h, w$ho, $how and then these terms appear alphabetically in the leaves of the b tree and all these rotated terms will point to the same term and then their corresponding posting lists of documents are retrieved to the user. Dictionary is very large.

K –gram index: In this technique they basically used to apply for wildcard queries. Depending on the value of k word is divided into k grams for example “hello” they append $ on both the ends of the term $hello$ k = 3 $he, hel, ell, llo, lo$ and find all the posting lists of the k gram terms then merge all the terms with “AND” operation and retrieve the result to user.

Spelling correction: This occurs usually when user makes mistakes when they type in the query. In this technique algorithms can be designed in many possible ways

1. Choosing the nearest correct spelling.2. Finding 2 or closest spellings.3. More frequent terms of other users queries.

We apply Jaccard coefficient to find the closest terms matching to the query and calculate the edit distance between them example (carot, carrot) =1 edit distance is 1 we add the letter ‘r’.

Page 3: Algorithms for Search Engine