people.uncw.edupeople.uncw.edu/tagliarinig/Courses/380/F2017 papers a… · Web view... other letters that make up the word can ... letter in the standard form of the word

Edge 1

Group FiveDr. Tag380November 21, 2017

Word Search SolversKelsey Edge, Mitch Mosley, and

JoshAbstract

The aim of this project is to analyze the efficiency of three different word search solver algorithms. The optimization of word search algorithms is required to design a fully functioning and efficient system (Volkov, Ramanauskaite 1). Word search solvers are expensive in terms of computational complexity and space complexity. Research to minimize these expenses in a word search game format has been minimal, but diverse. This project will compare three different word search solver algorithms using the criteria time, the number of times one character is compared to another in order to find a match (comparisons), and the amount of working storage and algorithm needs to solve the problem. This project will endeavor to identify areas in which the time, number of comparisons and storage space can be minimized and propose a more efficient solution. Three boards of sizes 10 by 10, 100 by 100, and 1000 by 1000 will be used to gather data. One word list of size five will be used in each puzzle. This project will prove that two of the three algorithms are of reasonable cost in terms of space, time, and computational complexity, while the third proves to be inefficient.

IntroductionA word search is a simple game that

many enjoy. It helps to improve a person’s vocabulary, expand erudition,

train memory and intelligence, and develop logic and associative thinking (Volkov, Ramanauskaite 1). A word search is a puzzle consisting of letters arranged in a grid containing several hidden words (or a specific arrangement of characters). For the purpose of this project, these hidden words can be found on the puzzle board reading from left to right horizontally, diagonally or vertically. The purpose of this project is to find the start and end location of each word in a list of words on a given puzzle board.

Formal Problem StatementA word search board B is of size N x

N (figure 1.1). Each bi,j represents an arbitrary letter in the ith row and the jth column. The list W = [w1, w2,… wz] contains Z words, each of which is a list of letters. The words in W may be found in B by searching in arbitrary locations reading left to right horizontally, vertically, and diagonally. The goal is to find the location in b of each word in W showing that a sequence of adjacent letters on B exists matching the sequence of letters in the list of representing the word.

Fig. 1.1 Example of a board of size N X N

ContextThere have been many attempts to

create an optimized word search solver. Oleksij Volkov and Simona Ramanauskaite of Siauliai university created four relational database designs to solve a word search puzzle in order to design an optimized word search system. Volkov and Ramanauskaite implemented different database schemas and

Edge 2

compared the execution time, space complexity, and comparisons made. They concluded that the execution time depended on the amount of data in the database, not the structure. They also concluded that creating queries that did not take much time to perform was crucial in finding a balanced result between execution time and space complexity. They proposed that setting a limitation on the size of the words in the word list would greatly improve performance. This research guided our group to analyze based on execution time and space complexity, to use three largely increased board sizes, and one-word list with a limitation on the allowed word length. Sanket Jain and Manish Pandey of the Computer Science and Engineering Department at the Maulana Azad National Institute of Technology in Bhopal India proposed a hash function with a heuristic adopted to solve a word search puzzle. This was concluded to have “reduced search time proportionally to constant to find a single pattern” (Jain, Pandey 1). Their implementation was based on an algorithm by P.A. Larson known as “Dynamic Hashing” (Jain, Pandey 1). The concept is to preprocess the puzzle into a hash table so that the starting positions of words are stored in the hash table based on the SDBM hash function algorithm, which is the name of the open-source project in which it was created. In the searching phase, “a word ‘W’ is inserted and its hash value ‘h’ is computed using the heuristic SDBM hash function” (Jain, Pandey 2). The conclusion of Jain’s and Pandey’s work was that pre-processing took O(n) time, and searching for the word was constant time. A replication of the hashing method was not used in our project due to the lack of resources and the flawed language translations of the research but inspired a new direction for this project, including measuring the number of

comparisons it takes to find a match for a word in the wordlist.

Algorithm One: The Recursive AlgorithmAlgorithm one employees a partially

iterative, partially recursive approach to the word search problem. The algorithm begins by first iterating through the 2-D Array of characters which represent the word search puzzle. During this initial iteration, the algorithm is attempting to find a letter that is the first letter in one of the five search words. Initially, the idea was that the algorithm would look for a node that contained any letter in any of the words and use that as a starting point to pass to the recursive function. This approach was quickly deemed ineffective because while it did help reduce the depth of the search on by increasing the probability that the algorithm finds and searches around a letter in a word that is placed on the bottom row, finding the word before having to search on the bottom row. This probable reduction in the depth of the search was, however, not worth the cost of having to look at nearly every single letter on the board and then consequently having to look in every direction around each of those letters. By having it search for only the first letter of each word, we can eliminate a large number of letters from our pool of candidates and furthermore, we can eliminate recursive searching in the horizontal left, vertical up, diagonal left/up and diagonal left/down directions because we know that from the letter that was picked as a starting point the other letters that make up the word can only be found left to right, straight down, left to right diagonally up and left to right diagonally down.

Upon finding a letter that is the first letter in a search word, the algorithm then designates that letter as the start to a candidate word and passes that letter, along with the search word it is potentially the

Edge 3

starting point for and an unspecified direction (dir = 0) into a recursive function. The recursive function checks in all of the valid directions to see if it can match the second letter of the search word in question. If it can find the second letter in the word than the direction of the search is set to the direction the second letter was found in, relative to the first letter. After establishing this direction, the function calls itself passing in the direction (dir = 1, 2 or 3 rather than 0), along with the second letter and the word it is searching for. This process is repeated now taking into account the search direction effectively cutting down the number of comparisons made for the remainder of the search by up to 75%. By tracking the search direction, the program also eliminates the possibility of the algorithm finding an unorganized sequence of letters that make up a search word but do not meet the condition of connectedness.

Algorithm one, despite its simplicity, performed well when solving the word search problem. The fact that the recursive function tracks the search direction of each candidate word allows the algorithm to substantially cut down on the number of potential comparisons but unfortunately is not taken into consideration during the calculation of the assumed worst case. The worst case runtime for this Algorithm is O(n2 * m) where n represents the width of a row and length of a column and m represents the number of search words in the word list. This represents the worst case runtime because the iterative portion of the algorithm could have to look at each element in the search board and the recursive portion of the algorithm could have to look around each letter for each word in the list of search words.

Algorithm Two: The Reductive AlgorithmAlgorithm two aims to reduce the

number of comparisons required to find all

words in the given word list. This is done by determining the word size of each word and deciding whether or not to continue checking for the word in a given direction based upon the row and column position. Let m represent the length of a given word, r represents the row position, and c represents the column position, and n represent the length of each row in the puzzle. Let S represent a Boolean to determine whether to continue checking for a match in the given direction. If S is equal to true, the algorithm will stop searching in the current direction for a match. If S is equal to false, the algorithm will continue to search in the current direction for a match. When a higher row and column position is mentioned, this is on the scale of zero being the highest position and n being the lowest position. Algorithm Two begins by comparing b0,0

with the first letter of the first word in the given word list. If it is a match, it will check each of five methods which check horizontally, vertically up, vertically down, diagonally up, and diagonally down to determine if the first word is found. If the first word is found, the algorithm will stop looking for the first word and start again at b0,0 with the second word and continue the pattern until there are no more words in the word list. If there is not a match at b0,0, the algorithm will continue searching row by row.

When determining whether to check horizontally for a match, the column position determines whether to continue searching. If the column position plus the length of the word is greater than the length of the puzzle's row, the remaining characters in the row do not have to be checked for a match while searching horizontally. Otherwise, the algorithm must continue searching for a match in that direction. This can be represented by S = (c + m) > n. Please see figure 2.1 for a visual diagram representing this property. Checking

Edge 4

vertically downwards refers to checking to see if a given word is found in one column where the first letter in a given word exists at a higher row position in the column than the last letter in the same given word. When determining whether to check vertically downwards for a match, the row position determines whether to continue searching for a match. If the row position plus the length of the word is greater than the length of the puzzle's row, the remaining characters in the column do not have to be checked for a match. This can be represented by S = (r + c) > n. Please see figure 2.2 for a visual diagram representing this property. Checking vertically upwards refers to checking to see if a given word is found in one column where the first letter in a given word exists at a lower row position in the column than the last letter in the same given word. If the row position incremented by one minus the length of the word is less than zero, the remaining characters in the column do not have to be checked for a match. This can be represented by S = (r + 1) - m < 0. Please see figure 2.3 for a visual diagram representing this property. Checking diagonally down refers to checking to see if a given word is found diagonally in a puzzle where the first letter in the word has a higher row position and lower column position than the last letter in the word. If the row position plus the length of the word is greater than the length of the puzzle's row and the column position plus the length of the word is greater than the length of the puzzle's row, the remaining characters do not have to be checked for a match diagonally downwards in the corresponding rows and columns. This can be represented by S = (r + m) > n and (c + m) > n. Please see figure 2.4 for a visual diagram representing this property. Checking diagonally up refers to checking to see if a given word is found diagonally in a puzzle where the first letter in the word has a lower row position and higher column

position than the last letter in the word. If the row position incremented by one plus the length of the word is less than 0 and the column position plus the length of the word is greater than the length of the puzzle's row, the remaining characters do not have to be checked for a match diagonally upwards in the corresponding rows and columns. This can be represented by S = (r + 1) + m > n and (c + m) > n. Please see figure 2.5 for a visual diagram representing this property.

Algorithm two repeats this process for each word in the word list. The advantage of algorithm two is that it drastically reduces the number of comparisons required for each word in the word list based upon the length of the word. When analyzed against Algorithm One and Algorithm Three, it will prove to use the least amount of space in memory and will have the shortest execution time. The disadvantage is that the worst-case scenario requires n2 for each word in the word list to confirm a match. That results in O(n * w), where w represents the number of words in the word list.

Algorithm Three: The Trie AlgorithmAlgorithm three employs a data

structure that is preprocessed before the word search board is considered. It uses the data structure known as a Trie which is typically used in mass text processing that allows for prefix, and in our case a reversed word, analysis of certain keywords that need to be identified. All words are looked for simultaneously instead of a word by word basis to search for a starting character, however once starting character is found it typically takes fewer comparisons to find the word. A deletion method is employed on the trie that removes the word from the structure once found and eliminates the extra comparisons later into the array.

Edge 5

To start off with the list of words that are being looked for are taken and placed into the structure with their starting character linked to a root node that has a value of null (Figure 3.1). To increase the efficiency of Algorithm Two the preprocessing phase can be extended shortly by placing all the keywords into the trie backward as well so that the ending letter of each word is being looked for and the words can be seen in reverse. Otherwise, a method within the trie structure itself could be written to process the end of each word towards the root.

Once the trie is built and the preprocessing phase has completed the searching phase of the Trie algorithm begins. It takes the word search board and creates a starting position at B0,0 and then begins iteration across the first row in search of the first or last character within each word. At each letter in the 2D array, the character is searched for in the trie as one of the starting or ending children of a word attached to the node. Once a child of the root node, or null node, is found, the algorithm begins to process each letter.The first character is confirmed by the data structure and a directionality check begins to see if the word continues in one of the predetermined directions. For this project, the only directions considered were horizontally, vertically, and diagonally reading from right to left could be options for our words. This however only applies if the beginning character is the first letter in the standard form of the word. If the letter at the end of the standard form, the last letter, is found a different set of directions need to be checked (down left and down). The iteration for the directionality check for this algorithm was just based on a clockwise movement around the letter that started with first up, then right up, etc. (shown in figure 3.2) Each of these was a new direction object and stored within an array of directions. If the next letter of the word is

found during the directionality check then the algorithm proceeds in a linear fashion in that direction checking each of the next characters against the word in the structure. If the character continues to match then it proceeds to the end and returns a found identifier and the word is removed from the structure along with its reverse. However, if the word were to not be found along that line it would proceed to return a not found identifier and continue its search along each row of characters until all the words are accounted for.

This setup functioned well to complete its tasks and made sure to consider the possibility of possible duplicates within the array through the deletion. This reduced the number of comparisons later into the array if words were found early. However, if words were placed farther down into the array more comparisons would need to be done and this algorithm would have a large runtime to account for all the possibilities of the prefixes that are possible within a very large 2D array. Its functionality of looking for all words at the same time could also lead to problems due to the way the algorithm operates. If a character is the starting or ending character in any key within the structure it must take the time to check the directionality which for a starting character is up to five different directions and for ending characters is two.

Analysis and ConclusionsBefore beginning development on

our algorithms, we collectively agreed that we expected the reductive algorithm to preform the best on most if not all metrics and expected that the recursive algorithm would preform the worst on most metrics. We considered the Trie algorithm to be somewhat of a wild card given that, in our research we did not find much if any reference to a Trie based algorithm being used to conquer the word search problem.

Edge 6

After development and analysis of our algorithms however, the data told a slightly different story.

The first metric we used to compare our algorithms was the total amount of time it took for each algorithm to run on the 10x10, 100x100 and 1000x1000 word search boards. This measurement was take using the java library “java.lang.System.nanoTime”. The reductive algorithm was very efficient in terms of time because it works to disqualify sections of the doubly-linked array that do not have the potential to house the search words because there is are not enough remaining characters before the edge. The recursive algorithm performed second best in terms of time complexity. The recursive algorithm was likely out-performed by the reductive algorithm because it is initially iterative and takes a substantial amount of time to compare each letter in the puzzle against the first letter in each search word, unless all the words are found before it is necessary. The Trie algorithm preformed the worst in terms of time. This is likely because each letter is checked by passing it into the Trie structure before checking the possible directionality of the word. This large difference in performance is a direct result of the number of objects being generated and added/removed from the data structure. The second metric we used to compare our algorithms was the total number of comparisons each algorithm made of a letter or candidate word to a search word or a letter contained in a search word. In this metric, the recursive algorithm preformed the best on all board sizes. This algorithm had an advantage in that it establishes a search direction and only makes comparisons in that direction, immediately exiting the recursive function if a match is not found. The reductive algorithm was significantly less efficient than the recursive algorithm in terms of the

number of comparisons despite taking less total time to run. The Trie algorithm performed worst in terms of total comparison because as the array got larger the possibility of letters being a prefix in the Trie increased and increased the number of comparisons that needed to be done.

The third and final metric we used to compare our algorithms was space complexity which was recorded using a java library for memory tracking memory use. The reductive algorithm and the recursive algorithm preformed almost identically in terms of space complexity when applied to the 10x10 and 100x100 board-size puzzles, the reductive algorithm doing negligibly better on the 100x100 but being outperformed slightly on the 10x10. But as the problem scaled, the recursive algorithm suffered sever scalability issues and was outperformed by both of the other algorithms. This is because the board size increased but the search words remained the same so a lot more letters where being considered as candidates per word that was actually found and in the recursive algorithm, each time a letter is passed into the recursive function and searched around a candidate word variable is instantiated taking up a chunk of memory. While the Trie algorithm was more successful in minimizing

To rank the algorithms quantitatively relative to one another, we calculated a score for each algorithm. This score was the sum of the rank of the algorithm on each of the three metrics for each of the three board sizes, a score of 1 being the most optimal algorithm for minimization of the metric and a 3 being the least. Using this scoring system, the reductive algorithm was the most optimal with a score of 13, the recursive algorithm came in second place with a score of 14 and the Trie algorithm was least optimal with a score of 26. It is important to note that these scores are not

Edge 7

weighted based on the magnitude of difference in performance in each category. Each algorithm had its merits, aside from the Trie algorithm which did not outperform the other algorithms in any metrics, but the reductive algorithm proved most effective as we stated in our original hypothesis.

Possible procedure holes could include limiting the length of the word list, limiting the size of the words in the word list, and starting from position b0,0 for all three algorithms. The length of the word list was limited for this project based upon Oleksij’s and Ramanauskaite’s research. We could use different size word lists in the future to help draw more concrete conclusions. This project limited the size of all words in the word list based on the proposal from Oleksij and Ramanauskaite which predicted that limiting the size of the word would produce more balanced results and a more efficient algorithm. Lastly, The Recursive Algorithm, The Reductive Algorithm, and The Trie Algorithm all started at b0,0 and iterate right to left, row by row and column by column. It is possible that restraining all three algorithms to this particular iterative pattern could deprive us from seeing more diverse algorithms which could produce better results. If we were to restart this project, these possible procedure holes would have been tested and analyzed.

Future WorkIf we were tasked with approaching

this same problem over again or to work further with it in the future there are a few things that we would like to consider and attempt to implement to vary our project further and approach it from a new angle.

To begin with we would vary the list of words and increase the number of puzzle sizes that we would look at. In the beginning of our project we were implementing varying lists and a smaller range of sizes but decided that to limit the

amount of variance we chose to stick with one word set and then use 3 puzzle sizes (10x10, 100x100, 1000x1000). By reintroducing the varied word lists and puzzle sizes we could explore different options including how giving varied sets of puzzles with differences how would the three algorithms compared running on unique puzzles just given to them.

The second new approach we would take would be develop an algorithm that wasn’t constrained to starting at b0,0 and possibly look to start somewhere near the center of the search array and work outwards. This could possibly open much faster solutions even if we reimplemented the algorithms that we already have. With our Trie algorithm starting from the center could return much faster results due to its ability to look for both the beginning and ending letters of words. Our Reductive algorithm could possibly benefit in the fact that it could rule out areas of the board sooner that previous and give itself a smaller searching area to actually work with. This could also form a problem though if a word to be placed one of the corners as the algorithms would still have to run for longer periods of time to find that starting letter.

Lastly we would like to look into the research of Sanket Jain and Manish Pandey who implemented an algorithm that used a hash table to attempt to decrease the time and overhead involved in solving a word search. They break down the processing of the word search into two parts, preprocessing and searching. They begin by reading in the whole array and filling the table with the respective hashed indices after which they entered the word, computed the hash value and checked if the index was empty. If empty the word is not found, however if its not empty the algorithm then extracts the next starting position and then begins a loop that compares the word using the starting position and continues until a

Edge 8

match is found, where it gives a word found report, or it finishes and the word is never found.

Questions1. Is a word search a constraint satisfac-

tion problem or an optimization problem?

Answer: Constraint satisfac-tion

2. Memory usage in the Trie data struc-ture is typically larger than other data structures because?

Answer: Tries have a much larger O, O(Alphabet size * key length * n)

Where n is the num-ber of words

3. What was the computational com-plexity of reductive and recursive al-gorithms (Algorithms 1 and 2)?

Answer: O(n^2 * m), where n = number of

rows and columns * m = words in the list

4. Out of the three algorithms, the Re-ductive algorithm (Algorithm 2) ran the quickest because it did what?

Answer: Ruled out areas of the puzzle for certain words where it could not exist

5. Is finding the location of a list of words on a word search board a P or NP problem?

Answer: P

Figures Referenced for Algorithm Two

Edge 9

Fig. 2.1 checking horizontally

Fig. 2.2 checking vertically down

Fig. 2.3 checking vertically up

Fig. 2.4 checking diagonally down

Fig. 2.5 checking diagonally up

Fig 2.6

Edge 10

Algorithm 1 Algorithm 2 Algorithm 30

500000100000015000002000000250000030000003500000400000045000005000000

1666881.1

630810.5

4330160.1295

Total Time 10x10

Axis Title

Algorithm 1 Algorithm 2 Algorithm 30

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

9000000

10000000

5793202.2

3455004.9

9298294.5

Total Time 100x100

Edge 11

Works CitedVolkov, Oleksij, and Simona Ramanauskaite. “Research of Word Search Algorithms Based On

Relational Database.” Su.It, Jaunuju Mokslininku Darbai, 2 Nov. 2013.

Jain, Sanket, and Mansih Pandey. “Hash Table Based Word Searching Algorithm.” International Journal of Computer Science and Information Technologies, vol. 3, no. 3, 2012, pp. 4385–4388., SemanticScholar.org.

Documents

people.uncw.edupeople.uncw.edu/tagliarinig/Courses/380/F2017 papers a… · Web view... other letters that make up the word can ... letter in the standard form of the word