Approximate phrase searching: Movie scripts and song lyrics

Approximate Phrase Searching: Movie Scripts and Song Lyrics

Kathryn Patterson, Carolyn Watters Faculty of Computer Science

Dalhousie University Halifax, Nova Scotia, Canada B3H 3L7

[email protected]

Abstract Search engines provide an effective means of retrieving a document in which a piece of text occurs when

the query contains infrequently occurring terms or the query is known to be an exact phrase. However, phrase queries usually contain common terms including determiners and users may not remember phrases exactly. Search engines discard common terms or assign them little importance, which may lead to poor retrieval results. In this paper, we examine the use of proximity-based phrase searching to search for quotes from song lyrics and movie scripts and compare the results against Google.ca, Yahoo.com and Ask.com. An improvement of over 25% on search engine results shows that an additional search method to complement the common search engine methods would be beneficial for this task.

Introduction Search engines currently include a variety of features that allow users to customize their query for their

task. For the task of finding web pages on a topic, simple keyword searching is available in two forms – “any of these terms” and “all of these terms”. For the task of phrase source searching, for example in quotations or lyrics, phrase matching is available in the form of “exact phrase”.

“Exact phrase” matching requires that all terms be present in the order provided. “All of these terms”

searches require that all of the terms provided be found in the retrieved document, but order does not matter. “Any of these terms” searching requires that at least one of the terms provided be found in the retrieved document and order, again, does not matter. Other features allow users to specify languages, domains, etc. in order to narrow their results. All of these features make present search engines very powerful. However, there is still room for improvement.

In general, the complexity of queries used in search engines has been decreasing and the voluntary

usage of query operators remains quite low (Jansen, 2006). In addition, 81% of users do not look at results past the first page (Jansen, 2006). Whether this is a result of search engine performance or user behavior, maintaining the level of simplicity which users are familiar with is important when providing them with a new method of performing certain types of queries. Increasing the complexity of queries also runs the risk of increasing the likelihood of errors (Jansen & Pooch, 2000).

Current search engine query methods do not meet all needs. For example, the task of searching for the

source of a phrase is very difficult when the user does not remember the phrase correctly. When the user does recall the precise phrase, exact phrase matching should be sufficient for finding the source of the phrase. We developed a proximity-based method in a previous study (Patterson et al., 2008) which showed promising results when compared to the vector space model for use in a dataset of movie scripts.

Our objective in this paper is to examine how well the proximity-based method performs on other

datasets and how well it performs in comparison to search engines, which provide phrase search methods. In this study, we used queries related to song lyric quotes. The rank and the accuracy of the proximity-

based method is compared to keyword searching with the Vector Space Model and compared to search engine methods using Google.ca, Yahoo.com and Ask.com. In addition, queries related to movie quotes are

also examined using the same search engines and the results of this study are compared to the results of our previous study on movie quotes.

The results of these studies show an improvement of approximately 25% in accuracy over the search

engines tested. The vector space model performed quite poorly. These results indicate that our proximity-based method would make a promising additional search engine method for phrase-searching tasks.

Related Work

Phrase searching enables users to find documents containing the full text for a remembered phrase

(Jansen, 2000) (Eastman & Jansen, 2003). Phrase searching accounted for approximately 10% of queries in a study on the usage of search engines on WebCrawler and Magellan (Jansen & Pooch, 2000). Keyword query methods have been shown to be far less likely to achieve the desired result than performing an exact phrase search (Jansen, 2000) (Eastman & Jansen, 2003).

Search engines, such as Google, allow users to provide exact phrases in search queries and return only

those documents containing the phrase as it appears with the exception of the addition or exemption of punctuation. The more uncommon the phrase is used in the phrase query, the more likely it is that the desired document will be ranked correctly. Exact phrase matching is very efficient, but users have trouble remembering the exact text to use in the query (Salton & McGill, 1983).

The vector space model is the most well-known and the most commonly used model in information

retrieval (Jansen, 2000). Among the many reasons for its wide use are its simplicity, usability in broad types of text searching tasks and high performance in keyword searching (Jansen, 2000). For the task of phrase source searching, the method is appealing as it allows us to perform potentially successful queries when one or more of the words provided are incorrect.

In a previous study (Patterson et al., 2008), however, we showed that vector space model performs

poorly on phrase source searching. We developed a method based on proximity operators, motivated by the use of a k-word proximity search by Sadakane et al. (Sadakane & Imai, 1999), testing done by Keen on four proximity search methods (Keen, 1992) and a combination of a proximity measurement and the Okapi probabilistic system by Rosolofo et al (Rasolofo & Savoy, 2003) which each showed promising results (Salton & McGill, 1983).

Proximity searching is the application of a maximum distance and, optionally, an ordering (Rasolofo &

Savoy, 2003). Proximity queries are written in expressions with two terms as operands and a proximity operator. For example, trailer NEAR3 boys means that there may be no more than three words between trailer and boys and that they can be in any order. In order to specify that trailer must come before boys, the operator would then become WITHIN3.

In our method, we required that each pair of terms be within a certain maximum distance of one another.

The maximum distance was calculated based on the query provided. This essentially created a maximum window in which all terms considered must appear (Patterson et al., 2008).

It was hypothesized that a greater emphasis on matching as many terms as possible within close

proximity had a higher relevance in approximate phrase matching than term-frequency factors employed by keyword search methods. The study showed that the proximity-based methods developed significantly outperformed the vector space model (Patterson et al., 2008).

The methods developed in our previous study on phrase source searching accounted for the number of

terms found, the collocation of terms, length of terms, and the frequency of terms. It was found that, on this dataset, the methods which accounted for term length and frequency showed little or no improvement over the simpler method (Patterson et al., 2008).

Our interest is in the use of a proximity-based phrase search method and testing the retrieval accuracy for retrieving quotes from text.

Methodology The methodology used in this study consisted of building a lyrics corpus by selecting all web pages

containing lyrics for songs from LyricsDownload.com, obtaining manually generated user queries, cleaning this data, building term-document matrices, applying the vector space model to all documents for each query, building a corpus-term matrix from the lyrics, building term-term proximity matrices from query terms, applying a proximity-based scoring system to all documents for each query, and supplying the query terms in a site-specific query to the search engines and evaluation of resulting document rankings.

Dataset For the first experiment, a corpus of 712 complete movie scripts was obtained from the Internet Movie

Script Database (IMSDb.com, 2007). The movie scripts had an average length of 23,766 words. A collection of 91 queries were obtained from six university students.

Figure 1. LyricsDownload.com lyrics For the second experiment, a corpus of 685,079 song lyric documents was obtained from

LyricsDownload.com (LyricsDownload.com, 2007). An example of the documents is found in Figure 1. This dataset was chosen as it contained several orders of magnitude more documents than the movie script collection and the documents were several orders of magnitude smaller than the movie scripts. The lyrics had an average length of 216 words. A collection of 86 queries were obtained from 8 university students.

The proximity search approach used in this study was based solely on words and relative locations of the words within the scripts. All HTML tags and images, etc., were removed from the web pages. All words were converted to lowercase and punctuation characters were removed.

For each collection of queries, the students were asked to write down quotes in the form of a complete or

incomplete sentence, indicating “small” and “large” gaps between sentence segments. The students were not told what the interpreted maximum size of the gaps would be. The students were allowed instructed to write quotes from any movie or song that came to mind. They were not told what movies or songs to choose from, so a few of the quotes provided came from documents which did not exist in the document sets. These quotes were removed from the query sets.

Proximity-based phrase searching

Queries are entered in the form t1 o1 t2 o2 … on-1 tn, where ti is a term or word and oi is an operator

between the words ti and ti+1. Each term is a word and each operator is whitespace, an ellipsis or a double-ellipsis. Whitespace operators imply that the adjacent terms ought to be consecutive; however, our constraints will be more relaxed. A single ellipsis implies that there is a small gap between the adjacent terms and a double-ellipsis implies that there is a large gap between the adjacent terms. Two sample queries are as follows:

• here's looking at you kid • My name is ... you kill my father, prepare to die

Each operand is converted into an allowable distance between the adjacent terms. The following

conversions were used:

• Whitespace ! maximum of 2 terms apart • Single Ellipsis ! maximum 10 terms apart • Double Ellipses ! maximum 20 terms apart

The distances were chosen by experimentation and made larger than is likely needed to ensure that the

conditions were not too strict. Using these distances, a maximum phrase window size, M, is determined in terms of the maximum number of words in the window by adding the total allowable distances and the number of words. For example, the window sizes for the above examples queries are calculated in Figures 2, and 3, resulting in 13 and 36, respectively.

Figure 2. Maximum window size calculation for here’s looking at you kid

For each unique maximum window of size M, a binary vector, f, of size |t| is created which indicates

whether each term, ti, was found in the window. If ti is found in the window, fi = 1, otherwise fi = 0. Each term, ti from the phrase may only be counted once, but since terms are not necessary unique, some words may be counted more than once.

The positions of the first and last words that match phrase terms are the first and last words of the actual

window, Wi. Given the length of document, length, there are length – M + 1 windows in that document and each window is denoted by Wj, where j indicates that the first term in the original window of size M is the jth word in the document.

For each window, the number of occurrences of adjacent query terms within the window is calculated.

The values are stored in a vector, a. A proximity-based scoring system was developed that ranked those documents the highest which

maximized the number of query terms found within a minimized window. Each document would have a score given to each window and the highest score found would become the score for that document.

Figure 3. Maximum window size calculation for My name is … you kill my father, prepare to die The number of query terms which were found is squared, since we placed higher priority on maximizing

the occurrence of query terms than any other factor. The score was increased by the number of occurrences of adjacent query terms. Finally, the percentage of terms in the window which were not query terms is subtracted. This scoring system is expressed in the equation for PBP-reg:..

(1) where jf is a binary array representing which query terms were found in the thj window in terms if a 1 or a 0,

ja represents how many co-occurring query terms were present, and W represents the window length in the number of terms it contains.

Two modified methods were developed to incorporate other factors, while maintaining the highest

importance in the factors already incorporated in PBP-reg. A modified method was developed to incorporate the lengths of the query terms. This was designed to determine whether assigning importance to documents which matched longer terms from the query would improve results:

(2)

where fj is a binary array representing which query terms were found in the jth window with a 1 or a 0, aj represents how many co-occurring query terms were present, W represents the window length in the number of terms it contains, and length(tk) represents the number of characters in tk where k is the index of a term in the query.

Finally, an attempt at further improvement upon PBP-reg was made by assigning importance to

documents which matched the query terms that were less frequent, relative to the entire corpus:

(3) where fj is a binary array representing which query terms were found in the jth window with a 1 or a 0, aj represents how many co-occurring query terms were present, W represents the window length in the number of terms it contains, and totalTF(tk) is the number of times tk occurs within the entire corpus and k is the index of a term in the query.

Experimental procedure All documents were converted into term-frequency vectors. All queries were converted into term vectors.

The similarity between the query term vectors and the documents was calculated and documents were ranked using the VSM. The top 100 ranked documents were recorded. This experiment is referred to as VSM in the results.

The three PBP systems were tested on all queries over the whole corpus. The top 100 ranked

documents for each query were recorded. Three search engines, Google.ca, Yahoo.com and Ask.com, were used to perform site-specific

searches. For example, in order to search for all of the terms “a”, “b” and “c” on lyricsdownload.com, we provided the query “a b c site:domain.com” to Google.ca. We performed 4 different searches over all queries: an “any of these” terms search with Google.ca, an “all of these terms” search with Google.ca, an “all of these terms” search with Yahoo.com and an “all of these terms” search with Ask.com. These methods will be referred to as Google-any, Google-all, Yahoo-all and Ask-any, respectively. The “any of these terms” methods were not used with Yahoo.com nor Ask.com as the Google.ca searches yielded very poor results and both search engines began to exhibit the same behavior. For each query, the rank of the correct document among the first 100 results was recorded for each search engine method.

The vector space model and proximity-based methods had already been tested on the movie script

database in a previous study (Patterson et al., 2008), so we only performed the search engines methods on the movie script collection in this study.

The number of top ranked documents to examine was chosen conservatively in the previous study

(Patterson et al., 2008). We chose 100 again in order to allow us to compare the results of both collections.

Results and discussion The results of all eight scoring systems for both experiments are shown in Table 1. Each entry gives the

number of queries which successfully ranked the correct document within a top range of documents given for that scoring system.

A comparison of the accuracy between search methods in achieving the correct document within the top

ranked documents is shown in Figures 4 and 5 and a comparison of the correct document occurring within the top ten ranked documents in both collections is shown in Figure 6 in terms of percentage of valid queries. All three of the PBP methods showed a great improvement of accuracy over the accuracy of the VSM method and the search engine methods. The average accuracy of the PBP methods improved from VSM’s average accuracy of approximately 1.6% and the highest average accuracy among the search engines of 55.3% to 80.5%, 81.1% and 82.1% for PBP-reg, PBP-len and PBP-tf, respectively. The minimum improvement of PBP over the other tested methods was 25.2%, where improvement is measured as the percentage increase in ranking documents within the top ten results. The amount of improvement over search engines in the experiment on song lyric queries was more than 10% higher than the improvement in the experiment on movie

script queries. This may be explained by the fact that movie scripts tend to contain more commonly spoken language than song lyrics. Song lyrics very often contain phrases which would rarely, if ever, be said in conversation.

Table 1. Number of documents with highest ranked documents in movie script collection

Movie Scripts (91 queries) Top 1 Top 5 Top 10 Top 20 Top 50 Top 100 VSM 1 2 3 8 17 23 PBP-reg 55 63 66 68 72 79 PBP-length 56 63 65 73 74 77 PBP-tf 59 67 69 72 73 78 Google-any 4 17 21 26 37 46 Google-all 47 51 52 54 59 59 Yahoo-all 18 31 32 39 42 46 Ask-all 36 42 45 47 47 47

Table 2. Number of documents with highest ranked documents in lyrics collection

Lyrics (85 queries) Top 1 Top 5 Top 10 Top 20 Top 50 Top 100 VSM 0 0 0 0 0 0 PBP-reg 68 73 76 78 80 81 PBP-length 69 76 68 80 82 83 PBP-tf 68 73 76 77 78 80 Google-any 0 1 1 2 2 4 Google-all 42 46 46 47 48 49 Yahoo-all 39 44 46 49 51 55 Ask-all 42 45 47 49 50 52

Figure 4. Accuracy on movie script collection

Figure 5. Accuracy on song lyric collection

Figure 6. Comparison of the “top 10” accuracy in each collection

Given the large performance improvement from the proximity-based methods over the search engine

methods, it is quite probable that such a method would prove to be useful. It may be the case that a proximity-based method such as this is not easily implemented in current search engine algorithms. The proximity-based method can still be employed for web searching by applying one or more “all” searches on subsets of the query to quickly gather a reasonably sized set of documents which an independent proximity-based algorithm can then be applied to.

Our previous study had shown very little difference in accuracy among the three PBP methods. The

length and term-frequency factors were improved through trial and error and were shown to improve the results of some queries. However, the number of queries affected is negligible compared to the number of queries issued. In addition to this, the term-frequency factor is very costly on dynamic corpora as it requires that the term-frequency over the whole corpus be known. The length factor does not decrease efficiency noticeably, but does not prove generally useful either. Both of the modified PBP methods will be discarded in future studies.

Many query result were improved by the proximity-based method but several queries failed in all

methods except for the proximity-based methods. One example is the query “tammy baker, boy did she loose some weight!” which all three PBP methods matched against the text: “… Kathy Straker, boy could she lose some weight …” and correctly ranked this document first. Despite the document containing the seemingly uncommon terms “tammy” and “baker”, the search engine methods and the vector space model were still unable to locate the correct document within the top 100 results.

Some queries were less successful to varying degrees. One query ranked the correct document highly, but since the query was poorly remembered, the result was not the first: “I wasn't even supposed to come in today”. The correct document is Clerks. The top four ranked documents from the PBP-tf method matched the following text:

• Ed-TV: even talks to me and then today I come • Pet-Sematary: wasn't even supposed to be a sprain today, my friend--that's what I • 25th-Hour: supposed to be at work in a couple hours. Christ, I can't even imagine working today • Clerks: even supposed to be here today!

Some other examples of lower performing queries are “na ... na ... not going to work her any more” which

was intended to match “Naga ... Naga-worker here anyway!” and “as soon as you stop looking at the reason you become a numerologist” which was intended to match “You become a numerologist. What you need to do is take a break from your research”.

An example of one query which failed in all methods and appears to be difficult to improve is "thank you

for calling initrode, please hold. Thank you for calling initrode, please hold" which was intended to match “Corporate Counsels Payroll, Nina speaking. Just a moment”. Some of the terms are found in the desired document and one term, “initrode”, is unique to it, but a method which is designed to make use of uncommon terms, such VSM, also failed.

There were many instances where the correct document was found within the top 100, but tied with so

many other documents in its score that the resulting rank was unsatisfactory. This typically occurred with queries made up mainly of common terms. These common terms often appeared near one another. The correct document could have its rank improved by awarding value to text more closely resembling the structure of the original query.

Table 3. Reasons for and numbers of failed queries

Reason for failed query Movie Scripts Song Lyrics Not enough correct words 4 2 Inaccurate document 3 0 Text does not exist 2 0 Window to small 1 0 Spelling error 1 2 Too common 1 2 Too many tied scores 3 1 Mistaken phrase common 4 2 Smaller window found 2 1

TOTAL FAILED 21 10

Table 3 shows the results of an examination of failed queries – queries for which the correct document

was not ranked within the top ten results. Interestingly, the query containing only a few correct words resulted in only common words being matched and documents returned that actually contained terms which the desired document did not have. Another notable reason was that the user’s remembered phrase did not exist in the documents. However, other common reason for failed queries was beyond the user’s control; the user remembered the quote correctly or fairly accurately, but the document did not contain the text. For example, many of these conditions are unlikely to be improved, resulting in a minimum expected continued failure rate of about 10%.

It is unsurprising that search engines performed poorly compared to the proximity-based methods.

Approximate phrase matching involves a high probability that many of the terms provided will be present with a

low probability that the phrase provided will be exactly correct. The “any” and “all” searches are far too relaxed and too strict, respectively, to adequately handle queries of this nature.

Summary and future research

Keyword searching, as performed by the vector space model and the search engines tested in this study,

has rather poor performance for the purpose of retrieving the document containing a phrase which may not be remembered correctly. The highest performance seen in the keyword methods tested was approximately 55%. Our proximity-based methods presented here improved on the keyword methods by at least 25%, on average. In addition to this, the increased level of complexity in formulating our queries over the complexity of forming keyword queries is very small.

The impressive accuracy rate and the simplicity of our proximity-based method indicate that our method

would make a promising additional search engine method for phrase-searching tasks. The small percentage of queries which failed can be attributed to significant user error, document

inaccuracy and the need to further develop our method. Based on the number of queries which failed for reasons beyond our control, we predict that our method will be unable to surpass an accuracy of approximately 90%, so we set this as our goal for future experiments.

The datasets used in our experiments varied greatly in the number of documents and document length.

A high accuracy achieved in both experiments indicates probable success on the datasets chosen for our future experiments. The common property of the datasets tested here is that the words are likely to be memorized, due to the repetitive nature of songs, some lines in movies being memorable due to humor or shock, or simply from repeated exposure in both instances. It is far less likely for a person to remember a large portion of a line from an academic paper, for example. Documents which discuss a certain topic are more likely to be remembered by concept and not by content. However, it would be interesting to see what sorts of queries users might have when trying to locate documents relevant to a topic or specific documents on a topic and how often, if at all, a proximity-based method may aid in their search. An analysis of queries and how often query terms appear near one another relative to how often they appear in the same document may provide some insight into the potential usefulness of the proximity-based method on significantly different corpora.

The most likely improvement to be made is that of distinguishing between documents which currently

have the same score. The length based and frequency based score values did not achieve this effect. We will be performing an experiment intended to achieve this. A new scoring system which awards value to pairs of terms correctly ordered relative to one another and pairs of terms which meet proximity constraints. We hypothesize that this will increase the number of top-ranked documents and the number of documents within the top ten. An example of a query which appears as though it would benefit from the above-mentioned factors is “yeah, I'm gonna have to get you to come in on sunday too” intended to match “Oh, oh, yea…I forgot. I'm gonna also need you to come in Sunday too.” Most of the terms in the query are very frequent terms that appear often near one another in a wide variety of structures.

Our goal is to improve the robustness of the proximity-based method to achieve accuracy as close to

90% as possible and ensure users are able to perform phrase searches on larger corpora and potentially the web without further increasing the complexity of query formulation.

References Eastman, C. M., Jansen, B. J.. Coverage, Relevance, and Ranking: The Impact of Query Operators on Web Search Engine Results. ACM Transactions on Information Systems (TOIS), 2003.

Jansen, B. J.. How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Information Processing & Management Volume 42, Issue 1, 248-263, January 2006,

Jansen, B. J.. The effect of query complexity on Web searching results. Information Research, Volume 6 No. 1, October 2000.

Jansen, B., Pooch, U., (In Press). Web user studies: a review of and framework for future research, Journal of the American Society for Information Science, 2000 (draft) Keen, E.M. Some aspects of proximity searching in text retrieval systems. Journal of Information Science 18, pp. 89-98, 1992. Lee, D. L., Chuang, H., Seamons, K. Documents Ranking and the Vector Space Model. Software, IEEE, Vol. 14, Issue 2, Mar. 1997.

Lyrics Download.com site: accessed 2007. http://www.lyricsdownload.com/

Patterson, K., Watters, C., Shepherd, M. Document Retrieval using Proximity-based Phrase Searching. Proceeding of the 41st Hawaii International Conference on System Sciences, Hawaii, 2008. Salton, G. and McGill, M. J. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY. 1983

Rasolofo, Y., Savoy, J.. Term Proximity Scoring for Keyword-Based Retrieval Systems. Advances in Information Retrieval: 25th European Conference on IR Research, ECIR 2003, Pisa, Italy, April 14-16, 2003. Proceedings. Sadakane, K., Imai, H.. Text Retrieval by using k-word Proximity Search. Database Applications in Non-Traditional Environments, 1999.

The Internet Movie Script Database (IMSDb) site: accessed 2007. http://www.imsdb.com/