Advanced document retrieval techniques for patent research

www.elsevier.com/locate/worpatin

World Patent Information 30 (2008) 238–243

Advanced document retrieval techniques for patent research

James F. Ryley a,*, Jeff Saffer b, Andy Gibbs c

a SumoBrain, 8605 Chapel View Drive, Ellicott City, MD 21043, USAb SciWit, 2765 14th Street, Boulder, CO 80304, USA

c PatentCafe, 441 Colusa Avenue, Yuba City, CA 95991, USA

Abstract

Latent semantic indexing (LSI) can be used in patent searching to overcome drawbacks of Boolean searching and to give more accu-rate retrieval. LSI combines the vector space model (VSM) of document retrieval with single value decomposition (SVD), using linearalgebra techniques to uncover word relationships in the text. Results can be enhanced by using text clustering and tailoring SVD param-eters to the specific corpus, in this case, patents, and by employing techniques to address ambiguities in language.� 2008 Elsevier Ltd. All rights reserved.

Keywords: Latent semantic indexing; LSI; Vector space model; VSM; Single value decomposition; SVD; Text mining; Clustering; Patents

1. The document retrieval challenge

Patent searchers have one objective as they prepare toplunge into a large patent data collection, that is, to iden-tify the documents that are most relevant to their research.Traditionally, this has been accomplished by craftingsearch queries comprised of a string of keywords andBoolean operators that retrieve relevant documents. Inthe patent search world, ‘‘keywords” typically includetopic-specific terms, patent classifications, and inventor orapplicant names.

A fundamental drawback to document retrieval usingkeyword (‘‘lexical”) matching is the inability of the processto identify documents which, for a variety of reasons, sim-ply do not contain those keywords. Using a Boolean searchengine, there is little that can be done about this problemother than to employ subject matter experts who attemptto create queries using every possible synonym that couldbe used to describe an invention. Needless to say, this isa daunting task.

As the stakes in the patent world have increased, so havethe consequences of not finding the most relevant docu-

0172-2190/$ - see front matter � 2008 Elsevier Ltd. All rights reserved.

doi:10.1016/j.wpi.2008.01.004

* Corresponding author.E-mail address: [email protected] (J.F. Ryley).

ments. New obviousness standards in recent decisions suchas KSR vs. Teleflex, Leapfrog Enterprises, Inc. vs. Fisher-Price, Inc. (in which the court decided that applying mod-ern electronics to an electro-mechanical learning device wasobvious), and others, suggest an increasingly demandingrequirement to not only retrieve art that literally matchesthe keywords, but to find relevant art that discloses themeaning of an invention, even if the invention or technol-ogy occurred at a time when different jargon or means wereused, or in industries that simply used different terminologyto describe a similar function, process or result. The transi-tion from document retrieval based on lexical matching(that is, what is literally written in a document) to semanticretrieval (based on the meaning of the words contained in adocument) requires sophisticated computational methods.Latent semantic indexing (LSI) is such a method that facil-itates document retrieval in response to text queries thatdescribe the concept, or meaning, of interest rather thanjust the literal words.

LSI promises more accurate retrieval of information byincorporating statistical information on term meaning andfrequency while retrieving documents as a result of asearch. LSI’s precision and accuracy has been proven manytimes on test corpora, but the world’s patent literatureposes a significant challenge in effectively implementing

mailto:[email protected]

J.F. Ryley et al. / World Patent Information 30 (2008) 238–243 239

an LSI search engine due the size and heterogeneity of thepatent corpus. Some of the factors which must beaddressed to realize the goal of a more accurate patentsearch engine are discussed herein.

2. Introduction to LSI

LSI is an advanced document retrieval technique createdby combining the vector space model (VSM) of documentretrieval and singular value decomposition (SVD) [1]. Inthe vector space model, each document is plotted as a pointin a multi-dimensional space. The location of that point isdetermined by the terms contained in the document andthe frequency with which those terms appear. Each individ-ual term represents a dimension in this space. Much as x, y,and z may be taken to represent the dimensions of traditionalthree-dimensional space, ‘‘cell”, ‘‘phone”, ‘‘stem”, etc. couldbe dimensions in some higher-order space used to describedocuments containing these terms, as illustrated in Fig. 1.

Effectively then, the documents are represented by vec-tors in a potentially very high-dimensional space. Such vec-tors can be used for many purposes, such as comparing

Fig. 1. Vector s

documents and querying databases. In the case of queries,for example, a query submitted to the system is assigned aposition in the same high-dimensional space used for doc-uments. Relevancy between the query and the documentsmay then be determined on the basis of the distance orangle between the query vector and the document vectors.

VSM-based document retrieval excels at ranking the rel-evancy of documents found. However, like Boolean searchengines, the weakness of VSM is its reliance upon literalvocabulary for word matching. For example, if a usersearches for the term ‘‘car”, documents that do not containthe word car will not be retrieved, even if they do containwords such as ‘‘auto”, ‘‘automobile”, or ‘‘vehicle”. Thisis a serious drawback in light of the fact that there are typ-ically many ways to describe the same concept (oftenreferred to as the problem of synonymy). In patent litera-ture in particular, the legalese, perhaps purposefully obfus-cated, magnifies the problem.

In LSI, singular value decomposition is used to enhanceVSM by using algebraic techniques to automatically ascer-tain word relationships that may have a bearing on docu-ment relevancy. For example, due to co-occurrence, SVD

pace model.

240 J.F. Ryley et al. / World Patent Information 30 (2008) 238–243

can uncover that ‘‘car”, ‘‘auto”, ‘‘automobile”, and ‘‘vehi-cle” are somehow related.

This relationship is not necessarily one of synonymy(though it may be), but rather it is a correlation that allowsthe system to tell the user, in effect, ‘‘If you are interested indocuments containing ‘car’ you may also want to considerdocuments containing the related words ‘auto’, ‘automo-bile’ or ‘vehicle’ ”. The end result is a search engine thatcan find documents on relevant topics even if the documentcontains none of the search terms. We refer to the ability tofind related words or concepts that were not specified in thequery as ‘‘conceptual search”.

3. Mathematical challenges in implementing LSI

The increase in accuracy of LSI over simple term match-ing is potentially substantial. In fact, on standard test cor-pora, LSI consistently outperforms VSM and other modelsby up to 30% [1], and has repeatedly set precision and recallbenchmarks [2–5]. However, the design of an LSI-basedsearch engine is far more complex than that of vector spacemodel or Boolean-based search engines.

Many design choices, and the nature of the documentcollection, can affect the performance of an LSI-basedsearch engine. For example, some groups have found thatLSI does not perform well on large document collections[6–8]. This is of particular concern to the patent analystsince the world patent literature is far, far larger than thetraditional test databases used in academics. Others havefound that poorly chosen specific parameters, e.g. ‘‘k”,the rank chosen for SVD calculations, can substantiallyreduce LSI recall and precision [9–11]. Decisions on howto deal with pre-processing of document terms, clustering,stemming, stop words, and other topics also have thepotential to affect the precision and recall of an LSI-basedsearch engine.

With respect to one example of using an LSI-basedsearch on patent documents found in the literature, Moldo-van et al. found only minimal improvement over VSMwhen using LSI specifically on subsets of the US patentcorpus, and in fact found that in some cases LSI actuallyperformed slightly worse than the vector space model[11]. However, while Moldovan tested multiple values fork, k seems to be the only parameter that was thoroughlyinvestigated. Moldovan’s pre-processing parameters wereseemingly arbitrarily chosen, the Porter stemming algo-rithm was used exclusively (with no mention of perfor-mance tests done without stemming, or the use of otherstemming algorithms such as snowball), the SMART stopword list was used (which is almost certainly not optimalfor patent documents) and no clustering was performed.Consequently, this work cannot be taken as a realisticassessment of the potential for LSI on patent documents.

Given the number of factors involved in optimizing anLSI-based search engine, it is unsurprising that somegroups find that, without comprehensively addressing allthe factors involved in optimizing an LSI-based search

engine, they cannot consistently replicate the performanceadvantages of LSI over VSM or Boolean engines. How-ever, LSI is a powerful tool for patent document analysisand retrieval and we will address several factors that mustbe heeded when creating an LSI-based search engine forpatents.

4. Clustering

Large document collections can pose a problem for LSI[6–8] if not treated appropriately. Lack of homogeneitybetween document topics exacerbates the problem. Thishas been attributed to the elevated noise level naturallyoccurring in very large, heterogeneous collections. Theworld’s patent documents are clearly both very large andvery heterogeneous.

The main problem with large, heterogeneous documentcollections is that singular value decomposition becomesless able to discriminate noise from information as the cor-pus size increases. One of the reasons for this is the prob-lem of polysemy (one word having multiple meanings).Large document collections also pose computational prob-lems, requiring a prohibitive amount of memory for SVDcalculations.

However, clustering can address these problems [12].Clustering breaks a document collection into sub-collec-tions with improved homogeneity, which are then treatedseparately for the purposes of SVD. Numerous clusteringalgorithms are available [13], and they can roughly bedivided into agglomerative and divisive techniques.

Agglomerative techniques start with a single document,and then find its nearest neighbor (forming a two-docu-ment cluster). The next two closest documents are likewisejoined, and the process (if being carried to conclusion) isrepeated until all documents are part of a single cluster.Agglomerative techniques are also called ‘‘hierarchical”,because a tree structure is formed which can show the relat-edness of the various documents and clusters by examiningin which iteration they were joined (the earlier in the pro-cess two document fall into the same cluster, the morerelated they are). In practicality, unless one wished toexamine the hierarchical structure in its entirety, the pro-cess would not need to be carried out to the point wherethere was only a single cluster left; in fact, that woulddefeat the purpose. The process would be terminated whenthe average cluster size, or other metrics, met user-definedparameters.

Divisive, or partitional, techniques work in the oppositemanner. The starting point is the entire corpus. The collec-tion is then split repeatedly, assigning documents to theclusters most similar to them (often as measured by tradi-tional vector space model relevancy ranking). The processis terminated when clusters are obtained with the appropri-ate characteristics for processing with SVD. K-means, andits variant, bisecting K-means, are perhaps the most fre-quently used partitional methods due to their accuracyand mathematical tractability for large collections [14].


Two very important characteristics of a cluster withrespect to LSI and SVD are number of documents andthe internal cluster similarity (a measure reported by com-monly available clustering software, such as CLUTO [15]).Reducing the number of documents in a cluster reduces therank of the SVD matrices, which reduces the memory andtime required for computation. High internal similarity in acluster is indicative of topic homogeneity, which reducesthe problem of polysemy. However, one must be carefulnot to over-cluster because, since each cluster is treatedseparately during the SVD calculations, overly-small clus-ters reduce SVD’s ability to create the proper correlationsbetween conceptually-related terms.

5. Singular value decomposition parameters

Once the database is clustered, singular value decompo-sition must be performed. Singular value decomposition isa linear algebra technique that is applied to a document/term matrix (that is, a table containing an entry for eachdocument in the database and a list of each term thatoccurs in a given document, with an indication of howmany times that term occurs) to generate a second matrixof reduced rank.

‘‘Rank” when used with respect to SVD, refers to theshortest-dimension of the matrix. For example, in a matrixcontaining information on 10,000 documents, with a totalof 50,000 terms, the rank of the matrix would be said tobe 10,000. ‘‘k” is used to refer to rank, and the choice ofwhat k will be output from the SVD operation is arbitrary.In practice, values between approximately 100 and 300have often been found to give the best performance.

The rank of the matrix can be interpreted as the numberof concepts present in the documents. If too low a k is cho-sen, multiple concepts may be inappropriately merged intoa single concept. If k is too high, then noise is being mod-eled, and the appropriate relationships between vocabularyand concepts may not be determined. In fact, if the outputrank were chosen so that it was equal to the input rank, noconceptual aliasing would be done and the results would beequivalent to that of VSM.

The choice of k has been show to have substantial effecton accuracy [9–11], and no satisfactory method exists fordetermining optimum k from theory; empirical testingmust be done. This is challenging in a corpus as largeand diverse as patents, but cannot be ignored if best accu-racy is to be achieved.

6. Computational challenges in implementing LSI

As seen from the discussion above, LSI is based on amathematical decomposition of a document/term matrix.As such, the computational challenge in not only relatedto the number of documents, but also to the sheer numberof terms; that is, the size of the overall vocabulary. For pat-ents, where technical language abounds, this can be partic-ularly problematic. For example, there are over 3.5 million

unique terms in the collection of 1976–2006 US drug pat-ents (US class 424). For comparison, consider that the sec-ond edition of the Oxford English Dictionary defines only615,100 word forms [16].

A common approach in LSI, as well as other text anal-ysis methods, to reduce the effective size of the vocabularyis to apply a stopword list. Stopwords are terms that arelikely to add little value to semantic analysis and can poten-tially be ignored in the application of the mathematicalmethods. Such a list of terms is usually predefined and con-tains common terms such as ‘‘the”, ‘‘and”, and so on.Obviously, the larger the stopword list, the smaller theoperative vocabulary, and therefore the more tractablethe necessary calculations become. However, there can bean informational cost to eliminating terms from any anal-ysis. For example, ‘‘already” is often contained in stop-word lists. However, the temporal implication of‘‘before” can have great importance in some analyses. Inmost LSI and related text mining applications, the end userhas the ability to modify the stopword list. In doing so, onemust balance the computational gains with informationloss. This is often compounded by the fact that in manyimplementations of LSI, the stopwords are not onlyignored in defining the high-dimensional document space,but they are also not even indexed for simple searching(which is true of many Boolean search engines as well).

7. Linguistic challenges in implementing LSI

The mathematical methods of LSI are inherently una-ware of the complications and nuances of language usage.Some of the issues, such as stemming, are dealt withdirectly through automatic algorithms or predefined lists.

Dealing with synonyms is more problematic. As dis-cussed above, LSI can uncover conceptual relatedness,but does not define actual equivalences. This can be a seri-ous drawback when searching patents. For instance, thedrug commonly known as Valium (whose name is a regis-tered trademark of Roche) is also known as diazepam andover 170 other names. A searcher would like to be able tofind all documents referencing Valium, regardless of thesynonym used. LSI cannot guarantee to find such syn-onyms, and it may be appropriate to give the user controlover the thesauri used in LSI, allowing adjustment of thelist of synonyms for a given search.

Polysemy, one word having multiple meanings, is also asubstantial challenge. The human reader of a documentcan easily recognize whether the term ‘‘bear” is referringto a large mammal or to an ability to hold up or support.But, for a computer this is a difficult computational prob-lem, and polysemy abounds in language. Beyond normalwords, abbreviations are another source of polysemy. Over80% of the abbreviations in the Unified Medical LanguageSystem (UMLS) are ambiguous [17]. Thus, the applicationof term-based mathematical methods which make no dis-tinction between the multiple meanings of a single termreduces retrieval precision. In general, when considering a

242 J.F. Ryley et al. / World Patent Information 30 (2008) 238–243

collection of patents in a closely related area, polysemy isless prevalent, thus the importance of clustering.

LSI and other text mining methods can deal with poly-semy, in part, by adding additional steps to the overall pro-cess. For example, natural language rules can be applied todetermine the part of speech (such as noun or verb), whichcan resolve some polysemy issues (e.g. ‘‘bear” as a noun islikely to be an animal, while ‘‘bear” as a verb is somethingvery different). Polysemy can also be partially addressedthrough attention to case usage. For instance, the term‘‘he” as a pronoun might be included on a stopword listwhereas the term ‘‘He” as the symbol for the elementhelium could be critical to the topic of interest. Simplytreating all text as case-sensitive, however, is not a solutionas capitalization occurs for other reasons, including thebeginning of sentences. Thus, substantial logic rules arenecessary to adequately deal with the various types of cap-italization. The reward, though, is substantial.

Multi-word terms, single conceptual entities representedby multiple words, are another linguistic challenge for LSI.‘‘Acquired immunodeficiency syndrome” is an example ofa multi-word term. Each component has its own meaning,but when considered together define a very specific con-cept, in this case a disease. Thus, mathematical methodsthat do not recognize multi-word terms ignore potentiallycritical concepts. This can, however, be overcome whenthe end user can predefine such terms so that the softwaretreats the phrase as a single entity. Automatic discoverymethods such as TRUCKS [18] can also be used in con-junction with LSI.

8. Applying LSI in the real world

Although, for comparison purposes, we tend to talkabout Boolean, VSM, and LSI as being separate entities,this is not always true in the real world. Actually, combina-tions of the various technologies can provide better practi-cal solutions than any of the technologies alone.

For example, by combining LSI with traditional key-word filters (Boolean), the result set from a Boolean searchcan be ordered using the more accurate LSI ranking meth-ods. Additionally, by combining a Boolean search with thepost-SVD LSI data (which includes word relationshipinformation), a Boolean search can be performed while stillretaining the ability to find documents that do not neces-sarily contain the search terms entered by the user.

These combinations can take many forms, and can bethought of as a Boolean search taking advantage of LSIranking and concept determination, or an LSI search capa-ble of using Boolean filters to limit the result set. In eithercase, the combination of the two technologies offers capa-bilities that neither can offer alone.

9. Looking beyond LSI

While LSI forms the technology basis of emerging,higher performing search engines, LSI does have some

inherent limitations. For example, when a new area of tech-nology emerges, initially the number of documents cover-ing the topic may not be enough to uncover termrelations and therefore resolve problems of synonymyand polysemy. This limitation can reduce the reliabilityof search results in the challenging universe of patents.

Building on LSI, some latent semantic analysis (LSA)models now incorporate additional co-occurrence dat-abases that allow mapping of synonymous concepts usingdata beyond just that found in the corpus, thereby alleviat-ing the problem with limited statistical information in cor-pus itself. Improvements on LSI also include additionalfiltering capability based on user-selected keywords, andenhanced self-learning capabilities.

Additionally, variations on LSA exist, such as probabi-listic latent semantic analysis (PLSA) [19], that may pro-vide better capturing of polysemy and synonymy thanLSI. However, while research continues, and models basedon new theories promise greater distinction between lexicaland semantic level of text recognition, they may also createeven more challenging barriers to practical implementationand have yet to be proven when being applied to somethingas complex as the world’s patent documents.

10. Conclusion

Creating an accurate LSI-based patent search engineinvolves many choices, some of which are still active areasof research with no known optimal solution. One of themain issues which must be addressed is that large collec-tions must be clustered carefully, to reduce the problemof polysemy, and to increase computational tractability,while not over clustering. Additionally, an appropriate k

must be chosen during SVD so as not to force false associ-ations between terms with a low k, while at the same timenot modeling noise with a high k, thereby losing the advan-tage that LSI has over the vector space model. Other topicsto consider include treatment of stop words, rare words,word stemming, term frequency normalization, and fieldweighting.

Hopefully, an awareness of some of the details involvedin constructing an LSI-based search will help patent infor-mation professionals interpret the apparent discrepanciesin the literature with regard to the usefulness of LSI forpatent searching (which can be difficult given that not allpapers report k, internal similarity of clusters, number ofdocuments and number of terms, and other importantparameters).

We are confident that LSI is a valuable tool for docu-ment retrieval, both in the patent industry and other fields.But, for best performance, great care must be taken whenimplementing an LSI-based engine to tailor the specificLSI implementation to the document collection at hand.Additionally, the most versatile LSI-based (and other)search engines will allow users to create their own stop-word and synonym lists when appropriate, and to combineBoolean and LSI-based tools into powerful workflows that


help the user find documents relevant to their needs asquickly and reliably as possible.

Acknowledgement

This article incorporates the content from several pre-sentations at the PIUG 2007 Annual Conference: Booleanand Beyond – Effective Patent Searching and IP Manage-ment, 5–10 May 2007, Costa Mesa, CA.

References

[1] Deerwester S, Dumais S, Landauer T, Furnas G, Harshman R.

Indexing by latent semantic analysis. J Am Soc Inform Sci 1990;41

(6):391–407.

[2] Text retrieval conference (TREC).

[3] Dumais S. LSI meets TREC: a status report. The first text retrieval

conference (TREC1). National Institute of Standards and Technol-

ogy Special Publication; 1993. p. 137–52.

[4] Dumais S. Latent semantic indexing (LSI) and TREC-2. The second

text retrieval conference (TREC2). National Institute of Standards

and Technology Special Publication; 1994. p. 105–16.

[5] Dumais S. Latent semantic indexing (LSI): TREC-3 report. The third

text retrieval conference (TREC-3). In: Harman D, editor. NIST

Special Publication; 1995. p. 219–30.

[6] Chen CS, Stoffel N, Post M, Basu C, Bassu D, Behrens C. Telcordia

LSI engine: implementation and scalability issues. In: Proceedings of

the 11th international workshop on research issues in data engineer-

ing; 2001. p. 51–8.

[7] Bassu DaBC. Distributed LSI: scalable concept-based information

retrieval with high semantic resolution. In: Proceedings of the 3rd

SIAM international conference on data mining (text mining work-

shop); 2003.

[8] Husbands P, Simon H, Ding C. Term norm distribution and its effects

on latent semantic indexing. Inform Process Manage 2005;41

(4):77–787.

[9] Ding C. A similarity-based probability model for latent semantic

indexing. In: Proceedings of the 22nd annual international ACM

SIGIR conference on research and development in information

retrieval; 1999. p. 58–65.

[10] Kontostathis A, Pottenger W. A framework for understanding LSI

performance. In: Proceedings of ACM SIGIR workshop on math-

ematical/formal methods in information retrieval; 2003.

[11] Moldovan A, Bot R, Wanka G. Latent semantic indexing for patent

documents. Germany: Technische Universitat Chemnitz, Fakultat fur

Mathematik; Preprint, 2004.

[12] Gao J, Zhang J. Clustered SVD strategies in latent semantic indexing.

Inform Process Manage 2004;41:1051–63.

[13] Jain A, Murty M, Flynn P. Data clustering: a review. ACM Comput

Surveys 1999;31 (3):264–323.

[14] Steinbach M, Karypis G, Kumar V. A comparison of document

clustering techniques. In: KDD workshop on text mining; 2000.

[15] Karypis G. CLUTO – A clustering toolkit. University of Minnesota,

Computer Science and Engineering Technical Report Abstract; 2002.

[16] OED, dictionary facts, in OED online. Oxford University Press; 1989.

[17] Liu H, Johnson SB, Friedman C. Automatic resolution of ambiguous

terms based on machine learning and conceptual relations in the

UMLS. J Am Med Inform Assoc 2002;9:621–36.

[18] Maynard D, Ananiadou S. Trucks: a model for automatic multiword

term recognition. Nat Language Process 2000;8 (1):101–26.

[19] Hofmann, T. Probabilistic latent semantic analysis. In: Proceedings

of the 15th Conference on Uncertainty in AI; 1999.

James Ryley is the President of Patents Online,LLC, a patent data and analytics firm whose websites include www.SumoBrain.com andwww.FreePatentsOnline.com, both specializing insophisticated searching and analysis of theworld’s intellectual property documents. Previ-ously with Patent Complete, LLC, a patentsearching firm specializing in technology-relatedpatent searches, James holds a B.S. in Cell Biol-ogy and Genetics, and a Ph.D. in MolecularBiology and has extensive patent searching, pro-

gramming and data analysis experience.

Jeff Saffer is President and co-founder of SciWit,a company focused on science-based intelligencefrom large volumes of data. Previously, Jeff wasPresident and CTO of OmniViz. Before foundingOmniViz, Dr. Saffer was Head of the MolecularBiosciences Department at the Pacific NorthwestNational Laboratory. After receiving his Ph.D. inMolecular Biophysics and Biochemistry fromYale, Jeff was a fellow at the National CancerInstitute and then an Associate Staff Scientist atThe Jackson Laboratory. Jeff has been involved in

informatics for more than two decades and has a special interest in helpingpeople understand large volumes of data.

Andy Gibbs is the Chief Executive Officer ofPatentCafe, a provider of a Latent SemanticAnalysis based solutions used for patent search,strategic portfolio management and qualitativepatent analysis. Andy was appointed to two termson the USPTO Public Patent Advisory Commit-tee (PPAC), and served as Chairman, E-govern-ment Subcommittee advising on Patent Office ITinfrastructure and software tools. He is a memberof the Board of Directors, Intellectual PropertyOwners Association, and a member of the

Licensing Executives Society and Patent Information Users Group. Mr.Gibbs authored the corporate patent strategy book, Essentials of Patents
(Wiley, English and Japanese), and over 100 articles related to patentquality and management.
http://www.SumoBrain.com

http://www.FreePatentsOnline.com

Documents

Advanced document retrieval techniques for patent research