Review of the state-of-the-art in patent information and forthcoming evolutions in intelligent patent informatics

World Patent Information 32 (2010) 30–38

Contents lists available at ScienceDirect

World Patent Information

journal homepage: www.elsevier .com/ locate/worpat in

Review

Review of the state-of-the-art in patent information and forthcomingevolutions in intelligent patent informatics

Dario Bonino a, Alberto Ciaramella b, Fulvio Corno a,*

a Politecnico di Torino, Dipartimento di Automatica ed Informatica, Corso Duca degli Abruzzi 24, 10129 Torino, Italyb Intellisemantic s.r.l., Via Giaglione 7, 10126 Torino, Italy

a r t i c l e i n f o

Keywords:Information retrievalPatent informaticsSemantic elaborationReview

0172-2190/$ - see front matter � 2009 Elsevier Ltd. Adoi:10.1016/j.wpi.2009.05.008

* Corresponding author.E-mail addresses: [email protected] (D. Bonin

semantic.com (A. Ciaramella), [email protected] (

a b s t r a c t

Information and meta-information related to national and international patents is a critical asset forevery innovative company. The complexity of managing, searching, analyzing and relating such informa-tion to the needs of the company, in the different user tasks, is tackled by innovative knowledge manage-ment solutions, that aim at supporting the users in such daunting tasks.

This paper aims at presenting a comprehensive and updated overview of patent information and ofinnovative solutions in patent informatics, in particular concerning intelligent and semantic solutionsproposed in recent years. The analysis starts from the actual requirements of different types of usersof patent information, and the typical information management tasks they require. Innovations, coveringall the layers from data bases to algorithms to on-line services, are also critically presented and com-pared, and current research trends are outlined.

� 2009 Elsevier Ltd. All rights reserved.

1. Introduction

Over 50 millions patents, scattered across national or regio-nal databases, possibly integrated by accessory databases andup-to-date information and commentaries from the Web, formthe corpus for patent search systems. The application context,the type of information, the available sources and the roles ofthe main actors are only a subset of the many aspects thatneed to be considered during the lifecycle of patent informaticssolutions.

This paper provides an overview of the different issues relatedto patent information management and exploitation from thepoint of view of industrial stakeholders. Starting from the natureand structure of patent documents, their lifecycle and the relatedstandardization efforts, the paper introduces patent databasesand related information, patent users’ roles and tasks, patentsearch tools and benchmarks, and state-of-the-art patent informat-ics applications based on semantics. For most aspects several re-search threads are outlined, both for developers and scientists,and critical challenges are highlighted.

Special emphasis is given to the increasing variety of users thatcan benefit from easy access to patent information. Current patent

ll rights reserved.

o), alberto.ciaramella@intelli-F. Corno).

users are not only patent domain experts, but include new occa-sional actors such as managers, industrial researchers, academicfaculty, and so on, each needing a different set of functionalitiesand a different degree of application complexity.

Requirements on patent informatics applications vary depend-ing on the type of user and on the tasks that need to be accom-plished, ranging from patent analysis to patent search and patentmonitoring. Each of these activities is treated separately and therelevant needs and expectations are analyzed to define the set ofissues that patent informatics solutions shall successfully address.Particular attention is devoted to patent search tools, and to theapplication of information retrieval methods, either traditional orsemantic, to the patent domain.

Problems deriving from the historical lack of patent-specificbenchmarks are analyzed and current initiatives aiming at fillingthis gap are presented.

Eventually, in Section 9, an overview of the characteristics ofmain patent informatics applications based on semantics isprovided. Results from the academic literature, from researchprojects as well as from commercial providers are considered,analyzing solutions from several points of view: applicationfunctionality, characteristics, and, especially, adoptedtechnologies.

The paper does not aim at being exhaustive but at providing aneutral overview of the domain complexity and of the many re-search challenges in this exciting, but relatively unexplored appli-cation field.

mailto:[email protected]




http://www.sciencedirect.com/science/journal/01722190

http://www.elsevier.com/locate/worpatin

Table 1Patent lifecycle.

Phase When Information disclosed

Patentapplication

Filing date Some metadata

Patentapplication –published

18 months after the filing date Full text and metadata

Granted patent 2 years or more after the filing date Amended full text andrevised metadata

Expired or ceasedpatent

20 years after the filing date (orbefore in case of unpaid fees)

Amended full text andrevised metadata

Table 2Patent-related information sources.

Information sources Information provided Degree of interest

Free portals of nationalor regional patentoffices

Patent text andmetadata

Always of interest, even incombination withcomplementary portals

Vendor-provided patentdatabases

Patent text andmetadata

Business critical decisions

Vendor-providedextended content ofpatent information

Patent mutualcitations, patentaugmented abstracts

Useful

Patent legal databases Patent legal statusand events (e.g.,litigations)

Always significant

Scientific literatureportals

Scientific papers For identifying the scientificbackground

Business databases Company data andaffiliations

For integrating the businessbackground information

News Business news For integrating the businessbackground information

Internet Company portals,sector portals

For low cost identification ofthe scientific or businessbackground

Web 2.0 solutions Wiki and blogs forpatents

Also for supporting thecooperative work

D. Bonino et al. / World Patent Information 32 (2010) 30–38 31

2. Patents as information sources

To fully exploit information embedded in patent documents, allthe available literature must be considered, including expired andunexpired patents, technical and scientific papers, and so on,requiring a great effort in information management. Furthermore,according to Table 1, the amount, the nature and the quality ofavailable information (last column) changes significantly duringthe patent lifecycle.

A patent has a strong part in information dissemination, allow-ing distributing up-to-date, validated, technical and scientificinformation. Some studies report that a significant part of informa-tion available in patents is actually unique and cannot be foundelsewhere – see for example chapter 4 in [6], or the USPTO ‘‘EighthTechnology Assessment and Forecast Report. Section II: Theuniqueness of patents as a technological resource”, that concludes(p. 37): ‘‘The results of this study show that about 8 out of 10 USpatents contain technology not disclosed in the non-patent litera-ture.” Patent information is also typically more detailed andexhaustive than scientific papers. International standards, e.g.WIPO,1 impose the sequence of sections and metadata to which apatent must conform. The standard patent structure encompassesa front page, the description, the claims, the drawings (optional),and the international search report, if available.

Patent documents show several peculiarities that must be takeninto account when applying automated information processingtechniques: they are usually longer than ordinary papers, with sig-nificant length variance; they adopt different writing styles in dif-ferent parts; they often include multimedia data, typicallydrawings or mathematical or chemical formulas, which requirespecific analysis and classification algorithms.

Patents are provided with standard metadata: title, dates anddocument numbers for the application, publication and grant, listof applicants and inventors, patent knowledge domain (field and/or subfield, classified against a standard taxonomy). Metadata arepart of the patent document or can be added by patent offices.Classification of patent domain, for example, is accomplished bypatent offices in a consistent way and is key information forsearching patents across nations.

To better support worldwide patent searches, the World Intel-lectual Property Organization defines and continuously main-tains/updates a standard taxonomy for patent classification,named International Patent Classification (IPC) [2]. Updates areprovided every 3 months for the deepest levels of the taxonomyand every 3 years for the core classification. This evolution rate ismotivated by the need to include and properly classify new,emerging technical fields, like it recently happened with nanotech.The 8th IPC version is available on-line since 2006 and it is adoptedas lingua franca by more than 100 nations worldwide. Major patentoffices including the European Patent Office2 (EPO), the United

1 The World Intellectual Property Organization (WIPO), http://www.wipo.int/.2 http://www.epo.org/.

States Patent and Trademark Office3 (USPTO) and the Japan PatentOffice4 (JPO) also define and use specific metadata complementarywith IPC, to include additional information and to support more ad-vanced functionalities. They are named respectively ECLA (EuropeanPatent Classification System), USPC (US Patent Classification System)and FI (File Index) and its extension F-terms, both used by JPO.

IPC Patent classification opens several research possibilities andseveral applications like:

� the design of tools to help patent offices in classifying applica-tions and grants according to the IPC;

� the design and development of solutions for enabling patentsearchers to exploit the full potential of the IPC classification;

� the integration of textual information and IPC metadata for pat-ent search and for patent clustering;

� the extraction and integration of information from multipleclassification systems such as the EPO, the JPO and the USPTOinternal classifications.

3. Patent databases and related information

As far as public access to patent information is concerned, pat-ent documents are usually organized into databases that actuallydiffer in content and available search methods [1]. Content is char-acterized by coverage in space, time and by the completeness ofprovided documents (e.g., full text or abstract). Search methodscan span from simple Boolean searches to advanced informationretrieval approaches, each carrying a different degree of complex-ity and a different performance in the retrieval process. In additionto document completeness and search effectiveness, documentformats must also be taken into account: older patents are usuallyavailable as images (no OCR done in the electronic scan process),only, while more recent documents are available in an electroni-cally readable form, often generated by automatic OCR processesthat can introduce errors, thus affecting the quality of the singledocument and, as a consequence, of the whole database. Only fordocuments written in the last few years, some patent informationis available in XML [3].

3 http://www.uspto.gov/.4 http://www.jpo.go.jp/.

http://www.wipo.int/

http://www.epo.org/

http://www.uspto.gov/

http://www.jpo.go.jp/

Table 3Patent search tasks.

Task When Why What Focus

Patentability search Writing a new patent application To construct claims not affected by the priorart

A new patentable idea, with respect tothe state-of-the-art

Specific

Validity (invalidity) search To defend a patent application or tolitigate a competitor’s patent

To validate or invalidate a patent on the basisof its claims

A patent against the state-of-the-art atthe application time

Specific

Infringement search(freedom to operatesearch)

Before launching a product on themarket

To verify that a product can be commercializedin a given market

A product against patents still holding inthe selected market

Specific

Technology survey Business planning For better focusing on present business or toanalyze new business opportunities

Patents, scientific and technicalpublications in a given technology area

Broad

Portfolio survey Business planning To identify the technical portfolio of differentplayers

Patents, scientific and technicalpublications in a given technology area

Broad

32 D. Bonino et al. / World Patent Information 32 (2010) 30–38

Patent databases can be either free, as it happens for many na-tional patent offices that provide free access to their collected dat-abases, or subject to a fee payment, as done by many commercialvendors,5 which use the background information provided by na-tional offices, and integrate it with other information, offering moreaccurate and customized search facilities. For example, many com-mercial providers integrate patent-related information such as mu-tual citations and abstracts selected, revised and integrated by askilled editorial staff (e.g., The World Patent Index,6 recently mergedwith The Patent Citation Index).

Independently from their business specificity, free and com-mercial databases are subject to errors including: errors originatedby OCR recognition failures and wrong metadata assignments orinconsistencies. Typical examples are transcription errors for com-pany and inventors’ names, especially concerning far-east lan-guages. All these errors [4,5] contribute to make the searchprocess more difficult and shall be taken into account for providingsuitable instruments to work on patent databases.

Core patent information is often not sufficient to support patentinformation users in their activities (see Table 2). Therefore it maybe necessary to integrate patent information with informationfrom other, complementary databases such as legal patent dat-abases, containing the legal history of a patent, and research papersdatabases, for accessing referenced knowledge. Moreover, moderndynamic business scenarios require integration of more responsiveinformation sources like forums and blogs dealing with patent is-sues (both general and specialist), or business portals, with reportsand news, thus allowing to contextualize patents in their specificknowledge domain. Web 2.0 mash-ups can ease the process andmake the integration process more efficient.

A significant initiative, in this context, is the Peer-To-Patentportal,7 a community blog which gathers comments and reviewsfrom volunteer patent experts involving specific technology areas,mainly related to Information Communication Technology (ICT).The initiative aims at relieving US patent examiners from the heavybacklog of waiting applications. Peer-To-Patent is both a warningsignal of the increasing patent offices’ overload and a revealing signof forthcoming innovations in this field.

4. Patent information users and related tasks

Users of patent information are interested in both core andcomplementary information, with a balance that differs in functionof their specific goals and roles. Companies and inventors wishingto file a new patent are interested in verifying that the invention isactually new, with reference to the current state-of-the-art. At the

5 A directory of vendors is available at http://www.piug.org/vendors.php.6 http://www.thomsonreuters.com/products_services/scientific/DWPI.7 http://www.peertopatent.org.

same time, they are interested in discovering infringements fortheir granted patents. Researchers are interested in finding patentinformation to avoid duplicating solutions already covered by pat-ents and/or to freely reuse expired patents. Managers can exploitpatent information to assess competitors, partners and suppliers,and to identify technology trends and new business opportunities.Finally, venture capitalists and investors can leverage patent-re-lated information to select the targets of their financial operationswhile third party resellers can benefit from patent informationwhen selecting their suppliers.

The tasks usually pursued by patent information users can beroughly subdivided in three main classes: patent search, analysisand monitoring. Table 3 summarizes the most common patentsearch tasks [1,6,7]. The first three rows report more standardizedand specific operations while the last two deal with technologysurvey and portfolio survey, which are broader and more variablein required depth and effort, depending on specific businessrequirements (e.g., allowed times and costs).

Patent analysis can be further subdivided in two broad catego-ries related respectively to micro- and macro-analysis. Micro-anal-ysis involves a single patent document, while macro-analysis isabout a patent portfolio. Table 4 shows some of the most commonanalysis tasks [6,8].

Analysis tasks can also be categorized on the basis of theirunderlying reasons: some analyses are motivated by businessneeds, e.g., Intellectual Property (IP) evaluation, and other by tech-nical reasons. IP evaluation exploits some available data, such asthe patent family size and the number of citations, for estimatingthe patent value; research is still active in this field as demon-strated by the related literature [9]. Patent maps, on the contrary,are more oriented to technical analysis, summarizing specific prob-lems addressed by the patent and respective solutions, in a single,two-dimensional matrix. For example, a patent in speech recogni-tion can reduce computation required for coding the voice (solvedproblem) by adopting a suitable pre-processing (approach). Patentmaps can be effectively evaluated by leveraging the classificationsystem defined by the JPO with F-terms [10].

Temporal information can be included in the analysis to identifytechnical trends and competitors’ strategies, allowing selecting theproper countermeasures. Patent information can also be used asbackground for derivation of new inventions, as done by the TRIZ8

method. In this technique, structured analysis of patent documentsand of the problem to be solved allows to infer new inventions byanalogy.9

8 TRIZ, initially devised by Attshuller, is a Russian acronym for ‘‘Theory of Inventor’sProblem Solving”. For further information please refer to the European portal http://etria.net/portal.

9 However, this method is not fully appropriate for deriving fundamental and/ordisruptive innovations.

http://www.piug.org/vendors.php

http://www.thomsonreuters.com/products_services/scientific/DWPI

http://www.peertopatent.org

http://etria.net/portal

http://etria.net/portal

Table 4Patent analysis tasks.

Task When Why What Focus

Micro patent analysis –business value

Assessing internal orthird party patent

IP evaluation Some indicators such as the family size and the number of citations arerelated to the patent business value

Documentspecific

Micro patent analysis –technical detail

Assessing internal orthird party patents

Patent map Identify solved problems and approaches used by a patent Documentspecific

Macro patent analysis –business value

Assessing internal orthird party portfolio

IP evaluation The business value of a patent portfolio can be inferred by the values ofthe single patents and by the patents’ distribution

Broad

Macro patent analysis –technical detail

Assessing internal orthird party portfolio

Patent map Identify solved problems and approaches used by a patent portfolio Broad

Macro patent analysis –technical trends

Business planning Identify competitorstrends and actions

Time evolution of competitors’ patent maps Broad

Macro patent analysis –technical trends

Business planning Identify technical trends Time evolution of technology patent maps Broad

Macro patent analysis –technical suggestions

Solving new researchissues

Suggestions for newinventions (TRIZ method)

Patent mining is used in conjunction with the TRIZ method Broad

Table 5Patent monitoring tasks.

Task When Why What

Early signmonitoring

Continuously Intelligence New patent applications as available, with incomplete data possibly integrated byother information

Single patentmonitoring

Continuously IP proactive management against specific thirdparty patents

Patent extensions in new countries, changes in the legal status, new litigations,etc.

TechnologyMonitoring

Continuously Technology intelligence New patents and applications in specific domains

Portfoliomonitoring

Continuously Competitor intelligence New patents, and applications, of specific players


Patent monitoring tasks (Table 5 reports some examples) areused to regularly update users about new patent information, inthe domains of interest they specify. These tasks are often imple-mented by means of software agents that act on behalf of the user,according to a properly defined interest profile and update fre-quency. Monitoring tasks can also be done manually, in this caseoperations are more flexible but definitely more costly. The18 months protection period between the patent application dateand the disclosing of the patent full text and of some additionalmetadata adds further latencies in the process, which can be mit-igated by monitoring incomplete application data (such as theapplicant names or the application title) integrated by other infor-mation, for example something that can be inferred from the list ofapplicants.10

5. Looking for patent information

Patent search and analysis are two business critical, time inten-sive, tasks that can be addressed either manually or automatically.Professionals currently adopt the first solution, being supported byautomatic tools in some steps of the whole evaluation procedure.Automatic and manual processes can be roughly subdivided intwo phases: query and result evaluation (summarized in Table 6).

Queries can either be expressed through metadata (e.g., appli-cant, year, IPC) or full text. Metadata queries on the applicantand year are straightforward from an algorithmic point of view,but they are made more complex by at least two factors: con-stantly changing company names, merging and demerging, andby unreliable translations of far-east names of companies and per-sons. In such cases additional databases with company history, orfuzzy name matching, may help.

IPC metadata search is also quite complex and it is usuallyaccomplished by using dictionaries11 (for finding suitable search

10 Enrico Ramunni, IP manager at De Nora, private communication.11 as in http://www.wipo.int/tacsy.

keywords) or with query-by-example approaches, e.g., by searchingpatents having IPC descriptions similar to the IPC classification of aprototype patent, whose membership to a given IPC is ‘‘a priori”known.

Text queries, on the converse, can be done with different levelsof complexity, from simple word occurrence and Boolean matchingto full information retrieval techniques, depending on the patentdatabase functions and on search requirements. Due to the heter-ogeneous structure of patent documents it may be useful to iden-tify sections to be searched, e.g., claims, as well as navigatingbetween referenced patents combining different search modalities.

Most patent searches, e.g., patentability and validity, are highlybusiness-sensitive and missing relevant documents would beunacceptable. Obviously, high relevance of search results is also re-quired to improve the overall efficiency of the search process. Inorder to achieve these two contrasting goals, patent professionalsoften adopt an iterative search paradigm that can be supportedby automatic tools, e.g., tools for expanding synonyms in textsearches, tools for refining search results, etc. In addition, somepatent retrieval task can benefit from other functions (not yet inte-grated in search tools) such as speeding up the evaluation of singlepatents by identifying the focus of each sentence, translating pat-ent documents, etc.

Patent analysis has usually different requirements from patentsearch, as its goal is to obtain relevant patents and to analyze themas an aggregate. Consequently, the query phase can use simplemetadata-based criteria and the result evaluation phase can beeffectively supported by automatic tools, integrating other typesof information like the patent family size, or the distribution of pat-ents in force for the given topic, or the year and the applicant, andso on.

6. Systems characterization and benchmarks

Patent search has traditionally been treated as part of thewider information retrieval (IR) [11] research area, but in IR the

http://www.wipo.int/tacsy

Table 6Tasks and phases in patent retrieval.

Tasks

Patent search Patent analysis

Phase Query Search with advancedcriteria, including textsearch. Highly business-sensitive results. Recall is ofparamount importance whileprecision is welcome forefficiency. Iterative search isused to balance precisionand recall figures

Simpler data/metadata. Alsonoise in the result list can beof interest. Iteration may bemotivated by exploratorysearch

Resultsevaluation

To rank and re-order the listof results, to support furtheriteration cycles or to speedup the analysis of a singlepatent

Important for automaticallyintegrating results withadditional data and to post-process the result list withdifferent presentationcriteria


mainstream research efforts have been more concentrated on gen-eral purpose search systems able to deal with any kind of docu-ments and queries, neglecting the specificity of patents.Conversely, patent retrieval has been much more investigated bythe database research community than by information retrievalresearchers. At the basis of this choice there is also the lack of testcollections targeting patent information.

Although some patent documents were included in the TREC12

test collections, until few years ago no domain specific collectionswere available. From 2001, the NTCIR13 workshop introduced a spe-cific track on patent search, called Patent Retrieval Collection, using acollection of patent documents and abstracts extracted from 2 yearsof unexamined Japanese patent applications published between1998 and 1999.

Patent search processes differ significantly depending on thetype and purpose of retrieval, such as technology survey or inval-idity search. Each retrieval type deserves a customized approach,as stated in the NTCIR-3 task description [12]. For example, ininvalidity search, professional users search at least 5 years worthof patents for applications conflicting with the patent under exam-ination. Instead, in technology survey, patents are searched forderiving a general overview of the current state-of-the-art [13].NTCIR-3, focuses more on the latter search modality and uses, asbase scenario, the case in which a manager wants to know the pat-ent landscape related to a newspaper clip. This patent search is re-ferred to as cross-genre or cross-database retrieval as informationcoming from a newspaper is used to query a patent collection fordocuments relevant to the query.

As the information retrieval tasks related to patent collectionsare so diverse, and multi-faceted, it is worth investigating whethertraditional IR models can be applied to patent retrieval and whatkind of evaluation measures can be adopted. In fact, experimentsdone by the committee that developed the NTCIR-3 benchmark[14], on a technology survey task, have shown that typical IR algo-rithms can be effectively applied to patent documents retrieval,with some particular cautions.

Classical IR quality figures include the Precision (1), which mea-sures the proportion of retrieved documents that are relevant, andthe Recall (2), which measures the proportion of relevant docu-ments that have been retrieved.

Precision ¼ jRelevantDocuments \ RetrievedDocumentsjjRetrievedDocumentsj ð1Þ

12 http://trec.nist.gov/.13 http://research.nii.ac.jp/ntcir/.

Recall ¼ jRelevantDocuments \ RetrievedDocumentsjjRelevantDocumentsj ð2Þ

Such metrics can be applied to patent information retrieval [14],too, although the absolute figures and their relative importancemay vary depending on the retrieval task. For example, while Pre-cision is generally more important than Recall for invalidity search,both Precision and Recall are important for technology surveys.Experiments show [14] that the retrieval models that perform bet-ter do not significantly differ depending on the document genre(patents vs. other types of documents, such as research papers ornewspaper articles).

In general, we may conclude that patent retrieval is a very spe-cialized information retrieval task that benefits from state-of-the-art approaches in the IR field, but that deserves development ofproper adaptations of traditional IR models, depending on the tar-get information need. Adaptations can include task specialization,e.g., IR models for patentability search that differ from modelsfor technology survey due to the different Precision vs. Recalltrade-off requirements, search model adaptation, as the singlequery assumption at the basis of traditional models is not realisticin the patent domain [15] and rather an interactive refinementmethod is preferable, etc. The domain-specific nature of the patentsearch process is widely recognized in the patent retrieval commu-nity as demonstrated by the composition of NTCIR-3 successors(NTCIR-7,14 etc.), where different patent retrieval tasks such as pat-entability search, invalidity search and technology survey are sup-ported with dedicated benchmarks, suggesting that differentalgorithms should be selected, and tuned, for each patent-relatedsub-task.

7. Patent software challenges and opportunities

Patent information software deals with two major aspects: con-veyed information and users, each implying different challenges/opportunities, outlined in Table 7.

Users can be of different categories, with different skills andneeds: professional patent searchers typically prefer more ad-vanced functionalities, with a higher degree of control on toolcapabilities and freedom in setting search parameters, while occa-sional users (such as managers) often require an easy to use inter-face and simpler commands, with complexity hidden under asimplified front-end.

Information, on the other side, is perhaps the most challengingaspect due to the increasing diversity in patent languages, espe-cially considering the unprecedented increase in patent applica-tions originating from Asian countries such as Japan, China andSouth Korea. As a consequence, traditionally neglected issues, suchas multilingualism, are becoming more and more important, to-gether with related tasks like multilingual querying and automatictranslation, etc. In addition to new challenges, more traditional is-sues still remain, the quality of patent databases, for example. Pat-ent databases are actually affected by errors due to OCR processingof old paper documents, to wrong metadata associations, to incon-sistencies between different metadata vocabularies, etc., and thiscan influence the results of a query.

In parallel with quality and multilingualism, integration is alsobecoming a very hot topic, since patents may only offer a partialview, due to the very nature and structure of patent documents.In fact, the dynamic nature of current business scenarios requirestimely integration of heterogeneous information sources like news,blogs, forums and more traditional databases. Web 2.0 mash-upsolutions can be of aid in this context.

14 http://www.nlp.its.hiroshima-cu.ac.jp/~nanba/ntcir-7/cfp-en.html.

http://trec.nist.gov/

http://research.nii.ac.jp/ntcir/

http://www.nlp.its.hiroshima-cu.ac.jp/~nanba/ntcir-7/cfp-en.html

Table 7Patent software challenges.

Challenge Why it is important Opportunity

Different kinds of users Patent information is not any more for pure patent professionals, but forR&D people, business analysts and managers

Different users require different balance between direct controland ease of use

Different kinds of tasks Patent information tasks are oriented to information retrieval, to patentanalysis and to patent monitoring

To develop different kinds of patent tasks, tailored on users’needs

Task benchmarking To assess different kinds of solutions or evolutions of the availableapproaches

To benefit as much as possible of patent processing byproductssuch as rejected applications, translations, etc.

Different languages The patent community is rapidly becoming truly multilingual To benefit as much as possible from patent processingbyproducts such as translations, etc.

Quality of databases OCR processing can introduce errors as well as human elaboration ofmetadata can do

New software for patent application management can reducethe problem.

Integration of different datasources

There is an emerging, increasing need of integrating heterogeneousinformation

Web 2.0 mash-ups can be crucial

Independence betweensoftware applications anddata

There is an increasing demand of flexibility in adding analysis andmonitoring procedures to simple patent data manipulation

Service Oriented Architectures can help

Table 8Current research trends.

Functional evolution Why it is important Some possible solutions

Advanced search including search forsimilar documents and search forsimilar drawings

To obtain more consistent results from queries, even if formulatedby less experienced users, and to reduce the search time, also forexpert users

Natural language processing and text, and multimedia,semantics can help to solve this issue

Support for iterative and exploratorysearch

To obtain a better recall as well as a better precision in the searchphase and also to improve the efficiency in technology and portfoliosurveys

Different kind of approaches can be used: morestructured, based on faceted search, or more holistic,based on patent clusters

Multi-language query and automatictranslation support

Fostered by the increasing importance of East Asian languages Natural language and semantic processing (e.g., byabstracting patent key topics)

More robustness against errors in patenttext and metadata

To avoid missing important documents Post-processing databases or considering more robustsearch methods, including semantic ones

Support for single patent analysis both inthe technical as well in the legal part

To provide more efficient and more consistent technical and legalanalyses, which have different objectives and can be oriented todifferent user communities

Use of natural language and semantics-based solutions

Automatic support for patentclassification

To have a more consistent classification from patent offices withreduced efforts. To support advanced users in their internalclassification

Natural language and semantic processing

Improved analysis and monitoring,including the identification of feeblesigns and relationships

To provide more efficiency as well as better quality in patent relatedintelligence

Integration of complementary news and informationthrough mash-up solutions, use of semantics in bottomup and top down modes

Identification of better proxies forevaluating patent value

To support an increasing community of interest about patentinformation

Patent economics basic research and integration ofcomplementary economic data, e.g., through WebServices and mash-up solutions


In the end, there is an increasing need of automatic elaborationfacilities for patent-related information, especially for patent anal-ysis and patent monitoring tasks; Service Oriented Architectures(SOA) are suitable to provide the technical infrastructure. As anexample, the European Patent Office is offering, since 2003, aWeb Service access to its internal patent database called esp@ce-net�. This Open Patent Service (OPS) has been constantly evolvingduring the last years. The current version increases the number offunctions offered through the Web Service interface, and allowssoftware applications to programmatically access all the informa-tion and all the search services that are available through the inter-active web site.15

8. Most significant recent evolutions of patent informatics

In response to the above mentioned challenges, patent infor-matics is showing significant functional evolutions ranging fromthe adoption of natural language processing (NLP) and semanticsfor automatic processing of information to the design of innovativeand efficient user interfaces, from the integration of information

15 Detailed information can be found at http://forum.espacenet.com together withthe final OPS specification (http://ops.espacenet.com/).

coming from less traditional sources (such as the World WideWeb) to the exploitation of hidden information in available docu-ments. Table 8 reports some of these evolutions, their motivationsand some possible solutions currently being investigated.

Traditional search methods, based on patent text and metadata,heavily rely on the skills of patent searchers and on the quality ofpatent document databases. However, the increasing amount ofpatent-related information, and the ever-growing need to accesspatent information by less experienced users, requires the devel-opment of new search tools and methodologies, suitable for casualusers and effective for experienced ones, e.g., enabling them toshorten search times. These solutions can be achieved by improv-ing database quality and by building error-tolerant query services.Nevertheless, new methods deserve particular attention as theycan deliver innovative services such as search for similar docu-ments and for similar pictures [16], e.g., by leveraging semantics(meaning behind word forms or structured drawings), and theycan easily support multilingualism (as semantics may be definedin a language-independent way). In addition, patent searches oftenimply several iterations in the process that must be effectively sup-ported by search tools. While traditional solutions poorly supportthis iterative nature of the search, new methodologies are moreconcerned about these issues, which can be tackled either by struc-

http://forum.espacenet.com

http://ops.espacenet.com/


tured or holistic approaches. Faceted search, for example, is a wellstructured solution that allows incrementally refining search re-sults by exploiting automatically extracted tags. Patent clustering,on the converse, is a more holistic approach where patent searchesare refined by finding similarities between extracted documents.

Patent analysis tasks also require new innovative solutions, tai-lored to their different goals. Analysis goals, in fact, are different fortechnical analysis, which is aimed at producing patent maps, andfor claims analysis, which is aimed at restructuring the legal partsof patent documents (i.e., the claims). For both issues, semanticsolutions can be suitable. Similar approaches can also address mul-tilingualism, in particular for identifying the main topics in patentdocuments written in different languages, which is a more tracta-ble problem than full translation, and that still retains added valuefor patent professionals.

Analysis and monitoring of patent-related information requiresboth integration of complementary data, through Web Servicesand Web 2.0 solutions, and discovery of hidden facts and relation-ships by means of natural language processing and semanticsexploitation.

Finally, the evaluation of a patent, or of a patent portfolio,although being currently addressed by several tools [17,18], re-quires further exploration to find better quality proxies and evalu-ation metrics [16].

16 http://www.w3.org/TR/owl-features/.17 Semantic Web activities at W3C, http://www.w3.org/2001/sw.

9. Overview of semantic-based solutions

Semantic solutions are good candidates for supporting some ofthe above-cited evolutions as they work on word meaning ratherthan on mere word occurrence or frequency counts. In general, asemantic-based solution is expected to improve recall figures ofthe search processes while keeping precision constant, or increas-ing it. This can be achieved by properly handling synonyms andhomophones, and by providing cross-lingual services that are eas-ily supported by ontology-powered systems. In addition, theunderlying semantic models, needed by these systems to work, al-low providing more advanced search and filtering functionalities inquery post-processing and can provide support for identifying hid-den relationships between documents.

Semantic solutions are based on knowledge bases, i.e., on theunion of a knowledge-domain model (ontology, thesaurus or tax-onomy) and of domain-specific data. Ontologies, thesauri and taxo-nomies have in common several features:

� they are approaches to help structure, classify, model, and orrepresent the concepts and relationships pertaining to somesubject matter of interest to some community;

� they are intended to enable a community to come to agreementand to commit to use the same terms in the same way;

� the meaning of the terms is specified in some explicit way and tosome degree;

� the explicit mapping to conceptual models eases interoperationacross different languages, as long as the relevant taxonomiesare translated.

However, they differ in formalizing the meaning of terms, andthey have different notations and different goals:

� a taxonomy describes a knowledge domain through a hierarchyof relevant entities. Hierarchy can be built on the basis of a suit-able binary, transitive asymmetric relationship such as parent-child (most common), part-of, instance-of, etc.;

� a thesaurus represents a knowledge domain by using a set ofterms (without definitions) related each other by means of hier-archical, equivalency and associative links. Whenever a term is

ambiguous, a ‘‘scope note” may be added to allow disambigua-tion of the intended meaning;

� an ontology defines a set of representational primitives withwhich to model a domain of knowledge or discourse [19,20].The representational primitives are typically classes (or sets),attributes (or properties), and relationships (or relations amongclass members). The definitions of the representational primi-tives include information about their meaning and constraintson their logically consistent application.

Ontologies can be classified as upper-level, middle-level, do-main, document and linguistic [21]. Upper-level ontologies aim atdescribing very general concepts that are the same across allknowledge domains [22], middle-level ontologies are more do-main-specific than upper ontologies and provide formal definitionsfor entities typical of a restricted field of knowledge. Domain ontol-ogies are targeted at a restricted knowledge area, e.g. patent infor-mation, and define the finest granularity with which the meaning ofa domain entity can be described. Document ontologies define thestructure of documents of a given knowledge domain, e.g., the factthat patent documents have a technical and a legal part, etc.

Finally, linguistic ontologies (e.g., WordNet [23]) provide infor-mation needed to bridge linguistic resources (i.e., the text of docu-ments) with their conceptual counterparts (i.e., hierarchicallyarranged sets of concepts, with semantic relationships linkingthem). Linguistic ontologies also play a role in multi-language pat-ent processing: if the relevant keywords are mapped to a sharedontology, then the patent classification becomes language-neutral,as it is bound to a conceptual, rather than lexical, representation.Language crossing through semantic processing is quite successfulfor western languages, while for eastern languages satisfactorysolutions have yet to be found [31].

Ontologies can be integrated in applications by adopting severalparadigms, e.g., they can be embedded without leaving anychances of modification by the end user or, instead, they can be ex-posed to the final users, through proper interfaces, allowing themto edit, refine or drop concepts inside the knowledge model.

Semantics and related processes are not completely new in thepatent informatics domain: for example, professional patentsearchers already tend to use synonym dictionaries for expandingtheir queries, and the IPC classification, itself, is actually a domainspecific ontology (or, better, a thesaurus), although not expressedin a standard Semantic Web language such as OWL.16 However, un-til few years ago the landscape of patent-related semantic applica-tions was still poorly populated. Nowadays, things have beenchanging, thanks to the level of maturity reached by SemanticWeb standards17 (W3C Recommendations) and to the ever-increas-ing availability of tools and methodologies to exploit this kind ofdata and metadata.

Patent information providers are starting to recognize this newtrend of Web application architectures, where data sharing isbecoming easier thanks to technologies such as Web Services, inter-operation between data providers is enabled by the Linked OpenData standards (based on the Semantic Web infrastructure), andend users expect data and information to be available on an ‘‘every-where, everyone” paradigm. These new technology enablers, cou-pled with the new user attitude (partially brought by the Web 2.0trend), shift the importance from the availability of the data tothe ability of processing it and providing the best results out of it.Some players are already shifting from their role as ‘‘patent data-base owners” to ‘‘search service providers” over partially publicdata.

http://www.w3.org/TR/owl-features/

http://www.w3.org/2001/sw

Table 9Currently available semantic solutions.

Functions Solution Architecture andtechnology

Semantic-based improvement of textsearch, related documents retrieval

Patent Cafe Latent SemanticAnalysis, withoutexplicit ontologymodeling

Semantic and multi-lingual search,search by similar documents and bysimilar drawings

PatExpert Ontologies in W3Cstandard formats

Semantic and multilingual search GoldfireInventionMachine

Embedded ontologies

Semantic indexing for faceteddocument navigation

IntelliPatent User-editableontologies

Semantic classification PateXpert Ontologies in W3Cstandard formats

Single patent functional analysis PatentAnalyzer

Embedded ontologies

Support for invention GoldfireInventionMachine

Embedded ontologies


In the patent knowledge domain, semantics is nowadays play-ing a role of increasing importance, by contributing to solve severalopen issues as reported in Table 9. Solutions are varied and tackledifferent problems ranging from search improvement to analysisand post-processing improvement. Also the way ontologies areused is extremely varied and encompasses systems using externalontologies, systems embedding ontologies in a read-only fashionand systems allowing to incrementally refine the knowledge modellying at their base.

Patent Cafe [24] applies semantic technologies for improvingrecall in patent search: a semantic index of the Patent Cafe data-base is computed and embedded in the database itself, allowingto directly exploit semantic correlations between available patents.This solution is claimed to be more robust than semantic expan-sion of queries and subsequent mapping on a traditional lexical in-dex. Patent Cafe uses Latent Semantic Analysis18 for building thesemantic index of the patent database [25]. Latent Semantic Analysisallows extracting the hidden relations between words occurring inpatent documents by applying principal component analysis to thetraditional term-document matrix, thus identifying most relevantdimensions among the many features that can be taken into accountin the patent search process.

In synthesis, the Patent Cafe approach can be classified as lyingbetween linguistic and semantics as, for extracting documentmeaning, it uses implicit, not formalized, knowledge.

The EU-funded PATexpert project [7], uses explicit semantics,i.e., a set of intertwined ontologies, for accomplishing differentgoals such as semantic search, automatic patent classificationand clustering. PATexpert supports a semantic ‘‘search by similar’’functionality for patent documents or single text passages. It de-scribes both patent documents and queries in terms of ontologyconcepts and is able to integrate query and filtering functionalitiessupporting complex searches like ‘‘find documents similar to doc-ument X as far as the feature Y is concerned”. Semantics is seen as ameans for helping patent offices in the patent classification pro-cess. Classification is supported through the adoption of domain-specific vocabularies [26] that allow to correctly manage patentjargon and patent structure, and through the definition of ontolo-gies tailored to the patent knowledge domain. In particular, the lat-ter effort is providing several reference ontologies that can beclassified in three different families:

18 Tutorials and demos can be found at http://lsa.colorado.edu.

� document ontologies, modeling the structure of patent docu-ments, the associated metadata and the embedded drawings,when available;

� domain ontologies, including a patent classification ontology;� linguistic ontologies, e.g., WordNet [23].

IntelliPatent [27] uses semantics to empower the navigation ofpatent search results: the user is firstly encouraged to perform a rel-atively coarse search, which is then refined in an iterative fashion.In order to support such an interaction paradigm, IntelliPatent firstindexes the results list, by using a semantic indexer, and thenaggregates patents in categories and facets, allowing to easily ‘‘clus-ter” patents about a common topic. As an example, if a generalquery about ‘‘speech coding” patents has provided a quite long listof results, IntelliPatent is able to restructure the list by filtering re-sults on the basis of the specific approach used by patents or byaggregating similar approaches, and so on. The adopted semanticindexer is domain-dependent (i.e., can only work on a specificknowledge domain) and uses a set of ontologies that can be modi-fied/improved at runtime by skilled enough users. Differently fromother approaches based on patent clustering, IntelliPatent leaves tousers the full control on classification criteria, allowing to finelytune the search behavior with respect to the actual search needs.

Patent Analyzer [28,29], evolved as a result of the EU-fundedproject WISPER (Worldwide Intelligent Semantic Patent Extractionand Retrieval19), performs functional analysis of specific patentsallowing to automatically describe the patent features at a relativelyhigh level. The process is organized in three main steps: first, com-ponents of the invention are identified and extracted; secondly, ex-tracted components are classified at a more abstract level, and finallyrelations between them are made explicit. Two different and com-plementary approaches drive the entire process: at the first stagenatural language processing is adopted while an embedded semanticmodel is exploited at the second and third stage. The final goal issupporting the user in exploring different design solutions and vari-ants, and in applying the TRIZ method.

Invention Machine20 goes a step further in the application of theTRIZ method by directly using semantics for solving the inventionproblem. The needed knowledge base is embedded in the InventionMachine software and is used to represent the problem to be solvedin a lightweight, machine understandable way. A semantic searchengine is also provided, which captures relationships lying at the ba-sis of different analysis tasks such as Failure Modes and Effect Anal-ysis (FMEA) [30], and supports multi-lingual queries, in particularEnglish, German, French and Japanese queries.

10. Conclusions

Patent information is crucial to define strategies and decisionsin the nowadays global and dynamic world of business. Databasesand tools have been supporting the elaboration and fruition of suchinformation for a long time, with various functionalities, targetedat different user categories. Far from being a completely solved re-search domain, patent intelligence is currently attracting more andmore interest, both for addressing old issues, such as databasequality, and for tackling new challenges including:

� the increasing amount of patent applications, in so many differ-ent languages;

� the augmented variety of users that now include experts, occa-sional users, etc., with different backgrounds and interests rang-ing from pure science to business;

19 http://wisper.bmtproject.net.20 http://www.invention-machine.com/.

http://lsa.colorado.edu

http://wisper.bmtproject.net

http://www.invention-machine.com/


� the focus shift from traditional defensive IP approaches to themore recent idea of leveraging patent information for miningnew technical and business opportunities.

A new stream of technical advancements and research initia-tives is currently trying to face these issues by designing improve-ments ranging from user friendly and effective, search modalitiesto provide better support to search result analysis, from more effi-cient integration of complementary information to improved iden-tification of hidden facts and relationships. In this new and excitingevolution, semantic technologies can play a relevant role by sim-plifying and improving the patent search and analysis processes.

References

[1] Adams SR. Information sources in patents: an overview of internationalpatents, K.G. Saur, 2006, ISBN: 3598244436, 9783598244438.

[2] WIPO, Strasbourg agreement concerning the international patentclassification, Legislative texts WO026EN, September 1979.

[3] WIPO, Recommendation for the processing of patent information using XML(eXtensible Markup Language), STANDARD ST.36, November 2007.

[4] Thielemann, W. Ocr errors in patent full text documents. In: Informationretrieval facility, 2007.

[5] Wretblad L, Sayeler J. IPR system error!. In: information retrieval facility, 2007.[6] Hunt D, Nguyen L, Rodgers M, editors. Patent searching: tools and techniques.

John Wiley and Sons; 2007.[7] Wanner L, Baeza-Yates R, Brügmann S, Codina J, Diallo B, Escorsa E, et al.

Towards content-oriented patent document processing. World Patent Inform2008;30(1):21–33.

[8] Porter A, Cunningham S. Tech mining: exploiting new technologies forcompetitive advantage. John Wiley and Sons; 2005.

[9] Brugmann S, Wanner L. Overview of the status of art, deliverable D.8.1. Tech.Rep. PATExpert project, 2006.

[10] Iwayama M, Furujii A, Kando N. Overview of classification subtask at NTCIR-5patent retrieval task. In: NTCIR-5 workshop meeting, 2005.

[11] Salton G, McGill M. Introduction to modern information retrieval. McGraw-Hill Book Company; 1984.

[12] Iwayama M, Fujii A, Kando N, Takano A. Report on the patent retrieval task atNTCIR workshop 3, Tech. Rep., vol. 38, No. 1, ACM SIGIR Forum, 2004.

[13] Fujita S. Technology survey and invalidity search: a comparative study ofdifferent tasks for Japanese patent document retrieval. Inf Process Manage2007;43(5):1154–72. doi: <http://dx.doi.org/10.1016/j.ipm.2006.11.009>.

[14] Iwayama M, Fujii A, Kando N, Marukawa Y. Evaluating patent retrieval in thethird NTCIR workshop. Inf Process Manage 2006;42:207–21.

[15] Jarvelin K. Issues and approaches on interactive patent retrieval: evaluation ofsessions rather than queries. In: Information retrieval facility, 2007.

[16] Wanner L. Patexpert annual report, Tech. Rep., PatExpert, 2007.[17] Hartung L. Management and evaluation of patents and product development

projects. IP Score 2.0, Tech. Rep., IP4Inno consortium, 2006.[18] Gibbs A. Application of multiple known determinants to evaluate legal,

commercial and technical value of a patent. Tech. Rep., Patent cafe, 2005.[19] Gruber T. Toward principles for the design of ontologies used for knowledge

sharing. Int J Hum-Comput Stud 1995;43:907–28.[20] Uschold M. Towards a methodology for building ontologies. In: Workshop on

basic ontological issues in knowledge sharing, IJCAI-95, 1995.[21] Gomez-Perez A, Fernandez-Lopez M, Corcho O. Ontological engineering,

advanced information and knowledge processing. Springer; 2003.[22] Niles I, Pease A. Towards a standard upper ontology. In: Proceedings of the

second international conference on formal ontology in information systems(FOIS-2001), 2001.

[23] Fellbaum C. WordNet an electronic lexical database. MIT press; 1998.[24] Gibbs A. Heuristic Boolean patent search: comparative patent search quality/

cost evaluation super Boolean vs. legacy Boolean search engines. Tech. Rep.,Patent cafe, 2006.

[25] Gibbs A. Semetric: conceptual search and discovery. Tech. Rep., Patent cafe,2005.

[26] Giereth M, Koch S, Kompatsiaris Y, Papadopoulos S, Pianta E, Serafini L,Wanner L. A modular framework for ontology-based representation of patentinformation. In: JURIX 2007: The 20th anniversary international conference onlegal knowledge and information systems, 2007.

[27] Ciaramella A. Intellipatent 4.3: a quick overview. Tech. Rep., Intellisemantics.r.l., 2007.

[28] Cascini G. System and method for automatically performing functionalanalyses of technical texts. European Patent EP1351156, 2002.

[29] Cascini G, Fantechi A, Spinicci E. Natural language processing of patents andtechnical documentation. Doc Anal Syst VI 2004; 3163/2004: 508–20.

[30] Stamatis DH. Failure mode and effect analysis: FMEA from theory to execution.Am Soc Qual 1995.

[31] Jackson P, Moulinier I. Natural language processing for online applications:text retrieval, extraction and categorization. John Benjamins PublishingCompany; 2007.

Dario Bonino (Ph.D) is a research assistant in the e-Literesearch group at the Department of Computer Scienceand Automation of Politecnico di Torino. He received hisM.S. and Ph.D. degrees in Electronics and ComputerScience Engineering, respectively, from Politecnico diTorino in 2002 and 2006. His interests include SemanticWeb technologies, with a focus on architectures forsemantic annotation, indexing and retrieval of Webresources, domotics and semantic-aware, home-relatedtechnologies. He is the manager of the H-DOSE semanticplatform and collaborates to other open source projectsincluding Ontosphere3D. He published two papers oninternational journals and 27 at internationalconferences.

Alberto Ciaramella is the CEO and founder of Intel-liSemantic, a high tech company starting its activities in2005 in the Incubator of the Politecnico di Torino anddeveloping solutions based on semantic technologies.He received his M.S. degree in Electronics and ComputerScience from the University of Rome ‘‘La Sapienza” in1969 and was a researcher and a research supervisor atCSELT, the research branch of the Telecom Italia group,where he published over 40 papers on journals andinternational conferences and was the author or coau-thor of four patents. His present interests include thedevelopment of easy to use search solutions usingsemantic technologies for different kind of verticals,including patents.

Fulvio Corno is an Associate Professor at the Depart-ment of Computer Science and Automation of Politec-nico di Torino. He received his M.S. and Ph.D. degrees inElectronics and Computer Science Engineering fromPolitecnico di Torino in 1991 and 1995. He is involved inseveral research projects. His current research interestsinclude the application of semantic technologies to Websystems, the design of Intelligent Domotic Environ-ments, and interfaces for alternative access to computersystems. He published more than 150 papers at inter-national conferences and 19 on international journals.

http://dx.doi.org/10.1016/j.ipm.2006.11.009