Upload
dario-bonino
View
223
Download
2
Embed Size (px)
Citation preview
World Patent Information 32 (2010) 30–38
Contents lists available at ScienceDirect
World Patent Information
journal homepage: www.elsevier .com/ locate/worpat in
Review
Review of the state-of-the-art in patent information and forthcomingevolutions in intelligent patent informatics
Dario Bonino a, Alberto Ciaramella b, Fulvio Corno a,*
a Politecnico di Torino, Dipartimento di Automatica ed Informatica, Corso Duca degli Abruzzi 24, 10129 Torino, Italyb Intellisemantic s.r.l., Via Giaglione 7, 10126 Torino, Italy
a r t i c l e i n f o
Keywords:Information retrievalPatent informaticsSemantic elaborationReview
0172-2190/$ - see front matter � 2009 Elsevier Ltd. Adoi:10.1016/j.wpi.2009.05.008
* Corresponding author.E-mail addresses: [email protected] (D. Bonin
semantic.com (A. Ciaramella), [email protected] (
a b s t r a c t
Information and meta-information related to national and international patents is a critical asset forevery innovative company. The complexity of managing, searching, analyzing and relating such informa-tion to the needs of the company, in the different user tasks, is tackled by innovative knowledge manage-ment solutions, that aim at supporting the users in such daunting tasks.
This paper aims at presenting a comprehensive and updated overview of patent information and ofinnovative solutions in patent informatics, in particular concerning intelligent and semantic solutionsproposed in recent years. The analysis starts from the actual requirements of different types of usersof patent information, and the typical information management tasks they require. Innovations, coveringall the layers from data bases to algorithms to on-line services, are also critically presented and com-pared, and current research trends are outlined.
� 2009 Elsevier Ltd. All rights reserved.
1. Introduction
Over 50 millions patents, scattered across national or regio-nal databases, possibly integrated by accessory databases andup-to-date information and commentaries from the Web, formthe corpus for patent search systems. The application context,the type of information, the available sources and the roles ofthe main actors are only a subset of the many aspects thatneed to be considered during the lifecycle of patent informaticssolutions.
This paper provides an overview of the different issues relatedto patent information management and exploitation from thepoint of view of industrial stakeholders. Starting from the natureand structure of patent documents, their lifecycle and the relatedstandardization efforts, the paper introduces patent databasesand related information, patent users’ roles and tasks, patentsearch tools and benchmarks, and state-of-the-art patent informat-ics applications based on semantics. For most aspects several re-search threads are outlined, both for developers and scientists,and critical challenges are highlighted.
Special emphasis is given to the increasing variety of users thatcan benefit from easy access to patent information. Current patent
ll rights reserved.
o), alberto.ciaramella@intelli-F. Corno).
users are not only patent domain experts, but include new occa-sional actors such as managers, industrial researchers, academicfaculty, and so on, each needing a different set of functionalitiesand a different degree of application complexity.
Requirements on patent informatics applications vary depend-ing on the type of user and on the tasks that need to be accom-plished, ranging from patent analysis to patent search and patentmonitoring. Each of these activities is treated separately and therelevant needs and expectations are analyzed to define the set ofissues that patent informatics solutions shall successfully address.Particular attention is devoted to patent search tools, and to theapplication of information retrieval methods, either traditional orsemantic, to the patent domain.
Problems deriving from the historical lack of patent-specificbenchmarks are analyzed and current initiatives aiming at fillingthis gap are presented.
Eventually, in Section 9, an overview of the characteristics ofmain patent informatics applications based on semantics isprovided. Results from the academic literature, from researchprojects as well as from commercial providers are considered,analyzing solutions from several points of view: applicationfunctionality, characteristics, and, especially, adoptedtechnologies.
The paper does not aim at being exhaustive but at providing aneutral overview of the domain complexity and of the many re-search challenges in this exciting, but relatively unexplored appli-cation field.
Table 1Patent lifecycle.
Phase When Information disclosed
Patentapplication
Filing date Some metadata
Patentapplication –published
18 months after the filing date Full text and metadata
Granted patent 2 years or more after the filing date Amended full text andrevised metadata
Expired or ceasedpatent
20 years after the filing date (orbefore in case of unpaid fees)
Amended full text andrevised metadata
Table 2Patent-related information sources.
Information sources Information provided Degree of interest
Free portals of nationalor regional patentoffices
Patent text andmetadata
Always of interest, even incombination withcomplementary portals
Vendor-provided patentdatabases
Patent text andmetadata
Business critical decisions
Vendor-providedextended content ofpatent information
Patent mutualcitations, patentaugmented abstracts
Useful
Patent legal databases Patent legal statusand events (e.g.,litigations)
Always significant
Scientific literatureportals
Scientific papers For identifying the scientificbackground
Business databases Company data andaffiliations
For integrating the businessbackground information
News Business news For integrating the businessbackground information
Internet Company portals,sector portals
For low cost identification ofthe scientific or businessbackground
Web 2.0 solutions Wiki and blogs forpatents
Also for supporting thecooperative work
D. Bonino et al. / World Patent Information 32 (2010) 30–38 31
2. Patents as information sources
To fully exploit information embedded in patent documents, allthe available literature must be considered, including expired andunexpired patents, technical and scientific papers, and so on,requiring a great effort in information management. Furthermore,according to Table 1, the amount, the nature and the quality ofavailable information (last column) changes significantly duringthe patent lifecycle.
A patent has a strong part in information dissemination, allow-ing distributing up-to-date, validated, technical and scientificinformation. Some studies report that a significant part of informa-tion available in patents is actually unique and cannot be foundelsewhere – see for example chapter 4 in [6], or the USPTO ‘‘EighthTechnology Assessment and Forecast Report. Section II: Theuniqueness of patents as a technological resource”, that concludes(p. 37): ‘‘The results of this study show that about 8 out of 10 USpatents contain technology not disclosed in the non-patent litera-ture.” Patent information is also typically more detailed andexhaustive than scientific papers. International standards, e.g.WIPO,1 impose the sequence of sections and metadata to which apatent must conform. The standard patent structure encompassesa front page, the description, the claims, the drawings (optional),and the international search report, if available.
Patent documents show several peculiarities that must be takeninto account when applying automated information processingtechniques: they are usually longer than ordinary papers, with sig-nificant length variance; they adopt different writing styles in dif-ferent parts; they often include multimedia data, typicallydrawings or mathematical or chemical formulas, which requirespecific analysis and classification algorithms.
Patents are provided with standard metadata: title, dates anddocument numbers for the application, publication and grant, listof applicants and inventors, patent knowledge domain (field and/or subfield, classified against a standard taxonomy). Metadata arepart of the patent document or can be added by patent offices.Classification of patent domain, for example, is accomplished bypatent offices in a consistent way and is key information forsearching patents across nations.
To better support worldwide patent searches, the World Intel-lectual Property Organization defines and continuously main-tains/updates a standard taxonomy for patent classification,named International Patent Classification (IPC) [2]. Updates areprovided every 3 months for the deepest levels of the taxonomyand every 3 years for the core classification. This evolution rate ismotivated by the need to include and properly classify new,emerging technical fields, like it recently happened with nanotech.The 8th IPC version is available on-line since 2006 and it is adoptedas lingua franca by more than 100 nations worldwide. Major patentoffices including the European Patent Office2 (EPO), the United
1 The World Intellectual Property Organization (WIPO), http://www.wipo.int/.2 http://www.epo.org/.
States Patent and Trademark Office3 (USPTO) and the Japan PatentOffice4 (JPO) also define and use specific metadata complementarywith IPC, to include additional information and to support more ad-vanced functionalities. They are named respectively ECLA (EuropeanPatent Classification System), USPC (US Patent Classification System)and FI (File Index) and its extension F-terms, both used by JPO.
IPC Patent classification opens several research possibilities andseveral applications like:
� the design of tools to help patent offices in classifying applica-tions and grants according to the IPC;
� the design and development of solutions for enabling patentsearchers to exploit the full potential of the IPC classification;
� the integration of textual information and IPC metadata for pat-ent search and for patent clustering;
� the extraction and integration of information from multipleclassification systems such as the EPO, the JPO and the USPTOinternal classifications.
3. Patent databases and related information
As far as public access to patent information is concerned, pat-ent documents are usually organized into databases that actuallydiffer in content and available search methods [1]. Content is char-acterized by coverage in space, time and by the completeness ofprovided documents (e.g., full text or abstract). Search methodscan span from simple Boolean searches to advanced informationretrieval approaches, each carrying a different degree of complex-ity and a different performance in the retrieval process. In additionto document completeness and search effectiveness, documentformats must also be taken into account: older patents are usuallyavailable as images (no OCR done in the electronic scan process),only, while more recent documents are available in an electroni-cally readable form, often generated by automatic OCR processesthat can introduce errors, thus affecting the quality of the singledocument and, as a consequence, of the whole database. Only fordocuments written in the last few years, some patent informationis available in XML [3].
3 http://www.uspto.gov/.4 http://www.jpo.go.jp/.
Table 3Patent search tasks.
Task When Why What Focus
Patentability search Writing a new patent application To construct claims not affected by the priorart
A new patentable idea, with respect tothe state-of-the-art
Specific
Validity (invalidity) search To defend a patent application or tolitigate a competitor’s patent
To validate or invalidate a patent on the basisof its claims
A patent against the state-of-the-art atthe application time
Specific
Infringement search(freedom to operatesearch)
Before launching a product on themarket
To verify that a product can be commercializedin a given market
A product against patents still holding inthe selected market
Specific
Technology survey Business planning For better focusing on present business or toanalyze new business opportunities
Patents, scientific and technicalpublications in a given technology area
Broad
Portfolio survey Business planning To identify the technical portfolio of differentplayers
Patents, scientific and technicalpublications in a given technology area
Broad
32 D. Bonino et al. / World Patent Information 32 (2010) 30–38
Patent databases can be either free, as it happens for many na-tional patent offices that provide free access to their collected dat-abases, or subject to a fee payment, as done by many commercialvendors,5 which use the background information provided by na-tional offices, and integrate it with other information, offering moreaccurate and customized search facilities. For example, many com-mercial providers integrate patent-related information such as mu-tual citations and abstracts selected, revised and integrated by askilled editorial staff (e.g., The World Patent Index,6 recently mergedwith The Patent Citation Index).
Independently from their business specificity, free and com-mercial databases are subject to errors including: errors originatedby OCR recognition failures and wrong metadata assignments orinconsistencies. Typical examples are transcription errors for com-pany and inventors’ names, especially concerning far-east lan-guages. All these errors [4,5] contribute to make the searchprocess more difficult and shall be taken into account for providingsuitable instruments to work on patent databases.
Core patent information is often not sufficient to support patentinformation users in their activities (see Table 2). Therefore it maybe necessary to integrate patent information with informationfrom other, complementary databases such as legal patent dat-abases, containing the legal history of a patent, and research papersdatabases, for accessing referenced knowledge. Moreover, moderndynamic business scenarios require integration of more responsiveinformation sources like forums and blogs dealing with patent is-sues (both general and specialist), or business portals, with reportsand news, thus allowing to contextualize patents in their specificknowledge domain. Web 2.0 mash-ups can ease the process andmake the integration process more efficient.
A significant initiative, in this context, is the Peer-To-Patentportal,7 a community blog which gathers comments and reviewsfrom volunteer patent experts involving specific technology areas,mainly related to Information Communication Technology (ICT).The initiative aims at relieving US patent examiners from the heavybacklog of waiting applications. Peer-To-Patent is both a warningsignal of the increasing patent offices’ overload and a revealing signof forthcoming innovations in this field.
4. Patent information users and related tasks
Users of patent information are interested in both core andcomplementary information, with a balance that differs in functionof their specific goals and roles. Companies and inventors wishingto file a new patent are interested in verifying that the invention isactually new, with reference to the current state-of-the-art. At the
5 A directory of vendors is available at http://www.piug.org/vendors.php.6 http://www.thomsonreuters.com/products_services/scientific/DWPI.7 http://www.peertopatent.org.
same time, they are interested in discovering infringements fortheir granted patents. Researchers are interested in finding patentinformation to avoid duplicating solutions already covered by pat-ents and/or to freely reuse expired patents. Managers can exploitpatent information to assess competitors, partners and suppliers,and to identify technology trends and new business opportunities.Finally, venture capitalists and investors can leverage patent-re-lated information to select the targets of their financial operationswhile third party resellers can benefit from patent informationwhen selecting their suppliers.
The tasks usually pursued by patent information users can beroughly subdivided in three main classes: patent search, analysisand monitoring. Table 3 summarizes the most common patentsearch tasks [1,6,7]. The first three rows report more standardizedand specific operations while the last two deal with technologysurvey and portfolio survey, which are broader and more variablein required depth and effort, depending on specific businessrequirements (e.g., allowed times and costs).
Patent analysis can be further subdivided in two broad catego-ries related respectively to micro- and macro-analysis. Micro-anal-ysis involves a single patent document, while macro-analysis isabout a patent portfolio. Table 4 shows some of the most commonanalysis tasks [6,8].
Analysis tasks can also be categorized on the basis of theirunderlying reasons: some analyses are motivated by businessneeds, e.g., Intellectual Property (IP) evaluation, and other by tech-nical reasons. IP evaluation exploits some available data, such asthe patent family size and the number of citations, for estimatingthe patent value; research is still active in this field as demon-strated by the related literature [9]. Patent maps, on the contrary,are more oriented to technical analysis, summarizing specific prob-lems addressed by the patent and respective solutions, in a single,two-dimensional matrix. For example, a patent in speech recogni-tion can reduce computation required for coding the voice (solvedproblem) by adopting a suitable pre-processing (approach). Patentmaps can be effectively evaluated by leveraging the classificationsystem defined by the JPO with F-terms [10].
Temporal information can be included in the analysis to identifytechnical trends and competitors’ strategies, allowing selecting theproper countermeasures. Patent information can also be used asbackground for derivation of new inventions, as done by the TRIZ8
method. In this technique, structured analysis of patent documentsand of the problem to be solved allows to infer new inventions byanalogy.9
8 TRIZ, initially devised by Attshuller, is a Russian acronym for ‘‘Theory of Inventor’sProblem Solving”. For further information please refer to the European portal http://etria.net/portal.
9 However, this method is not fully appropriate for deriving fundamental and/ordisruptive innovations.
Table 4Patent analysis tasks.
Task When Why What Focus
Micro patent analysis –business value
Assessing internal orthird party patent
IP evaluation Some indicators such as the family size and the number of citations arerelated to the patent business value
Documentspecific
Micro patent analysis –technical detail
Assessing internal orthird party patents
Patent map Identify solved problems and approaches used by a patent Documentspecific
Macro patent analysis –business value
Assessing internal orthird party portfolio
IP evaluation The business value of a patent portfolio can be inferred by the values ofthe single patents and by the patents’ distribution
Broad
Macro patent analysis –technical detail
Assessing internal orthird party portfolio
Patent map Identify solved problems and approaches used by a patent portfolio Broad
Macro patent analysis –technical trends
Business planning Identify competitorstrends and actions
Time evolution of competitors’ patent maps Broad
Macro patent analysis –technical trends
Business planning Identify technical trends Time evolution of technology patent maps Broad
Macro patent analysis –technical suggestions
Solving new researchissues
Suggestions for newinventions (TRIZ method)
Patent mining is used in conjunction with the TRIZ method Broad
Table 5Patent monitoring tasks.
Task When Why What
Early signmonitoring
Continuously Intelligence New patent applications as available, with incomplete data possibly integrated byother information
Single patentmonitoring
Continuously IP proactive management against specific thirdparty patents
Patent extensions in new countries, changes in the legal status, new litigations,etc.
TechnologyMonitoring
Continuously Technology intelligence New patents and applications in specific domains
Portfoliomonitoring
Continuously Competitor intelligence New patents, and applications, of specific players
D. Bonino et al. / World Patent Information 32 (2010) 30–38 33
Patent monitoring tasks (Table 5 reports some examples) areused to regularly update users about new patent information, inthe domains of interest they specify. These tasks are often imple-mented by means of software agents that act on behalf of the user,according to a properly defined interest profile and update fre-quency. Monitoring tasks can also be done manually, in this caseoperations are more flexible but definitely more costly. The18 months protection period between the patent application dateand the disclosing of the patent full text and of some additionalmetadata adds further latencies in the process, which can be mit-igated by monitoring incomplete application data (such as theapplicant names or the application title) integrated by other infor-mation, for example something that can be inferred from the list ofapplicants.10
5. Looking for patent information
Patent search and analysis are two business critical, time inten-sive, tasks that can be addressed either manually or automatically.Professionals currently adopt the first solution, being supported byautomatic tools in some steps of the whole evaluation procedure.Automatic and manual processes can be roughly subdivided intwo phases: query and result evaluation (summarized in Table 6).
Queries can either be expressed through metadata (e.g., appli-cant, year, IPC) or full text. Metadata queries on the applicantand year are straightforward from an algorithmic point of view,but they are made more complex by at least two factors: con-stantly changing company names, merging and demerging, andby unreliable translations of far-east names of companies and per-sons. In such cases additional databases with company history, orfuzzy name matching, may help.
IPC metadata search is also quite complex and it is usuallyaccomplished by using dictionaries11 (for finding suitable search
10 Enrico Ramunni, IP manager at De Nora, private communication.11 as in http://www.wipo.int/tacsy.
keywords) or with query-by-example approaches, e.g., by searchingpatents having IPC descriptions similar to the IPC classification of aprototype patent, whose membership to a given IPC is ‘‘a priori”known.
Text queries, on the converse, can be done with different levelsof complexity, from simple word occurrence and Boolean matchingto full information retrieval techniques, depending on the patentdatabase functions and on search requirements. Due to the heter-ogeneous structure of patent documents it may be useful to iden-tify sections to be searched, e.g., claims, as well as navigatingbetween referenced patents combining different search modalities.
Most patent searches, e.g., patentability and validity, are highlybusiness-sensitive and missing relevant documents would beunacceptable. Obviously, high relevance of search results is also re-quired to improve the overall efficiency of the search process. Inorder to achieve these two contrasting goals, patent professionalsoften adopt an iterative search paradigm that can be supportedby automatic tools, e.g., tools for expanding synonyms in textsearches, tools for refining search results, etc. In addition, somepatent retrieval task can benefit from other functions (not yet inte-grated in search tools) such as speeding up the evaluation of singlepatents by identifying the focus of each sentence, translating pat-ent documents, etc.
Patent analysis has usually different requirements from patentsearch, as its goal is to obtain relevant patents and to analyze themas an aggregate. Consequently, the query phase can use simplemetadata-based criteria and the result evaluation phase can beeffectively supported by automatic tools, integrating other typesof information like the patent family size, or the distribution of pat-ents in force for the given topic, or the year and the applicant, andso on.
6. Systems characterization and benchmarks
Patent search has traditionally been treated as part of thewider information retrieval (IR) [11] research area, but in IR the
Table 6Tasks and phases in patent retrieval.
Tasks
Patent search Patent analysis
Phase Query Search with advancedcriteria, including textsearch. Highly business-sensitive results. Recall is ofparamount importance whileprecision is welcome forefficiency. Iterative search isused to balance precisionand recall figures
Simpler data/metadata. Alsonoise in the result list can beof interest. Iteration may bemotivated by exploratorysearch
Resultsevaluation
To rank and re-order the listof results, to support furtheriteration cycles or to speedup the analysis of a singlepatent
Important for automaticallyintegrating results withadditional data and to post-process the result list withdifferent presentationcriteria
34 D. Bonino et al. / World Patent Information 32 (2010) 30–38
mainstream research efforts have been more concentrated on gen-eral purpose search systems able to deal with any kind of docu-ments and queries, neglecting the specificity of patents.Conversely, patent retrieval has been much more investigated bythe database research community than by information retrievalresearchers. At the basis of this choice there is also the lack of testcollections targeting patent information.
Although some patent documents were included in the TREC12
test collections, until few years ago no domain specific collectionswere available. From 2001, the NTCIR13 workshop introduced a spe-cific track on patent search, called Patent Retrieval Collection, using acollection of patent documents and abstracts extracted from 2 yearsof unexamined Japanese patent applications published between1998 and 1999.
Patent search processes differ significantly depending on thetype and purpose of retrieval, such as technology survey or inval-idity search. Each retrieval type deserves a customized approach,as stated in the NTCIR-3 task description [12]. For example, ininvalidity search, professional users search at least 5 years worthof patents for applications conflicting with the patent under exam-ination. Instead, in technology survey, patents are searched forderiving a general overview of the current state-of-the-art [13].NTCIR-3, focuses more on the latter search modality and uses, asbase scenario, the case in which a manager wants to know the pat-ent landscape related to a newspaper clip. This patent search is re-ferred to as cross-genre or cross-database retrieval as informationcoming from a newspaper is used to query a patent collection fordocuments relevant to the query.
As the information retrieval tasks related to patent collectionsare so diverse, and multi-faceted, it is worth investigating whethertraditional IR models can be applied to patent retrieval and whatkind of evaluation measures can be adopted. In fact, experimentsdone by the committee that developed the NTCIR-3 benchmark[14], on a technology survey task, have shown that typical IR algo-rithms can be effectively applied to patent documents retrieval,with some particular cautions.
Classical IR quality figures include the Precision (1), which mea-sures the proportion of retrieved documents that are relevant, andthe Recall (2), which measures the proportion of relevant docu-ments that have been retrieved.
Precision ¼ jRelevantDocuments \ RetrievedDocumentsjjRetrievedDocumentsj ð1Þ
12 http://trec.nist.gov/.13 http://research.nii.ac.jp/ntcir/.
Recall ¼ jRelevantDocuments \ RetrievedDocumentsjjRelevantDocumentsj ð2Þ
Such metrics can be applied to patent information retrieval [14],too, although the absolute figures and their relative importancemay vary depending on the retrieval task. For example, while Pre-cision is generally more important than Recall for invalidity search,both Precision and Recall are important for technology surveys.Experiments show [14] that the retrieval models that perform bet-ter do not significantly differ depending on the document genre(patents vs. other types of documents, such as research papers ornewspaper articles).
In general, we may conclude that patent retrieval is a very spe-cialized information retrieval task that benefits from state-of-the-art approaches in the IR field, but that deserves development ofproper adaptations of traditional IR models, depending on the tar-get information need. Adaptations can include task specialization,e.g., IR models for patentability search that differ from modelsfor technology survey due to the different Precision vs. Recalltrade-off requirements, search model adaptation, as the singlequery assumption at the basis of traditional models is not realisticin the patent domain [15] and rather an interactive refinementmethod is preferable, etc. The domain-specific nature of the patentsearch process is widely recognized in the patent retrieval commu-nity as demonstrated by the composition of NTCIR-3 successors(NTCIR-7,14 etc.), where different patent retrieval tasks such as pat-entability search, invalidity search and technology survey are sup-ported with dedicated benchmarks, suggesting that differentalgorithms should be selected, and tuned, for each patent-relatedsub-task.
7. Patent software challenges and opportunities
Patent information software deals with two major aspects: con-veyed information and users, each implying different challenges/opportunities, outlined in Table 7.
Users can be of different categories, with different skills andneeds: professional patent searchers typically prefer more ad-vanced functionalities, with a higher degree of control on toolcapabilities and freedom in setting search parameters, while occa-sional users (such as managers) often require an easy to use inter-face and simpler commands, with complexity hidden under asimplified front-end.
Information, on the other side, is perhaps the most challengingaspect due to the increasing diversity in patent languages, espe-cially considering the unprecedented increase in patent applica-tions originating from Asian countries such as Japan, China andSouth Korea. As a consequence, traditionally neglected issues, suchas multilingualism, are becoming more and more important, to-gether with related tasks like multilingual querying and automatictranslation, etc. In addition to new challenges, more traditional is-sues still remain, the quality of patent databases, for example. Pat-ent databases are actually affected by errors due to OCR processingof old paper documents, to wrong metadata associations, to incon-sistencies between different metadata vocabularies, etc., and thiscan influence the results of a query.
In parallel with quality and multilingualism, integration is alsobecoming a very hot topic, since patents may only offer a partialview, due to the very nature and structure of patent documents.In fact, the dynamic nature of current business scenarios requirestimely integration of heterogeneous information sources like news,blogs, forums and more traditional databases. Web 2.0 mash-upsolutions can be of aid in this context.
14 http://www.nlp.its.hiroshima-cu.ac.jp/~nanba/ntcir-7/cfp-en.html.
Table 7Patent software challenges.
Challenge Why it is important Opportunity
Different kinds of users Patent information is not any more for pure patent professionals, but forR&D people, business analysts and managers
Different users require different balance between direct controland ease of use
Different kinds of tasks Patent information tasks are oriented to information retrieval, to patentanalysis and to patent monitoring
To develop different kinds of patent tasks, tailored on users’needs
Task benchmarking To assess different kinds of solutions or evolutions of the availableapproaches
To benefit as much as possible of patent processing byproductssuch as rejected applications, translations, etc.
Different languages The patent community is rapidly becoming truly multilingual To benefit as much as possible from patent processingbyproducts such as translations, etc.
Quality of databases OCR processing can introduce errors as well as human elaboration ofmetadata can do
New software for patent application management can reducethe problem.
Integration of different datasources
There is an emerging, increasing need of integrating heterogeneousinformation
Web 2.0 mash-ups can be crucial
Independence betweensoftware applications anddata
There is an increasing demand of flexibility in adding analysis andmonitoring procedures to simple patent data manipulation
Service Oriented Architectures can help
Table 8Current research trends.
Functional evolution Why it is important Some possible solutions
Advanced search including search forsimilar documents and search forsimilar drawings
To obtain more consistent results from queries, even if formulatedby less experienced users, and to reduce the search time, also forexpert users
Natural language processing and text, and multimedia,semantics can help to solve this issue
Support for iterative and exploratorysearch
To obtain a better recall as well as a better precision in the searchphase and also to improve the efficiency in technology and portfoliosurveys
Different kind of approaches can be used: morestructured, based on faceted search, or more holistic,based on patent clusters
Multi-language query and automatictranslation support
Fostered by the increasing importance of East Asian languages Natural language and semantic processing (e.g., byabstracting patent key topics)
More robustness against errors in patenttext and metadata
To avoid missing important documents Post-processing databases or considering more robustsearch methods, including semantic ones
Support for single patent analysis both inthe technical as well in the legal part
To provide more efficient and more consistent technical and legalanalyses, which have different objectives and can be oriented todifferent user communities
Use of natural language and semantics-based solutions
Automatic support for patentclassification
To have a more consistent classification from patent offices withreduced efforts. To support advanced users in their internalclassification
Natural language and semantic processing
Improved analysis and monitoring,including the identification of feeblesigns and relationships
To provide more efficiency as well as better quality in patent relatedintelligence
Integration of complementary news and informationthrough mash-up solutions, use of semantics in bottomup and top down modes
Identification of better proxies forevaluating patent value
To support an increasing community of interest about patentinformation
Patent economics basic research and integration ofcomplementary economic data, e.g., through WebServices and mash-up solutions
D. Bonino et al. / World Patent Information 32 (2010) 30–38 35
In the end, there is an increasing need of automatic elaborationfacilities for patent-related information, especially for patent anal-ysis and patent monitoring tasks; Service Oriented Architectures(SOA) are suitable to provide the technical infrastructure. As anexample, the European Patent Office is offering, since 2003, aWeb Service access to its internal patent database called esp@ce-net�. This Open Patent Service (OPS) has been constantly evolvingduring the last years. The current version increases the number offunctions offered through the Web Service interface, and allowssoftware applications to programmatically access all the informa-tion and all the search services that are available through the inter-active web site.15
8. Most significant recent evolutions of patent informatics
In response to the above mentioned challenges, patent infor-matics is showing significant functional evolutions ranging fromthe adoption of natural language processing (NLP) and semanticsfor automatic processing of information to the design of innovativeand efficient user interfaces, from the integration of information
15 Detailed information can be found at http://forum.espacenet.com together withthe final OPS specification (http://ops.espacenet.com/).
coming from less traditional sources (such as the World WideWeb) to the exploitation of hidden information in available docu-ments. Table 8 reports some of these evolutions, their motivationsand some possible solutions currently being investigated.
Traditional search methods, based on patent text and metadata,heavily rely on the skills of patent searchers and on the quality ofpatent document databases. However, the increasing amount ofpatent-related information, and the ever-growing need to accesspatent information by less experienced users, requires the devel-opment of new search tools and methodologies, suitable for casualusers and effective for experienced ones, e.g., enabling them toshorten search times. These solutions can be achieved by improv-ing database quality and by building error-tolerant query services.Nevertheless, new methods deserve particular attention as theycan deliver innovative services such as search for similar docu-ments and for similar pictures [16], e.g., by leveraging semantics(meaning behind word forms or structured drawings), and theycan easily support multilingualism (as semantics may be definedin a language-independent way). In addition, patent searches oftenimply several iterations in the process that must be effectively sup-ported by search tools. While traditional solutions poorly supportthis iterative nature of the search, new methodologies are moreconcerned about these issues, which can be tackled either by struc-
36 D. Bonino et al. / World Patent Information 32 (2010) 30–38
tured or holistic approaches. Faceted search, for example, is a wellstructured solution that allows incrementally refining search re-sults by exploiting automatically extracted tags. Patent clustering,on the converse, is a more holistic approach where patent searchesare refined by finding similarities between extracted documents.
Patent analysis tasks also require new innovative solutions, tai-lored to their different goals. Analysis goals, in fact, are different fortechnical analysis, which is aimed at producing patent maps, andfor claims analysis, which is aimed at restructuring the legal partsof patent documents (i.e., the claims). For both issues, semanticsolutions can be suitable. Similar approaches can also address mul-tilingualism, in particular for identifying the main topics in patentdocuments written in different languages, which is a more tracta-ble problem than full translation, and that still retains added valuefor patent professionals.
Analysis and monitoring of patent-related information requiresboth integration of complementary data, through Web Servicesand Web 2.0 solutions, and discovery of hidden facts and relation-ships by means of natural language processing and semanticsexploitation.
Finally, the evaluation of a patent, or of a patent portfolio,although being currently addressed by several tools [17,18], re-quires further exploration to find better quality proxies and evalu-ation metrics [16].
16 http://www.w3.org/TR/owl-features/.17 Semantic Web activities at W3C, http://www.w3.org/2001/sw.
9. Overview of semantic-based solutions
Semantic solutions are good candidates for supporting some ofthe above-cited evolutions as they work on word meaning ratherthan on mere word occurrence or frequency counts. In general, asemantic-based solution is expected to improve recall figures ofthe search processes while keeping precision constant, or increas-ing it. This can be achieved by properly handling synonyms andhomophones, and by providing cross-lingual services that are eas-ily supported by ontology-powered systems. In addition, theunderlying semantic models, needed by these systems to work, al-low providing more advanced search and filtering functionalities inquery post-processing and can provide support for identifying hid-den relationships between documents.
Semantic solutions are based on knowledge bases, i.e., on theunion of a knowledge-domain model (ontology, thesaurus or tax-onomy) and of domain-specific data. Ontologies, thesauri and taxo-nomies have in common several features:
� they are approaches to help structure, classify, model, and orrepresent the concepts and relationships pertaining to somesubject matter of interest to some community;
� they are intended to enable a community to come to agreementand to commit to use the same terms in the same way;
� the meaning of the terms is specified in some explicit way and tosome degree;
� the explicit mapping to conceptual models eases interoperationacross different languages, as long as the relevant taxonomiesare translated.
However, they differ in formalizing the meaning of terms, andthey have different notations and different goals:
� a taxonomy describes a knowledge domain through a hierarchyof relevant entities. Hierarchy can be built on the basis of a suit-able binary, transitive asymmetric relationship such as parent-child (most common), part-of, instance-of, etc.;
� a thesaurus represents a knowledge domain by using a set ofterms (without definitions) related each other by means of hier-archical, equivalency and associative links. Whenever a term is
ambiguous, a ‘‘scope note” may be added to allow disambigua-tion of the intended meaning;
� an ontology defines a set of representational primitives withwhich to model a domain of knowledge or discourse [19,20].The representational primitives are typically classes (or sets),attributes (or properties), and relationships (or relations amongclass members). The definitions of the representational primi-tives include information about their meaning and constraintson their logically consistent application.
Ontologies can be classified as upper-level, middle-level, do-main, document and linguistic [21]. Upper-level ontologies aim atdescribing very general concepts that are the same across allknowledge domains [22], middle-level ontologies are more do-main-specific than upper ontologies and provide formal definitionsfor entities typical of a restricted field of knowledge. Domain ontol-ogies are targeted at a restricted knowledge area, e.g. patent infor-mation, and define the finest granularity with which the meaning ofa domain entity can be described. Document ontologies define thestructure of documents of a given knowledge domain, e.g., the factthat patent documents have a technical and a legal part, etc.
Finally, linguistic ontologies (e.g., WordNet [23]) provide infor-mation needed to bridge linguistic resources (i.e., the text of docu-ments) with their conceptual counterparts (i.e., hierarchicallyarranged sets of concepts, with semantic relationships linkingthem). Linguistic ontologies also play a role in multi-language pat-ent processing: if the relevant keywords are mapped to a sharedontology, then the patent classification becomes language-neutral,as it is bound to a conceptual, rather than lexical, representation.Language crossing through semantic processing is quite successfulfor western languages, while for eastern languages satisfactorysolutions have yet to be found [31].
Ontologies can be integrated in applications by adopting severalparadigms, e.g., they can be embedded without leaving anychances of modification by the end user or, instead, they can be ex-posed to the final users, through proper interfaces, allowing themto edit, refine or drop concepts inside the knowledge model.
Semantics and related processes are not completely new in thepatent informatics domain: for example, professional patentsearchers already tend to use synonym dictionaries for expandingtheir queries, and the IPC classification, itself, is actually a domainspecific ontology (or, better, a thesaurus), although not expressedin a standard Semantic Web language such as OWL.16 However, un-til few years ago the landscape of patent-related semantic applica-tions was still poorly populated. Nowadays, things have beenchanging, thanks to the level of maturity reached by SemanticWeb standards17 (W3C Recommendations) and to the ever-increas-ing availability of tools and methodologies to exploit this kind ofdata and metadata.
Patent information providers are starting to recognize this newtrend of Web application architectures, where data sharing isbecoming easier thanks to technologies such as Web Services, inter-operation between data providers is enabled by the Linked OpenData standards (based on the Semantic Web infrastructure), andend users expect data and information to be available on an ‘‘every-where, everyone” paradigm. These new technology enablers, cou-pled with the new user attitude (partially brought by the Web 2.0trend), shift the importance from the availability of the data tothe ability of processing it and providing the best results out of it.Some players are already shifting from their role as ‘‘patent data-base owners” to ‘‘search service providers” over partially publicdata.
Table 9Currently available semantic solutions.
Functions Solution Architecture andtechnology
Semantic-based improvement of textsearch, related documents retrieval
Patent Cafe Latent SemanticAnalysis, withoutexplicit ontologymodeling
Semantic and multi-lingual search,search by similar documents and bysimilar drawings
PatExpert Ontologies in W3Cstandard formats
Semantic and multilingual search GoldfireInventionMachine
Embedded ontologies
Semantic indexing for faceteddocument navigation
IntelliPatent User-editableontologies
Semantic classification PateXpert Ontologies in W3Cstandard formats
Single patent functional analysis PatentAnalyzer
Embedded ontologies
Support for invention GoldfireInventionMachine
Embedded ontologies
D. Bonino et al. / World Patent Information 32 (2010) 30–38 37
In the patent knowledge domain, semantics is nowadays play-ing a role of increasing importance, by contributing to solve severalopen issues as reported in Table 9. Solutions are varied and tackledifferent problems ranging from search improvement to analysisand post-processing improvement. Also the way ontologies areused is extremely varied and encompasses systems using externalontologies, systems embedding ontologies in a read-only fashionand systems allowing to incrementally refine the knowledge modellying at their base.
Patent Cafe [24] applies semantic technologies for improvingrecall in patent search: a semantic index of the Patent Cafe data-base is computed and embedded in the database itself, allowingto directly exploit semantic correlations between available patents.This solution is claimed to be more robust than semantic expan-sion of queries and subsequent mapping on a traditional lexical in-dex. Patent Cafe uses Latent Semantic Analysis18 for building thesemantic index of the patent database [25]. Latent Semantic Analysisallows extracting the hidden relations between words occurring inpatent documents by applying principal component analysis to thetraditional term-document matrix, thus identifying most relevantdimensions among the many features that can be taken into accountin the patent search process.
In synthesis, the Patent Cafe approach can be classified as lyingbetween linguistic and semantics as, for extracting documentmeaning, it uses implicit, not formalized, knowledge.
The EU-funded PATexpert project [7], uses explicit semantics,i.e., a set of intertwined ontologies, for accomplishing differentgoals such as semantic search, automatic patent classificationand clustering. PATexpert supports a semantic ‘‘search by similar’’functionality for patent documents or single text passages. It de-scribes both patent documents and queries in terms of ontologyconcepts and is able to integrate query and filtering functionalitiessupporting complex searches like ‘‘find documents similar to doc-ument X as far as the feature Y is concerned”. Semantics is seen as ameans for helping patent offices in the patent classification pro-cess. Classification is supported through the adoption of domain-specific vocabularies [26] that allow to correctly manage patentjargon and patent structure, and through the definition of ontolo-gies tailored to the patent knowledge domain. In particular, the lat-ter effort is providing several reference ontologies that can beclassified in three different families:
18 Tutorials and demos can be found at http://lsa.colorado.edu.
� document ontologies, modeling the structure of patent docu-ments, the associated metadata and the embedded drawings,when available;
� domain ontologies, including a patent classification ontology;� linguistic ontologies, e.g., WordNet [23].
IntelliPatent [27] uses semantics to empower the navigation ofpatent search results: the user is firstly encouraged to perform a rel-atively coarse search, which is then refined in an iterative fashion.In order to support such an interaction paradigm, IntelliPatent firstindexes the results list, by using a semantic indexer, and thenaggregates patents in categories and facets, allowing to easily ‘‘clus-ter” patents about a common topic. As an example, if a generalquery about ‘‘speech coding” patents has provided a quite long listof results, IntelliPatent is able to restructure the list by filtering re-sults on the basis of the specific approach used by patents or byaggregating similar approaches, and so on. The adopted semanticindexer is domain-dependent (i.e., can only work on a specificknowledge domain) and uses a set of ontologies that can be modi-fied/improved at runtime by skilled enough users. Differently fromother approaches based on patent clustering, IntelliPatent leaves tousers the full control on classification criteria, allowing to finelytune the search behavior with respect to the actual search needs.
Patent Analyzer [28,29], evolved as a result of the EU-fundedproject WISPER (Worldwide Intelligent Semantic Patent Extractionand Retrieval19), performs functional analysis of specific patentsallowing to automatically describe the patent features at a relativelyhigh level. The process is organized in three main steps: first, com-ponents of the invention are identified and extracted; secondly, ex-tracted components are classified at a more abstract level, and finallyrelations between them are made explicit. Two different and com-plementary approaches drive the entire process: at the first stagenatural language processing is adopted while an embedded semanticmodel is exploited at the second and third stage. The final goal issupporting the user in exploring different design solutions and vari-ants, and in applying the TRIZ method.
Invention Machine20 goes a step further in the application of theTRIZ method by directly using semantics for solving the inventionproblem. The needed knowledge base is embedded in the InventionMachine software and is used to represent the problem to be solvedin a lightweight, machine understandable way. A semantic searchengine is also provided, which captures relationships lying at the ba-sis of different analysis tasks such as Failure Modes and Effect Anal-ysis (FMEA) [30], and supports multi-lingual queries, in particularEnglish, German, French and Japanese queries.
10. Conclusions
Patent information is crucial to define strategies and decisionsin the nowadays global and dynamic world of business. Databasesand tools have been supporting the elaboration and fruition of suchinformation for a long time, with various functionalities, targetedat different user categories. Far from being a completely solved re-search domain, patent intelligence is currently attracting more andmore interest, both for addressing old issues, such as databasequality, and for tackling new challenges including:
� the increasing amount of patent applications, in so many differ-ent languages;
� the augmented variety of users that now include experts, occa-sional users, etc., with different backgrounds and interests rang-ing from pure science to business;
19 http://wisper.bmtproject.net.20 http://www.invention-machine.com/.
38 D. Bonino et al. / World Patent Information 32 (2010) 30–38
� the focus shift from traditional defensive IP approaches to themore recent idea of leveraging patent information for miningnew technical and business opportunities.
A new stream of technical advancements and research initia-tives is currently trying to face these issues by designing improve-ments ranging from user friendly and effective, search modalitiesto provide better support to search result analysis, from more effi-cient integration of complementary information to improved iden-tification of hidden facts and relationships. In this new and excitingevolution, semantic technologies can play a relevant role by sim-plifying and improving the patent search and analysis processes.
References
[1] Adams SR. Information sources in patents: an overview of internationalpatents, K.G. Saur, 2006, ISBN: 3598244436, 9783598244438.
[2] WIPO, Strasbourg agreement concerning the international patentclassification, Legislative texts WO026EN, September 1979.
[3] WIPO, Recommendation for the processing of patent information using XML(eXtensible Markup Language), STANDARD ST.36, November 2007.
[4] Thielemann, W. Ocr errors in patent full text documents. In: Informationretrieval facility, 2007.
[5] Wretblad L, Sayeler J. IPR system error!. In: information retrieval facility, 2007.[6] Hunt D, Nguyen L, Rodgers M, editors. Patent searching: tools and techniques.
John Wiley and Sons; 2007.[7] Wanner L, Baeza-Yates R, Brügmann S, Codina J, Diallo B, Escorsa E, et al.
Towards content-oriented patent document processing. World Patent Inform2008;30(1):21–33.
[8] Porter A, Cunningham S. Tech mining: exploiting new technologies forcompetitive advantage. John Wiley and Sons; 2005.
[9] Brugmann S, Wanner L. Overview of the status of art, deliverable D.8.1. Tech.Rep. PATExpert project, 2006.
[10] Iwayama M, Furujii A, Kando N. Overview of classification subtask at NTCIR-5patent retrieval task. In: NTCIR-5 workshop meeting, 2005.
[11] Salton G, McGill M. Introduction to modern information retrieval. McGraw-Hill Book Company; 1984.
[12] Iwayama M, Fujii A, Kando N, Takano A. Report on the patent retrieval task atNTCIR workshop 3, Tech. Rep., vol. 38, No. 1, ACM SIGIR Forum, 2004.
[13] Fujita S. Technology survey and invalidity search: a comparative study ofdifferent tasks for Japanese patent document retrieval. Inf Process Manage2007;43(5):1154–72. doi: <http://dx.doi.org/10.1016/j.ipm.2006.11.009>.
[14] Iwayama M, Fujii A, Kando N, Marukawa Y. Evaluating patent retrieval in thethird NTCIR workshop. Inf Process Manage 2006;42:207–21.
[15] Jarvelin K. Issues and approaches on interactive patent retrieval: evaluation ofsessions rather than queries. In: Information retrieval facility, 2007.
[16] Wanner L. Patexpert annual report, Tech. Rep., PatExpert, 2007.[17] Hartung L. Management and evaluation of patents and product development
projects. IP Score 2.0, Tech. Rep., IP4Inno consortium, 2006.[18] Gibbs A. Application of multiple known determinants to evaluate legal,
commercial and technical value of a patent. Tech. Rep., Patent cafe, 2005.[19] Gruber T. Toward principles for the design of ontologies used for knowledge
sharing. Int J Hum-Comput Stud 1995;43:907–28.[20] Uschold M. Towards a methodology for building ontologies. In: Workshop on
basic ontological issues in knowledge sharing, IJCAI-95, 1995.[21] Gomez-Perez A, Fernandez-Lopez M, Corcho O. Ontological engineering,
advanced information and knowledge processing. Springer; 2003.[22] Niles I, Pease A. Towards a standard upper ontology. In: Proceedings of the
second international conference on formal ontology in information systems(FOIS-2001), 2001.
[23] Fellbaum C. WordNet an electronic lexical database. MIT press; 1998.[24] Gibbs A. Heuristic Boolean patent search: comparative patent search quality/
cost evaluation super Boolean vs. legacy Boolean search engines. Tech. Rep.,Patent cafe, 2006.
[25] Gibbs A. Semetric: conceptual search and discovery. Tech. Rep., Patent cafe,2005.
[26] Giereth M, Koch S, Kompatsiaris Y, Papadopoulos S, Pianta E, Serafini L,Wanner L. A modular framework for ontology-based representation of patentinformation. In: JURIX 2007: The 20th anniversary international conference onlegal knowledge and information systems, 2007.
[27] Ciaramella A. Intellipatent 4.3: a quick overview. Tech. Rep., Intellisemantics.r.l., 2007.
[28] Cascini G. System and method for automatically performing functionalanalyses of technical texts. European Patent EP1351156, 2002.
[29] Cascini G, Fantechi A, Spinicci E. Natural language processing of patents andtechnical documentation. Doc Anal Syst VI 2004; 3163/2004: 508–20.
[30] Stamatis DH. Failure mode and effect analysis: FMEA from theory to execution.Am Soc Qual 1995.
[31] Jackson P, Moulinier I. Natural language processing for online applications:text retrieval, extraction and categorization. John Benjamins PublishingCompany; 2007.
Dario Bonino (Ph.D) is a research assistant in the e-Literesearch group at the Department of Computer Scienceand Automation of Politecnico di Torino. He received hisM.S. and Ph.D. degrees in Electronics and ComputerScience Engineering, respectively, from Politecnico diTorino in 2002 and 2006. His interests include SemanticWeb technologies, with a focus on architectures forsemantic annotation, indexing and retrieval of Webresources, domotics and semantic-aware, home-relatedtechnologies. He is the manager of the H-DOSE semanticplatform and collaborates to other open source projectsincluding Ontosphere3D. He published two papers oninternational journals and 27 at internationalconferences.
Alberto Ciaramella is the CEO and founder of Intel-liSemantic, a high tech company starting its activities in2005 in the Incubator of the Politecnico di Torino anddeveloping solutions based on semantic technologies.He received his M.S. degree in Electronics and ComputerScience from the University of Rome ‘‘La Sapienza” in1969 and was a researcher and a research supervisor atCSELT, the research branch of the Telecom Italia group,where he published over 40 papers on journals andinternational conferences and was the author or coau-thor of four patents. His present interests include thedevelopment of easy to use search solutions usingsemantic technologies for different kind of verticals,including patents.
Fulvio Corno is an Associate Professor at the Depart-ment of Computer Science and Automation of Politec-nico di Torino. He received his M.S. and Ph.D. degrees inElectronics and Computer Science Engineering fromPolitecnico di Torino in 1991 and 1995. He is involved inseveral research projects. His current research interestsinclude the application of semantic technologies to Websystems, the design of Intelligent Domotic Environ-ments, and interfaces for alternative access to computersystems. He published more than 150 papers at inter-national conferences and 19 on international journals.