Multiple Retrieval Models and Regression Models for Prior Art Search

Multiple Retrieval Models and Regression Modelsfor Prior Art Search

Participating institution: Humboldt Universität zu Berlin - IDSL

Patrice Lopez

also at EPOBerlin, Germany

Laurent Romary

INRIA Gemo - Saclay, FranceHUB-IDSL – Berlin, Germany

• Searching Scientific and Technical Documents • Issues related to Prior Art Search• Overview of PATATRAS• Patent Document processing• Combining metadata & text in four steps • Results• Future work

PATATRAS !!

• PATent and Article Tracking, Retrieval and AccesS addresses Scientific and Technical Publications in general:

• Scientific and Technical Publications have 5 dimensions:

1. metadata 2. document structure3. textual content4. supporting content5. experimental data

• How is this instantiated in patent publications?

Patent Publications

1. Metadata encode procedure-related data:– Date, applicant, inventors, language(s)– Classification: hierarchy of technical fields IPC,

ECLA (+ICO) G06F17/30T2P2X – Citations Information retrieval Query expansion

EPO Citation Statistics

EPO Search Reports produced in the last 5 years (tot. 775.000)

100000

200000

300000

400000

500000

600000

US EP WO DE XP JP GB FR AT IT CA CN RU KR

EC classified Pat: 95%

NPL: 24%

JP: 17%

Patent Publications

1. Metadata encode procedure-related data:– Date, applicant, inventors, language(s)– Classification: hierarchy of technical fields IPC,

ECLA (+ICO) G06F17/30T2P2X – Citations

2. Patent Document Structure: Title, Abstract, Claims, Description (description of prior art, "subjective" technical problem, description of embodiments)

– Strong interrelations between these structures– Each of these structures serves different goals

Information retrieval Query expansion

Patent Publications

3. Textual Content of Patent:• Attornish, multilinguality

4. Supporting content:• tables, mathematical and chemical formulas,

citations, technical drawing, etc.

5. Experimental data: absent

PATATRAS !!• Scientific and Technical Publications have 5

dimensions:1. metadata 2. document structure3. textual content4. supporting content5. experimental data

What are the known practices in prior art search ?

Prior art search

Search reportSearch report

Patent application

Prior artPrior artPrior artPrior art

Prior artPrior art

Topic patents are granted patent publication (richer documents then applications; ECLA classes, extra citations; final claims)

Result set extended to all EPO documents introduced during the prosecution the application

All documents without EPO counterpart (via patent family) are discarded (patent applications never filed at the EPO and non-patent literature omitted)

CLEF-IP biases

Motivation and approach:• Non exhaustivity• Recall-oriented search is a myth (titles and abstracts; lack of elaborate tool)• Usage of classification for search (a priori restriction of the result set)•Usage of meta-data (thickets; patents are continuation of previous applications)

The patent examiner’sreal life

PATATRAS !!• Scientific and Technical Publications have 5

dimensions:1. metadata 2. document structure3. textual content4. supporting content5. experimental data

• We investigated only 1 and 3 in CLEF IP 2009• However...

• how to combine metadata-based and text-content retrieval?• How to combine results in different languages?• How to combine different retrieval approaches?

Overview of PATATRAS

Tokenization

POS Tagging

PhraseExtraction

ConceptTagging

FinalRankedResults

IndexLemma

IndexPhrase

IndexConcept

Lemur 4.9- KL divergence- Okapi BM25

RankedResults

Query Lemma en

Query Lemma fr

Query Lemma de

Query Phrase en

Query Concept

InitInitial

Working Set

Post-Ranking

RankedMergedResults

Merging

PatentCollection

PatentTopic

Post-Ranking

Merging

Init Post-RankingMerging

Patent Document Processing: Text Indexing

• Sound linguistic processing as groundwork:– No stemming: POS tagging & lemmatisation– No stop words: Only open grammatical categories are

considered (N, V, Adj., Adv., numbers)

• A total of 5 indexes:– One word form* (lemma) index per language (en, fr,

de) – English phrase indexing (Dice coefficient)– Conceptual indexing

*ISO/DIS 24611, Language resource management — Morpho-syntactic annotation framework

Conceptual indexing

• Creation of a multilingual terminological database base based on a conceptual model* covering scientific & technical fields• Sources: MeSH, UMLS, Gene Ontology, SUMO,

WordNet/WordNet-Domains/WOLF, Wikipedia en/fr/de

• Merging on concept based on: – Domain matches (manual mappings between sources)– Term matches

• Represent terms/term variants/synonyms/acronyms and multilingual correspondences

• Term disambiguation based on IPC class

• 2,6 millions terms for en, 190.000 for de, 140.000 for fr

• 1,4 millions concepts (71.000 realized in de, 65.000 in fr)

*ISO 16642:2003, Computer applications in terminology — Terminological markup framework

Limitations of text-only retrieval• Queries are based on all the textual content of

the topic patent documents

Model Index Language base with citation text

KL lemma en 0.1068 0.1083

KL lemma fr 0.0611 0.0612

KL lemma de 0.0627 0.0634

KL phrase en 0.0717 0.0720

KL concept all 0.0671 0.0680

Okapi lemma en 0.0806 0.0813

Okapi lemma fr 0.0301 0.0303

Okapi lemma de 0.0598 0.0612

Okapi phrase en 0.0328 0.0330

Okapi concept all 0.0510 0.0516

Patent Document Processing: Metadata

• Additional extraction of cited patents in the descriptions (regular expressions)• 7960 additional cited EP doc. found in XL set

• Metadata representation: basic normalization (author, applicant),

• Storage in a MySQL database (total 2,48 Go for the collection)

Prior working sets

• Goal: For a given patent topic, create the smallest set of patents containing the relevant documents

• Iterative expansion from a core list of documents based on metadata: citation tree, common applicant/author, patent family relation, classifications → patent examiner's strategies

• Result: micro-recall of 0.7303, approx. 2600 doc. per patent topic (415 results per topic after final cutoff)

• Significant improvement of MAP results:

Model Index Language with cit. text with prior sets

KL lemma en 0.1083 0.1516 (+40%)

KL lemma de 0.0634 0.1145 (+81%)

KL phrase en 0.0720 0.1268 (+76%)

Okapi lemma en 0.0813 0.1365 (+68%)

Merging of results

•Strong complementarities between the results sets•So many examples ! → fully supervised ML•Regression model for estimating for each patent topic the pertinence of a result set•Features: language, query size, init. working set size, max./min. & range of retrieval scores, IPC main & class, average phrase length•Training set: 500 + Addition of 4131 patents of the collection•Linear combination of weights: md

Mmmqdq wcs

Merging of results

Feat. LeastMedSq MP SMO ν-SVM

f1 0.1681 (+5.8%) 0.1711 (+7.7%) 0.1706 (7.4) 0.1691 (+6.4%)

f1-6 0.1689 (+6.3%) 0.1797 (+13.1%) 0.1807 (+13.7) 0.1976 (+24.3%)

all 0.1786 (+12.4) 0.1898 (+19.4%) 0.2016 (+26.9%) 0.2281 (+43.5%)

f1 language

f2-6 related to the retrieval score

f7-8 IPC (domains)

f9 av. phrase length

Post-ranking

• Regression model for estimating the pertinence of a patent in the result set for a given patent topic

• Features: citation, # of common IPC & ECLA classes, prob. of citation, same applicant & inventors

• Training set: 500 + Addition of 4131 patents of the collection

Final Results

Measures S M XL en-XL fr-XL de-XL MAP 0.2714 0.2783 0.2802 0.2358 0.1787 0.2092Prec. at 5 0.2780 0.2766 0.2768 0.2365 0.1855 0.2122Prec. at 10 0.1768 0.1748 0.1776 0.1575 0.1338 0.1467

• In average approx. 43s per topic

• Final runs (10.000 patent topics) for all, en, fr, de took 5 days on 4 machines

Conclusion

• We have proposed an architecture for retrieving Scientific and Technical Publications

• We have adapted the architecture to patent search practices

• Need– improve terminological representations

– address document structures

– refine query representations

Full text available in HAL: http://hal.archives-ouvertes.fr/hal-00411835/fr/

Multiple Retrieval Models and Regression Models for Prior Art Search

Documents

Sigir2013 retrieval models-and_ranking_i_pub

AIRS (Atmospheric Infrared Sounder) Regression Retrieval (Level 2)

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Retrieval Model Overview Boolean Retrieval Retrieval INFO 4300 / CS 4300 ! Retrieval models – Older models » Boolean retrieval » Vector Space model – Probabilistic Models »

Retrieval Models II

Retrieval Models I

€¦ · CS-590I Information Retrieval Retrieval Models Luo Si Department of Computer Science Purdue University ˘ˇ Retrieval Models Exact-match retrieval method

Tbs910 regression models

Choosing Regression Models

AIRS (Atmospheric Infrared Sounder) Regression Retrieval (Level 2)

Information Retrieval Language ModelInformation Retrieval INFO 4300 / CS 4300 ! Retrieval models – Older models » Boolean retrieval » Vector Space model – Probabilistic Models

Estimating Regression Models for Categorical Dependent ... · PDF fileEstimating Regression Models for Categorical Dependent ... 1.1 Regression Models for Categorical Dependent

Linear regression models

Flexible Regression Models: Data Transformationistbigdata.com/.../01/6-Flexible-Regression-Models-Data-Transformati… · Flexible Regression Models: Data Transformation ... the original

Regression models I

Advanced Information- Retrieval Models

Language Models for Information Retrieval - NTNUberlin.csie.ntnu.edu.tw/Courses/Information Retrieval and Extraction... · Language Models for Information Retrieval References: 1

Evaluating Hospital Case Cost Prediction Models Using ... · regression machine learning models: linear regression, Bayesian linear regression, decision forest regression, boosted

Retrieval Models - ccs.neu.edu

Information Retrieval: Retrieval Models...Retrieval Models: Unranked Boolean WestLaw system: Commercial Legal/Health/Finance Information Retrieval System zLogical operators zProximity