View
54
Download
0
Category
Tags:
Preview:
DESCRIPTION
Participating institution: Humboldt Universität zu Berlin - IDSL. Multiple Retrieval Models and Regression Models for Prior Art Search. Patrice Lopez also at EPO Berlin, Germany. Laurent Romary INRIA Gemo - Saclay, France HUB-IDSL – Berlin, Germany. Plan. - PowerPoint PPT Presentation
Citation preview
Multiple Retrieval Models and Regression Modelsfor Prior Art Search
Participating institution: Humboldt Universität zu Berlin - IDSL
Patrice Lopez
also at EPOBerlin, Germany
Laurent Romary
INRIA Gemo - Saclay, FranceHUB-IDSL – Berlin, Germany
Plan
• Searching Scientific and Technical Documents • Issues related to Prior Art Search• Overview of PATATRAS• Patent Document processing• Combining metadata & text in four steps • Results• Future work
PATATRAS !!
• PATent and Article Tracking, Retrieval and AccesS addresses Scientific and Technical Publications in general:
• Scientific and Technical Publications have 5 dimensions:
1. metadata 2. document structure3. textual content4. supporting content5. experimental data
• How is this instantiated in patent publications?
Patent Publications
1. Metadata encode procedure-related data:– Date, applicant, inventors, language(s)– Classification: hierarchy of technical fields IPC,
ECLA (+ICO) G06F17/30T2P2X – Citations Information retrieval Query expansion
EPO Citation Statistics
EPO Search Reports produced in the last 5 years (tot. 775.000)
0
100000
200000
300000
400000
500000
600000
US EP WO DE XP JP GB FR AT IT CA CN RU KR
N°
of
Se
arc
h R
ep
ort
s
EC classified Pat: 95%
NPL: 24%
JP: 17%
1%
Patent Publications
1. Metadata encode procedure-related data:– Date, applicant, inventors, language(s)– Classification: hierarchy of technical fields IPC,
ECLA (+ICO) G06F17/30T2P2X – Citations
2. Patent Document Structure: Title, Abstract, Claims, Description (description of prior art, "subjective" technical problem, description of embodiments)
– Strong interrelations between these structures– Each of these structures serves different goals
Information retrieval Query expansion
Patent Publications
3. Textual Content of Patent:• Attornish, multilinguality
4. Supporting content:• tables, mathematical and chemical formulas,
citations, technical drawing, etc.
5. Experimental data: absent
PATATRAS !!• Scientific and Technical Publications have 5
dimensions:1. metadata 2. document structure3. textual content4. supporting content5. experimental data
What are the known practices in prior art search ?
Prior art search
Search reportSearch report
Patent application
Patent application
Prior artPrior artPrior artPrior art
Prior artPrior artPrior artPrior art
Prior artPrior art
Topic patents are granted patent publication (richer documents then applications; ECLA classes, extra citations; final claims)
Result set extended to all EPO documents introduced during the prosecution the application
All documents without EPO counterpart (via patent family) are discarded (patent applications never filed at the EPO and non-patent literature omitted)
CLEF-IP biases
Motivation and approach:• Non exhaustivity• Recall-oriented search is a myth (titles and abstracts; lack of elaborate tool)• Usage of classification for search (a priori restriction of the result set)•Usage of meta-data (thickets; patents are continuation of previous applications)
The patent examiner’sreal life
PATATRAS !!• Scientific and Technical Publications have 5
dimensions:1. metadata 2. document structure3. textual content4. supporting content5. experimental data
• We investigated only 1 and 3 in CLEF IP 2009• However...
• how to combine metadata-based and text-content retrieval?• How to combine results in different languages?• How to combine different retrieval approaches?
Overview of PATATRAS
Tokenization
POS Tagging
PhraseExtraction
ConceptTagging
FinalRankedResults
IndexLemma
en
IndexLemma
fr
IndexLemma
de
IndexPhrase
en
IndexConcept
Lemur 4.9- KL divergence- Okapi BM25
RankedResults
(10)
Query Lemma en
Query Lemma fr
Query Lemma de
Query Phrase en
Query Concept
InitInitial
Working Set
Post-Ranking
RankedMergedResults
Merging
PatentCollection
PatentTopic
Lemur 4.9- KL divergence- Okapi BM25
Init
Post-Ranking
Merging
Overview of PATATRAS
Lemur 4.9- KL divergence- Okapi BM25
Init Post-RankingMerging
Overview of PATATRAS
Lemur 4.9- KL divergence- Okapi BM25
Init Post-RankingMerging
Overview of PATATRAS
Patent Document Processing: Text Indexing
• Sound linguistic processing as groundwork:– No stemming: POS tagging & lemmatisation– No stop words: Only open grammatical categories are
considered (N, V, Adj., Adv., numbers)
• A total of 5 indexes:– One word form* (lemma) index per language (en, fr,
de) – English phrase indexing (Dice coefficient)– Conceptual indexing
*ISO/DIS 24611, Language resource management — Morpho-syntactic annotation framework
Conceptual indexing
• Creation of a multilingual terminological database base based on a conceptual model* covering scientific & technical fields• Sources: MeSH, UMLS, Gene Ontology, SUMO,
WordNet/WordNet-Domains/WOLF, Wikipedia en/fr/de
• Merging on concept based on: – Domain matches (manual mappings between sources)– Term matches
• Represent terms/term variants/synonyms/acronyms and multilingual correspondences
• Term disambiguation based on IPC class
• 2,6 millions terms for en, 190.000 for de, 140.000 for fr
• 1,4 millions concepts (71.000 realized in de, 65.000 in fr)
*ISO 16642:2003, Computer applications in terminology — Terminological markup framework
Limitations of text-only retrieval• Queries are based on all the textual content of
the topic patent documents
Model Index Language base with citation text
KL lemma en 0.1068 0.1083
KL lemma fr 0.0611 0.0612
KL lemma de 0.0627 0.0634
KL phrase en 0.0717 0.0720
KL concept all 0.0671 0.0680
Okapi lemma en 0.0806 0.0813
Okapi lemma fr 0.0301 0.0303
Okapi lemma de 0.0598 0.0612
Okapi phrase en 0.0328 0.0330
Okapi concept all 0.0510 0.0516
Overview of PATATRAS
Lemur 4.9- KL divergence- Okapi BM25
Init Post-RankingMerging
Overview of PATATRAS
Lemur 4.9- KL divergence- Okapi BM25
Init Post-RankingMerging
Patent Document Processing: Metadata
• Additional extraction of cited patents in the descriptions (regular expressions)• 7960 additional cited EP doc. found in XL set
• Metadata representation: basic normalization (author, applicant),
• Storage in a MySQL database (total 2,48 Go for the collection)
Prior working sets
• Goal: For a given patent topic, create the smallest set of patents containing the relevant documents
• Iterative expansion from a core list of documents based on metadata: citation tree, common applicant/author, patent family relation, classifications → patent examiner's strategies
• Result: micro-recall of 0.7303, approx. 2600 doc. per patent topic (415 results per topic after final cutoff)
• Significant improvement of MAP results:
Model Index Language with cit. text with prior sets
KL lemma en 0.1083 0.1516 (+40%)
KL lemma de 0.0634 0.1145 (+81%)
KL phrase en 0.0720 0.1268 (+76%)
Okapi lemma en 0.0813 0.1365 (+68%)
Overview of PATATRAS
Lemur 4.9- KL divergence- Okapi BM25
Init Post-RankingMerging
Overview of PATATRAS
Lemur 4.9- KL divergence- Okapi BM25
Init Post-RankingMerging
Merging of results
•Strong complementarities between the results sets•So many examples ! → fully supervised ML•Regression model for estimating for each patent topic the pertinence of a result set•Features: language, query size, init. working set size, max./min. & range of retrieval scores, IPC main & class, average phrase length•Training set: 500 + Addition of 4131 patents of the collection•Linear combination of weights: md
Mmmqdq wcs
Merging of results
Feat. LeastMedSq MP SMO ν-SVM
f1 0.1681 (+5.8%) 0.1711 (+7.7%) 0.1706 (7.4) 0.1691 (+6.4%)
f1-6 0.1689 (+6.3%) 0.1797 (+13.1%) 0.1807 (+13.7) 0.1976 (+24.3%)
all 0.1786 (+12.4) 0.1898 (+19.4%) 0.2016 (+26.9%) 0.2281 (+43.5%)
f1 language
f2-6 related to the retrieval score
f7-8 IPC (domains)
f9 av. phrase length
Overview of PATATRAS
Lemur 4.9- KL divergence- Okapi BM25
Init Post-RankingMerging
Overview of PATATRAS
Lemur 4.9- KL divergence- Okapi BM25
Init Post-RankingMerging
Post-ranking
• Regression model for estimating the pertinence of a patent in the result set for a given patent topic
• Features: citation, # of common IPC & ECLA classes, prob. of citation, same applicant & inventors
• Training set: 500 + Addition of 4131 patents of the collection
Final Results
Measures S M XL en-XL fr-XL de-XL MAP 0.2714 0.2783 0.2802 0.2358 0.1787 0.2092Prec. at 5 0.2780 0.2766 0.2768 0.2365 0.1855 0.2122Prec. at 10 0.1768 0.1748 0.1776 0.1575 0.1338 0.1467
• In average approx. 43s per topic
• Final runs (10.000 patent topics) for all, en, fr, de took 5 days on 4 machines
Conclusion
• We have proposed an architecture for retrieving Scientific and Technical Publications
• We have adapted the architecture to patent search practices
• Need– improve terminological representations
– address document structures
– refine query representations
Full text available in HAL: http://hal.archives-ouvertes.fr/hal-00411835/fr/
Recommended