22
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Identification of Low/High Retrievable Patents using Content- Based Features Shariq Bashir, and Andreas Rauber PaIR’09, Hong Kong, China, 6 th November, 2009 Department of Software Technology and Interactive Systems Vienna University of Technology, Austria {bashir, rauber}@ifs.tuwien.ac.at

Identification of Low/High Retrievable Patents using Content-Based Features

  • Upload
    kasen

  • View
    32

  • Download
    0

Embed Size (px)

DESCRIPTION

Identification of Low/High Retrievable Patents using Content-Based Features. Shariq Bashir, and Andreas Rauber PaIR’09, Hong Kong, China, 6 th November, 2009. Department of Software Technology and Interactive Systems Vienna University of Technology, Austria {bashir, rauber}@ifs.tuwien.ac.at. - PowerPoint PPT Presentation

Citation preview

Page 1: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Identification of Low/High Retrievable Patents using Content-Based

Features

Shariq Bashir, and Andreas Rauber

PaIR’09, Hong Kong, China, 6th November, 2009

Department of Software Technology and Interactive Systems

Vienna University of Technology, Austria

{bashir, rauber}@ifs.tuwien.ac.at

Page 2: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Patent Retrieval

Patent Retrieval is a recall oriented domain Retrievability of all relevant Patents is considered more

important than viewing only set of top rank Patents of Queries

Challenges in Patent Retrieval– Complex contents and technical structureComplex contents and technical structure– Acronyms and new terminologyAcronyms and new terminology– Used of Many vague terms for narrowing the scope of their Used of Many vague terms for narrowing the scope of their

inventioninvention– Writer used their own terminologies for passing patents from Writer used their own terminologies for passing patents from

examination testexamination test

These factors create non trivial effect on the Findability of Patents

Page 3: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Retrieval Systems Evaluation

Conventionally, retrieval systems are evaluated based Conventionally, retrieval systems are evaluated based on Average Precision, Q-measure, Normalized on Average Precision, Q-measure, Normalized Discounted Cumulative Gain, metricsDiscounted Cumulative Gain, metrics

These metrics cannot evaluate: These metrics cannot evaluate: – What we can find and what we can’t find?What we can find and what we can’t find?– Which Patents are easy to Find?Which Patents are easy to Find?– Which Patents are hard to Find?Which Patents are hard to Find?– Which retrieval system is better to find Patents on top rank Which retrieval system is better to find Patents on top rank

results of queriesresults of queries

Retrieval systems are evaluated using the concept of Retrieval systems are evaluated using the concept of retrievabilityretrievability

Retrievability analyzes, how easily users can find Retrievability analyzes, how easily users can find documents in given systemdocuments in given system

Page 4: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Retrievability Measurement

Measures how likely individual documents in collection (Measures how likely individual documents in collection (D)D) can be can be retrieved within top c results of queriesretrieved within top c results of queries

Defined as Defined as ddDD

cc denotes the rank user willing to proceed denotes the rank user willing to proceed kkdgdg is the rank of document is the rank of document ddDD in query in query qqQQ

f(kf(kdg,c),c) is cost function, return 1 if is cost function, return 1 if kkdgdg c c, otherwise 0, otherwise 0

Page 5: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bias in Retrieval SystemsBias in Retrieval Systems– Large bias in different retrieval systems is figured out Large bias in different retrieval systems is figured out

toward subset of collectiontoward subset of collection– A large number of patents have very low retrievability A large number of patents have very low retrievability

scorescore– Some patents could not be found via any querySome patents could not be found via any query– Patents have different retrievability scores in different Patents have different retrievability scores in different

systemssystems– Main factors behind low retrievabilityMain factors behind low retrievability

• Short queries, Short queries, • System bias,System bias,• Terms mismatch document-queryTerms mismatch document-query

Retrieval Systems Evaluation

Page 6: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Our Contribution

Motivation: Patents have different Retrievability Scores in different systems Need to understand what factors effect Retrievability Based on different factors, can we identify high/low retrievable

patents a-priori?

Contribution: Retrievability is analyzed with content based features. Other than Patent Length, Following Features are considered:

– Rare Terms Ratio– Average Terms Frequencies– Frequent Terms Count– Average Terms Probabilities in Related Patents– Average Terms Probabilities in Whole Collection

Automatically classify patents into low/high retrieval based on text features

Page 7: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Experimental Setup

Dataset– Patents downloaded from http://www.uspto.gov/ (US Patent and

Trademark office website)– United State Patent Classification (USPC) Classes 422 and 423

are used for experiments– Total Patents = 54,353– With Average Size = 3,317.41 words (without stop words

removal) Retrieval Systems

– TFIDF,– Exact Match– OKAPI BM25– Jelinek-Mercer (JM)– Dirichlet (Bayesian) Smoothing (DirS)– Absolute Discounting (AbsDis)– Two-Stage Smoothing (Two-Stage).

Page 8: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

10

100

1000

10000

100000

0 8000 16000 24000 32000 40000 48000

Patents Ordered by r(d) based on BM25 System

Retr

ieva

bili

ty S

core

TFIDF BM25JM DirSAbsDis Two-Stage

Findability Distribution of Patents with Different Retrieval Systems

Page 9: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Experimental Analysis

Content-Based Feature Analysis– Randomly 800 low and 800 high retrievable patents are pick

from each retrieval system for analysis– We consider a Patent Low retrievable, it has r(d) < 300, whereas

patents with r(d) >= 700 are considered as High Retrievable– Features

• Rare Terms Ratio (RTR)• Average Terms Frequencies (ATF)• Average Terms Probabilities in Related Patents (ATP)• Average Terms Probabilities in Whole Collection (ATP)• Frequent Terms Count (FTC)• Patent Length (PL)

– Features RTF, ATP, ATP are further computed with two, three, and four terms combinations

Page 10: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Rare Terms Ratio (RTR): Which Systems are worst for finding Patents with Large rare terms.

– We consider a term “rare”, if it’s collection frequency is less than 200 Patents with large RTR could indicate

• New invention• Presence of hiding information

TFIDF is worst for finding Large RTR Patents. [Reason: IDF] In Two-Stage Some Patents with Large RTR are high Retrievable

Page 11: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Average Terms Frequencies (ATF)– Helpful for understanding the effect of Terms Frequencies on

Ranking– We consider both rare and all terms

BM25, JM and Two-Stage make smaller ATF patents more findable

DirS and AbsDis make larger ATF patents more findable

Page 12: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Average Terms Probabilities in Related Patents (ATPrd)– Helpful for understanding, whether system make those Patents more

findable • which have similar terms in their related Patents (Strong Cluster), • or Weaker Clusters

– We consider top-35 most similar Patents TFIDF, JM, AbsDis, and DirS all make stronger clusters more retrievable. In BM25, and Two-Stage weaker clusters have high findability

– Thus BM25 and Two-Stage are suitable for finding those patents, which frequently used those alternative terms as compared to those terms which appear in their related patents

Page 13: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Average Terms Probabilities in Related Patents (ATP)– In ATPrd we used 35 most similar Patents– In this feature, whole collection is considered– Useful, for understanding the effect of Inverse Document Frequency (IDF)

TFIDF, DirS, AbsDis and JM make larger ATP patents more retrievable BM25, and Two-Stage make smaller ATO patents more findable

– Suitable for finding those patents which frequently used new terminology

Page 14: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Patent Length (PL)– Useful for understanding whether system makes

• Patents with large length more findable,• Or Patents with smaller length more findable

– In BM25 and Two-Stage• Only those Longer length Patents are Retrievable which have smaller

ATF values (Due to effect of length normalization)– In AbsDis and DirS some shorter length Patents with higher ATF

have higher findability

Page 15: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Experimental Analysis

Automatic Retrievability Classification– For automatic identification of low/high retrievable patents, we

build classification model– Content-based features are used for learning model– 1600 random (800 low and 800 high) retrievable patents are

used for learning classification model– Further 1600 random (800 low and 800 high) retrievable patents

are used for testing classification accuracy– We used J48 classifier implemented in WEKA– For most of systems with both CQG approaches, our

classification indicate more than 80% accuracy

Page 16: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Experimental Analysis

Automatic Retrievability Classification

Page 17: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusions and Future Directions

Patent retrieval is a recall oriented retrieval domain Patents have complex contents and technical structure

– That’s effect on the Findability of Patents

Retrieval Systems are evaluated using the concept of Findability measurement

High and Low retrievable patents are analyzed using text based features

Based on text features, high and low retrievable patents are identified a-priori

Future Directions: Automatic Low/High Retrievable Patent Identification

– Useful for Patents examiners for analyzing the contents of Very Low or Very High Retrievable Patents

Page 18: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusions

Future Directions: Query Results Merging

– Separate Low and High Retrievable Patents– Then Query in both Low and High Retrievable Patents– Can increase Retrievability

Retrieval Systems Fusion– Query with all Systems– Merge only those Systems in which Patents are high Retrievable

Page 19: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Thank You

Page 20: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Queries Generation

Controlled Query Generation (CQG)– Query are generated based upon Prior-Art Search– Two Methods used:

• Without Patents Relatedness, and with Patents Relatedness

– Query Generation combining Frequent Terms (QG-FT)• Each patent is considered as Query Patent for Prior-Art Search• Single Frequent Terms are Extracted with Minimum Document

Frequency > 2• Single Frequent Terms are combined for constructing longer length

QueriesPatent (A)---------------------- Use Patent (A) as a

query for searching related documents.

Page 21: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Queries Generation

Controlled Query Generation (CQG)– Query Generation using Document Relatedness (QG-DR)

• In QG-FT, Queries are generated from Single Patents• Terms Mismatch can effect Results• In this approach, Query terms are selected from Related Patents

– (Step 1): For each Patent, group related Patents in set (R)related Patents in set (R)– (Step 2): Then Using R and whole CollectionR and whole Collection, construct

Language Model, for finding dominant terms

– Where Pjm(t|R) is the probability of term t in set R, and Pjm (t|corpos) is the probability of term t in whole collection

– (Step 3): Combine single terms with two, three, and four terms combinations for constructing longer queries

Page 22: Identification of Low/High Retrievable Patents using Content-Based Features

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Queries Generation

Controlled Query Generation (CQG)– Descriptions of Query Sets used for Retrievability Analysis