The PATENTSCOPE search system: CLIR

Preview:

DESCRIPTION

The PATENTSCOPE search system: CLIR. February 2013. Sandrine Ammann Marketing & Communications Officer. To the PATENTSCOPE search system webinar CLIR. Agenda. CLIR Definition History Search with CLIR Usefulness Golden rules Technicalities Q & A session. CLIR. - PowerPoint PPT Presentation

Citation preview

The PATENTSCOPE search system: CLIR

February 2013

Sandrine Ammann

Marketing & Communications Officer

To the PATENTSCOPE search system webinar

CLIR

Agenda

CLIR

Definition

History

Search with CLIR

Usefulness

Golden rules

Technicalities

Q & A session

CLIR

Cross-Lingual Information Retrieval

Finds synonyms in different domains

Translates those found synonyms + original query into different languages

CLIR – 12 languages available

NON-ASIAN

Dutch

English

French

German

Italian

Portuguese

Russian

Spanish

Swedish

ASIAN

Chinese

Japanese

Korean

History

History

Lower language barriers in patent search

First language tool developed in-house

CLIR: the interface

CLIR: precision vs recall

Precision = the ability to retrieve the most precise results. Trying to find only precisely relevant items (high precision) = miss important items because they don't use quite the same vocabulary.

Recall = the ability to retrieve as many documents as possible that match or are related to a query. Trying to find all the relevant items (high recall) = often get a lot of junk.

CLIR: precision vs recall

Example: precision

Example: recall

Example: ARM

CHIP

CLIR: supervised mode

2 modes: automatic and supervised

Automatic: 1 step

Supervised: 4 steps

Cross-Lingual Expansion (CLIR)

Result : the query from “container” to:

Supervised mode: 1 of 4 steps

Supervised mode : 2 of 4 steps

Supervised mode : 3 of 4 steps

Crowdsourcing

"is the practice of obtaining needed services, ideas, or content by soliciting contributions from a large group of people and especially from the online community rather than from traditional employees or suppliers. […]

Crowdsourcing is different from an ordinary outsourcing since it is a task or problem that is outsourced to an undefined public rather than a specific body."

source:

http://en.wikipedia.org/wiki/Crowdsourcing

Supervised mode : 4 of 4 steps

First: select languages

Second: select parameters

Stemming

Process that removes common ending from words by English Porter algorithm

electric¦al = electric

electric¦ity = electric

electron¦ics = electron

Third: check variants

Second: check variants

Editing

Checking: IPC

Supervised mode: results

Search examples: clothes for sport

Entering “sports clothing” in the Simple search interface will return 168 results

Entering “sports clothing” in the CLIR interface (in automatic mode) will return 5,449 results

Entering “sports clothing” in the CLIR interface (in supervised mode) will return 1,023 results

Why use CLIR?A) Search full text collections simultaneously in many foreign

languages

B) Improve significantly the number of relevant results without increasing significantly the number of irrelevant results

485 results in English titles or abstracts for “sports clothing”

575 results obtained with CLIR searching in titles or abstracts in all languages

C) Have confidence in your searches:

No black box: users have access to the CLIR generated Boolean queries (albeit complex) and have the full control on them

D) Have a responsive system even for complex queries

Golden rules

Expansion modes

Keyword very specific with only 1 meaning AUTO

For any other queries, SUPERVISED is recommended

Variants/synonyms

Select words that you would like to appear in your search results

If you have too much noise in the result list, remove generic variant

Golden rules

Parameters

1. Title and abstract: unconstrained distance

2. Claims: sentence/paragraph distance

3. Description: sentence/paragraph distance

Stemming recommended

Technicalities

Compilation of a long list of titles in language pairs

Creation of in-house extraction methodology

Tool learns statistical bilingual dictionaries of titles

EN

FR

ZH

DE

KOES

Technicalities

Quality of dictionaries: no human intervention

The more title available, the better the coverage

Chinese Korean Dutch

English Portuguese Italian

French Russian Swedish

German Spanish

Japanese

Technicalities

Disambiguation: process of identifying the sense of a word in a sentence. http://en.wikipedia.org/wiki/Disambiguation_%28disambiguation%29

Disambiguation is applied to keywords:

1. Technical domains based on the IPC

2. Synonyms selection

Future plans

Improve terminology coverage of already supported languages

Add other languages: over 200’000 titles and abstracts with associated high quality translations in English

Slides and recording

www.wipo.int/patentscope/en/webinar/index.html

+

patentscope@wipo.int

mulțumesc mulțumesc

Recommended