21
The EP-INV-Patstat db and preliminary results Andrea Maurino DISCo - Dip. di Informatica, Sistematica e Comunicazione Università di Milano Bicocca viale Sarca 336/14, 20124, Milano (Italy)

The EP-INV- Patstat db and preliminary results

  • Upload
    pahana

  • View
    53

  • Download
    3

Embed Size (px)

DESCRIPTION

The EP-INV- Patstat db and preliminary results. Andrea Maurino DISCo - Dip. di Informatica, Sistematica e Comunicazione Universit à di Milano Bicocca viale Sarca 336/14, 20124, Milano (Italy ). Index. APE-INV project EP-INV- PatStat Feedback Web application Preliminary results - PowerPoint PPT Presentation

Citation preview

Page 1: The EP-INV- Patstat  db and preliminary results

The EP-INV-Patstat db and preliminary results

Andrea MaurinoDISCo - Dip. di Informatica, Sistematica e Comunicazione

Università di Milano Bicoccaviale Sarca 336/14, 20124, Milano (Italy)

Page 2: The EP-INV- Patstat  db and preliminary results

Index

• APE-INV project• EP-INV-PatStat• Feedback Web application• Preliminary results• Ongoing works

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 2

Page 3: The EP-INV- Patstat  db and preliminary results

A preliminary truth

• The world is dirty!• and • Real world data are dirty!

• A mandatory and prelimnary task before to realize any analysis or statistic is

• Clean your data

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 3

Page 4: The EP-INV- Patstat  db and preliminary results

Disambiguation of academic inventors: ESF-APE-INV

Project chair: Francesco Lissoni (uniBocconi)Technical Manager: Andrea Maurino (uniMiB)

Project steps:• Reclassification of all patents by inventor (INV)• Matching between inventors and academic scientists (APE)

• Results expected:To produce a freely-available database of “Academic

Patenting in Europe”

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 4

www.academicpatenting.eu

Page 5: The EP-INV- Patstat  db and preliminary results

EP-INV-PatStat

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 5

INVENTORS_INFO

PATSTAT_PUBL_NR

PATSTAT_APPL_ID

DISAMBIGUATION

Page 6: The EP-INV- Patstat  db and preliminary results

Which is the part of PatStat

interested by disambiguatio

n?Users should not consider these tables, SUBSTITUTIVE TABLES with disambiguated inventors and inventors information are provided by APE-INV project 6

Source: PatStat documentation

Page 7: The EP-INV- Patstat  db and preliminary results

INVENTORS_INFO

• INVENTORS_INFO table• CODINV2• NAME-SURNAME• COUNTRY / GCOUNTRY• STATE• REGION / GREGION• COUNTY / GCOUNTY• CITY / GCITY• STREET / GSTREET • ZIP / GZIP• LONGITUDE• LATITUDE• GACCURACY

• Fields preceded by letter G are the result of Google-based standardization algorithm, all the other fields are cleaned PatStat addresses (eg. CITY and GCITY)

• We report Google information only when GACCURACY is larger than or equal to 6 (i.e. Address is available at the level of Street).

Page 8: The EP-INV- Patstat  db and preliminary results

From APE-INV to PatStat, PATSTAT_PUBL_NR and PATSTAT_APPL_ID

In order to connect DISAMBIGUATION and INVENTORS_INFO tables with PatStat dataset we include in the repository other two tables:• PATSTAT_PUBL_NR

• allows to link each inventor (as identified by the CODINV2 code in the APE-INV dataset) to her granted patents (PUBLN_NR).

• PATSTAT_APPL_ID• Allows to identify the APPLN_ID corresponding to each PUBLN_NR (NB In

the specific case of EP patents there is a one-to-one correspondence between APPLN_ID and PUBLN_NR).

• The table reports also the information of the PatStat edition the APPLN_ID refers to.

PATSTAT_APPL_IDPUBLN_NR APPLN_ID PEDITION

1 5 420112 6 420113 7 420114 8 42011

PATSTAT_PUBL_NRCODINV2 PUBLN_NR100 1101 2102 3115 4

Page 9: The EP-INV- Patstat  db and preliminary results

DISAMBIGUATION.txt

• DISAMBIGUATION table• CODINV2: is a stable key generated within the APE-INV

project. It identifies uniquely any distinctive combination of inventor and address

• CODINV: is a code associated to each CODINV2 after applying the disambiguation procedure. If two or more distinct CODINV2s are found to be the same person, they are assigned the same CODINV

CODINV CODINV2

1 100

1 101

2 102

3 115

Page 10: The EP-INV- Patstat  db and preliminary results

FEEDBACK WEB APPLICATION

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 10

Page 11: The EP-INV- Patstat  db and preliminary results

Why sharing data

• Instead of looking for one golden algorithm, APE-INV proposes data dissemination and users’ feedback recording

• 2 kinds of users:1. Take the data and run (dissemination only): they use the data in

their studies a-critically. No benefit for the project, risky for them (data are disambiguated according to the state-of-the-art of dissemination techniques, but we can always do better..).

2. Critical users (dissemination+feedback): they use the data, usually sub-samples of the whole dataset, and have the possibility to increase the data quality:

• Hand checked data and survey work on smaller samples• Algorithms fitting better sub-sample specificities (es. Country, firm,

technological field) • Data sources external to PatStat helping the disambiguation effort

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 11

Page 12: The EP-INV- Patstat  db and preliminary results

How does data dissemination work?

• Access http://www.ape-inv.disco.unimib.it/ with id and password• Choose the country(s) of inventors you need (eg. My research is

on Italian inventors)• Get the EP-INV dataset and the CONTROVERSY.txt

12

Query results in txt format.

Page 13: The EP-INV- Patstat  db and preliminary results

SOME RESULTS

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 13

Page 14: The EP-INV- Patstat  db and preliminary results

Number of academic patents, 1996-2006

Page 15: The EP-INV- Patstat  db and preliminary results

Ownership distribution of academic patents lower bound estimates

Page 16: The EP-INV- Patstat  db and preliminary results

Ownership distribution of academic patents, upper bound estimates

Page 17: The EP-INV- Patstat  db and preliminary results

ONGOING WORKS

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 17

Page 18: The EP-INV- Patstat  db and preliminary results

Temporal Record linkage

• “Panta rei” (Heraclitus) everything flows, everything is constantly changing.

• Database may keep trace of these never ending changes

• Examples• People change names

• Xin Dong Xin Luna Dong

• People change works• Havely moves from Univ. of Wa. to Google

• Nations change• YUGOSLAVIA Serbia-Montenegro Serbia Kosovo

• Based on the paper P. Li, X.L.Dong, A.Maurino, D.Scrivistava, linking temporal data, VLDB 2011

••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 18

Page 19: The EP-INV- Patstat  db and preliminary results

An example

person_id person_name person_addressappln_filing_date

110670 ABELE, MANLIO, G.

5 EAST 22ND STREET;NEW YORK, NY 10016 18/10/1990

110670 ABELE, MANLIO, G.

5 EAST 22ND STREET;NEW YORK, NY 10016 06/04/1992

110671 ABELE, MANLIO, G.

5 EAST 22ND STREET, 205;NEW YORK, NY 10010

20/02/1991

110672 ABELE, Manlio, G.

5 East 22nd Street, 205,New York, NY 10010 20/02/1991

110674 ABELE, Manlio, G.

5 East 22nd Street,New York, NY 10016 18/10/1990

110674 ABELE, Manlio, G.

5 East 22nd Street,New York, NY 10016 06/04/1992

110674 ABELE, Manlio, G.

5 East 22nd Street,New York, NY 10016 12/04/1995

110674 ABELE, Manlio, G.

5 East 22nd Street,New York, NY 10016 12/03/1996

110675 Abele, Manlio 250 East 54th St.,New York, NY 10022 19/03/2004

Page 20: The EP-INV- Patstat  db and preliminary results

Experimental Evaluation

F-1 Precision Recall0.5

0.6

0.7

0.8

0.9

1

PARTITION DECAYEDPARTITIONNODECAYADJUST ADJUST

• Effectiveness test:• Data set: patent data, 1871 records, 359

entities, in 1978-2003• Comparison: three existing algorithms, w./o. decayed

similarity

Page 21: The EP-INV- Patstat  db and preliminary results

Thanks!

疑问 ••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 21