Disambiguating inventor names using deep neural networks Steve Petrie T’Mir Julius

Disambiguating inventor names using deep neural networks

Steve PetrieT’Mir Julius

Swinburne

• Inventors in patent apps do not have unique IDs: identical names same inventor? / different inventors? different names same inventor? / different inventors?

• Goal: disambiguate inventor names assign unique inventor ID

2

Project goal: match inventor names

Swinburne

• Program a hand-crafted algorithm based on, eg: same/similar last name (account for spelling variations) same/similar first name (account for spelling variations) similar application dates (investigate different windows) similar co-authors (account for spelling variations)

• Machine learning (algorithm learns important discriminating features from data), eg: neural networks

3

Options to tackle the problem

Swinburne

• Program a hand-crafted algorithm:+ inner workings are explicit/transparent- requires a lot of programmer effort & time- brittle: a lot of special cases (exceptions) may go unseen/unimplemented (analogous to overfitting)

• Machine learning:+ automatically learns discriminating features from data+ learns features fast+ does not require as much expert knowledge of dataset- inner workings usually not explicit/transparent- overfitting may be a problem 4

Options to tackle the problem

Swinburne

• Biological neuron:

• Artificial neuron (“perceptron”):

Neural networks

[credit: en.wikipedia.org/wiki/Neuron]

[credit: neuralnetworksanddeeplearning.com]

Swinburne

• Deep neural networks (DNNs) multiple hidden layers enables abstraction of concepts

6

Neural networks

[credit: neuralnetworksanddeeplearning.com]

Swinburne

• AlexNet2012 architecture:

DNN example: AlexNet2012

[credit: Krizhevsky et al (2012)]

inputlayer

outputlayer

hidden layers

Swinburne




inputlayer

outputlayer

hidden layers

dog

Swinburne


• Train (labelled):



ship flower elephant

Swinburne



• Deploy (unlabelled):




Swinburne



• Deploy (unlabelled):




Swinburne



[ credit: devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/ ]

Swinburne

• Perhaps a DNN designed to classify:

1.2 Mn training images 1,000 classes

  will perform well when classifying:

430k training comparisons 2 classes

• But patent app data is text, not images!

Can we use a DNN…?

o dogo humano shipo etc…

o matcho non-match

Swinburne

• Need to represent text as numbers

• Convert to vector?

• Convert to image (2D bitmap)? works with previous DNNs designed for image analysis accounts for spelling errors, translations (different string

positions within word/s)

Transforming text to 2D images

Swinburne

• Clean text: remove whitespace remove punctuation convert to uppercase


Swinburne

• 2D map structure:


L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

Swinburne



L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

PETRIE

Swinburne



L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

PETRIE

Swinburne



L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

PETRIE

Swinburne



L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

PETRIE

Swinburne



L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

PETRIE

Swinburne



L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

PETRIE

Swinburne



L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

PETRIE

Swinburne



L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

PETRIE

Swinburne



L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

WANG

Swinburne



L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

WANG

Swinburne



L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

PETRIE

WANG

Swinburne

Transforming text to 2D imagesL M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

firstname

Swinburne


D C K/Q B P

Z S A E F

R T I O U

J G Y X V

L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

firstname

lastname

Swinburne


D C K/Q B P

Z S A E F

R T I O U

J G Y X V

L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

firstname

lastname city

Swinburne


D C K/Q B P

Z S A E F

R T I O U

J G Y X V

L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

L M N H W

D C K/Q B P

Z S A E F

R T I O U

J G Y X V

firstname

lastname city

international patentclassification (IPC)

8 A B 7

H 3 C

9 4 0 2

G 1 D

5 F E 6

Swinburne

Transforming text to 2D imagesmatch

non-match

Swinburne

• Process all data within 5 days:

How many comparisons?

n = 12.4Mn inventor names

Swinburne



n – 1 comparisons


Swinburne



n – 2 comparisons


Swinburne



= 7.7x1013 (77 Tn) comparisons

computation time ~ years


Swinburne

• Sort patent apps into “bins” by lastname:

Binning (blocking)

Swinburne


Binning (blocking)

PETRIE

Swinburne


Binning (blocking)

PET

PETRIE

Swinburne


Binning (blocking)

WANG

PET

Swinburne


Binning (blocking)

WANG

WAN

PET

Swinburne


Binning (blocking)

PETERSON

WAN

PET

Swinburne


Binning (blocking)

PETERSON

WAN

PET

Swinburne

# comparisons matches retained

• No binning: 77 Tn 100%

• PETrie: 154 Bn 99.85%

• PETRie: 68 Bn 99.68%

• PETRIe: 38 Bn 99.37%

Binning (blocking)

Swinburne



• PETrie: 154 Bn 99.85%

• PETRie: 68 Bn 99.68%

• PETRIe: 38 Bn 99.37%

• Still a problem: LI: 27k inventors; 360 Mn comparisons LEE: 98k inventors; 5 Bn comparisons

Binning (blocking)

Swinburne



• PETrie: 154 Bn 99.85% many “problem” comparisons are within a small no. of bins

• only consider bins containing > 100 inventors (> 10k comparisons)

if bin key ≤ 2 letters, re-bin with 1st letter of 1st name:Jet Li LI,JJing Li LI,JWei Li LI,W

Binning (blocking)

Swinburne



• PETrie: 154 Bn 99.85% if bin key ≤ 2 letters, re-bin with 1st letter of 1st name

• PETRie: 64 Bn 99.71% only increase 3 4 letters if bin contains > 100 inventors if bin key ≤ 3 letters, re-bin with 1st letter of 1st name

Binning (blocking)

Swinburne



• PETrie: 154 Bn 99.85% if bin key ≤ 2 letters, re-bin with 1st letter of 1st name

• PETRie: 64 Bn 99.71% only increase 3 4 letters if bin contains > 100 inventors if bin key ≤ 3 letters, re-bin with 1st letter of 1st name

• …[15 letters]: 441 Mn 99.54% only increase 14 15 letters if bin contains > 100 invtrs if bin key ≤ 14 letters, re-bin with letters from 1st name

Binning (blocking)

Swinburne

• Precision

• Recall

• Splitting

• Lumping

Preliminary results (labelled data only)

Swinburne

• Precision

• Recall

• Splitting

• Lumping

Preliminary results (labelled data only) • match:non-match ratio in

labelled data is 48:52• Probably more non-matches

in bulk data (falsepos ↑)

Swinburne


Swinburne


Swinburne

• Total: ~ 70h [not including assigning unique IDs]

• Break-down:1. Binning:

~ 1h2. Generating comparison-map images: ~ 36h3. Image classification (deploying DNN for inference): ~ 33h4. Obtaining linked groups (with unique IDs): [not run yet]

Run-times for bulk data processing

Swinburne

• DNN outputs probability of match/non-match for any given invtr-invtr comparison

• However, obtaining unique IDs is not straightforward:

Obtaining unique inventor IDs

A

B C

Swinburne



• Should we give D: same ID…? different ID…?


A

B C

D

Swinburne



• Should we give D: same ID…? different ID…?


A

B C

D

Swinburne



• Should we give D: same ID…? bad for precision (more false pos) different ID…? bad for recall (fewer true pos)


A

B C

D

Swinburne



• Should we give D: same ID…? if # links ≥ n/2 different ID…? if # links < n/2


A

B C

D

Swinburne



• Should we give D: same ID…? if # links ≥ n/2 different ID…? if # links < n/2


A

B C

D

Documents

Disambiguating inventor names using deep neural networks Steve Petrie T’Mir Julius