Upload
hilary-dawson
View
229
Download
0
Embed Size (px)
DESCRIPTION
Swinburne Program a hand-crafted algorithm based on, eg: same/similar last name (account for spelling variations) same/similar first name (account for spelling variations) similar application dates (investigate different windows) similar co-authors (account for spelling variations) Machine learning (algorithm learns important discriminating features from data), eg: neural networks 3 Options to tackle the problem
Citation preview
Disambiguating inventor names using deep neural networks
Steve PetrieT’Mir Julius
Swinburne
• Inventors in patent apps do not have unique IDs: identical names same inventor? / different inventors? different names same inventor? / different inventors?
• Goal: disambiguate inventor names assign unique inventor ID
2
Project goal: match inventor names
Swinburne
• Program a hand-crafted algorithm based on, eg: same/similar last name (account for spelling variations) same/similar first name (account for spelling variations) similar application dates (investigate different windows) similar co-authors (account for spelling variations)
• Machine learning (algorithm learns important discriminating features from data), eg: neural networks
3
Options to tackle the problem
Swinburne
• Program a hand-crafted algorithm:+ inner workings are explicit/transparent- requires a lot of programmer effort & time- brittle: a lot of special cases (exceptions) may go unseen/unimplemented (analogous to overfitting)
• Machine learning:+ automatically learns discriminating features from data+ learns features fast+ does not require as much expert knowledge of dataset- inner workings usually not explicit/transparent- overfitting may be a problem 4
Options to tackle the problem
Swinburne
• Biological neuron:
• Artificial neuron (“perceptron”):
Neural networks
[credit: en.wikipedia.org/wiki/Neuron]
[credit: neuralnetworksanddeeplearning.com]
Swinburne
• Deep neural networks (DNNs) multiple hidden layers enables abstraction of concepts
6
Neural networks
[credit: neuralnetworksanddeeplearning.com]
Swinburne
• AlexNet2012 architecture:
DNN example: AlexNet2012
[credit: Krizhevsky et al (2012)]
inputlayer
outputlayer
hidden layers
Swinburne
• AlexNet2012 architecture:
DNN example: AlexNet2012
[credit: Krizhevsky et al (2012)]
inputlayer
outputlayer
hidden layers
dog
Swinburne
• AlexNet2012 architecture:
• Train (labelled):
DNN example: AlexNet2012
[credit: Krizhevsky et al (2012)]
ship flower elephant
Swinburne
• AlexNet2012 architecture:
• Train (labelled):
• Deploy (unlabelled):
DNN example: AlexNet2012
[credit: Krizhevsky et al (2012)]
ship flower elephant
Swinburne
• AlexNet2012 architecture:
• Train (labelled):
• Deploy (unlabelled):
DNN example: AlexNet2012
[credit: Krizhevsky et al (2012)]
ship flower elephant
Swinburne
• AlexNet2012 architecture:
DNN example: AlexNet2012
[ credit: devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library/ ]
Swinburne
• Perhaps a DNN designed to classify:
1.2 Mn training images 1,000 classes
will perform well when classifying:
430k training comparisons 2 classes
• But patent app data is text, not images!
Can we use a DNN…?
o dogo humano shipo etc…
o matcho non-match
Swinburne
• Need to represent text as numbers
• Convert to vector?
• Convert to image (2D bitmap)? works with previous DNNs designed for image analysis accounts for spelling errors, translations (different string
positions within word/s)
Transforming text to 2D images
Swinburne
• Clean text: remove whitespace remove punctuation convert to uppercase
Transforming text to 2D images
Swinburne
• 2D map structure:
Transforming text to 2D images
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
Swinburne
• 2D map structure:
Transforming text to 2D images
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
PETRIE
Swinburne
• 2D map structure:
Transforming text to 2D images
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
PETRIE
Swinburne
• 2D map structure:
Transforming text to 2D images
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
PETRIE
Swinburne
• 2D map structure:
Transforming text to 2D images
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
PETRIE
Swinburne
• 2D map structure:
Transforming text to 2D images
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
PETRIE
Swinburne
• 2D map structure:
Transforming text to 2D images
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
PETRIE
Swinburne
• 2D map structure:
Transforming text to 2D images
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
PETRIE
Swinburne
• 2D map structure:
Transforming text to 2D images
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
PETRIE
Swinburne
• 2D map structure:
Transforming text to 2D images
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
WANG
Swinburne
• 2D map structure:
Transforming text to 2D images
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
WANG
Swinburne
• 2D map structure:
Transforming text to 2D images
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
PETRIE
WANG
Swinburne
Transforming text to 2D imagesL M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
firstname
Swinburne
Transforming text to 2D imagesL M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
firstname
lastname
Swinburne
Transforming text to 2D imagesL M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
firstname
lastname city
Swinburne
Transforming text to 2D imagesL M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
L M N H W
D C K/Q B P
Z S A E F
R T I O U
J G Y X V
firstname
lastname city
international patentclassification (IPC)
8 A B 7
H 3 C
9 4 0 2
G 1 D
5 F E 6
Swinburne
Transforming text to 2D imagesmatch
non-match
Swinburne
• Process all data within 5 days:
How many comparisons?
n = 12.4Mn inventor names
Swinburne
• Process all data within 5 days:
How many comparisons?
n – 1 comparisons
n = 12.4Mn inventor names
Swinburne
• Process all data within 5 days:
How many comparisons?
n – 2 comparisons
n = 12.4Mn inventor names
Swinburne
• Process all data within 5 days:
How many comparisons?
= 7.7x1013 (77 Tn) comparisons
computation time ~ years
n = 12.4Mn inventor names
Swinburne
• Sort patent apps into “bins” by lastname:
Binning (blocking)
Swinburne
• Sort patent apps into “bins” by lastname:
Binning (blocking)
PETRIE
Swinburne
• Sort patent apps into “bins” by lastname:
Binning (blocking)
PET
PETRIE
Swinburne
• Sort patent apps into “bins” by lastname:
Binning (blocking)
WANG
PET
Swinburne
• Sort patent apps into “bins” by lastname:
Binning (blocking)
WANG
WAN
PET
Swinburne
• Sort patent apps into “bins” by lastname:
Binning (blocking)
PETERSON
WAN
PET
Swinburne
• Sort patent apps into “bins” by lastname:
Binning (blocking)
PETERSON
WAN
PET
Swinburne
# comparisons matches retained
• No binning: 77 Tn 100%
• PETrie: 154 Bn 99.85%
• PETRie: 68 Bn 99.68%
• PETRIe: 38 Bn 99.37%
Binning (blocking)
Swinburne
# comparisons matches retained
• No binning: 77 Tn 100%
• PETrie: 154 Bn 99.85%
• PETRie: 68 Bn 99.68%
• PETRIe: 38 Bn 99.37%
• Still a problem: LI: 27k inventors; 360 Mn comparisons LEE: 98k inventors; 5 Bn comparisons
Binning (blocking)
Swinburne
# comparisons matches retained
• No binning: 77 Tn 100%
• PETrie: 154 Bn 99.85% many “problem” comparisons are within a small no. of bins
• only consider bins containing > 100 inventors (> 10k comparisons)
if bin key ≤ 2 letters, re-bin with 1st letter of 1st name:Jet Li LI,JJing Li LI,JWei Li LI,W
Binning (blocking)
Swinburne
# comparisons matches retained
• No binning: 77 Tn 100%
• PETrie: 154 Bn 99.85% if bin key ≤ 2 letters, re-bin with 1st letter of 1st name
• PETRie: 64 Bn 99.71% only increase 3 4 letters if bin contains > 100 inventors if bin key ≤ 3 letters, re-bin with 1st letter of 1st name
Binning (blocking)
Swinburne
# comparisons matches retained
• No binning: 77 Tn 100%
• PETrie: 154 Bn 99.85% if bin key ≤ 2 letters, re-bin with 1st letter of 1st name
• PETRie: 64 Bn 99.71% only increase 3 4 letters if bin contains > 100 inventors if bin key ≤ 3 letters, re-bin with 1st letter of 1st name
• …[15 letters]: 441 Mn 99.54% only increase 14 15 letters if bin contains > 100 invtrs if bin key ≤ 14 letters, re-bin with letters from 1st name
Binning (blocking)
Swinburne
• Precision
• Recall
• Splitting
• Lumping
Preliminary results (labelled data only)
Swinburne
• Precision
• Recall
• Splitting
• Lumping
Preliminary results (labelled data only) • match:non-match ratio in
labelled data is 48:52• Probably more non-matches
in bulk data (falsepos ↑)
Swinburne
Preliminary results (labelled data only)
Swinburne
Preliminary results (labelled data only)
Swinburne
• Total: ~ 70h [not including assigning unique IDs]
• Break-down:1. Binning:
~ 1h2. Generating comparison-map images: ~ 36h3. Image classification (deploying DNN for inference): ~ 33h4. Obtaining linked groups (with unique IDs): [not run yet]
Run-times for bulk data processing
Swinburne
• DNN outputs probability of match/non-match for any given invtr-invtr comparison
• However, obtaining unique IDs is not straightforward:
Obtaining unique inventor IDs
A
B C
Swinburne
• DNN outputs probability of match/non-match for any given invtr-invtr comparison
• However, obtaining unique IDs is not straightforward:
• Should we give D: same ID…? different ID…?
Obtaining unique inventor IDs
A
B C
D
Swinburne
• DNN outputs probability of match/non-match for any given invtr-invtr comparison
• However, obtaining unique IDs is not straightforward:
• Should we give D: same ID…? different ID…?
Obtaining unique inventor IDs
A
B C
D
Swinburne
• DNN outputs probability of match/non-match for any given invtr-invtr comparison
• However, obtaining unique IDs is not straightforward:
• Should we give D: same ID…? bad for precision (more false pos) different ID…? bad for recall (fewer true pos)
Obtaining unique inventor IDs
A
B C
D
Swinburne
• DNN outputs probability of match/non-match for any given invtr-invtr comparison
• However, obtaining unique IDs is not straightforward:
• Should we give D: same ID…? if # links ≥ n/2 different ID…? if # links < n/2
Obtaining unique inventor IDs
A
B C
D
Swinburne
• DNN outputs probability of match/non-match for any given invtr-invtr comparison
• However, obtaining unique IDs is not straightforward:
• Should we give D: same ID…? if # links ≥ n/2 different ID…? if # links < n/2
Obtaining unique inventor IDs
A
B C
D