Also By The Same Author:AKTiveAuthor, A Citation Graph Approach
To Name Disambiguation
AKT DTA Colloquium
January 23, 2006
Duncan McRae-Spencer
Also By The Same Author
• Name ambiguity a problem for automated information extraction.
• Two problems:1. Same name, different object: David L. Harris
(Harvey Mudd College, formerly Stanford and MIT) and David L. Harris (Sandia Labs, Albuquerque)
2. Different name, same object: Professor Nick Jennings, Nicholas Jennings, N. R. Jennings.
Also By The Same Author
• Existing Solutions:– By-hand disambiguation (eg DBLP).
• Problem: slow, labour-intensive.
– Text and context processing: Li et al (2005).• Problem: deals with names within text, not
document authors.
– Metadata machine-learning techniques: Han et al (2004, 2005).
• Problem: Requires known ‘canonical’ set and 50% of data used in training.
Also By The Same Author
• AKTiveAuthor: Linking together paper authors using metadata analysis.
• Specifically based on the following observation:– People cite their own work. When they cite an
author with a similar name, 95-98% of the time it is the same person.
• Step one: Initial clustering on last name.
Also By The Same Author
• Self-citation analysis:– Within a name-cluster,
test papers against each other.
– Does paper A appear in the bibliography of paper B, or vice versa?
– Iteratively use this approach to build groups of papers, each representing one real-world author.
Also By The Same Author
• Co-authorship Analysis:– Standard approach in disambiguation (Han et al) and
social network analysis (AKT Ontocopi).– Use co-authorship relationships to further match the
groups created in the self-citation stage.
• Source URL Analysis:– Extra linking provided using the ‘source URL’
metadata field.– Links papers by same author on different subjects
across one time period.
Also By The Same Author
• Sanity Check:– Before committing to a ‘join’ on any of the
three stages, check to see if it’s obviously not the same person.
– Eg Norman L. Johnson and David E. Johnson (self-citation match).
– Eg Earl and Erik Johnson (co-authorship match).
– Eg Nicholas Jennings and N. Jennings allowed.
Also By The Same Author
• Metrics:– Essentially an information retrieval exercise.
• Three measures, each per individual paper:– Precision: (number of relevant docs retrieved)
/ (number of docs retrieved).– Recall: (number of relevant docs retrieved) /
(number of relevant docs overall).– F-measure: Harmonic mean of Precision and
Recall, used as generic measure of IR success.
Also By The Same Author
• Results:– Tested eight name-clusters, checking against
by-hand disambiguated results.
• Precision ranged from 0.991 to 1.000 (mean 0.997).
• Recall ranged from 0.705 to 0.935 (mean 0.818)
• F-measure ranged from 0.824 to 0.965 (mean 0.899)
Also By The Same Author
• Analysis / Conclusions:– Precision higher than recall, mainly due to
sanity check.– All three methods (self-citation, co-authorship
and url source analysis) needed for best results.
– Heavily-dominated name-clusters give best results (eg Giles (81.6% C Lee Giles)).
– Large and small name-clusters equally good.
Also By The Same Author
• Future Work:– Original purpose: citation graph services, eg
‘view my papers’, ‘count my citations’, ‘calculate my impact’.
– Improving the disambiguation algorithm: institutional affiliation data, tightening up co-authorship, better initial clustering.
Also By The Same Author
• Questions?