Download ppt - Also By The Same Author: AKTiveAuthor, A Citation Graph Approach To Name Disambiguation AKT DTA Colloquium January 23, 2006 Duncan McRae-Spencer

Also By The Same Author:AKTiveAuthor, A Citation Graph Approach

To Name Disambiguation

AKT DTA Colloquium

January 23, 2006

Duncan McRae-Spencer

Also By The Same Author

• Name ambiguity a problem for automated information extraction.

• Two problems:1. Same name, different object: David L. Harris

(Harvey Mudd College, formerly Stanford and MIT) and David L. Harris (Sandia Labs, Albuquerque)

2. Different name, same object: Professor Nick Jennings, Nicholas Jennings, N. R. Jennings.


• Existing Solutions:– By-hand disambiguation (eg DBLP).

• Problem: slow, labour-intensive.

– Text and context processing: Li et al (2005).• Problem: deals with names within text, not

document authors.

– Metadata machine-learning techniques: Han et al (2004, 2005).

• Problem: Requires known ‘canonical’ set and 50% of data used in training.


• AKTiveAuthor: Linking together paper authors using metadata analysis.

• Specifically based on the following observation:– People cite their own work. When they cite an

author with a similar name, 95-98% of the time it is the same person.

• Step one: Initial clustering on last name.


• Self-citation analysis:– Within a name-cluster,

test papers against each other.

– Does paper A appear in the bibliography of paper B, or vice versa?

– Iteratively use this approach to build groups of papers, each representing one real-world author.


• Co-authorship Analysis:– Standard approach in disambiguation (Han et al) and

social network analysis (AKT Ontocopi).– Use co-authorship relationships to further match the

groups created in the self-citation stage.

• Source URL Analysis:– Extra linking provided using the ‘source URL’

metadata field.– Links papers by same author on different subjects

across one time period.


• Sanity Check:– Before committing to a ‘join’ on any of the

three stages, check to see if it’s obviously not the same person.

– Eg Norman L. Johnson and David E. Johnson (self-citation match).

– Eg Earl and Erik Johnson (co-authorship match).

– Eg Nicholas Jennings and N. Jennings allowed.


• Metrics:– Essentially an information retrieval exercise.

• Three measures, each per individual paper:– Precision: (number of relevant docs retrieved)

/ (number of docs retrieved).– Recall: (number of relevant docs retrieved) /

(number of relevant docs overall).– F-measure: Harmonic mean of Precision and

Recall, used as generic measure of IR success.


• Results:– Tested eight name-clusters, checking against

by-hand disambiguated results.

• Precision ranged from 0.991 to 1.000 (mean 0.997).

• Recall ranged from 0.705 to 0.935 (mean 0.818)

• F-measure ranged from 0.824 to 0.965 (mean 0.899)


• Analysis / Conclusions:– Precision higher than recall, mainly due to

sanity check.– All three methods (self-citation, co-authorship

and url source analysis) needed for best results.

– Heavily-dominated name-clusters give best results (eg Giles (81.6% C Lee Giles)).

– Large and small name-clusters equally good.


• Future Work:– Original purpose: citation graph services, eg

‘view my papers’, ‘count my citations’, ‘calculate my impact’.

– Improving the disambiguation algorithm: institutional affiliation data, tightening up co-authorship, better initial clustering.


• Questions?