Upload
keren
View
36
Download
0
Embed Size (px)
DESCRIPTION
Name Disambiguation in Digital Libraries. Tan Yee Fan 2005 October 19 WING Group Meeting. Digital libraries. DBLP, Citeseer, etc. Information is stored as metadata records to facilitate searching Author names Titles Publication titles Inconsistency in metadata records hinders searching - PowerPoint PPT Presentation
Citation preview
Name Disambiguation in Digital Libraries
Tan Yee Fan
2005 October 19
WING Group Meeting
Digital libraries
DBLP, Citeseer, etc. Information is stored as metadata records to
facilitate searching Author names Titles Publication titles
Inconsistency in metadata records hinders searching Abbreviation of names and publication titles Typographical errors
Are they the same author?
Danny Poo Danny C. C. Poo, Teck-Kang Toh, Christopher S. G. Khoo,
Glenn Hong. Development of an Intelligent Web Interface to Online Library Catalog Databases. APSEC 1999: 64-7
Danny Chiang Choon Poo, Isaac K. C. Tan. Design of an Automatic Annotation Framework for Corporate Web Content. APSEC 2004: 384-391
Hui Yang Maan A. Kousa, Ahmed K. Elhakeem, Hui Yang. Performance of
ATM networks under hybrid ARQ/FEC error control scheme. IEEE/ACM Trans. Netw. 7(6): 917-925 (1999)
Hui Yang, Tat-Seng Chua. QUALIFIER: Question Answering by Lexical Fabric and External Resources. EACL 2003: 363-370
Who am I, I am who?
Author name disambiguation Given a large number of citations, how to
determine which name is which author? Closely related problem: citation matching
Given a large number of citations, how to determine which citations refer to the same papers?
Solutions must be scalable DBLP has more than 660,000 citations Citeseer has more than 730,000 documents
Ideas
Idea 1: determine the research field Unfortunately, paper titles have limited words and
some conferences tend to be broad Idea 2: use coauthors information
Likely that an author will collaborate with a selected group of people
This group will likely publish a number of papers together
To find the similarity of coauthor lists
Forward direction:M. Kan = M.-Y. Kan = Min-Yen Kan Problem
Pairwise comparison on all the coauthor lists is very expensive (few days also cannot finish)
Solution Soft clustering on the coauthor lists using some
cheap distance measure Then perform pairwise comparison within the
clusters What is a good soft clustering algorithm?
Backward direction:This Hang Cui is not that Hang Cui Difficult to determine using the metadata
alone without external resources Many authors have several distinct research
areas Each research area with different collaborators
Currently investigating what kind of external resource to use Goooooooooogle for URLs?
The end
But the research has just begun…