8
Name Disambiguation in Digital Libraries Tan Yee Fan 2005 October 19 WING Group Meeting

Name Disambiguation in Digital Libraries

  • Upload
    keren

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

Name Disambiguation in Digital Libraries. Tan Yee Fan 2005 October 19 WING Group Meeting. Digital libraries. DBLP, Citeseer, etc. Information is stored as metadata records to facilitate searching Author names Titles Publication titles Inconsistency in metadata records hinders searching - PowerPoint PPT Presentation

Citation preview

Page 1: Name Disambiguation in Digital Libraries

Name Disambiguation in Digital Libraries

Tan Yee Fan

2005 October 19

WING Group Meeting

Page 2: Name Disambiguation in Digital Libraries

Digital libraries

DBLP, Citeseer, etc. Information is stored as metadata records to

facilitate searching Author names Titles Publication titles

Inconsistency in metadata records hinders searching Abbreviation of names and publication titles Typographical errors

Page 3: Name Disambiguation in Digital Libraries

Are they the same author?

Danny Poo Danny C. C. Poo, Teck-Kang Toh, Christopher S. G. Khoo,

Glenn Hong. Development of an Intelligent Web Interface to Online Library Catalog Databases. APSEC 1999: 64-7

Danny Chiang Choon Poo, Isaac K. C. Tan. Design of an Automatic Annotation Framework for Corporate Web Content. APSEC 2004: 384-391

Hui Yang Maan A. Kousa, Ahmed K. Elhakeem, Hui Yang. Performance of

ATM networks under hybrid ARQ/FEC error control scheme. IEEE/ACM Trans. Netw. 7(6): 917-925 (1999)

Hui Yang, Tat-Seng Chua. QUALIFIER: Question Answering by Lexical Fabric and External Resources. EACL 2003: 363-370

Page 4: Name Disambiguation in Digital Libraries

Who am I, I am who?

Author name disambiguation Given a large number of citations, how to

determine which name is which author? Closely related problem: citation matching

Given a large number of citations, how to determine which citations refer to the same papers?

Solutions must be scalable DBLP has more than 660,000 citations Citeseer has more than 730,000 documents

Page 5: Name Disambiguation in Digital Libraries

Ideas

Idea 1: determine the research field Unfortunately, paper titles have limited words and

some conferences tend to be broad Idea 2: use coauthors information

Likely that an author will collaborate with a selected group of people

This group will likely publish a number of papers together

To find the similarity of coauthor lists

Page 6: Name Disambiguation in Digital Libraries

Forward direction:M. Kan = M.-Y. Kan = Min-Yen Kan Problem

Pairwise comparison on all the coauthor lists is very expensive (few days also cannot finish)

Solution Soft clustering on the coauthor lists using some

cheap distance measure Then perform pairwise comparison within the

clusters What is a good soft clustering algorithm?

Page 7: Name Disambiguation in Digital Libraries

Backward direction:This Hang Cui is not that Hang Cui Difficult to determine using the metadata

alone without external resources Many authors have several distinct research

areas Each research area with different collaborators

Currently investigating what kind of external resource to use Goooooooooogle for URLs?

Page 8: Name Disambiguation in Digital Libraries

The end

But the research has just begun…