Matching Bibliographic Data from Publication Lists with Large Databases using N-Grams

Matching Bibliographic Data from Publication Lists with Large Databases using N-Grams

Mehmet Ali Abdulhayoglu

Bart Thijs

OUTLINE•Introduction•Methodology

▫N-gram Notion▫Levenshtein (Edit) Distance Based on N-

grams▫Kernel Discriminant Analysis

•Results•Conclusion and Discussions

2

Introduction• CVs and publication lists of authors, applicants

or institutions are used for the application of evaluative bibliometrics…(Job promotion, Institutional or macro level assessments)

• It is generally needed to identify these publications in large databases like Web Of Science, Scopus and this process requires lots of manual work…

• Automation of identifying publications in large databases save time and free up resources for manual cleaning

3

Introduction•The main issue is to deal with the existence

of different reference standards such as APA(American Psychological Association), MLA(Modern Language Association) etc.

•They may have different sequencing for the components. That is, while co-author names are placed in the very beginning of the reference, they may be put at the end of the reference. Or some of them may use abbreviations for the authors or journal names.

4

Introduction•Besides diverse standards, incomplete,

erroneous or censored data in publication list or erroneous indexing in the database or changes in publication (title, number or sequence of co-authors, publication year…)

•To grab the similarity of texts between the CV references and indexed publications in bibliometric databases, we applied a notion namely N-grams

5

N-GramsExample: The diffusion of H-related literature

•Word N-grams - adjacent sequence of n words from a given string(the diffusion of) (diffusion of h-related) (of h-related literature)

•Character N-grams - adjacent sequence of n characters from a given string(_ _t) (_th) (the) (he_ ) (e_d ) (_di ) (dif ) and so on

6

N-grams• Word N-Grams are suitable for full text studies and

not convenient for this study…

• Character N-Grams are very powerful and handy especially for short texts and no need for stemming!

• 3-grams are chosen considering the components’ lengths (e.g. author names, publication year, page)

• Kondrak (2005) method for similarity measure Levenshtein (Edit) Distance based on character N-grams

7

Modified Levenshtein Distance• The minimum number of single-character edits that have to be made

in order to change one string to another (Levenshtein, 1966).

• Operations: Add, Remove, Change

• Kondrak (2005) improved this notion by using N-grams instead of single-character

• For the N-gram based edit distance between strings x and y, a matrix

is constructed where is the minimum number of edit operations needed to match to .

8

RemoveAdd

Change

Levenshtein Distance

9

d i f f u s i o n( ) _ _d _di dif iff ffu fus usi sio ion

( ) 0 1 2 3 4 5 6 7 8 9d _ _d 1 0 1 2 3 4 5 6 7 8i _di 2 1 0 1 2 3 4 5 6 7f dif 3 2 1 0 1 2 3 4 5 6f iff 4 3 2 1 0 1 2 3 4 5. ff. 5 4 3 2 1 0,3

31,33

2,33

3,33

4,33

Features of the Approach• Ordering is crucial…

• For example

similarity between:

the diffusion vs. the diff. : 0,66

the diffusion vs. diff. the : 0,31

• Also, one can find two strings (Xanex and Nexan) having exactly the same N-gram decompositions which would give maximum similarity.

10

Application•As can be expected that publication lists

provide detailed bibliographic information about the publications such as its title, the journal title, the names of the author and co-author(s), publication year, volume and first and end page.

11

Components Variables

Title Journal Co-authors Volume Begin Page Publication Year

SCORE1 2 3 1 4 5 6SCORE2 2 - 1 3 4 5SCORE3 1 2 - 3 4 5SCORE4 1 - 2 - - -SCORE5 1 - - - - -SCORE6 1 2 - - - -SCORE7 - 1 2 - - -SCORE8 - 1 - - - -

ApplicationCV

Zhang, L., Thijs, B., Glänzel, W., The diffusion of H-related literature. JOI, 2011, 5 (4), 583-593 Variables

3-Gram Score

WOS ZHANG, L, THIJS, B, GLANZEL, W, The diffusion of H-related literature, JOURNAL OF INFORMETRICS, 5, 583, 2011 SCORE1 0,67

WOS ZHANG, L, THIJS, B, GLANZEL, W, The diffusion of H-related literature, 5, 583, 2011 SCORE2 0.73

WOS The diffusion of H-related literature, JOURNAL OF INFORMETRICS, 5, 583, 2011 SCORE3 0.34

WOS ZHANG, L, THIJS, B, GLANZEL, W, The diffusion of H-related literature SCORE4 0.65

WOS The diffusion of H-related literature SCORE5 0.36

WOS The diffusion of H-related literature, JOURNAL OF INFORMETRICS SCORE6 0.41

SCORE9 0,73

12

Using these scores, we would like to decide whether the publication is indexed in the given database. Discriminant Analysis is such a convenient tool for this purpose.

Kernel Discriminant Analysis• Since no assumptios are held for discriminant analysis,

this non-parametric method is applied

• Exploiting a kernel function (normal), it handles a non-linear mapping by a linear mapping in a feature space

• As a result, it is based on estimating a non-parametric density function for the observations

• There exist a smoothing parameter ‘r’ which determine the degree of irregularity in the estimate of density function. As suggested in Khattree and Naik (2000), we tried several ‘r’ and reach the optimal solution

13

Kernel Discriminant Analysis•SCORE1, SCORE2, SCORE3, SCORE4,

SCORE5 and SCORE6

•SCORE9, SCORE5 and SCORE8

•While the former set is chosen to examine the variables all including “Title” component and its variations, the latter one is chosen to analyse as a relatively more independent set with “Maximum”, “Title” and “Journal Name”.

14

Data• Training Set

6525 real pairs of applicants’ CVs (correct matched pairs by manually) (Group 1)

3 x 6525 randomly unmatched pairs (Group 0) for the same publications in Group 1

• Test Set

2570 new pairs to be classified completely different from the ones in training set

The publications are queried through a sample of WoS data having a size of 738715

Results

• false positive: wrongly assigned to Group 1• false negative: wrongly assigned to Group 0

set1

r 2 3 4 5 accuracy 92.96 90.3 78.25 60.62 false negatives 4.48 13.9 33.31 60.53 false positives 6.20 1.20 0.20 0

set2

r 2 3 4 5 accuracy 91.13 94.9 90.97 82.33 false negatives 2.63 5.68 13.33 27.03 false positives 10.15 2.23 0.62 0.16

16

Results Estimated 0 Estimated 1 TotalObserved 0 862 36 898

Observed 1 95 1555 1672

Total 957 1613 2570

• false positive: 36• false negative: 95• 94,3% of publications in Group 1 are classified correctly• 97,8% of publications estimated in Group 1 are classified

correctly

17

Results

• Even though vast majority of the publications are classified correctly in Group 1,

• Similarity scores for Group 1 between 0,30 – 0,45 false negatives• Similarity scores for Group 0 higher than 0,45 false positives

18

Conclusion•By means of proposed model, 95% correct

classification is achieved…•The matches which have a similarity score

0,6267 or higher indicate a precise correct matching…

•For bibliometric evaluation processes, it will be useful to decrease the manual work

However… 19

Parts not to be ruled out!

•Only for papers in English

•“false positives” remains as an issue to be solved

•Tolerance to “false positives” depends on the

Application (Micro vs. Macro Level Studies)20

Thank you!

21

Documents

Matching Bibliographic Data from Publication Lists with Large Databases using N-Grams