Efficient blocking method fora large scale citation matching
Mateusz Fedoryszak & Łukasz Bolikowski{matfed,bolo}@icm.edu.pl
Interdisciplinary Centre for Mathematical andComputational Modelling
University of Warsaw
Citation matching
• Note: it's an instance of data linkage problem
References[1] I. Newton, Philosophiae naturalis...[2] N. Copernicus, De revolutionibus...
ID Title Author
Copernicus14 De revolutionibus...
ΕὐκλείδηςΣτοιχεῖα11
Why important?
• Clickable interfaces• Bibliometrics
(think: H-index)• Further analysis
(e.g. similarities)
Why difficult?
• Citation extraction errors (in both digital-born and retro-born docs)
• Countless citation styles used inconsistently
• Typos and other human errors
The Problem
References
ID Title Author
Naïve approach
For 1.3M documents and 12M citations it's 15.6 × 1012 comparisons
References
ID Title Author
Select the best candidates
• I'll present a method of candidate selection and how to implement it using Apache Hadoop
References
ID Title Author
Blocking
References
ID Title Author
Fingerprints
References
ID Title AuthorAAAABBBB CCCC
AAAA
AAAA FFFF
CCCC
EEEE
Workflow
document IDhashcitation IDhash
citation document
document IDhash
citation ID
citation ID
document IDhash
document ID
citation IDhash
citation ID document ID
citation ID document ID
citation ID document ID
Map
Redu
ce
Workflow with tuning
• Before:• Compute bucket sizes• Reject too big ones• Use DistributedCache
disseminate
• After:• For each citation
choose only the most popular candidates
document IDhashcitation IDhash
citation document
document IDhash
citation ID
citation ID
document IDhash
document ID
citation IDhash
citation ID document ID
citation ID document ID
citation ID document ID
Map
Redu
ce
Hash functions
Normalisation• Lowercase• Remove
• diacritics• punctuation marks
• Filter out tokens shorter than 3 characters (except numbers)
Normalisation
Pawlak, Zdzisław (1982). "Rough sets". Internat. J. Comput. Inform. Sci. 11 (5): 341–356.
pawlak zdzislaw 1982 rough sets internat comput inform sci 11 5 341 356
Examples
Pawlak, Zdzisław (1982). "Rough sets". Internat. J. Comput. Inform. Sci. 11 (5): 341–356.
{ author: "Zdzisław Pawlak", year: "1982", title: "Rough sets", journal: "International Journal of Computer & Information Sciences", volume: "11", issue: "5", pages: "341–356"}
Baseline
pawlakzdzislaw
1982rough
...internat
...
zdzislawpawlak1982
rough...
internationaljournal
...
Bigrams
• For document we use only authors and title fields
pawlak zdzislawzdzislaw 1982
1982 roughrough sets
...
zdzislaw pawlakrough sets
name-year• For citation:
• name: any of first 4 distinct text tokens• year: any number between 1900 and 2050
pawlak#1982zdzislaw#1982
rough#1982sets#1982
zdzislaw#1982pawlak#1982
+approximate variant zdzislaw#1981pawlak#1981
zdzislaw#1983pawlak#1983
name-year-pages• For citation:
• pages: any sorted pair of numbers, not year
pawlak#1982#5#11pawlak#1982#5#341
pawlak#1982#...pawlak#1982#341#356
zdzislaw#...zdzislaw#1982#341#356
rough#...sets#...
zdzislaw#1982#341#356pawlak#1982#341#356
+approximate & optimistic variant
Intermezzo: citation parsing
Pawlak , Zdzisław ( 1982 ) .
author other author other year other other
...
...
Pawlak, Zdzisław (1982). "Rough sets". Internat. J. Comput. Inform. Sci. 11 (5): 341–356.
name-year-numn
• n = 1..3• For citation:
• numn: any sorted tuple of numbers, not year
pawlak#1982#5#11#341pawlak#1982#5#341#356pawlak#1982#5#11#356#pawlak#1982#11#341#356
zdzislaw#...rough#...sets#...
+approximate variant
pawlak#1982#5#11#341pawlak#1982#5#341#356pawlak#1982#5#11#356#
pawlak#1982#11#341#356zdzislaw#...
Evaluation
Test dataset<ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation publication-type="journal"><name><surname>Jemal</surname><given-names>A</given-names></name>, <name><surname>Bray</surname><given-names>F</given-names></name>, <name><surname>Center</surname><given-names>MM</given-names></name>, <name><surname>Ferlay</surname><given-names>J</given-names></name>, <name><surname>Ward</surname><given-names>E</given-names></name>, <etal>et al</etal> (<year>2011</year>) <article-title>Global cancer statistics</article-title>. <source>CA Cancer J Clin</source><volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage><pub-id pub-id-type="pmid">21296855</pub-id></mixed-citation></ref>
Test dataset<ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation publication-type="journal"><name><surname>Jemal</surname><given-names>A</given-names></name>, <name><surname>Bray</surname><given-names>F</given-names></name>, <name><surname>Center</surname><given-names>MM</given-names></name>, <name><surname>Ferlay</surname><given-names>J</given-names></name>, <name><surname>Ward</surname><given-names>E</given-names></name>, <etal>et al</etal> (<year>2011</year>) <article-title>Global cancer statistics</article-title>. <source>CA Cancer J Clin</source><volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage><pub-id pub-id-type="pmid">21296855</pub-id></mixed-citation></ref>
Test dataset<ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation publication-type="journal"><name><surname>Jemal</surname><given-names>A</given-names></name>, <name><surname>Bray</surname><given-names>F</given-names></name>, <name><surname>Center</surname><given-names>MM</given-names></name>, <name><surname>Ferlay</surname><given-names>J</given-names></name>, <name><surname>Ward</surname><given-names>E</given-names></name>, <etal>et al</etal> (<year>2011</year>) <article-title>Global cancer statistics</article-title>. <source>CA Cancer J Clin</source><volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage><pub-id pub-id-type="pmid">21296855</pub-id></mixed-citation></ref>
Test dataset
2 Jemal A, Bray F, Center MM, Ferlay J, Ward E, et al (2011) Global cancer statistics. CA Cancer J Clin 61: 69–90
Test dataset
• Based on Open Access Subset of PMC• Only citations preserving original formatting• Only citations with PMID assigned• 528k documents• 3.6M citation out of which 321k resolvable
Metrics
• Recall — the percentage of true citation → document links that are maintained by the heuristic
• Precision — the percentage of citation → document links returned by algorithm that are correct
• Intermediate data — total number of hashes and pairs generated (before selecting the most popular ones)
• Candidate pairs — number of pairs returned by heuristic for further assessment
• F-measure not included intentionally
Limits
• Candidate documents per citation• 30• no limit
• Bucket size• 10• 100• 1000• 10000• no limit
Recallhash precision recall intermediate data to assess
bigrams (10000, 30) 0.4% 98.2% 285,908,900 79,329,459 baseline (10000, 30) 0.3% 97.9% 221,212,080 114,223,777 bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883 name-year (approx.) 0.0% 92.4% 928,068,651 862,357,212 name-year (strict) 0.1% 90.2% 322,015,088 290,940,929 baseline (10000, 10) 0.9% 88.7% 221,212,080 49,747,843 name-year-num (approx., 1000, 30) 1.2% 88.5% 170,633,938 23,591,933 name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129 name-year (strict, 1000, 30) 2.5% 77.9% 28,463,067 9,940,403 name-year (approx., 1000, 30) 1.4% 75.6% 40,726,102 17,098,080 baseline (1000, 30) 0.9% 73.2% 115,822,141 26,083,677
Precision
hash precision recall intermediate data to assess
name-year-pages (strict, optimistic) 98.4% 7.3% 4,787,215 23,734
name-year-num^3 (strict) 84.0% 43.4% 257,639,965 166,128
name-year-pages (approx., optimistic) 78.2% 7.8% 42,478,742 32,182
name-year-pages (strict, pessimistic) 53.7% 42.5% 132,809,210 254,208
name-year-num^3 (approx.) 17.6% 47.1% 617,193,035 860,314
name-year-num^2 (strict) 14.8% 66.6% 141,885,270 1,444,074
bigrams (10, 10) 11.8% 65.6% 84,042,160 1,784,228
Recall/intermediate datahash precision recall intermediate data to assess
name-year (strict, 1000, 30) 2.5% 77.9% 28,463,067 9,940,403
name-year (approx., 1000, 30) 1.4% 75.6% 40,726,102 17,098,080 name-year-pages (strict, optimistic, 1000, 30) 98.4% 7.3% 4,787,215 23,734
name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129
bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883
bigrams (10, 30) 11.8% 65.6% 84,042,160 1,793,997
baseline (1000, 30) 0.9% 73.2% 115,822,141 26,083,677
name-year-num (approx., 1000, 30) 1.2% 88.5% 170,633,938 23,591,933
baseline (100, 30) 3.2% 44.0% 91,175,101 4,458,560
name-year-num^2 (strict., 1000, 30) 18.4% 66.6% 141,553,137 1,165,181
baseline (10000, 30) 0.3% 97.9% 221,212,080 114,223,777
Recall vs. intermediate data
Recall/to assesshash precision recall intermediate data to assess name-year-pages (strict, optimistic, 1000, 30) 98.4% 7.3% 4,787,215 23,734 name-year-num^3 (strict., 1000, 30) 84.0% 43.4% 257,637,645 165,995 name-year-pages (approx., optimistic, 1000, 30) 78.5% 7.8% 42,478,742 32,042 name-year-pages (strict, pessimistic, 1000, 30) 56.3% 42.5% 132,792,590 242,261
name-year-num^3 (approx., 1000, 30) 19.1% 47.1% 617,046,925 794,284
name-year-num^2 (strict., 1000, 30) 18.4% 66.6% 141,553,137 1,165,181
bigrams (10, 30) 11.8% 65.6% 84,042,160 1,793,997 name-year-pages (approx., pessimistic, 1000, 30) 9.9% 45.8% 172,447,469 1,483,980
name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129
name-year-num^2 (approx., 1000, 30) 3.2% 69.8% 359,051,798 7,023,337
baseline (100, 30) 3.2% 44.0% 91,175,101 4,458,560
bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883
Recall vs. to assess
Combination
Lost citationsHash Lost fraction
name-year (approx., 1000, 30) 12.4%name-year-num2 (approx., 1000, 30) 12.3%name-year (strict, 1000, 30) 9.8%name-year-pages (approx., pessimistic, 1000, 30) 9.0%baseline (10000, 10) 6.7%name-year-num (approx., 1000, 30) 6.0%name-year (strict) 5.8%name-year-num2 (strict., 1000, 30) 5.6%name-year (approx.) 5.1%name-year-num (strict., 1000, 30) 4.4%name-year-num3 (approx., 1000, 30) 4.2%baseline (1000, 30) 3.7%
ResultsHash sequence Recall Intermediate data To assess
bigrams (10000, 30) 98.17% 285,908,900 79,329,459
name-year-pages (strict, optimistic)name-year (strict, 1000, 30)name-year (strict, 10000, 30)bigrams (10000, 30)
87.64% 187,394,452 41,152,278
name-year-pages (strict, optimistic)name-year-pages (strict, pessimistic)bigrams (100, 30)bigrams (10000, 30)
96.86% 333,701,109 29,818,635
name-year-pages (strict, optimistic)bigrams (100, 30)bigrams (10000, 30)
97.76% 202,590,413 30,582,488
name-year-pages (strict, optimistic)name-year-num3 (strict)bigrams (10, 10)bigrams (100, 30)bigrams (10000, 30)
97.73% 398,895,930 25,123,164
Future work
• Other combinations• After fine-grained assessment• Various hash functions at the same time
• Further efficiency tuning• Limit number of generated hashes
CoAnSys Project
• An open source framework for mining very large collections of scientific publications
• Contains implementation of the presented workflow
• http://coansys.ceon.pl/