40
Efficient blocking method for a large scale citation matching Mateusz Fedoryszak & Łukasz Bolikowski {matfed,bolo}@icm.edu.pl Interdisciplinary Centre for Mathematical and Computational Modelling University of Warsaw

Efficient blocking method for a large scale citation matching

Embed Size (px)

DESCRIPTION

An efficient blocking method for a large scale citation matching presented during WOSP 2014

Citation preview

Page 1: Efficient blocking method for a large scale citation matching

Efficient blocking method fora large scale citation matching

Mateusz Fedoryszak & Łukasz Bolikowski{matfed,bolo}@icm.edu.pl

Interdisciplinary Centre for Mathematical andComputational Modelling

University of Warsaw

Page 2: Efficient blocking method for a large scale citation matching

Citation matching

• Note: it's an instance of data linkage problem

References[1] I. Newton, Philosophiae naturalis...[2] N. Copernicus, De revolutionibus...

ID Title Author

Copernicus14 De revolutionibus...

ΕὐκλείδηςΣτοιχεῖα11

Page 3: Efficient blocking method for a large scale citation matching

Why important?

• Clickable interfaces• Bibliometrics

(think: H-index)• Further analysis

(e.g. similarities)

Why difficult?

• Citation extraction errors (in both digital-born and retro-born docs)

• Countless citation styles used inconsistently

• Typos and other human errors

Page 4: Efficient blocking method for a large scale citation matching

The Problem

References

ID Title Author

Page 5: Efficient blocking method for a large scale citation matching

Naïve approach

For 1.3M documents and 12M citations it's 15.6 × 1012 comparisons

References

ID Title Author

Page 6: Efficient blocking method for a large scale citation matching

Select the best candidates

• I'll present a method of candidate selection and how to implement it using Apache Hadoop

References

ID Title Author

Page 7: Efficient blocking method for a large scale citation matching

Blocking

References

ID Title Author

Page 8: Efficient blocking method for a large scale citation matching

Fingerprints

References

ID Title AuthorAAAABBBB CCCC

AAAA

AAAA FFFF

CCCC

EEEE

Page 9: Efficient blocking method for a large scale citation matching

Workflow

document IDhashcitation IDhash

citation document

document IDhash

citation ID

citation ID

document IDhash

document ID

citation IDhash

citation ID document ID

citation ID document ID

citation ID document ID

Map

Redu

ce

Page 10: Efficient blocking method for a large scale citation matching

Workflow with tuning

• Before:• Compute bucket sizes• Reject too big ones• Use DistributedCache

disseminate

• After:• For each citation

choose only the most popular candidates

document IDhashcitation IDhash

citation document

document IDhash

citation ID

citation ID

document IDhash

document ID

citation IDhash

citation ID document ID

citation ID document ID

citation ID document ID

Map

Redu

ce

Page 11: Efficient blocking method for a large scale citation matching

Hash functions

Page 12: Efficient blocking method for a large scale citation matching

Normalisation• Lowercase• Remove

• diacritics• punctuation marks

• Filter out tokens shorter than 3 characters (except numbers)

Page 13: Efficient blocking method for a large scale citation matching

Normalisation

Pawlak, Zdzisław (1982). "Rough sets". Internat. J. Comput. Inform. Sci. 11 (5): 341–356.

pawlak zdzislaw 1982 rough sets internat comput inform sci 11 5 341 356

Page 14: Efficient blocking method for a large scale citation matching

Examples

Pawlak, Zdzisław (1982). "Rough sets". Internat. J. Comput. Inform. Sci. 11 (5): 341–356.

{ author: "Zdzisław Pawlak", year: "1982", title: "Rough sets", journal: "International Journal of Computer & Information Sciences", volume: "11", issue: "5", pages: "341–356"}

Page 15: Efficient blocking method for a large scale citation matching

Baseline

pawlakzdzislaw

1982rough

...internat

...

zdzislawpawlak1982

rough...

internationaljournal

...

Page 16: Efficient blocking method for a large scale citation matching

Bigrams

• For document we use only authors and title fields

pawlak zdzislawzdzislaw 1982

1982 roughrough sets

...

zdzislaw pawlakrough sets

Page 17: Efficient blocking method for a large scale citation matching

name-year• For citation:

• name: any of first 4 distinct text tokens• year: any number between 1900 and 2050

pawlak#1982zdzislaw#1982

rough#1982sets#1982

zdzislaw#1982pawlak#1982

+approximate variant zdzislaw#1981pawlak#1981

zdzislaw#1983pawlak#1983

Page 18: Efficient blocking method for a large scale citation matching

name-year-pages• For citation:

• pages: any sorted pair of numbers, not year

pawlak#1982#5#11pawlak#1982#5#341

pawlak#1982#...pawlak#1982#341#356

zdzislaw#...zdzislaw#1982#341#356

rough#...sets#...

zdzislaw#1982#341#356pawlak#1982#341#356

+approximate & optimistic variant

Page 19: Efficient blocking method for a large scale citation matching

Intermezzo: citation parsing

Pawlak , Zdzisław ( 1982 ) .

author other author other year other other

...

...

Pawlak, Zdzisław (1982). "Rough sets". Internat. J. Comput. Inform. Sci. 11 (5): 341–356.

Page 20: Efficient blocking method for a large scale citation matching

name-year-numn

• n = 1..3• For citation:

• numn: any sorted tuple of numbers, not year

pawlak#1982#5#11#341pawlak#1982#5#341#356pawlak#1982#5#11#356#pawlak#1982#11#341#356

zdzislaw#...rough#...sets#...

+approximate variant

pawlak#1982#5#11#341pawlak#1982#5#341#356pawlak#1982#5#11#356#

pawlak#1982#11#341#356zdzislaw#...

Page 21: Efficient blocking method for a large scale citation matching

Evaluation

Page 22: Efficient blocking method for a large scale citation matching

Test dataset<ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation publication-type="journal"><name><surname>Jemal</surname><given-names>A</given-names></name>, <name><surname>Bray</surname><given-names>F</given-names></name>, <name><surname>Center</surname><given-names>MM</given-names></name>, <name><surname>Ferlay</surname><given-names>J</given-names></name>, <name><surname>Ward</surname><given-names>E</given-names></name>, <etal>et al</etal> (<year>2011</year>) <article-title>Global cancer statistics</article-title>. <source>CA Cancer J Clin</source><volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage><pub-id pub-id-type="pmid">21296855</pub-id></mixed-citation></ref>

Page 23: Efficient blocking method for a large scale citation matching

Test dataset<ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation publication-type="journal"><name><surname>Jemal</surname><given-names>A</given-names></name>, <name><surname>Bray</surname><given-names>F</given-names></name>, <name><surname>Center</surname><given-names>MM</given-names></name>, <name><surname>Ferlay</surname><given-names>J</given-names></name>, <name><surname>Ward</surname><given-names>E</given-names></name>, <etal>et al</etal> (<year>2011</year>) <article-title>Global cancer statistics</article-title>. <source>CA Cancer J Clin</source><volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage><pub-id pub-id-type="pmid">21296855</pub-id></mixed-citation></ref>

Page 24: Efficient blocking method for a large scale citation matching

Test dataset<ref id="pone.0052832-Jemal1"><label>2</label><mixed-citation publication-type="journal"><name><surname>Jemal</surname><given-names>A</given-names></name>, <name><surname>Bray</surname><given-names>F</given-names></name>, <name><surname>Center</surname><given-names>MM</given-names></name>, <name><surname>Ferlay</surname><given-names>J</given-names></name>, <name><surname>Ward</surname><given-names>E</given-names></name>, <etal>et al</etal> (<year>2011</year>) <article-title>Global cancer statistics</article-title>. <source>CA Cancer J Clin</source><volume>61</volume>: <fpage>69</fpage>–<lpage>90</lpage><pub-id pub-id-type="pmid">21296855</pub-id></mixed-citation></ref>

Page 25: Efficient blocking method for a large scale citation matching

Test dataset

2 Jemal A, Bray F, Center MM, Ferlay J, Ward E, et al (2011) Global cancer statistics. CA Cancer J Clin 61: 69–90

Page 26: Efficient blocking method for a large scale citation matching

Test dataset

• Based on Open Access Subset of PMC• Only citations preserving original formatting• Only citations with PMID assigned• 528k documents• 3.6M citation out of which 321k resolvable

Page 27: Efficient blocking method for a large scale citation matching

Metrics

• Recall — the percentage of true citation → document links that are maintained by the heuristic

• Precision — the percentage of citation → document links returned by algorithm that are correct

• Intermediate data — total number of hashes and pairs generated (before selecting the most popular ones)

• Candidate pairs — number of pairs returned by heuristic for further assessment

• F-measure not included intentionally

Page 28: Efficient blocking method for a large scale citation matching

Limits

• Candidate documents per citation• 30• no limit

• Bucket size• 10• 100• 1000• 10000• no limit

Page 29: Efficient blocking method for a large scale citation matching

Recallhash precision recall intermediate data to assess

bigrams (10000, 30) 0.4% 98.2% 285,908,900 79,329,459 baseline (10000, 30) 0.3% 97.9% 221,212,080 114,223,777 bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883 name-year (approx.) 0.0% 92.4% 928,068,651 862,357,212 name-year (strict) 0.1% 90.2% 322,015,088 290,940,929 baseline (10000, 10) 0.9% 88.7% 221,212,080 49,747,843 name-year-num (approx., 1000, 30) 1.2% 88.5% 170,633,938 23,591,933 name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129 name-year (strict, 1000, 30) 2.5% 77.9% 28,463,067 9,940,403 name-year (approx., 1000, 30) 1.4% 75.6% 40,726,102 17,098,080 baseline (1000, 30) 0.9% 73.2% 115,822,141 26,083,677

Page 30: Efficient blocking method for a large scale citation matching

Precision

hash precision recall intermediate data to assess

name-year-pages (strict, optimistic) 98.4% 7.3% 4,787,215 23,734

name-year-num^3 (strict) 84.0% 43.4% 257,639,965 166,128

name-year-pages (approx., optimistic) 78.2% 7.8% 42,478,742 32,182

name-year-pages (strict, pessimistic) 53.7% 42.5% 132,809,210 254,208

name-year-num^3 (approx.) 17.6% 47.1% 617,193,035 860,314

name-year-num^2 (strict) 14.8% 66.6% 141,885,270 1,444,074

bigrams (10, 10) 11.8% 65.6% 84,042,160 1,784,228

Page 31: Efficient blocking method for a large scale citation matching

Recall/intermediate datahash precision recall intermediate data to assess

name-year (strict, 1000, 30) 2.5% 77.9% 28,463,067 9,940,403

name-year (approx., 1000, 30) 1.4% 75.6% 40,726,102 17,098,080 name-year-pages (strict, optimistic, 1000, 30) 98.4% 7.3% 4,787,215 23,734

name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129

bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883

bigrams (10, 30) 11.8% 65.6% 84,042,160 1,793,997

baseline (1000, 30) 0.9% 73.2% 115,822,141 26,083,677

name-year-num (approx., 1000, 30) 1.2% 88.5% 170,633,938 23,591,933

baseline (100, 30) 3.2% 44.0% 91,175,101 4,458,560

name-year-num^2 (strict., 1000, 30) 18.4% 66.6% 141,553,137 1,165,181

baseline (10000, 30) 0.3% 97.9% 221,212,080 114,223,777

Page 32: Efficient blocking method for a large scale citation matching

Recall vs. intermediate data

Page 33: Efficient blocking method for a large scale citation matching

Recall/to assesshash precision recall intermediate data to assess name-year-pages (strict, optimistic, 1000, 30) 98.4% 7.3% 4,787,215 23,734 name-year-num^3 (strict., 1000, 30) 84.0% 43.4% 257,637,645 165,995 name-year-pages (approx., optimistic, 1000, 30) 78.5% 7.8% 42,478,742 32,042 name-year-pages (strict, pessimistic, 1000, 30) 56.3% 42.5% 132,792,590 242,261

name-year-num^3 (approx., 1000, 30) 19.1% 47.1% 617,046,925 794,284

name-year-num^2 (strict., 1000, 30) 18.4% 66.6% 141,553,137 1,165,181

bigrams (10, 30) 11.8% 65.6% 84,042,160 1,793,997 name-year-pages (approx., pessimistic, 1000, 30) 9.9% 45.8% 172,447,469 1,483,980

name-year-num (strict., 1000, 30) 3.6% 88.3% 85,756,601 7,864,129

name-year-num^2 (approx., 1000, 30) 3.2% 69.8% 359,051,798 7,023,337

baseline (100, 30) 3.2% 44.0% 91,175,101 4,458,560

bigrams (100, 30) 2.9% 92.7% 94,693,721 10,446,883

Page 34: Efficient blocking method for a large scale citation matching

Recall vs. to assess

Page 35: Efficient blocking method for a large scale citation matching

Combination

Page 36: Efficient blocking method for a large scale citation matching

Lost citationsHash Lost fraction

name-year (approx., 1000, 30) 12.4%name-year-num2 (approx., 1000, 30) 12.3%name-year (strict, 1000, 30) 9.8%name-year-pages (approx., pessimistic, 1000, 30) 9.0%baseline (10000, 10) 6.7%name-year-num (approx., 1000, 30) 6.0%name-year (strict) 5.8%name-year-num2 (strict., 1000, 30) 5.6%name-year (approx.) 5.1%name-year-num (strict., 1000, 30) 4.4%name-year-num3 (approx., 1000, 30) 4.2%baseline (1000, 30) 3.7%

Page 37: Efficient blocking method for a large scale citation matching

ResultsHash sequence Recall Intermediate data To assess

bigrams (10000, 30) 98.17% 285,908,900 79,329,459

name-year-pages (strict, optimistic)name-year (strict, 1000, 30)name-year (strict, 10000, 30)bigrams (10000, 30)

87.64% 187,394,452 41,152,278

name-year-pages (strict, optimistic)name-year-pages (strict, pessimistic)bigrams (100, 30)bigrams (10000, 30)

96.86% 333,701,109 29,818,635

name-year-pages (strict, optimistic)bigrams (100, 30)bigrams (10000, 30)

97.76% 202,590,413 30,582,488

name-year-pages (strict, optimistic)name-year-num3 (strict)bigrams (10, 10)bigrams (100, 30)bigrams (10000, 30)

97.73% 398,895,930 25,123,164

Page 38: Efficient blocking method for a large scale citation matching

Future work

• Other combinations• After fine-grained assessment• Various hash functions at the same time

• Further efficiency tuning• Limit number of generated hashes

Page 39: Efficient blocking method for a large scale citation matching

CoAnSys Project

• An open source framework for mining very large collections of scientific publications

• Contains implementation of the presented workflow

• http://coansys.ceon.pl/

Page 40: Efficient blocking method for a large scale citation matching

Thank you! Questions?

Mateusz [email protected]

http://coansys.ceon.pl/http://adalab.icm.edu.pl/