1
Large scale citation matching using Apache Hadoop Mateusz Fedoryszak9 %ominika Tkaczyk and Łukasz æolikowski What is citation matching[ What is Hadoop[ We are the part of UoNnSys! Pairwise similarity Results Performance tests IntermezzoG Nuthor indexing Pawlak9 Zdzisław x"vbJZ2 SRough setsS2 International Journal of Parallel Programming "" x4ZG 53"–54–2 Pawlak 9 Zdzisław x "vbJ Z 2 author other author other year other other 222 222 During citation matching the links from bibliographic entries to referenced publications are createdO Such links are indicators of topical similarity between linked textsA are used in assessing the impact of the referenced document and improve navigation in the user interfaces of digital librariesO First of allA metadata fields are extracted from a citation string in a process called citation parsing8 it is done using CRF)based parser supplied with CERMINE ) a metadata and content extraction tool zTkaczyk et alOA VWHVA DAS2O Our citation matching algorithm is implemented as the part of CoAnSysA an open source framework for mining scientific publicationsO It is developed at the Centre for Open Science zCeON2 in the Interdisciplinary Centre for Mathematical and Computational Modelling zICM2A University of WarsawO Apache Hadoop is an open source implementation of MapReduce paradigmA which allows big data to be partitioned and processed in small portionsO ThenA those pieces of information are used to assess similarity between two citations or between a citation and document metadata by means of fuzzy similarity metrics which are combined by SVMO Given a pairwise similaritiesA citations are clustered using a simple single)link algorithmO We have tested the clustering correctness on CORA)ref dataset and compared the results with those achieved by Joint MLN method zPoon and DomingosA VWW7A AI2O ) Author similarity ) heaviest matching of tokensA an instance of assignment problem ) Journal similarity ) longest common subsequence of characters ) Title similarity ) a fraction of common trigrams zcharacter)based2 ) Pages similarity ) a fraction of common tokens ) Year equality Hadoopised citation matching To limit the number of pairwise comparisonsA we have used a heuristic based on author indexingO To be less vulnerable to spelling errors we have implemented an index supporting retrieval of tokens with edit distance less than or equal HO Instead of putting as a key an exact word wA we put all the rotations of w$ zwhere $ is a character not present in an alphabet2O In a similar mannerA to retrieve a word from the indexA we create all its rotations and for each rotation r of length nA all the keys of length n that match at least n-1 first letters of r and keys of length n+1 that match first n letters are returnedO Nuthor index building Nctual matching docId2" docId2J docId2" doc2" doc2J docId2J docId2J docId2J token2" token2J token25 %ocuments %ocument I%s and extracted tokens %ocument I%s grouped by tokens Rotations for each token docId2" docId2" token2" token2J token2" token2J token25 docId2" docId2J docId2" docId2J rot2J2" rot2J2J rot2J25 docId2" docId2J docId2" docId2J %ocument retrieval Token extraction Grouping Rotations generation Persisting doc2" doc2J doc25 ref2J2" ref2J2J ref2J25 ref2J2J ref2J2J ref2J2J cand2J2J2" cand2J2J2J cand2J2J25 ref2J2J cand2J2J25 ref2x2y cand2x2y2x ref2x2y cand2x2y2z %ocument reading Reference string extraction Heuristic matching Selecting the best matching Storing resolved references %ocuments References References with candidate documents æest matching pairs M ap Reduce Nll .G..G46 "5 J .G..G3– "5 . 5G."G5b 634 . JG4"G.. vv– " Visit us at https://github.com/CeON/CoAnSys/ Map Reduce Sort I shuffle References ["] I2 Newton9 Philosophiæ naturalis222 [J] N2 Uopernicus9 %e revolutionibus222 ID Title Author Uopernicus "3 %e revolutionibus222 Εὐκλείδης Στοιχεῖα "" Interdisciplinary Uentre for Mathematical and Uomputational Modelling9 University of Warsaw {m.fedoryszak, d.tkaczyk, l.bolikowski}@icm.edu.pl We have evaluated efficiency of our solution using PMC Open Access Subset document setO It consists of over JLW thousand documents containing HV million citationsO The benchmark was performed on our Hadoop cluster which consists of four ʹfatʹ slave nodesA each containing Jq CPU coresO –42bJR 6J24.R 66245R 6"2v4R 6b2".R v42J"R v624"R v32vbR v42v.R v62..R v52v"R v52.–R v6235R v32b.R v325.R v324–R v42J5R v–2"vR v4255R v42–5R cluster recall pairwise precision pairwise recall pairwise F " Nverage Phase Time spent [hrsGminsGsecs] Index building Matching Uitation extraction Heuristic matching Selecting the best match Task No2 Fold. Fold" FoldJ Joint MLN

Large scale citation matching using Apache Hadoop

Embed Size (px)

Citation preview

Page 1: Large scale citation matching using Apache Hadoop

LargeIscaleIcitationImatchingusingIApacheIHadoop

MateuszCFedoryszak9C%ominikaCTkaczykCandCŁukaszCæolikowski

WhatCisCcitationCmatching[ WhatCisCHadoop[

WeCareCtheCpartCofCUoNnSysA

PairwiseCsimilarity

Results

PerformanceCtestsIntermezzoGCNuthorCindexing

Pawlak9CZdzisławCx"vbJZ2CSRoughCsetsS2CInternationalCJournalCofCParallelCProgrammingC""Cx4ZGC53"–54–2

Pawlak 9 Zdzisław x "vbJ Z 2

author other author other year other other

222222

DuringRcitationRmatchingRtheRlinksRfromRbibliographicRentriesRtoRreferencedRpublicationsRareRcreatedORSuchRlinksRareRindicatorsRofRtopicalRsimilarityRbetweenRlinkedRtextsARareRusedRinRassessingRtheRimpactRofRtheRreferencedRdocumentRandRimproveRnavigationRinRtheRuserRinterfacesRofRdigitalRlibrariesOR

FirstRofRallARmetadataRfieldsRareRextractedRfromRaRcitationRstringRinRaRprocessRcalledRcitationRparsing8RitRisRdoneRusingRCRF)basedRparserRsuppliedRwithRCERMINER)RaRmetadataRandRcontentRextractionRtoolRzTkaczykRetRalOARVWHVARDAS2O

OurRcitationRmatchingRalgorithmRisRimplementedRasRtheRpartRofRCoAnSysARanRopenRsourceRframeworkRforRminingRscientificRpublicationsORItRisRdevelopedRatRtheRCentreRforROpenRScienceRzCeON2RinRtheRInterdisciplinaryRCentreRforRMathematicalRandRComputationalRModellingRzICM2ARUniversityRofRWarsawO

ApacheRHadoopRisRanRopenRsourceRimplementationRofRMapReduceRparadigmARwhichRallowsRbigRdataRtoRbeRpartitionedRandRprocessedRinRsmallRportionsO

ThenARthoseRpiecesRofRinformationRareRusedRtoRassessRsimilarityRbetweenRtwoRcitationsRorRbetweenRaRcitationRandRdocumentRmetadataRbyRmeansRofRfuzzyRsimilarityRmetricsRwhichRareRcombinedRbyRSVMO

GivenRaRpairwiseRsimilaritiesARcitationsRareRclusteredRusingRaRsimpleRsingle)linkRalgorithmORWeRhaveRtestedRtheRclusteringRcorrectnessRonRCORA)refRdatasetRandRcomparedRtheRresultsRwithRthoseRachievedRbyRJointRMLNRmethodRzPoonRandRDomingosARVWW7ARAI2O

)RAuthor similarityR)RheaviestRmatchingRofRtokensARanRinstanceRofRassignmentRRRproblem)RJournal similarityR)RlongestRcommonRsubsequenceRofRcharacters)RTitle similarityR)RaRfractionRofRcommonRtrigramsRzcharacter)based2)RPages similarityR)RaRfractionRofRcommonRtokens)RYear equality

HadoopisedCcitationCmatching

ToRlimitRtheRnumberRofRpairwiseRcomparisonsARweRhaveRusedRaRheuristicRbasedRonRauthorRindexingORToRbeRlessRvulnerableRtoRspellingRerrorsRweRhaveRimplementedRanRindexRsupportingRretrievalRofRtokensRwithReditRdistanceRlessRthanRorRequalRHO

InsteadRofRputtingRasRaRkeyRanRexactRwordRwARweRputRallRtheRrotationsRofRw$RzwhereR$RisRaRcharacterRnotRpresentRinRanRalphabet2ORInRaRsimilarRmannerARtoRretrieveRaRwordRfromRtheRindexARweRcreateRallRitsRrotationsRandRforReachRrotationRrRofRlengthRnARallRtheRkeysRofRlengthR≤ nRthatRmatchRatRleastRn-1RfirstRlettersRofRrRandRkeysRofRlengthR≤ n+1RthatRmatchRfirstRnRlettersRareRreturnedO

NuthorCindexCbuilding

NctualCmatching

docId2"

docId2J

docId2"

doc2" doc2J

docId2J docId2J docId2J

token2" token2J token25

%ocuments

%ocumentCI%sCandextractedCtokens

%ocumentCI%sgroupedCbyCtokens

RotationsforCeachCtoken

docId2" docId2"

token2" token2J

token2" token2J token25

docId2"

docId2J

docId2"

docId2J

rot2J2" rot2J2J rot2J25

docId2"

docId2J

docId2"

docId2J

%ocumentCretrieval

TokenCextraction

Grouping

RotationsCgeneration

Persisting

doc2"

doc2J

doc25

ref2J2"

ref2J2J

ref2J25

ref2J2J

ref2J2J

ref2J2J

cand2J2J2"

cand2J2J2J

cand2J2J25

ref2J2J cand2J2J25

ref2x2y cand2x2y2x

ref2x2y cand2x2y2z

%ocumentreading

ReferenceCstringextraction

HeuristicCmatchingSelectingtheCbestCmatching

StoringresolvedCreferences

%ocuments ReferencesReferencesCwithcandidateCdocuments

æestCmatchingCpairs

Map ReduceNll .G..G46 "5 J

.G..G3– "5 .5G."G5b 634 .JG4"G.. vv– "

VisitCusCatChttps://github.com/CeON/CoAnSys/

Map ReduceSortICshuffle

References["]CI2CNewton9CPhilosophiæCnaturalis222[J]CN2CUopernicus9C%eCrevolutionibus222

ID Title Author

Uopernicus"3 %eCrevolutionibus222

ΕὐκλείδηςΣτοιχεῖα""

InterdisciplinaryCUentreCforCMathematicalCandCUomputationalCModelling9CUniversityCofCWarsaw

{m.fedoryszak, d.tkaczyk, l.bolikowski}@icm.edu.pl

WeRhaveRevaluatedRefficiencyRofRourRsolutionRusingRPMCROpenRAccessRSubsetRdocumentRsetORItRconsistsRofRoverRJLWRthousandRdocumentsRcontainingRHVRmillionRcitationsORTheRbenchmarkRwasRperformedRonRourRHadoopRclusterRwhichRconsistsRofRfourRʹfatʹRslaveRnodesAReachRcontainingRJqRCPURcoresO

–42bJR 6J24.R 66245R 6"2v4R 6b2".R

v42J"R v624"R v32vbR v42v.R v62..Rv52v"R v52.–R v6235R v32b.R v325.Rv324–R v42J5R v–2"vR v4255R v42–5R

clusterCrecall

pairwiseCprecisionpairwiseCrecallpairwiseCF"

Nverage

PhaseTimeCspent

[hrsGminsGsecs]IndexCbuilding

MatchingUitationCextractionHeuristicCmatching

SelectingCtheCbestCmatch

TaskCNo2

Fold. Fold" FoldJ JointCMLN