Large scale citation matching using Apache Hadoop

LargeIscaleIcitationImatchingusingIApacheIHadoop

MateuszCFedoryszak9C%ominikaCTkaczykCandCŁukaszCæolikowski

WhatCisCcitationCmatching[ WhatCisCHadoop[

WeCareCtheCpartCofCUoNnSysA

PairwiseCsimilarity

Results

PerformanceCtestsIntermezzoGCNuthorCindexing

Pawlak9CZdzisławCx"vbJZ2CSRoughCsetsS2CInternationalCJournalCofCParallelCProgrammingC""Cx4ZGC53"–54–2

Pawlak 9 Zdzisław x "vbJ Z 2

author other author other year other other

222222

DuringRcitationRmatchingRtheRlinksRfromRbibliographicRentriesRtoRreferencedRpublicationsRareRcreatedORSuchRlinksRareRindicatorsRofRtopicalRsimilarityRbetweenRlinkedRtextsARareRusedRinRassessingRtheRimpactRofRtheRreferencedRdocumentRandRimproveRnavigationRinRtheRuserRinterfacesRofRdigitalRlibrariesOR

FirstRofRallARmetadataRfieldsRareRextractedRfromRaRcitationRstringRinRaRprocessRcalledRcitationRparsing8RitRisRdoneRusingRCRF)basedRparserRsuppliedRwithRCERMINER)RaRmetadataRandRcontentRextractionRtoolRzTkaczykRetRalOARVWHVARDAS2O

OurRcitationRmatchingRalgorithmRisRimplementedRasRtheRpartRofRCoAnSysARanRopenRsourceRframeworkRforRminingRscientificRpublicationsORItRisRdevelopedRatRtheRCentreRforROpenRScienceRzCeON2RinRtheRInterdisciplinaryRCentreRforRMathematicalRandRComputationalRModellingRzICM2ARUniversityRofRWarsawO

ApacheRHadoopRisRanRopenRsourceRimplementationRofRMapReduceRparadigmARwhichRallowsRbigRdataRtoRbeRpartitionedRandRprocessedRinRsmallRportionsO

ThenARthoseRpiecesRofRinformationRareRusedRtoRassessRsimilarityRbetweenRtwoRcitationsRorRbetweenRaRcitationRandRdocumentRmetadataRbyRmeansRofRfuzzyRsimilarityRmetricsRwhichRareRcombinedRbyRSVMO

GivenRaRpairwiseRsimilaritiesARcitationsRareRclusteredRusingRaRsimpleRsingle)linkRalgorithmORWeRhaveRtestedRtheRclusteringRcorrectnessRonRCORA)refRdatasetRandRcomparedRtheRresultsRwithRthoseRachievedRbyRJointRMLNRmethodRzPoonRandRDomingosARVWW7ARAI2O

)RAuthor similarityR)RheaviestRmatchingRofRtokensARanRinstanceRofRassignmentRRRproblem)RJournal similarityR)RlongestRcommonRsubsequenceRofRcharacters)RTitle similarityR)RaRfractionRofRcommonRtrigramsRzcharacter)based2)RPages similarityR)RaRfractionRofRcommonRtokens)RYear equality

HadoopisedCcitationCmatching

ToRlimitRtheRnumberRofRpairwiseRcomparisonsARweRhaveRusedRaRheuristicRbasedRonRauthorRindexingORToRbeRlessRvulnerableRtoRspellingRerrorsRweRhaveRimplementedRanRindexRsupportingRretrievalRofRtokensRwithReditRdistanceRlessRthanRorRequalRHO

InsteadRofRputtingRasRaRkeyRanRexactRwordRwARweRputRallRtheRrotationsRofRw$RzwhereR$RisRaRcharacterRnotRpresentRinRanRalphabet2ORInRaRsimilarRmannerARtoRretrieveRaRwordRfromRtheRindexARweRcreateRallRitsRrotationsRandRforReachRrotationRrRofRlengthRnARallRtheRkeysRofRlengthR≤ nRthatRmatchRatRleastRn-1RfirstRlettersRofRrRandRkeysRofRlengthR≤ n+1RthatRmatchRfirstRnRlettersRareRreturnedO

NuthorCindexCbuilding

NctualCmatching

docId2"

docId2J

docId2"

doc2" doc2J

docId2J docId2J docId2J

token2" token2J token25

%ocuments

%ocumentCI%sCandextractedCtokens

%ocumentCI%sgroupedCbyCtokens

RotationsforCeachCtoken

docId2" docId2"

token2" token2J

token2" token2J token25

docId2"

docId2J

docId2"

docId2J

rot2J2" rot2J2J rot2J25

docId2"

docId2J

docId2"

docId2J

%ocumentCretrieval

TokenCextraction

Grouping

RotationsCgeneration

Persisting

doc2"

doc2J

doc25

ref2J2"

ref2J2J

ref2J25

ref2J2J

ref2J2J

ref2J2J

cand2J2J2"

cand2J2J2J

cand2J2J25

ref2J2J cand2J2J25

ref2x2y cand2x2y2x

ref2x2y cand2x2y2z

%ocumentreading

ReferenceCstringextraction

HeuristicCmatchingSelectingtheCbestCmatching

StoringresolvedCreferences

%ocuments ReferencesReferencesCwithcandidateCdocuments

æestCmatchingCpairs

Map ReduceNll .G..G46 "5 J

.G..G3– "5 .5G."G5b 634 .JG4"G.. vv– "

VisitCusCatChttps://github.com/CeON/CoAnSys/

Map ReduceSortICshuffle

References["]CI2CNewton9CPhilosophiæCnaturalis222[J]CN2CUopernicus9C%eCrevolutionibus222

ID Title Author

Uopernicus"3 %eCrevolutionibus222

ΕὐκλείδηςΣτοιχεῖα""

InterdisciplinaryCUentreCforCMathematicalCandCUomputationalCModelling9CUniversityCofCWarsaw

{m.fedoryszak, d.tkaczyk, l.bolikowski}@icm.edu.pl

WeRhaveRevaluatedRefficiencyRofRourRsolutionRusingRPMCROpenRAccessRSubsetRdocumentRsetORItRconsistsRofRoverRJLWRthousandRdocumentsRcontainingRHVRmillionRcitationsORTheRbenchmarkRwasRperformedRonRourRHadoopRclusterRwhichRconsistsRofRfourRʹfatʹRslaveRnodesAReachRcontainingRJqRCPURcoresO

–42bJR 6J24.R 66245R 6"2v4R 6b2".R

v42J"R v624"R v32vbR v42v.R v62..Rv52v"R v52.–R v6235R v32b.R v325.Rv324–R v42J5R v–2"vR v4255R v42–5R

clusterCrecall

pairwiseCprecisionpairwiseCrecallpairwiseCF"

Nverage

PhaseTimeCspent

[hrsGminsGsecs]IndexCbuilding

MatchingUitationCextractionHeuristicCmatching

SelectingCtheCbestCmatch

TaskCNo2

Fold. Fold" FoldJ JointCMLN

Software

Large scale citation matching using Apache Hadoop