Upload
fedoryszak-mateusz
View
39
Download
1
Embed Size (px)
Citation preview
LargeIscaleIcitationImatchingusingIApacheIHadoop
MateuszCFedoryszak9C%ominikaCTkaczykCandCŁukaszCæolikowski
WhatCisCcitationCmatching[ WhatCisCHadoop[
WeCareCtheCpartCofCUoNnSysA
PairwiseCsimilarity
Results
PerformanceCtestsIntermezzoGCNuthorCindexing
Pawlak9CZdzisławCx"vbJZ2CSRoughCsetsS2CInternationalCJournalCofCParallelCProgrammingC""Cx4ZGC53"–54–2
Pawlak 9 Zdzisław x "vbJ Z 2
author other author other year other other
222222
DuringRcitationRmatchingRtheRlinksRfromRbibliographicRentriesRtoRreferencedRpublicationsRareRcreatedORSuchRlinksRareRindicatorsRofRtopicalRsimilarityRbetweenRlinkedRtextsARareRusedRinRassessingRtheRimpactRofRtheRreferencedRdocumentRandRimproveRnavigationRinRtheRuserRinterfacesRofRdigitalRlibrariesOR
FirstRofRallARmetadataRfieldsRareRextractedRfromRaRcitationRstringRinRaRprocessRcalledRcitationRparsing8RitRisRdoneRusingRCRF)basedRparserRsuppliedRwithRCERMINER)RaRmetadataRandRcontentRextractionRtoolRzTkaczykRetRalOARVWHVARDAS2O
OurRcitationRmatchingRalgorithmRisRimplementedRasRtheRpartRofRCoAnSysARanRopenRsourceRframeworkRforRminingRscientificRpublicationsORItRisRdevelopedRatRtheRCentreRforROpenRScienceRzCeON2RinRtheRInterdisciplinaryRCentreRforRMathematicalRandRComputationalRModellingRzICM2ARUniversityRofRWarsawO
ApacheRHadoopRisRanRopenRsourceRimplementationRofRMapReduceRparadigmARwhichRallowsRbigRdataRtoRbeRpartitionedRandRprocessedRinRsmallRportionsO
ThenARthoseRpiecesRofRinformationRareRusedRtoRassessRsimilarityRbetweenRtwoRcitationsRorRbetweenRaRcitationRandRdocumentRmetadataRbyRmeansRofRfuzzyRsimilarityRmetricsRwhichRareRcombinedRbyRSVMO
GivenRaRpairwiseRsimilaritiesARcitationsRareRclusteredRusingRaRsimpleRsingle)linkRalgorithmORWeRhaveRtestedRtheRclusteringRcorrectnessRonRCORA)refRdatasetRandRcomparedRtheRresultsRwithRthoseRachievedRbyRJointRMLNRmethodRzPoonRandRDomingosARVWW7ARAI2O
)RAuthor similarityR)RheaviestRmatchingRofRtokensARanRinstanceRofRassignmentRRRproblem)RJournal similarityR)RlongestRcommonRsubsequenceRofRcharacters)RTitle similarityR)RaRfractionRofRcommonRtrigramsRzcharacter)based2)RPages similarityR)RaRfractionRofRcommonRtokens)RYear equality
HadoopisedCcitationCmatching
ToRlimitRtheRnumberRofRpairwiseRcomparisonsARweRhaveRusedRaRheuristicRbasedRonRauthorRindexingORToRbeRlessRvulnerableRtoRspellingRerrorsRweRhaveRimplementedRanRindexRsupportingRretrievalRofRtokensRwithReditRdistanceRlessRthanRorRequalRHO
InsteadRofRputtingRasRaRkeyRanRexactRwordRwARweRputRallRtheRrotationsRofRw$RzwhereR$RisRaRcharacterRnotRpresentRinRanRalphabet2ORInRaRsimilarRmannerARtoRretrieveRaRwordRfromRtheRindexARweRcreateRallRitsRrotationsRandRforReachRrotationRrRofRlengthRnARallRtheRkeysRofRlengthR≤ nRthatRmatchRatRleastRn-1RfirstRlettersRofRrRandRkeysRofRlengthR≤ n+1RthatRmatchRfirstRnRlettersRareRreturnedO
NuthorCindexCbuilding
NctualCmatching
docId2"
docId2J
docId2"
doc2" doc2J
docId2J docId2J docId2J
token2" token2J token25
%ocuments
%ocumentCI%sCandextractedCtokens
%ocumentCI%sgroupedCbyCtokens
RotationsforCeachCtoken
docId2" docId2"
token2" token2J
token2" token2J token25
docId2"
docId2J
docId2"
docId2J
rot2J2" rot2J2J rot2J25
docId2"
docId2J
docId2"
docId2J
%ocumentCretrieval
TokenCextraction
Grouping
RotationsCgeneration
Persisting
doc2"
doc2J
doc25
ref2J2"
ref2J2J
ref2J25
ref2J2J
ref2J2J
ref2J2J
cand2J2J2"
cand2J2J2J
cand2J2J25
ref2J2J cand2J2J25
ref2x2y cand2x2y2x
ref2x2y cand2x2y2z
%ocumentreading
ReferenceCstringextraction
HeuristicCmatchingSelectingtheCbestCmatching
StoringresolvedCreferences
%ocuments ReferencesReferencesCwithcandidateCdocuments
æestCmatchingCpairs
Map ReduceNll .G..G46 "5 J
.G..G3– "5 .5G."G5b 634 .JG4"G.. vv– "
VisitCusCatChttps://github.com/CeON/CoAnSys/
Map ReduceSortICshuffle
References["]CI2CNewton9CPhilosophiæCnaturalis222[J]CN2CUopernicus9C%eCrevolutionibus222
ID Title Author
Uopernicus"3 %eCrevolutionibus222
ΕὐκλείδηςΣτοιχεῖα""
InterdisciplinaryCUentreCforCMathematicalCandCUomputationalCModelling9CUniversityCofCWarsaw
{m.fedoryszak, d.tkaczyk, l.bolikowski}@icm.edu.pl
WeRhaveRevaluatedRefficiencyRofRourRsolutionRusingRPMCROpenRAccessRSubsetRdocumentRsetORItRconsistsRofRoverRJLWRthousandRdocumentsRcontainingRHVRmillionRcitationsORTheRbenchmarkRwasRperformedRonRourRHadoopRclusterRwhichRconsistsRofRfourRʹfatʹRslaveRnodesAReachRcontainingRJqRCPURcoresO
–42bJR 6J24.R 66245R 6"2v4R 6b2".R
v42J"R v624"R v32vbR v42v.R v62..Rv52v"R v52.–R v6235R v32b.R v325.Rv324–R v42J5R v–2"vR v4255R v42–5R
clusterCrecall
pairwiseCprecisionpairwiseCrecallpairwiseCF"
Nverage
PhaseTimeCspent
[hrsGminsGsecs]IndexCbuilding
MatchingUitationCextractionHeuristicCmatching
SelectingCtheCbestCmatch
TaskCNo2
Fold. Fold" FoldJ JointCMLN