Table of Contentspublications.aston.ac.uk/30749/1/Geo_tagging_news_stories_using... · 50 Geo-Tagging News Stories Using Contextual Modelling; Md Sadek Ferdous, University of Southampton,

The International Journal of Information Retrieval Research is indexed or listed in the following: ACM Digital Library; Bacon’s Media Directory; Cabell’s Directories; DBLP; Google Scholar; INSPEC; JournalTOCs; MediaFinder; ProQuest Advanced Technologies & Aerospace Journals; ProQuest Computer Science Journals; ProQuest Illustrata: Technology; ProQuest SciTech Journals; ProQuest Technology Journals; Research Library; The Standard Periodical Directory; Ulrich’s Periodicals Directory; Web of Science; Web of Science Emerging Sources Citation Index (ESCI)

Editorial Preface

iv Vishal Bhatnagar, Ambedkar Institute of Advanced Communication Technologies and Research, Department of Computer Science and Engineering, Delhi, India

Research Articles

1 Offlinevs.OnlineSentimentAnalysis:IssuesWithSentimentAnalysisofOnlineMicro-Texts;

Ritesh Srivastava, University of Delhi, New Delhi, India

M.P.S. Bhatia, University of Delhi, New Delhi, India

19 SemanticAnalysisBasedApproachforRelevantTextExtractionUsingOntology;

Poonam Chahal, Manav Rachna International University, Faridabad, India

Manjeet Singh, YMCA University of Science and Technology, Faridabad, India

Suresh Kumar, Manav Rachna International University, Faridabad, India

37 FrequentItemsetMininginLargeDatasetsaSurvey;

Amrit Pal, Indian Institute of Information Technology, Allahabad, India

Manish Kumar, Indian Institute of Information Technology, Allahabad, India

50 Geo-TaggingNewsStoriesUsingContextualModelling;

Md Sadek Ferdous, University of Southampton, Southampton, United Kingdom

Soumyadeb Chowdhury, Singapore Institute of Technology, Singapore, Singapore

Joemon M Jose, University of Glasgow, Scotland, United Kingdom

72 HadoopMapOnlyJobforEncipheringPatient-GeneratedHealthData;

Arushi Jain, Ambedkar Institute of Advanced Communication Technologies and Research, Department of Computer Science and Engineering, New Delhi, India

Vishal Bhatnagar, Ambedkar Institute of Advanced Communication Technologies and Research, Department of Computer Science and Engineering, New Delhi, India

CoPyRightThe International Journal of Information Retrieval Research (IJIRR) (ISSN 2155-6377; eISSN 2155-6385), Copyright © 2017 IGI Global. All rights, including translation into other languages reserved by the publisher. No part of this journal may be reproduced or used in any form or by any means without written permission from the publisher, except for noncommercial, educational use including classroom teaching purposes. Product or company names used in this journal are for identification purposes only. Inclusion of the names of the products or companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark. The views expressed in this journal are those of the authors but not necessarily of IGI Global.

Volume 7 • Issue 4 • October-December-2017 • ISSN: 2155-6377 • eISSN: 2155-6385An official publication of the Information Resources Management Association

InternationalJournalofInformationRetrievalResearch

TableofContents

jtravers

Highlight

DOI: 10.4018/IJIRR.2017100104

International Journal of Information Retrieval ResearchVolume 7 • Issue 4 • October-December 2017

Copyright©2017,IGIGlobal.CopyingordistributinginprintorelectronicformswithoutwrittenpermissionofIGIGlobalisprohibited.

Geo-Tagging News Stories Using Contextual ModellingMd Sadek Ferdous, University of Southampton, Southampton, United Kingdom

Soumyadeb Chowdhury, Singapore Institute of Technology, Singapore, Singapore

Joemon M Jose, University of Glasgow, Scotland, United Kingdom

ABSTRACT

Withtheever-increasingpopularityofLocation-basedServices,geo-taggingadocument-theprocessofidentifyinggeographiclocations(toponyms)inthedocument-hasgainedmuchattentioninrecentyears.Therehavebeenseveralapproachesproposedinthisregardandsomeofthemhavereportedtoachievehigherlevelofaccuracy.Theexistingapproachesperformwellatthecityorcountrylevel,unfortunately,theperformancedegradesduringgeo-taggingatthestreet/localitylevelforaspecificcity.Moreover,thesegeo-taggingapproachesfailcompletelyintheabsenceofaplacementionedinadocument.Inthispaper,analgorithmispresentedtoaddressthesetwolimitationsbyintroducingamodelofcontextswithrespecttoanewsstory.Thealgorithmevolvesaroundtheideathatanewsstorycanbegeo-taggednotonlyusingplace(s)foundinthenews,butalsousingcertainaspectsofitscontext.Animplementationoftheproposedapproachispresentedanditsperformanceisevaluatedonauniquedatasetwherefindingssuggestanimprovementoverexistingapproaches.

KeywoRdSContextual Modelling, Evaluation, Geo-Tagging, Information Retrieval, Text Mining

INTRodUCTIoN

With the ever-increasing popularity of Location-based Services, geo-tagging a document - theprocessofidentifyinggeographiclocations(toponyms)inthedocument-hasgainedmuchattentioninrecentyears.Insuchservices,geographiclocationsactasthegluethatbindtogetherdisparatedocumentsets(suchastextualcontents,imagesandvideos)frommultipledatasources.Devicesthatproducemultimediadocumentssuchasimagesandvideosareequippedwiththecapabilitytohaveadditionalsensors(GPSsensors)thatcangeo-tagtherelateddocumentwithgeographicinformationsuchaslatitudeandlongitudeandtherespectiveinformationisstoredinametadataalongwiththecorrespondingdocument.Webservicesthataccumulatesuchdocuments(e.g.YouTubeandFlickr)canretrievesuchinformationautomatically.Inaddition,suchservicesallowanyusertomanuallytaganymultimediadocumentwithgeographiclocationsincasesthedocumentsarenotgeo-taggedbytheircapturingdevices.Unfortunately,thegeo-taggingprocedureisrathercumbersomefortextualdocumentsandgenerallyreliesonmanualhumaninput.Therehavebeenseveralworkstoaddress

50


51

thislimitationandsomeofthemhavereportedtoachievehighlevelofaccuracyasreportedin(Ding,2000),(Amitay,2004),(Garbin,2005),(Lieberman,2007),(Andogah,2012)and(Ignazio,2014).

Aspartofalarge-scaleproject,wehavebeencollectingnewsstoriesaboutacountryfromthecountry-specificRSSfeedofdifferentonlinenewswebsitesonadailybasisforaroundayear.ThemainideaistoaggregatethisdatasetwithothermodesofpublicdatasuchassocialmediapostsfromTwitter;multimediadatafromimagesharingwebsitessuchasFlickranddatafromwearablesensorssuchaslifeloggersandGPStrackerstocreateauniquemulti-modal(textualaswellasmultimedia)setofdataaboutaparticulargeographiclocation.Thiswillencodeexperiencesfrommultipleuserperspectivesandhasenormouspotentialinexploitingforpublicbenefit.Oneofthecorechallengesfordealingwithsuchheterogeneoussetofdataistodefinetheparametersthatcanbeusedtolinkthemtogetherfordifferentuse-casescenarios.Amongseveralparameters,thespatio-temporalattributepairisthesimplestofchoicesduetotheiromni-presenceinallourdatasetsexceptinnewsstories.

Newsstories,mostlytextual,areequippedwithatemporalattribute(intheformofatimestamp)tohighlightthetimeanddateofpublication,however,lackanyaccompanyingmetadatatopublicisethespatialattribute,eventhougheverynewsgenerallyhasageographicfocusinit(Andogah,2012).Thelackofanyspatialattributemakesitachallengingtasktogeo-taganewsstoryinanautomaticfashion.Togeo-tagourcollectionofnewsstories,wehavebeenlookingforpubliclyavailablegeo-taggingAPIs.CLAVIN(CLAVIN,2016)andCLIFF(Ignazio,2014)and(CLIFF,2015)aretwosuchAPIs.

AfterutilisingCLAVINandCLIFFoverasubsetofournewsdataset,wehavenoticed thefollowingshortcomings:

• Theyfailtogeo-tagadocumentintheabsenceofdirectmentionsofalocation;and• Theyfailtocreateanassociationbetweenafine-grainedlocationandacityincasesaTextual

documenthasbeengeo-tagged.

Whatwemeanbyafine-grainedlocationisatthegranularityofastreetoralocalityinacity.AnexampleofalocalityisChelseainLondonandanexampleofastreetisKing’sRoadinLondon,UK.Withoutaproperassociationbetweensuchfine-grainedlocationsandacity,itopensupthedoorfordisambiguity,sincemanycitiesmaysharethesamenameforalocalityorastreet.Thereasonforourinterestinsuchfine-grainedlocationsisthatitallowsustolinksuchnewswithotherdatasets,especiallylifelogsandGPStrails,whicharesupplementedwithsuchfine-grainedgeo-information.

Inthispaperweinvestigatethewaystheabovementionedproblemscanberectified.Especially,weinvestigatehowamathematicalmodelofcontextwithrespecttoanewsstorycanbedevelopedandhowsuchamodelcanberelatedwithamathematicalmodelofgeo-tagginganditsalgorithmicimplementationtorectifysuchproblems.

Inparticular,weseekanswerstothefollowingresearchquestions:

[RQ-1]:Canwedevelopamathematicalmodelofcontextrelatedtonewsstoriesandrelatewithamathematicalmodelofgeo-tagging?

[RQ-2]:Canweexploitthenewscontextmodeltogeo-tagnewsintheabsenceofdirectmentions?[RQ-3]:Canweexploitthenewscontextmodeltogeo-tagnewsatstreetgranularitylevelconsidering

thedisambiguitythatmayoccur?

Withthisintroduction,thepaperisorganisedasfollows.WedescribetherelatedworkinSection:RELATEDWORK.Section:GEO-TAGGINGMODELLINGintroducesourmathematicalmodel


52

of context and geo-tagging for a news story. In Section: IMPLEMENTATION, we discuss ourimplementationalongwiththealgorithmthatutilisesourmodelofcontextforgeo-tagginganewsstory.WedescribeourevaluationprocedureinSection:EVALUATIONandpresenttheresultsinSection:RESULT.Weanswerourresearchquestions,discusstheadvantagesandhighlightthelimitationsoftheproposedapproachinSection:DISCUSSION.WeconcludeinSection:CONCLUSION.

ReLATed woRK

Oneoftheearliestworksongeo-taggingtextualwebresourceswasreportedin(Ding,2000)wheretheauthorsintroducedheuristictechniquesforautomaticallydetectingthegeographicalscope(s)withintheresource.Thetechniquesreliedontheanalysisoftextualcontentsandexaminingthegeographicaldistributionofhyperlinkswithintheresources.Anevaluationoftheirreportwascarriedoutover150webresourcesandmorethan75%precisionandrecallwasreported.Finally,ageo-awaresearchenginewasdevelopedusingtheirproposedapproachtoshowthesuitabilityoftheirapproach.Theauthorsmainlyfocusedonthecitylevelgranularityanditwasnotinvestigatediftheapproachwouldbesuitableforstreet/localitylevelgranularity.

An influential work for geo-tagging web documents was presented in (Amitay, 2004). Thepaperdescribedadataminingapproachutilisingagazetteer (anatlasenlisting thenamesof allplaces)tolocateplacesmentionedwithinthedocumentaswellastodeterminethegeographicfocus,representingthebroaderlocalitysuchascitiesorstates,ofthedocument.Theauthorsalsodiscussedmechanismstoresolvetwotypesofambiguities:geo/non-geoandgeo/geo.Thefirstambiguitydepictsthescenarioswhenalocationnameissimilartoanynon-geographicname,e.g.Turkey,whereasthesecondambiguity(geo/geo)illustratesthescenarioswhenplacesindifferentcountriessharethesamename,e.g.London,EnglandandLondon,Canada.Basedontheevaluationover600webpages,theauthorsreportedaprecisionof82%forindividualgeo-tagsandaprecisionof91%indeterminingthegeographicfocusofthenews.Theirpaperalsodidnotinvestigateiftheapproachwouldbesuitableforstreet/localitylevelgranularity.

Oneofthemajorchallengesingeo-taggingadocumentistohandledisambiguity.Inthisregard,theauthorsin(Garbin,2005)presentedanapproachbasedonunsupervisedmachinelearningbyaggregating twopublicly available gazetteers. At first, ambiguous locations weredisambiguatedautomaticallybyapplyingpreferenceheuristicswhichactedasatrainingdatasetforthemachinelearner.Next,themachinelearnerwasusedtodisambiguateambiguouslocationsfromotherdata.Theirresultoftheirapproachwascomparedwithahuman-annotatednewscorpuscontaining7,739documentswith78.5%precision.

Liebermanet.al.presentedaSpatio-TextualsearchenginecalledSTEWARDwhichisasystemforgeo-tagging,determininggeographicfocus,querying,andvisualisinggeographiclocationsintextdocumentsin(Lieberman,2007).Theauthorsutilisedadocumenttaggertoextractpotentialreferencesoflocationswhicharethenresolvedtogeographiclocationsusingtwogazetteers.Toresolvedisambiguity,theyformulatedanalgorithmcalledPairStrengthAlgorithm.Thealgorithmutilisedthefrequencycountofambiguouslocations,theirdistancewithinthedocument,theirgeodesicdistanceandtheirpopulations.Thedocumentdistanceisameasurementofoffsetbetweenapairoflocationsfromthestartofthedocument.AnalgorithmcalledContext-AwareRelevancyDetermination(CARD)wasformulatedtodeterminethefocusofthedocument.Then,theproposedapproachwasappliedtodesignanddevelopasystemthatvisualisesthedocumentontoamapallowingdocumentretrievalbasedonspatio-textualqueries.Toresolvedisambiguity,wehaveadoptedasimilarapproachasdocumentdistanceforcalculatingthedistancebetweenacitynameandstreetname(calledvicinityscoreinourapproach,seeSection:IMPLEMENTATION)mentionedinthenews.

WiththeassumptionthateverysingledocumentinanIR(InformationRetrieval)Systemandthequerytoretrievesuchdocumentsfromthesystemhasageographicfocus,Andogahet.al.presentedanapproachfordetermininggeographicscopeofadocument(Andogah,2012).Then, thescope


53

hadbeenutilisedfortoponymresolution,relevancerankingandqueryexpansion.ThegeographicscopewasdeterminedbyextractingnamedentitiesusingaNamedEntityRecognizer(NER)tool.Especially, theyexploited thehierarchicalstructureof thenamedplacesandpeople,particularlypoliticalandgovernmentleaders,todeterminethescopeofthedocument.Then,theyemployedaheuristicalgorithmforresolvingambiguity.Finally,theauthorsdescribedhowtheirapproachcouldbeusedforqueryexpansionandrelevanceranking.AnevaluationwascarriedoveradatasetcalledTR-CoNLLcontaining946documents (Leidner, 2006).They reportedanaccuracyof79%overmanuallyannotatedarticlesfromthedatasetforgeographicresolutionaswellasanaccuracyof71%-80%overmanuallyannotatedarticlesfortoponymresolution.

A recent work on geo-tagging the news article was presented in (Ignazio, 2014) where theauthorsextendedanexistinggeo-taggingAPIcalledCLAVIN(CLAVIN,2016)byapplyingafewheuristicsbasedonthemethoddescribedin(Amitay,2004).Theirapproachdeterminedthefocusofanewsarticleaswellasallplacesmentionedinthearticle.Theyreported95%accuracyoverasmallmanuallyannotateddatasetof75newsand90%-91%accuracyindeterminingfocusatthecountrylevelusingseparate10,000samplesfromtheNewYorkTimesAnnotatedCorpus(Sandhaus,2008)andReutersRCV-1Corpusrespectively(Sandhaus,2004).

ThereareseveralcommercialAPIsavailableforgeo-taggingtextualdocumentssuchasnewsstories,e.g.OpenCalais(OpenCalais,2016),Placespotter(Placespotter,2016)andGeoTag(GeoTag,2016),however,wehavebeenlookingforpubliclyavailableAPIssothattheycanbeextendedtomeet our requirements. We have found two such APIs, namely, CLAVIN (CLAVIN, 2016) andCLIFF(CLIFF,2016).Betweenthesetwo,CLIFFisbasedonCLAVINandhasextendedCLAVIN’scapability.Moreover,ithasbeenreportedtoachievebetterperformancethanCLAVINin(Ignazio,2014). Therefore, we have selected CLIFF for our experiment. CLIFF utilises Stanford NER(StanfordNER,2016)toextractnamedentitiesandthenappliesafewheuristicstogeo-tagthenewsandtodetermineitsgeographicfocus.EventhoughCLIFFcanidentifyastreet/locality,itdoesnotassociateastreetwithacity.Inaddition,alltheworksdiscussedabovemainlyfocusedeitheronacountryoracitylevelanddidnotinvestigateiftheirapproachwouldbesuitableforstreetorlocalitylevelgeo-tagging.Furthermore,theirapproachwouldfailintheabsenceofdirectmentionsoflocations.

Havingbeeninspiredbytheworkof(Lieberman,2007)and(Ignazio,2014),wewouldliketoinvestigateifthementionedshortcomingscanbehandledandtheeffectivenessofgeo-taggingcanbeimprovedbyintroducingandexploitingamathematicalmodelofcontextandgeo-tagginginrelationtoanewsstory.Toourknowledge,thisisthefirstattempttoformaliseacontextwithrespecttoanewsstoryandthentoapplythatcontextforgeo-taggingnewsstories.

Geo-TAGGING ModeLLING

Inthissection,wedefineourmathematicalmodelofcontextofanewsstory.Atfirst,wedefinethetermContext.Next,wedefineamodelofspatiallocationinformation.Finally,werelatecontextualinformationwithspatial location informationbyformallydefining theprocessofgeocodingandgeo-tagging.

Contextual InformationTheterm“Context”(alsoknownascontextualinformation)hasbeenpopularisedfromthedomainofContext-awareservices.Interestingly,whatthetermContextmeansinhighlydebatedandithasbeendefinedinnumerousways.SchilitandTheimerusedthetermcontext-awareforthefirsttimein(Schilit,1994)wheretheydescribedcontextsaslocations,identitiesofnearbypeople,objectsandchangestothoseobjects.Similarly,Ryanetal.regardedcontextsastheuser’slocation,environment,identityandtime(Ryan,1997).Hulletal.representcontextsasdifferentaspectsofthecurrentsituation(Hull,1997).Oneofthemostaccurateandwidely-useddefinitionsisgivenbyAbowdetal.whereacontexthasbeendescribedas“anyinformationthatcanbeusedtocharacterizethesituationofentities


54

(i.e.,whetheraperson,placeorobject)thatareconsideredrelevanttotheinteractionbetweenauserandanapplication,includingtheuserandtheapplicationthemselves”(Abowd,1999).

InthedomainofInformationRetrieval,acontextisusedtodefinethestateofauser(includingtheuser’sspatio-temporalattributes,interests,previousretrievalhistoriesandsoon)andsuchinformationcanbeusedasaknowledgebaseinordertoachievehigheraccuracyduringinformationretrieval(Gross,2002).Interestingly,inaninformationretrievalsystemwhichmainlydealswithnewsstories(asinours),howanewsstoryisstructuredcanoffervaluableinsights,whichcanbesupplementedwiththecontextofausertoachievehighlyaccurateretrievalresults.

Anewsstoryoutlinesafactualstorybyanswering5corequestionsofwho,what,where,whenandwhy(Errico,1997).Oftheseanswers,theanswerofwhohighlightsthenamedentitiessuchaspeopleororganisationsmentionedinthenewswhereastheanswersofwhenandwherehighlightthetemporalandaspatiallocationinformationrespectivelyregardingthenews.Theanswersofwhatandwhyrepresentthecontentsofthenewsandcanbeusedtoclassifythenews.Thecombinedanswersofthesequestionsessentiallydefinethestateofanews,muchlikethewayacontextdefinesthestateofauser.Thismotivatesustodefinethecontextofanewsinthefollowingway:

Definition 1:Thecontextofanewsstoryconsistsofthenamedentitiesandthetemporalandaspatiallocationinformationmentionedinthenewsaswellasthecategoriesdepictingtheclassificationofthenews(seeFigure1).

Itisimportanttounderstandthatthelocationinformationinsuchacontextmerelyrepresentsanaspatiallocationallydescriptivetext,usuallyidentifiedbyaNamedEntityRecognizer(NER)andhence,isnotavalidspatialrepresentationofalocation.

Figure 1. Modelling contexts of a news


55

Mathematically,letP denotethesetofpeople, LN denotethesetofaspatiallocationnames,T denotethesingletonsetdepictingatimestampandO denotethesetoforganisations.Then,thecontextofanewscanbedefinedusingthefollowingset:

INFO P LN T OCONTEXT = ∪ ∪ ∪

where,P P⊆ , LN LN⊆ andO O⊆ .

Spatial Location ModellingAspatiallocationdescribesthephysicallocationofalocationnameandisrepresentedusinggeospatialcoordinatessuchaslatitudeandlongitude(Research,2016).Tomodelaspatiallocation,weassumeatreestructurewheretheworldistherootofthetreeandallcountriesintheworldisitsfirstlevelchildren.Acountryisassumedtobedividedindifferentcitiesconsistingofroads,localitiesandpostcodes.Themodelispresentedbelow.

LetC denotethesetofallcountriesintheworldandCITYc denotethesetofallcitiesinacountry c C∈ .Wedefinethesetofallroadsin city CITYc∈ ofacountry c with Rcity .Eachcountrydefinesitsownformatofapostcode.Withoutspecifyingwhatthatformatis,weassumethatapostcodeisassignedtoacollectionofoneormoreroadswithinaspecificcity.WedenotethesetofpostcodeswithPCcity forcity CITYc∈ .Torelateapostcodewiththecorrespondingroads,wedefinethefollowingfunction.

Definition 2:Let pcToRoads PC P Rcity city: → ( ) bethefunctionthatreturnsthesetofroads

whicharepartofthatpostcodein city CITYc∈ .

Inversely,wecandefineanotherfunctioncalled, roadToPC whichgivenaninputofaroadwillreturnthepostcodeofthatroad.Formally:

Definition 3:Let roadToPC R PCcity city: → bethefunctionthatreturnsthepostcodeofthatroadin city CITYc∈ .

Forexample,ifweassumethatthepostcode p PCcity∈ isassignedforroads r1 , r2 and r3 inthe city CITYc∈ ,then:

pcToRoads p r r r( ) = { }1 2 3, ,

roadToPC r p roadToPC r p and roadToPC r p1 2 3( ) = ( ) = ( ) =,

Anexampleofaroadalongwithitspostcodeis176King’sRoad(depictingtheroad),SW34UP(depictingthepostcode)incityLondon,UK.

Next,wedefinealocalityasthecollectionofseveralpostcodeswithinacityanddenotethesetof localitieswithinacityas LOCcity .Likeabove,wedefinethefollowingfunctionstorelateapostcodewithalocalitywithinacity.


56

Definition 4:LetlocToPC LOC P PCcity city: ( )→ bethefunctionthatreturnsthesetofpostcodeswhicharepartofthatlocalityin city CITYc∈ .

Definition 5:Let pcToLoc PC LOCcity city: → bethefunctionthatreturnsthelocalityofthatpostcodein city CITYc∈ .

Forexample,ifweassumethatalocalityloc LOCcity∈ consistsof p1 , p2 and p3 postcodesin city CITYc∈ ,then:

locToPC loc p p p( ) = { }1 2 3, ,

pcToLoc p loc pcToLoc p loc and pcToLoc p loc1 2 3( ) = ( ) = ( ) =,

AnexampleofalocalityinLondon,UKisChelseaconsistingofseveralpostcodesandoneofitspostcodesisSW34UP.Itmayhappenthatforacountryorevenforacityinacountrythereisnodefinedlocality.Insuchcases,thelocalitysetforthatcity LOCcity( ) willrepresentanemptyset.

Finally,wedefinealocationasanorderedpairconsistingofaroad,alocality,apostcode,acityandacountry.Formally,thesetoflocationsforacity city CITYc∈ incountry c C∈ isdenotedas Lcity andisdefinedas:

L r loc pc city ccity = ( ){ }, , , ,

where:

r R loc LOC pc PC city CITY and c Ccity city city c∈ ∈ ∈ ∈ ∈, , ,

Anexampleofanelementofthelocationsetcanbegivenasfollows:

l r pcToLoc roadToPC r roadToPC r city c= ( )( ) ( )( )1 1 1, , , ,

where l Lcity∈ , roadToPC r1( ) resolves the rode to the corresponding postcode and

pcToLoc roadToPC r1( )( ) resolves the road to the postcode which is then resolved to thecorrespondinglocality.Anexampleofalocationwherealocalityisdefinedis:176King’sRoad,Chelsea,SW34UP,London,UK.Anotherexampleofalocationwherealocalityisnotdefinedis:29EthelbertRoad,CT13NF,Canterbury,UKwhereCanterburyisacity,notalocality,intheUK.

Then,wecandefinethesetoflocationsforacountryc C∈ astheunionofthelocationsetforallcitiesinthatcountryandtheuniversallocationset(denotedasL )canbedefinedastheunionofthelocationsetofallcountriesintheworld.Formally:

L L city CITY and L L c CC city c C= ∪ ∈ = ∪ ∈{ | } { | }


57

Theremayexistdifferentwaysaspatialinformationcanberepresented.Inthispaper,weassumethataspatialinformationisrepresentedusingalocationasdefinedaboveandageographiccoordinateasdefinednext.

Ageographiccoordinateconsistsofalatitudeandlongitude.WedenotethesetoflatitudesasLAT andthesetoflongitudesasLON .Formally,ageographiccoordinateisdenotedasCOORD andisdefinedasanorderedpairoflatitudeandlongitude:

COORD lat lan lat LAT and lon LON= ( ) ∈ ∈{ , | }

Wedenotethesetofspatialinformationas INFOSPATIAL anddefineitasanorderedpairoflocationandcoordinatesinthefollowingway:

INFO Bcoord B L and coord COORDSPATIAL = ( ) ∈ ∈{ , | }

Geocoding and Geo-Tagging Modelling

Torelatecontextualinformation INFOCONTEXT( ) withspatialinformation INFOSPATIAL( ) ,wedefinetheconceptofgeocoding.In(Goldberg,2008),Geocodingisdefinedas:“theactoftransformingaspatiallocationallydescriptivetextintoavalidspatialrepresentationusingapredefinedprocess”.Formally,wedefinethegeocodingprocessasafunctionasfollows:

Definition 6:Let geocoding C INFO T INFOCONTEXT SPATIAL: ( \ )× → bethefunctionthat,giventheinputsofacountryandanycontextualinformationexceptatimestamp,returnsthespatialinformationforthatcontextwithinthatcountry.

Thecountryintheinputofthe geocoding functionisusedtorestrictthegeographicfocusontoaspecificcountryandisutilisedbythegeocodingservicessuchasGeocode.FarmAPI(Geocode,2016).

AsmentionedinSection1,ageo-taggingprocessidentifieslocationswithinadocument.Thisismerelyaninformalambiguousdefinitionasalocationinformationcanbeaspatial(asdefinedinINFOSPATIAL orspatial(asdefinedin INFOCONTEXT ).Arigorousformaldefinition,hence,shouldeliminatesuchambiguity.Furthermore,toresonatewiththescopeofthispaper,wemainlyfocusongeo-taggingnewsstories.Withthisgoalinmind,wedefinethegeo-taggingprocessforanewsstoryasafunctioninthefollowingwaywhereN isthesetofnewsstories:

Definition 7:Letgeo tagging N C P INFOnews SPATIAL− × →: ( ) bethefunctionthat,giventheinputsofanewsandacountry,returnsthesetofspatiallocationinformationwhichareidentifiedinthatnewsandlocatedwithinthatcountry.

AsinDefinition6,thecountryinputinthe geo taggingnews− functionisusedtorestrictthegeographicfocusontoaspecificcountry.

SUMMARy

Ourmodelofgeo-tagginganewsstoryissummarisedinFigure2.Inessence,ourmodelconsistsofdifferentatomicsets(P ,LN ,O ,R ,PC andsoon)andcombinedsets( Lcity , INFOCONTEXT ,


58

INFOSPATIAL andsoon)whichareusedtorepresentdifferententities,attributesandgeographicalproperties.Forexample, INFOCONTEXT encodesthenotionofcontextualinformationwithrespecttoanewsstorywhereas INFOSPATIAL representsaspatiallocationofacitywithinacountryalongwith its coordinates.Themodel alsoconsistsofdifferent functions thatdefine the inter-relationbetweenthespecifiedsetsusingmathematicalfunctions.Amongalldefinedfunctions,geocoding andgeo taggingnews− areofparticularinterest.Thegeocoding functionillustratestheconceptofconvertingacontextintoaspatiallocationsconsistingofthecorrespondingcoordinates.Ontheotherhandthe geo taggingnews− functionillustratestheconceptoftagginganewswithaspatialinformationandprovidesapowerfulabstractionthathidesawaytheinternalgeocodingprocess.

These twofunctions, in reality,provide theblue-print fordevelopingalgorithms thatcanbeutilisedtoimplementanimprovedgeo-tagger.Inthenextsection,weelaboratehowwehaveachievedthesegoals.

IMPLeMeNTATIoN

Toutilisethemodelofcontext,anapplication,calledGeo-Tagger,hasbeendesignedandimplemented.Theapplicationutilisesanalgorithm,calledGeo-Taggingalgorithm(seebelow),whichrevolvesaroundtheideathatthereareotheraspectsofacontext,apartfromalocation,thatcanbeusedtogeo-taganews.Especially,weseektoexploitinformationregardingorganisationsfoundinanewsstoryforgeo-taggingthestory.Insuch,ourideagoesbeyondthetechniquesutilisedbyexistinggeo-taggingtoolssuchasCLAVINandCLIFFwhereonlylocationinformationwasexploitedforgeo-tagging.Otheraspectsofacontextsuchaspeopleandtimestamphavenotbeenconsideredforourcurrentimplementation.Thisisbecauseatimestamphasnogeographiclocationassociatedwith

Figure 2. Geo-tagging model


59

it.Furthermore,eventhoughpeople,particularlypoliticalleaders,havebeenexploitedingeo-tagginganewsin(Andogah,2012),wearguethatthisisquitetrickyandcanbesusceptibletoerrors,sincethelocationofapersonisnotstationary.

The architecture of the Geo-Tagger is illustrated in Figure 3. The application relies on thefollowingcomponents:

• TheStanfordNamedEntityRecognizer (NER),which isused toextractnamedentitiesandaspatial location information such as locations, persons and organisations within a news(StanfordNER,2016)representingthesetofcontextualinformation INFOCONTEXT( ) ;

• Geocode.FarmRESTAPI(Geocode,2016),whichhasbeenusedtogeocodeanamedentity(consideredasanimplementationofthe geocoding function);

• Adatabasefromwhereanewstoryisretrievedandtowheretheresultaftergeo-taggingthenewsisstored;

• Aninputfilecontainingboundingboxcoordinatesofthecorrespondinggeographiclocation;and• TwoalgorithmscalledGeo-TaggingAlgorithmandAmbiguityResolutionAlgorithm(ARAin

short)describedbelow.

Sinceournewscollection(denotedas N )primarilyconsistsofnewsstoriesfromaspecificcountry,themainfocusoftheGeo-Taggeristoidentifylocationswithinthatcountry.Theboundingboxcoordinatesspecifiedintheinputfileisusedtofilteroutanyotherlocationsoutsideofthisboundingbox.Inthisway,theboundingboxcoordinatesintheinputfileactsasthegeographicfocusspecifiedbytheuserandrepresentsacountry c C∈ cforthe geo taggingnews− function.Thisisincontrastwithanyexistingapproachwherethegeographicalfocusisdetectedautomatically.TheadvantageofourapproachwillbediscussedinSection:Advantages.

Theflowforgeo-tagginganewsstoryisasfollows.TheuseroftheGeo-Taggerinputstherequiredboundingboxcoordinatesintotheinputfile.AnewsisfetchedfromthedatabaseandispassedintotheGeo-Taggingalgorithm(Algorithm1)alongwiththeinputfile.Thealgorithmprocessesthefileandoutputsthelocationsalongwiththeircoordinates.Thisinformationisthenstoredbackintothedatabasewithareferenceofthenews,indicatingthatthecorrespondingnewshasbeengeo-tagged.

Figure 3. Architecture of geo-tagger application


60

The geo-tagging algorithm (Algorithm 1) essentially is an implementation of thegeo taggingnews− function.Ittakesanews n N∈( ) andaboundingboxofacountry(representingc C∈ ).Internally,itexploitsthecontextualinformationextractedfromnamedentitiesusingtheStanford NER representing the INFOCONTEXT set. At the first phase, all aspatial locationsLN INFOCONTEXT⊆( ) areprocessedonebyone.Ifsuchalocationpresentsacityhavingthe

coordinates(retrievedusingtheGeocode.FarmAPI)withinthespecifiedboundingbox,thedocumentisgeo-taggedwiththelocation.Ifthelocationrepresentseitheralocalityorastreet,thenthelocationisfedintotheARA(Algorithm2).Thisisbecausesuchalocationmayexistinmorethanonecitywithinthespecifiedboundingbox.TheARAmayreturnaproperlydisambiguatedlocationandifso,thenewsisgeo-taggedwiththelocation.TheARAmayalsoreturnalistoflocationsindicatingthelocationsinthenewsarestillambiguous.Insuchcases,thenewsisgeo-taggedwiththelocationalongwiththisambiguitytagandallmatchedcitiesandisstoredinthedatabase.Atthesecondphase,allorganisations O INFOCONTEXT⊆( ) areextractedfromthenewsusingtheStanfordNER.ThecoordinatesofeachorganisationisthenretrievedusingGeocode.FarmAPIandiftheyarewithintheboundingbox,thenewsisgeo-taggedwiththelocationoftheorganisation.

Algorithm1.Geo-taggingalgorithm

Input: a news story n N∈( ), input file (representing c C∈ )

Output: spatial locations identified in the news (a subset of INFOSPATIAL )1: → use Stanford NER to extract the set INFOCONTEXT( ) of named entities within the news 2: → extract locations LN INFOCONTEXT⊆3:loop for ln LN∈4: if ln represents a city CITYc∈ then5: → utilise the Geocode.Farm API by passing ln and c and retrieve coordinates i INFOSPATIAL∈( ) of ln within c.6: → geotag n with i7: else if ln∈LOCcity or ln∈Rcity (i.e. ln representing a locality or a road in a city) then

8: → call ARA passing the location ln( ) and the named entities INFOCONTEXT( ) as inputs and retrieve coordinates i INFOSPATIAL∈( )9: if i null≠ then10: → geotag n with i11: end loop12: → extract the set of organisations locations O INFOCONTEXT⊆13: loop for o O∈14: →utilise the Geocode.Farm API by passing o and c and retrieve coordinates i INFOSPATIAL∈( ) of o within c15: if i null≠ then16: → geotag n with i17: end loop


61

Algorithm2.Ambiguityresolutionalgorithm

Input: an aspatial location ln( ) , a news n N∈( ), named entities from the news INFOCONTEXT( ) and a country c C∈Output: a set of spatial locations i INFOSPATIAL⊆ containing either one or multiple locations 1: → retrieve i INFOSPATIAL'⊆ the set of coordinates for ln in c using the Geocode.Farm API 2:loop for i i∈ '3: if coordinates in i belong to a city CITYc∈ then

4: → return i associating ln with city5: else if coordinates in i belong to different cities within country c then6: → extract city locations LN INFOCONTEXT⊆7: if only one city is found LN =( )1 then

8: → return i associating ln with city9: else if more than one match is found LN >( )1 then

10: → match between cities extracted in ln and i11: if only a single match is found then12: → return i associating ln with city13: else if more than one match is found then14: → determine the vicinity between ln and the matched cities 15: → choose city having the lowest vicinity score16: → return i associating ln with city17: → if more than one cities having the same vicinity score then

18: → store all cities in a list l1( )19: else if no city is found or list l1 exists then20: → extract O INFOCONTEXT⊆21: → retrieve locations for each organization using similar heuristic and return i associating ln with city22: end loop23: →return l1

TheAmbiguityResolutionAlgorithm(ARA)retrievesthecoordinatesofalocationusingtheGeocode.FarmAPI.Ifthelocation( ln LN∈ ,mainlyalocalityorastreet)isresolvedtoonlyonecityhavingthecoordinateswithintheboundingbox,thelocationisassociatedwiththecityandthenewsisgeo-taggedwiththelocationalongwithitscoordinates.Ifthelocationisresolvedtomorethanone city, cities residingwithin theboundingboxare selected.Then, aspatial city locationsLN INFOCONTEXT⊆( ) fromthenewsareextractedusingtheStanfordNER.Twolistsofcities

arematched.Ifonlyonematchisfound,thelocationisassociatedwiththematchedcity.Ifmorethanonematchisfound,thismeansthatthelocationmayresideinanyofthematchedcities.Inthe


62

nextstep,avicinityscorebetweenthelocationandmatchedcitiesiscalculated.Avicinityscoremeasuresthedistancebetweenthelocationandthematchedcitiesmentionedwithinthenews.Thedistanceitselfismeasuredbycountingthenumberofwordsbetweenthelocationandthecity.Theintuitionisthatalocalityorastreetisgenerallymentionedalongwithitscityinthesamesentenceand/orinthesameparagraph.Thelocationisassociatedwiththecityhavingthelowestvicinityscore.Ifmorethanonecityhasthesameorreasonablyclosevicinityscore,allofthemarestoredinalist(called l1).Asthelastresortforambiguityresolution,allorganisationsO INFOCONTEXT⊆ areextractedfromthenewsusingtheStanfordNER.CoordinatesforeachorganisationareretrievedusingtheGeocode.FarmAPI.Citieshavingthecoordinateswithintheboundingboxarechosenandarematchedagainstthecitieslistedin l1.Ifonlyonematchisfound,thelocationisassociatedwiththecityandthenewsisgeo-taggedwiththelocation.Thereasonforthisisthatoftenanewshavingmentionedalocalityorastreetmaydiscussaboutanorganisationfromthesamecity.However,thismaynotbealwaystrueandthusmayresultinmorethanonematch.Thismeansthatthelocationisstillambiguousandhence,alistofambiguouslocationsisreturnedbythealgorithm.

eVALUATIoN

The studywith thehighest number evaluateddocumentswas reported in (Ignazio, 2014)whichutilisedNewYorkTimesAnnotatedCorpus(Sandhaus,2008)andReutersRCV-1Corpusrespectively(Sandhaus,2004),bothannotatedatcountryandcitylevel.However,theevaluationwasnotconductedforfinergranularity(e.g.street/localitylevel).Toourknowledge,thereisnopubliclyavailabledatasetwhichisannotatedatstreetorlocalitylevel.Hence,acomparativeevaluationwithsuchdatasetswasnotperformed.Instead,wehavedesignedauserstudywithasmallerdatasetfurtherdiscussedinSection:OurDataSetbelow.Thisdatasetwasgeo-taggedbytheGeo-Taggerapplicationandusedfortheuserstudy.Themaingoalofthestudyistodemonstratetheeffectivenessofthealgorithmaswellastoidentifyitslimitations.Thedatasetgenerationprocedure,theuser-studyanditsprotocolsarepresentedbelow.

Reuter data SetForalargescalecomparison,wecomparedthecountrylevelresultidentifiedbyGeo-TaggerwiththereportedresultbyCLIFFusingtheReuterRCV-1Corpus(Sandhaus,2004).TheRCV-1corpusconsistsofover800,000newswhereeachnewsincludesacountrytag.Fromthiscollection,asampleof10,000newswasrandomlyselectedrepresentingourReuterdatasetwhichwerethengeo-taggedusingCLIFFandtheGeo-Tagger.

our data SetOurcollection( n )ofnewsstoriesconsistsofmorethan11,000newsretrievedfromdifferentnewswebsitesforaroundaperiodofoneyear,asdiscussedinSection:INTRODUCTION.Atfirst,wehaveutilisedtheCLIFFAPItogeo-tagallnewsstoriesin n generatingtwodatasets:thefirstset,S1 ,consistingofnewsforwhichCLIFFhasfailedtogenerateanylocationandthesecondset, S2 consistingofnewsthathavebeengeo-taggedbyCLIFF.Then,fromS1 ,wehavechosen100randomnewsrepresentingourfirstevaluationdataset(denotedasEVAL SET−

100)andfromS2 wehave

selected200randomnewsrepresentingoursecondevaluationdataset(denotedasEVAL SET−200

).Next, our application has been utilised to geo-tag these 300 news in EVAL SET−

100 and

EVAL SET−200

whicharethenusedtocarryoutthefollowinguserstudy.


63

User StudyTheweb-baseduserstudyconsistedof6subjects(Female:2;Male:4;agerange:25-35years).Thesubjectswererecruitedbysendinganemailto6differentresearchgroups.SinceEVAL SET−

100

and EVAL SET−200

primarilyconsistedofnewsfromaspecificgeographiclocation,oneoftherecruitmentconditionswasthatthesubjectneededtohavegoodfamiliarityregardingthatgeographiclocation.Thiswastoensurethatthesubjectcanidentifythelocationappropriatelywithinthatspecificregionandcanassociatethestreetwithacitywithhigherconfidence.Wereceived8responseswhoagreedtovoluntarilytakepartinourstudy.Amongthem,wechose6,sincetheothertwosubjectshavenotlivedinthatgeographiclocationformorethanayear.Thechosensubjectslivedintothatgeographiclocationformorethanthreeyears,whichensuredthattheyhavehigherfamiliaritywiththelocation.Moreover,only2subjectswereexpertsinthefieldofIR(subjectarearesearchedinthispaper),2wereComputingScienceresearchersand2wereconductingtheirresearchinEngineering.Wedidnotchooseacrowdsourcedevaluationsinceitwouldbedifficulttoensurethegeographicalknowledgeoftheevaluators,whichissignificanttofulfiltheobjectiveoftheevaluation.

Inordertoevaluatetheevaluationdatasets,weassigned100newsstoriestoeachsubject,whichensuredthateachstorywasevaluatedby2subjects.Inourstudy,eachsubjectwasgivenalinktotheweb-basedstudy,andanaccesskey.Uponaccessingthestudy,alistof100storieswasdisplayedtothem.Uponclickinganewsstory,thetitleandthenewsstorywasdisplayed,withthreequestions.Thesubjectswereaskedtofirstreadthewholecontent.Then,theywererequiredtocompletethefollowingtasks(i.e.toanswerthequestions):

T1: Ifanysortof locationinformation(i.e.nameofacity,country,etc.)relevant to thenewsispresent in thenews story?T1helpedus togather statistics related to thenumberof storieshavingnolocationinformation.Theseresultswillbefurtherusedtoevaluatetheeffectivenessofourapproach,i.e.abilitytofindageo-locationintheabsenceofdirectionmentionsinastory.

T2:Ifastreetname(e.g.JamaicaStreet)oralocalityinformation(e.g.Chelsea)relevanttothenewsispresentexplicitlyinthenews?T2gatheredstatisticsrelatedtothenumberofstorieshavinggranularlocationinformation(streetname,localityetc.).Theseresultswillbefurtherusedtodemonstratetheeffectivenessofourapproachinabsenceofdirectmentionsofastreet/localityinastory.

T3:Thesubjectswereshownalistoflocationsgeo-taggedbyGeo-Tagger,andaskedtochoosetheirrelevantones.T3helpedtofinderroneousgeo-taggedlocations.Theseresultswillbefurtheranalysedtoidentifythelimitationoftheproposedalgorithm.

Uponcompletingalltasks,thesubjectswerecontactedtovoicetheiropinionsontherelevanceofthelocationspresentedtothemduringthestudy.Theresultsobtainedduringthestudyandtheiranalysisalongwiththecollecteduseropinionsarediscussednext.

ReSULT

Inthissection,wepresentourevaluationresults.Atfirst,wepresenttheresultsofourstudywheretheindependentvariablesareuserresponses,CLIFFandGeo-Taggerapplicationwhereasthedependantvariablerepresentstheeffectivenessforeachindependentvariable.Then,wepresentacomparativeresultbetweenCLIFFandGeo-Taggeroverasampleof5,000newsstoriesfromRCV-1corpus.

T1 and T2Atfirst,weanalysetheresultswithrespecttoT1andT2.Theagreementlevel(i.e.bothsubjectshavingthesameopinion)inrelationtoT1was100andforT2was98.Thedisagreementlevelin


64

thecaseofT2isattributedtothefactthatsomesubjectsconsideredahighwayzone(e.g.A390)asapartofthestreetinformation,butothersdidnot.Wealsofoundslightdisagreementwithrespecttolocality.Thehighagreementlevelisattributedtooursubjectdemographics(knowledgeaboutlocation),whichindeedmadetheresultsobtainedfromtheevaluationvalid.

ThecomparisonofCLIFF,Geo-TaggerandUserevaluationovertheevaluationdatasetwithrespecttoT1andT2ispresentedinTable1alongwiththecorrespondingstandarddeviation(SDcolumninTable1)andillustratedinFigure4.From EVAL SET−

100,theGeo-Taggerisableto

geo-tag78%ofthenewsstories.Thismeansthatfortheevaluationdatasetcomprisingof300news,thesuccessratio(indicatingthetotalnumberofnewssuccessfullygeo-taggedwithrespectto300news)forCLIFFis67whereasfortheGeo-Tagger,itis93%-animprovementof26%.

The results from the Geo-Tagger and the user response with respect to T1 and T2 are alsocompared.Thesubjectsidentifiedlocationsin260newsoutof300(87%).Thisisduetothefactthatsubjectshavebeenabletoidentifylocationsfromthenameoforganisationsbecauseoftheirlocalknowledge.However,theyfailedtoidentifyasmanylocationsastheGeo-Tagger(87%vs.93%)sincetheycouldnotidentifythelocationsofsomeorganisations,e.g.primaryschools,pubs,etc.Ontheotherhand,thesubjectsidentifiedstreets/localitiesin143newsoutof300(48%)andhenceperformedbetterthanCLIFF(15%vs48%).Thisisattributedtotheirlocalgeographicalknowledgeallowingthemtoidentifymanystreetsorlocalities.However,theperformanceofthesubjectswasstillinferiorcomparedtotheGeo-Taggerwhichidentifiedstreets/localitiesin233newsoutof300(78%).ThereasonforthisisthattheGeo-Tagger,intheabsenceofastreet/localityname,usedthenamesoftheorganisationstoidentifystreetsorlocalities.Eventhoughthesubjectshadlocalknowledge,theycouldnotresolvethestreet/localityinformationforsomeorganisations.Inthewordsofonesubject:

Table 1. CLIFF vs. geo-tagger vs. user response

CLIFF Geo-Tagger User SD

T1:Location 67% 93% 87% 14%

T2:Street/Locality 15% 78% 48% 32%

Figure 4. Result plot for T1 and T2


65

“Ididnotfindanylocationinthenews.IunderstoodthatthesystemidentifiedthelocationusingXPrimarySchoolmentionedinthenews.However,Ihavenoideawherethisschoolis”.Anothersubjectsaid:“Evenifthenewsdidnothavealocation,Iwaspresentedwithlocationnames,whichseemedrelevanttothenews,forexample,containedthelocationoftheorganisationsheriffcourtpresentedinthenews.”AnotherinterestingobservationisthecomparisonofT1andT2withrespecttoGeo-Tagger.Geo-Taggerwasablefindlocationsin93%ofnewswhereasitcouldfindstreets/localitiesinonly78%ofnews.Thereasonisthatthereweremanynewsinwhichtherewasnomentionofstreetorlocalities.InmanysuchscenariosGeo-Taggerwasabletoidentifystreets/localitiesusingthelocationsoftheidentifiedorganisationsinthenews.However,thereweresituationswheretheStanfordNERcouldnotfindanyorganisationwhichresultedwithnewsnotgeo-taggedwithstreetsorlocalities.

Anon-parametricKruskal-Wallistestisusedtoexaminethesignificanceoftheresultsobtainedfrom the three independent groups (UserResponses,Geo-Tagger andCLIFF), in relation to thenumberofnewsstorieshavinglocationinformationinthem(T1).Thetestresultsshowedsignificantdifferences χ 2

75 172 2 0 001= = <( ). , , .df p .AMann-Whitneytest(post-hoc)wasconductedtofollowupthefindingsbyapplyingaBonferronicorrection,toreportalltheeffectsata0.016levelofsignificance.Thiscorrectionisusedtoreducethechancesofobtainingfalse-positiveresults(typeIerrors),whenmultiplepair-wisetestsareperformedonasinglesetofdata(especially,whenanon-parametrictestisusedasapost-hoctest).WeperformedaBonferronicorrection,bydividingthecriticalpvalue α =( )0 05. bythenumberofcomparisonsbeingmade(i.e.3).Thepost-hoctest

resultsshowedsignificantdifferencesbetweenallpairs p <( )0 001. ,except,UserResponsesand

Geo-Tagger ( p =( )0 018. ). Hence, the statistical test also suggests that the Geo-Tagger issignificantlyeffective(itsabilitytofindlocationsinanewsstory)thanCLIFF.

Anon-parametricKruskal-Wallistestisusedtoexaminethestatisticalsignificanceoftheresultsobtainedfromthethreeindependentgroups,inrelationtothenumberofnewsstorieshavingstreetl o c a t i o n s i n t h e m ( T 2 ) . T h e t e s t r e s u l t s s h owe d s i g n i f i c a n t d i f fe r e n c e sχ 2

233 20 2 0 001= = <( ). , , .df p .AMann-Whitneytest(post-hoc)wasconductedtofollowup the findingsbyapplyingaBonferroni correction, to report all the effects at a0.016 levelofsignificance.Thepost-hoctestresultsshowedsignificantdifferencesbetweenallpairs p <( )0 001. .TheseresultssuggestthattheGeo-Taggerissignificantlyeffective(infindingstreets/localitiesinanewsstory)thanCLIFF.

Insummary,theGeo-TaggerhaseffectivelyfoundspatiallocationsbetterthanCLIFFevenintheabsenceofsuchinformationinanewsstoryusingourcontextualmodel(Section:GEO-TAGGINGMODELLING).

T3Next,weanalysetheresultswithrespecttoT3.TheagreementlevelofthesubjectsinrelationtoT3was95%.Thedifferenceinagreementlevelisattributedtothefactthatsomesubjectschosesubsetlocationsasnon-relevant,evenifthisaccuratelyrepresentedthenews.Forexample,locationsinanewsmayinclude:i)CityXandii)RoadA,CityX.Ourapproachpresentedsuchlocationsbecausewewanted to show thecity level scope for anews tohelp the subjects in identifyingirrelevantstreet/locality information.Moreover, in thecaseofsportsnews(e.g. footballmatchbetweenChelseaVsArsenal,bothinLondon,UK),somesubjectsconsideredthenameoftheteamsasrelevantlocations,buttherestdidnot.Furthermore,somesubjectschoseallirrelevantoptionsandothersdidnot.Thisvariationontheagreement,webelieve,isacommonphenomenonsincedifferentuserswillhavedifferentsubjectiveopinionsregardinghowalocationcanbeinferredevenintheabsenceofadirectmention.Inadditiontheiropinionwillbeinfluenceddependingontheirfamiliaritywithaparticularlocation.


66

TheGeo-Tagger identified1118 streets/localities in300news including repetitiveentries inmanylocations,meaningastreet/localitywasfoundinmorethanonenews.Outofthese,79streets/localitiesweretaggedasirrelevantbytheuserswhichisaround7%of1118,whichwasstatisticallyinsignificantaccordingtoMann-Whitneytest(p<0.001),i.e.demonstratedtheeffectivenessofthealgorithm,indicatinganaccuracyofaround93%(seeFigure5).

Foreachstorywitherroneousstreets/localitiesidentifiedbyasubjectinT3,theresultswerefurtherexploredtofindthenatureoferrorsbythetwoexperts.MostoftheerrorsareattributedtothewayGeocode.FarmAPIgeocodedalocationname.Forexample,whenAmerica/HollandoranyothercountrieswerepassedtoGeocode.FarmAPIwithafocusofacity,theAPIresolvedthenametoastreet(e.g.AmericaStreetorHollandStreet)ofthatcity.Webelievethiscanbecorrectedbyapplyinganexclusionpolicythatwillexcludethenameofacountryoutsidethechosenscopeunlessthenameissucceededbyareferenceofstreet(e.g.AmericaStreet)inanews.Inaddition,theAPIalsocouldnotgeocodeafeworganisationsproperlyandresolvedtheorganisationnameintoacompletelyanotherorganisationhavingasimilarnameinanothercity.Anexampleofsuchanorganisationisasupermarkethavingmultiplebranchesinacityorbranchesinmultiplecities.Inthewordsofanothersubject:“Sometimesthelocationwasnotcorrectfortheorganisation.BecauseIcouldseethenewsaboutcityXandcontainedthereferenceoforganisationAwhichwascompletelyresolvedtoalocationbelongingtocityZ”.Onewaytohandlethisistorestrictthescopeduringorganisationresolutiontothecitymainlyfocusedinthenews.Weplantorectifythisprobleminfuture.

Reuter RCV-1ThecomparativeresultispresentedinTable2.AsevidentfromTable2,Geo-TaggerperformedslightlybetterthanCLIFF.ThisimprovementisattributedtothefactthatGeo-Taggerexploited

Figure 5. Result plot for T3

Table 2. CLIFF vs. geo-tagger using RCV-1

Method Sample Size Accuracy

CLIFF 10,000 90%

Geo-Tagger 10,000 92%


67

organisations,inadditiontoaspatiallocationsandthishelpedGeo-Taggertogeo-tagnewsevenintheabsenceofdirectmentionsofaspatiallocations.TheaccuracyofGeo-Taggeralmostresonate with the results obtained using our data set. This demonstrates the effectivenessofourapproachoverdifferentdata setsofnewsstories.TheaccuracyofCLIFFusingourReuterdatasetisslightlylowerthanwhatwasreportedin(Ignazio,2014).Theexactreasonisdifficulttoestablishsincethesampledistributionin(Ignazio,2014)andusedinthispaperwillmost likelybedifferenteventhoughtheyoriginatedfromthesamedataset.Wecouldnotcomparetheeffectivenessofourapproachinfindingthestreet/localitylevelgranularityoverthis10,000datasetsincethenewsstoriesinRCV-1arenotgeo-taggedatthisgranularityandhence it isnot reported in thispaper.Furthermore, it is important to realise that,eventhoughasmallerdata-setwasusedforcarryingoutthiscomparison,itdoesnotinvalidatethecomparisonwhatsoeverasthesamedatasethasbeenusedforcarryingoutthecomparisonbetweenCLLIFFandourapproach.

dISCUSSIoN

Inthissection,weexploretheresearchquestionsmentionedinSection:INTRODUCTION,highlighttheadvantagesofourapproachanddiscussitslimitations.

Research QuestionsWithrespecttoRQ-1,wehavedevelopedamathematicalmodelofcontextforanewsstory(Section:GEO-TAGGINGMODELLING)consistingofinformationthatanswers5corequestionsofwho,what,where,whenandwhy.Inthisway,thecontextofanewsessentiallycharacterisesthenews,justlikethewayacontextofausercharacteriseshis/hersituation.Thisisasimpleyetpowerfulmodelasseveralof itscomponentssuchaspeople,organisationsand locationscanbeutilisedforgeo-taggingnewsstories.Moreover,someofitscomponentssuchaslocationandcategorycanbeeasilyexpandeddependingonthegranularityofourchoice.Wehavealsoformalisedamodelofspatialinformationandhaveshownhowacontextualinformationcanberelatedwithaspatialinformationbydevelopingamathematicalmodelofgeocodingandgeo-tagging.Ourmodel,inessence,providesablue-printfordevelopingalgorithmstobeutilisedforthegeo-taggingprocess.Wehavediscussedhowwehavedevelopedsuchalgorithmsutilisingourmodelthatencodethegeo-taggingprocessintheimplementationsection.

TheeffectivenessofourmodelbecomesapparentwhenweanswertheRQ-2.Wehaveevaluatedourgeo-taggingprocedureovertwodatasets.Thefirstdatasetconsistsof10,000newswhichhavebeenrandomlycollectedfromtheReuterRCV-1Corpusconsistingof800,000news.Theseconddatasetconsistingofaround11,000newsretrievedfromdifferentnewswebsitesforaroundaperiodofoneyearintheUK.Theresultsofourevaluation(Section:RESULT)clearlyshowthatourmodelofcontextcanbeexploitedtogeo-tagnewsstoriesevenintheabsenceofdirectmentionsoflocations(seeFigure1).Thisisachievedbygeo-taggingnewsstoriesnotonlywithlocations,butalsowithotheraspectsofthecontextmodel.

AsforRQ-3,theresultsoftheevaluationalsodemonstratetheapplicabilityofexploitingdifferentaspectsofthecontextforgeo-taggingnewsatthestreetlevel.Themainchallengeregardingthisistodetectandresolveambiguitythatmayoccuratthisgranularity(street/locality).Ourapproachsuggeststhatitisevenpossibletoresolvelocationsatthestreetlevelevenifthereisnomentionofastreet.Inaddition,ourapproachhasalsobeenabletoassociatestreetswithacitybothmentionedwithinanews(seeFigure5).

AdvantagesAnumberofadvantagesofourapproacharehighlightedbelow:


68

• Ourapproachenablesgeo-taggingnewsstoriesevenintheabsenceofdirectmentionsoflocationsbyutilisingthecontextualinformation,especiallyorganisations;

• Havingtheabilitytogeo-tagnewsatthestreet/localitylevelopensuptheopportunitytointegrateanynewscollectionswithothermulti-modaldatawhicharealreadyequippedwithsuchfine-grainedlocationinformation.Thewholedatasetcanthenbeutilisedtocreateauniquefine-grainedspatio-temporalretrievalsystem;

• AllowingauseroftheGeo-Taggertofixategeographicalscopeswiththehelpofanexternalinputfilewillenablehim/hertogeo-tagnewsoverthesamedatasetfordifferentscopesatdifferenttimes.Unlikeanyexistingapproachthatdeterminessuchscopesautomatically,thiswillenableausertogeo-tagnewsby“zoomingin”,or“zoomingout”,withinageographiclocationfordifferentscenarios.Forexample,ausercanrestrictthefocustoaparticularcitywithinacountrybyusingaparticularboundingboxcoordinatesatonetimeorcanrestrictthefocusoverthewholecountryinanothertimeusinganotherboundingboxcoordinates.

LimitationsThemainlimitationoftheapproachistheinaccuracyofsomeofthestreet/localitylocations.Asitturnsout,mostofsuchinaccuraciesareattributedtotheexternalgeocodingAPIthatisusedtoresolvelocationnamesintogeographiccoordinates.OnepossiblewaytorectifythisproblemistocombinetheresultsfromothergeocodingAPIssuchasGooglePlace(Google,2016)withtheresultofGeocode.FarmAPIandthenapplyapolicytoresolvethecorrectlocation.Weplantoincorporatethisintoourapplicationinfuture.

Anotherlimitationisthewaytheuserstudywascarriedwithasmallerdatasetandwithasmallnumberofsubjects.Bothareattributedtothefollowingreasons:

• Thereisnopubliclyavailabledatasetgeo-taggedwiththegranularityofstreet/locality;• Wehadtoascertainthatthesubjectshavegoodlocalknowledgewithinthegeographicalscope.

Thisruledoutthepossibilityforalarge-scalecrowd-sourcedevaluation.

CoNCLUSIoN

Inthispaper,wehavedevelopedamathematicalmodelofcontextandgeo-taggingwithrespecttonewsstoriesandhaveexploitedthatmodeltogeo-tagnewsstoriesevenintheabsenceofdirectmentionsoflocationsaswellasatthegranularityofstreet/localitylevel.Forthis,wehave incorporated our model with a geo-tagging algorithm and utilised off-the-shelf toolsandexistingGeocodingAPIs.Thedatasetgeneratedafterapplyingourapproachhasbeenevaluatedwith6users.Inaddition,wehaveevaluatedourapproachover10,000newsfromtheReuterdataset.TheresultsdemonstratetheeffectivenessofourapproachagainstexistingpubliclyavailableAPIs.

In future, we plan to extend the evaluation with a larger sample and subjects, once theapproachisimprovedandthentoconductanexperimentwithindependentstyledesign.Attheend,weaim to release thedatasetcontaining locationsat thegranularityof street/localitiesto the research community for their research. As we have found that different subjects havedifferentopinionsregardingafewspecificlocations(e.g.locationsregardinganorganisation),itwillbeinterestingtoincorporateaconfidencelevelwhiletheusersevaluatetheapproach.Then,resultscontainingaverylowconfidencelevelcanbespeciallytreatedorevenexcludedduringtheoverallevaluation.

Theultimategoalofourapproachistointegrateournewsdatacollectionwithanarrayofmulti-modaldatafromdifferentsocialmediasuchasFlickr,Twitter,YouTubeaswellasfromdifferent


69

wearablesensorssuchaslifeloggersandGPStrackersandtodevelopalocation-basedinformationfusionsystemwherelocations,amongotherfactors,willactasthegluetobindtogetheralldatasets.Inthisregard,theproposedapproachwillbeanessentialingredientofthesystem.However,webelievethattheapproachcanbeadaptedforgeo-tagginganynewsstoriesinanyotherscenarioswithlittleornofurthermodifications.


70

ReFeReNCeS

Abowd,G.D.,Dey,A.K.,Brown,P.J.,Davies,N.,Smith,M.,&Steggles,P.(1999,September).Towardsabetterunderstandingofcontextandcontext-awareness.Proceedings of theInternational Symposium on Handheld and Ubiquitous Computing(pp.304-307).Springer.doi:10.1007/3-540-48157-5_29

Amitay,E.,Har’El,N.,Sivan,R.,&Soffer,A.(2004,July).Web-a-where:geotaggingwebcontent.Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval(pp.273-280).ACM.

Andogah,G.,Bouma,G.,&Nerbonne,J.(2012).Everydocumenthasageographicalscope.Data & Knowledge Engineering,81,1–20.doi:10.1016/j.datak.2012.07.002

BericoTechnologies.(n.d.).CLAVIN.Retrievedfromhttps://clavin.bericotechnologies.com/

MIT.(n.d.).CLIFF.CenterforCivicMedia.Retrievedfromhttp://cliff.mediameter.org/

D’Ignazio,C.,Bhargava,R.,Zuckerman,E.,&Beck,L.(2014).Cliff-clavin:Determininggeographicfocusfornews.Proceedings of KDD ‘14.

Ding,J.,Gravano,L.,&Shivakumar,N.(1999).Computinggeographicalscopesofwebresources.

Errico,M.,April,J.,Asch,A.,Khalfani,L.,Smith,M.,&Ybarra,X.(1997).The evolution of the summary news lead.MediaHistoryMonographs-OnLineJournalofMediaHistory.

Garbin,E.,&Mani,I.(2005,October).Disambiguatingtoponymsinnews.Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing(pp.363-370).AssociationforComputationalLinguistics.doi:10.3115/1220575.1220621

Geocode.FarmAPI.(n.d.).Retrievedfromhttps://www.geocode.farm/

Goldberg,D.W.(2008).A geocoding best practices guide.Springfield,IL:NorthAmericanAssociationofCentralCancerRegistries.

GooglePlace,A.P.I.Retrieved25April,2016,fromhttps://developers.google.com/places/

Gross,T.,&Klemke,R.(2002).ContextModellingforInformationRetrieval-RequirementsandApproaches.ProceedingsofICWI(pp.247-254).

Hull,R.,Neaves,P.,&Bedford-Roberts,J.(1997,October).Towardssituatedcomputing.ProceedingsoftheFirstInternationalSymposiumonWearableComputers(pp.146-153).IEEE.doi:10.1109/ISWC.1997.629931

Leidner,J.L.(2006).Anevaluationdatasetforthetoponymresolutiontask.Computers, Environment and Urban Systems,30(4),400–417.doi:10.1016/j.compenvurbsys.2005.07.003

Lieberman,M.D.,Samet,H.,Sankaranarayanan,J.,&Sperling,J.(2007,November).STEWARD:architectureofaspatio-textualsearchengine.Proceedings of the 15th annual ACM international symposium on Advances in geographic information systems(p.25).ACM.doi:10.1145/1341012.1341045

MetaCarta.(n.d.).GeoTag.Retrievedfromhttp://www.metacarta.com/products-platform-geotag.htm

OPENCALAIS.ThomsonReuters.Retrievedfromhttp://new.opencalais.com

Placespotter.Yahoo.Retrievedfromhttps://developer.yahoo.com/boss/geo/docs/key-concepts.html

ResearchDataAustraliaContentProvidersGuide.Whatisspatiallocation?Retrievedfromhttp://guides.ands.org.au/rda-cpg/spatiallocation

Ryan,N.,Pascoe,J.,&Morse,D.(1999).Enhancedrealityfieldwork:Thecontextawarearchaeologicalassistant.Bar International Series,750,269–274.

Sandhaus,E.(2004).ReutersCorpora(RCV1,RCV2,TRC2).Retrievedfromhttp://trec.nist.gov/data/reuters/reuters.html


71

Sandhaus,E.(2008).TheNewYorkTimesannotatedcorpusLDC2008T19.Retrievedfromhttps://catalog.ldc.upenn.edu/LDC2008T19

Schilit,B.N.,&Theimer,M.M.(1994).Disseminatingactivemapinformationtomobilehosts.IEEE Network,8(5),22–32.doi:10.1109/65.313011

StanfordNamedEntityRecognizer(NER),&theStanfordNaturalLanguageProcessingGroup.(n.d.).Retrievedfromhttp://nlp.stanford.edu/software/CRF-NER.shtml

Md Sadek Ferdous is a Research Fellow at the University of Southampton, UK. He received his PhD degree in Computing Science from the University of Glasgow, UK in 2015. His research interests include Information Retrieval, Security, Privacy, Identity Management, Trust Management and Blockchain technologies.

Soumyadeb Chowdhury completed his Masters and PhD in Computing Science from University of Glasgow. Employed as a Research Assistant thereafter working in the Integrated Multimedia City Data project in Urban Big Data Center (funded by EPSRC). Currently, lecturer of InfoComm Technology in Singapore Institute of Technology. Research interests include Information Retrieval, Information Security and Privacy, Human Computer Interaction, Pervasive Sensing Technologies and Visualisation Metaphors and Usable Security.

Please recommend this Publication to your librarianFor a convenient easy-to-use library recommendation form, please visit:http://www.igi-global.com/IJIRR

Volume 7 • Issue 4 • October-December 2017 • ISSN: 2155-6377 • eISSN: 2155-6385An official publication of the Information Resources Management Association

all inquiries regarding iJirr should be directed to the attention of:Zhongyu (Joan) Lu, Editor-in-Chief • [email protected]

all manuscriPt submissions to iJirr should be sent through the online submission system:http://www.igi-global.com/authorseditors/titlesubmission/newproject.aspx

Advanced software development related information retrieval issues • Classification • Clustering approaches • Content and context awareness and environment awareness • Filtering system • Index techniques • Information mining • Information retrieval in cloud computing issues • Information retrieval in education • Information retrieval in healthcare • Information retrieval in science, engineering, and technologies • Information retrieval in social science, social behaviors • Information retrieval with business, commerce, etc. • Information retrieval with Internet of Things • Knowledge mining • Link analysis • Machine learning on documents • Message passing • Metadata and XML retrieval • Mobile computing related information retrieval issues • Multimedia retrieval • Performance measures • Query languages and optimization • Retrieval architecture • Retrieval evaluation • Retrieval languages and operations • Retrieval strategies • Retrieval systems • Retrieval theories • Retrieval with big data technologies • Scalability • Search algorithms • Search engine • Social media related information retrieval issues • Taxonomy theory and applications • Text mining • Text, document, and image retrieval • Web mining

Coverage and major topiCsThe topics of interest in this journal include, but are not limited to:

The mission of the International Journal of Information Retrieval Research (IJIRR) is to provide an outlet for researchers to present their research and obtain inspiration in the areas of information retrieval, computer science, and information science. Focusing on theories, methods, technologies, and tools, IJIRR is aimed towards information engineers, scientists, and related professionals. This journal exhibits expert experiences and state-of-the-art technologies in search and storage of texts, images, videos, and other data to stimulate innovation and exploration of improved approaches for conquering industry problems.

mission

Ideas FoR specIal Theme Issues may be submITTed To The edIToR(s)-In-chIeF

international Journal of information retrieval research

call for articles

Documents

Table of Contentspublications.aston.ac.uk/30749/1/Geo_tagging_news_stories_using... · 50 Geo-Tagging News Stories Using Contextual Modelling; Md Sadek Ferdous, University of Southampton,