131
Transport mode detection using Cellular Signaling Data (Case study of Graz and Vienna, Austria) GEO 620 Master’s Thesis Author Kimberley Chin Jiaqi 16-721-185 Supervised by Dr Haosheng Huang, University of Zurich Christopher Horn, Invenium Data Insights Dr. Ivan Kasanicky, Parkbob Faculty representative Prof. Dr.Robert Weibel 29.06.2018 Department of Geography, University of Zurich

Transport mode detection using Cellular Signaling Data (Case study of Graz …e2eb001e-580e-4528-9d2d... · 2018. 8. 10. · Transport mode detection using Cellular Signaling Data

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • TransportmodedetectionusingCellularSignalingData

    (CasestudyofGrazandVienna,Austria)

    GEO620Master’sThesis

    AuthorKimberleyChinJiaqi

    16-721-185

    SupervisedbyDrHaoshengHuang,UniversityofZurichChristopherHorn,InveniumDataInsightsDr.IvanKasanicky,ParkbobFacultyrepresentativeProf.Dr.RobertWeibel

    29.06.2018DepartmentofGeography,UniversityofZurich

  • ii

    ACKNOWLEDGEMENTS

    AsIembarkonmyfinallegofmystudiesattheUniversityofZurich,Iwouldliketoexpressmyheartfeltthankstoeveryonewhohastaught,helpedandsupportedmethroughoutmystudiesandmaster’sthesis.Thisjourneywouldhavebeenalotharderwithout:

    • MysupervisorDrHaoshengHuang.Thankyouforallthesupport,ideasandtimeyouhavegivenmeduringthispastyear.Withoutyourcriticalinputandconstantguidancethiswouldnothavebeenpossible.

    • Co-supervisorsChristopherHornofInveniumDataInsightsandDr.IvanKasanickyofParkbob.Manythanksforgivingmetheopportunitytoworkonatopicinsuchan exciting field, and all your valuable insights and suggestions. Special thanksgoes to to Christopher for providing the data on their own as well as fromAlexandraLechnerofTUGraz.

    • My dear friends and family and especially Nick, for proofreading and moreimportantly, for the much needed emotional support and encouragement youhavegiveninmorewaysthanone.

  • iii

    Contact

    AuthorKimberleyChinRousseaustrasse61,8037Zürich,Switzerlandkimberleychin@hotmail.comSupervisorDrHaoshengHuangGeographicInformationScience(GIS)DepartmentofGeographyUniversityofZurichWinterthurerstrasse1908057Zürich,Switzerlandhaosheng.huang@geo.uzh.chCo-SupervisorChristopherHornScienceTowerWaagner-Biro-Strasse100/118020Graz,[email protected]

    Co-SupervisorDrIvanKanasickyParkbobGmbhTreustrasse22-241200Vienna,[email protected]

  • iv

    ABSTRACT

    TheriseofnewBigDatasourcessuchascellularnetworkdatahasallowedustoobserveandcomprehend human behavior and the interactions between them and the environment on amuch deeper level. This leads to both new research opportunities as well as challenges.Transportmode detection plays a key role directly or indirectly inmany fields such as urbanplanning, epidemiology, transportation science and many more. Improving travel demandsurveysisanimportantdrivingfactorandmotivationinthisresearch.Aspectslikescalabilityofthese alternatives are critical considerations in their development in terms of data collectionand processing. Researchers have looked to Global Positioning Systems (GPS) in the form ofloggers or GPS-enabled mobile phones, as well as Call Detail Records (CDR) as alternatives.Whilethesemethodshaveshownpromisingresults,theyarenotwithoutflaws.The aim of this research is thus to design a methodology that can detect modes oftransportation from another more unknown type of data, cellular signaling data. Cellularsignaling data does not require overhead as it can be described as data crumbs leftover byeveryday usage of one’s cellphone. Based on the results, we can present a deeperunderstanding into thedata characteristics and itspotential inunderstandinghumanmobilityflowsincities.Thisresearchwillpresentasetofsupervisedandunsupervisedmethodsthatareapplied to data that is collected in Vienna andGraz (Austria) in two separate data collectioncampaigns by a group of 2 and 9 participants. The results from the proposedmethods showpromise and are comparable to existing GPS studies of the same aim. The best performingmethod, a hybridmethod of rule-based heuristics and supervised random forestmanaged tocorrectlydistinguishbetweenU-Bahns, S-Bahns, cars,bikesandwalkmodes73%of the time.Rule-based methods perform especially well on rail modes (U-Bahn/S-Bahn). For the moresimilarmodes(cars,bikesforexample),randomforestdoesthebestatdistinguishingbetweenthesemodes.While unsupervisedmethods are not able to achieve the same accuracies, theresultsarestillcomparablewitha68%accuracyachievedwiththepartitioning-around-medoidtechnique.Keywords: Transport mode detection, cellular signaling data, fuzzy logic, rule-based heuristic,randomforest,unsupervisedclustering,principalcomponentanalysis

  • v

    TABLEOFCONTENTS

    ACKNOWLEDGEMENTS............................................................................................................II

    CONTACT................................................................................................................................III

    ABSTRACT...............................................................................................................................IV

    LISTOFFIGURES...................................................................................................................VIII

    LISTOFTABLES........................................................................................................................X

    INTRODUCTION.....................................................................................................1CHAPTER1

    1.1 ContextandMotivation........................................................................................................1

    1.2 Problemstatementandresearchaims.................................................................................3

    1.3 Mainexpectedoutcomes.....................................................................................................5

    1.4 Thesisstructure.....................................................................................................................5

    BACKGROUNDANDRELATEDWORK......................................................................6CHAPTER2

    2.1 Theworldoftransportation..................................................................................................6

    2.2 Transportmodedetection....................................................................................................7

    2.3 TransportmodedetectionusingGlobalPositioningSystemsdata......................................8

    2.3.1 Pre-processing...........................................................................................................9

    2.3.2 Modedetection.......................................................................................................10

    2.4 Transportmodedetectionusingcellularnetworkdata.....................................................16

    2.4.1 Mobilephonenetworkstructure.............................................................................17

    2.4.2 DataGeneration.......................................................................................................17

    2.4.3 SpatialandTemporalGranularity............................................................................19

    2.4.4 Pre-processing.........................................................................................................20

    2.4.5 Modedetection.......................................................................................................23

    2.5 Variableselection................................................................................................................27

    2.6 Ethicalissues.......................................................................................................................31

    2.7 SummaryandResearchGaps.............................................................................................32

  • vi

    DATAANDITSCHARACTERISTICS.........................................................................35CHAPTER3

    3.1 CellularSignalingData........................................................................................................35

    3.1.1 Spatialresolution.....................................................................................................36

    3.1.2 Temporalresolution................................................................................................41

    3.2 Otherdata...........................................................................................................................42

    METHODOLOGY...................................................................................................44CHAPTER4

    4.1 Methodologicalprocedure.................................................................................................44

    4.2 Computingenvironment.....................................................................................................49

    4.3 Pre-processing....................................................................................................................49

    4.4 Featurelist..........................................................................................................................52

    4.4.1 Featuresattheobservationlevel............................................................................52

    4.4.2 Featuresatthetrajectorylevel................................................................................53

    4.5 Rule-basedheuristics..........................................................................................................56

    4.6 FuzzyLogicSystems............................................................................................................61

    4.7 MachineLearning...............................................................................................................65

    4.8 UnsupervisedLearning.......................................................................................................65

    4.9 Variableselection...............................................................................................................67

    RESULTS...............................................................................................................74CHAPTER5

    5.1 Validation............................................................................................................................74

    5.2 Parametersettings..............................................................................................................76

    5.2.1 FuzzyLogicSystems.................................................................................................76

    5.2.2 RandomForest.........................................................................................................77

    5.2.3 UnsupervisedK-meansandPAMwithRF................................................................78

    5.3 Pre-processingandtrip-segmentation...............................................................................78

    5.4 Supervisedmethods...........................................................................................................81

    5.4.1 WithRBHvs.withoutRBH.......................................................................................85

    5.4.2 RandomForestvsFuzzyLogic..................................................................................92

    5.4.3 Variablesformodedetection..................................................................................95

    5.5 Unsupervisedmethods.......................................................................................................97

    5.6 Summaryofmainresults..................................................................................................102

  • vii

    DISCUSSION.......................................................................................................104CHAPTER6

    6.1 ResearchQuestion1.........................................................................................................104

    6.2 ResearchQuestion2.........................................................................................................105

    6.3 Limitationsofthestudy....................................................................................................109

    CONCLUSIONANDFUTUREWORK.....................................................................111CHAPTER7

    REFERENCES........................................................................................................................114

    PersonalDeclaration..................................................................................................................119

  • viii

    LISTOFFIGURES

    Figure1RulesusedinGongetal.'spaperinthemodedetectionprocess(Gongetal.,2012)...11Figure2RulesusedinBohte&Maat’spaperformodedetection(BohteandMaat,2009).......12Figure 3 Fuzzy Logic membership functions generated by human expertise (Axhausen and

    Schüssler,2009)....................................................................................................................13Figure 4 a) Location Area and Base Stations b) Periodic updates c) Handovers d) Mobility

    locationupdates(Calabreseetal.,2015).............................................................................18Figure 5 Illustration of data pre-processing to extract stay locations (Jiang et al., 2013) with

    distanceandtimeconstraintsof300mand10minutes......................................................22Figure6Detectionofstaysusinggeometry(Widhalmetal.,2015).............................................23Figure7Tripdataclusteredtotwosubgroups,drivingandpublictransit.Thearrowsshowthe

    average travel times of the subgroups. Black lines are the travel times as reported byGoogleMaps.(Wangetal.,2010).......................................................................................24

    Figure8FrameworkproposedbyQuetal.(2015)formodedetectionwithCDRdatathroughaspeed split, augmenting the dataset with transportation network information and utilityapproximations.....................................................................................................................25

    Figure 9 Average Euclidean distance between subsequent observations during stationary,walkinganddrivingperiods.(Sohnetal.,2006)...................................................................27

    Figure10Exampleoffirsttwocomponentsbasedontwovariables,wordlengthandnumberoflinesindictionarydefinition.Greenlinesrepresentthevectorswhosemaindirectionleadstothemaximumsumofsquareddistancesfromthepointstothevectors(AbdiandWilliams,2010).....................................................................................................................................30

    Figure11VisualizationofCSDpointsinVienna...........................................................................38Figure12VisualizationofCSDpointsinGraz...............................................................................38Figure 13: Examples of differing spatial resolution of observations, each color representing a

    differentU-Bahntrajectory..................................................................................................37Figure14DistributionofrawCSDobservationstocorrespondingGPSobservations..................39Figure 15 CSD (yellow) and corresponding GPS trajectory (black) when commuting between

    citieswherecellularcoverageisnotasstrong.....................................................................40Figure16DistributionofrawCSDobservationstocorrespondingGPSobservations inGrazand

    Vienna...................................................................................................................................41Figure 17 Distribution of the first quartiles,medians and third quartiles of the time intervals

    betweenobservationsofeachuser......................................................................................42Figure18ExampleoftwoU-Bahntrajectories.Thepointsareindividualobservationsoftheuser

    taking theU-Bahn.Bluecircles representa trip in themorningandpinkcircles representtripsintheafternoon.RedtrackssymbolizetheU-Bahnnetwork......................................45

    Figure19:ExampleofCSD(pentagons)generatedforwalk(yellow)andbike(blue)comparedtotheircorrespondingGPStracks(circles)...............................................................................46

    Figure20:MethodologicalFramework........................................................................................48Figure21:Exampleofattributesextractedforeachobservation................................................53Figure22:Version1ofrule-basedalgorithmwhenallconsiderationsasmentionedinsection4.1

    are taken into account, and when concepts from other GPS studies using rule-basedheuristicsareborrowed........................................................................................................57

  • ix

    Figure 23: Cumulative distribution function of ubahndist. The diagram shows all U-Bahntrajectories have an average distance of less than 200m to the U-Bahn network for theViennadataset......................................................................................................................58

    Figure24Cumulativedistributionofpercentile95accoftrajectoriesofvariousmodesinVienna..............................................................................................................................................59

    Figure 25 Cumulative distribution of vel.rolling2.median of trajectories of various modes inVienna...................................................................................................................................60

    Figure26Secondversionofrule-basedalgorithm,simplifiedandimproved..............................60Figure27Trapezoidalmembership functionsandparametervaluesbasedonnormaliseddata.

    Differentcolorsmeandifferentmemberships,blue-small,red-medium,green-high........64Figure28PlotofZ-scoresafterBorutaalgorithmisrunonvariables.Blueboxplotsrepresentto

    minimal, average andmaximum Z score of a shadow attribute. Red and green boxplotsrepresent Z scores of respectively rejected and confirmed attributes. Yellow are thetentativevariables................................................................................................................69

    Figure29Variablesrankedbasedontheircontributionstothefirstprincipalcomponentinthedatasetforthisstudy(Vienna).............................................................................................70

    Figure30Variables rankedbasedon theircontributions to thesecondprincipalcomponent inthedatasetforthisstudy(Vienna).......................................................................................71

    Figure31ComparisonofeignevaluesandvaluesfromtheBroken-stickmodel..........................73Figure32SensitivityanalysisforFLSparametersofRB_FLELmethodontheViennesedataset. 77Figure33SensitivityanalysisforRFparameters..........................................................................78Figure34ResultsforViennadataset............................................................................................82Figure35ResultsforGrazdata.....................................................................................................82Figure36PrecisionofeachmodeinVienna.................................................................................87Figure37RecallofeachmodeinVienna......................................................................................87Figure38F1ofeachmodeinVienna...........................................................................................87Figure39PrecisionofallmodesinGraz.......................................................................................88Figure40RecallofallmodesinGraz............................................................................................88Figure41F1statisticforallmodesinGraz...................................................................................88Figure42CDFfor95thpercentilespeedsinprivatemodesinVienna.........................................93Figure 43 Example of unmatched bike start point (red circle). Small points represent GPS

    observations, larger pentagons represent CSD observations. Blue indicates bicycle andgreenindicatescars..............................................................................................................94

    Figure44Precision,RecallandF1-scoreforRB_PAMandRB_KMEANSinViennesedataset...101

  • x

    LISTOFTABLES

    Table1Examplesoffuzzyrulesformodedetection(AxhausenandSchüssler,2009).................14Table2DistanceofrawCSDtocorrespoindingGPSpoints in(allpointsandintheurbanstudy

    areas)....................................................................................................................................39Table3Listofmodedetectionmethodsproposedandtheirabbreviations...............................47Table4OverviewofmainR-packagesused.................................................................................50Table5:SpatialresolutionofCSDpointsinmetres......................................................................52Table6:Listofattributesextractedfromeachobservation........................................................52Table7:Listofspatialfeaturesextractedforeachtrajectory......................................................54Table8Listofdescriptivemotionfeaturesextractedfromeachtrajectory................................56Table9Parametervaluesofmembershipfunctionsofnormalizeddata.....................................63Table10ImportanceofcomponentsinPCAofVienna................................................................70Table11InterpretationofCohen'sKappa(McHugh,2012).........................................................76Table12Codeformethods..........................................................................................................79Table13Tableofnumberoftripsextractedfromfirstdatacollection(A)..................................80Table14Tableofnumberoftripsextractedfromseconddatacollection(C).............................80Table15totalmodesharesofdata(A+C)..................................................................................80Table16Precision,RecallandF1valuesforeachmodeinVienna..............................................83Table17Precision,RecallandF1valuesforeachmodeinGraz..................................................84Table18ConfusionmatrixofRule-BasedHeuristics+RandomForestVienna............................89Table19ConfusionmatrixofRule-BasedHeuristics+FuzzyLogicwithExistingLiteratureVienna

    ..............................................................................................................................................90Table 20 Confusionmatrix of Rule-BasedHeuristics + Fuzzy Logicwith variables fromexisting

    literatureGraz.......................................................................................................................91Table21ConfusionmatrixofRule-BasedHeuristics+RandomForestGraz................................91Table22ComparisonofspeedandaccelerationprofilesofwalktrajectoriesinGrazandVienna

    ..............................................................................................................................................92Table23ConfusionmatrixoftrajectoriesthatareassignedmodesintheFLELstepofRB_FLEL93Table24ConfusionmatrixoftrajectoriesthatareassignedmodesintheRFstepofRB_RF......93Table 25 Variables selected for FL by RF and existing literature (in no particular order), and

    variablesselectedbyPCAforunsupervisedmethods..........................................................97Table26ExampleofmedoidsofeachclusterforPAMandthemodeassigned..........................99Table27ClustercentroidsforK-means........................................................................................99

  • xi

    Table28ConfusionmatrixoftrajectorieswhosemodesareassignedbythePAMstep...........100Table29ConfusionmatrixoftrajectorieswhosemodesareassignedbytheK-meansstep.....100Table30PerformanceofunsupervisedmethodsintheViennesedataset................................101Table31PerformanceofunsupervisedmethodsintheGrazdataset.......................................102Table32OverviewofthecompositionofclustersgeneratedinPAMonGrazdataset.............102Table33OverviewofthecompositionofclustersgeneratedinK-meansonGrazdataset.......102

  • 1

    CHAPTER1Introduction

    1.1 ContextandMotivation

    “Researchinhumanmovementintimeandspacehasbeenaroundforatleastoverfivedecades”-(Weiner,1986).

    ResearchinhumanmovementhasbeengivenahugehelpinghandwiththeriseofnewBigDatasourcessuchasmobilephonecalldetailrecordsorsocialmediarecordswithlocationtags.Wenowhavetheabilitytoobserveandcomprehendhumanbehaviorandhowthey interactwiththeir environment on an unprecedented level of detail. This leads to both new researchopportunities aswell as challenges. Such location-based data can give us valuable insights tohumanmovement inbothtimeandspaceoncenewtechniquesaredevelopedtoharnessthispotential (Zooketal.,2015).Theglobal spreadofmobile technologies forcommunicationhasbroughttheworldclosertogetherwhilstalsoresultingintheexistenceofanunparalleleddatasourcecapableofdescribingdiversedealings in theworldofhumanandsocialbehavior.Oneexample of this is the Call Detail Records, the byproduct of billing services for calls, whichincludetimestampsandlocationcoordinatesofthesetransmissions.Thesewidespreaddatasetscan reveal compelling information on patterns on an individual as well as collective scale(Blondel et al., 2015).This is a passive data type which is usually by-products of existingstructuresthatweregeneratedforpurposesthatwerenotforbutcouldpotentiallybeusedforresearch(Chenetal.,2016).Otherpassivedatatypesincludesocialmediadatathathavebeenposted voluntarily byonlineusers (Gonzalez et al., 2008) and transit carddataused in publictransportsystems(Hasanetal.,2013;Liuetal.,2009).Oneimportantgroupofbenefactorsofthisdataisthoseinthefieldoftransportscience.Urbanplanners, policymakers and transportmanagement are interested in howpeople travel, howinfrastructure and the environment affect movement, and of course, in obtaining a realisticpictureoftraveldemand.Indoingso,manyotherfieldscanbenefitfromit.Inthisvein,these

  • 2

    mobilephonetraceshavebeenusedtofurtherseveralaimssuchasestimatinghumanmobilitypatternsandpopulationdistribution(Calabreseetal.,2015;Gonzalezetal.,2008,2010;SevtsukandRatti,2010;Readesetal.,2007),analysisurbanactivities(Jiangetal.,2013;Ricciatoetal.,2017;Widhalmetal.,2015),generatingOrigin-destinationflows(Calabreseetal.,2011;Hornetal.,2017;KalatianandShafahi,2016;Tettamantietal.,2012;Wangetal.,2010),andmanymore.All thesepursuits areof great interest to the fieldof transportation science andplanners arelooking intohowbest toyield this information for cities’ transportation systems.Singapore,arapidlygrowing,denselypopulatedmetropolitoncityisusingdecadelongdemandforecastsontheir public transportion, among other planning sectors such as land use and urbanredevelopemntplanning.TheirSmartMobility20301initiativeisleadingthewayinincorporatingsuchlocationdatatoplanthenationstransitneedsandiscontinueingtogrowinthisarea.Whenwehaveabetterunderstandingofhumanandtrafficflows,wegaingreaterinsightsandmanagement capabilities for traffic congestion, health monitoring, elderly care and evenepidemiology. Oneway this can be done is by understanding the transportmode choices ofpeople and this is achieved through collecting information through travel demand surveys ortraveldiaries.Improvingtraveldemandsurveysisanimportantdrivingfactorandmotivationinthis research. Aspects like scalability of these alternatives are critical considerations in theirdevelopment in termsofdatacollectionandprocessing.Forexample, traditional surveysmaytaketheformofmanualcollectionand labeling, liketelephone interviewsandquestionnaires.Inaccuraciesareintroducedasaresult.ResearchershavelookedtoGlobalPositioningSystems(GPS) in the form of loggers or GPS-enabledmobile phones, as well as CDRs as alternatives.Another motivation is context-aware location-based services. Transportation modes such aswalking,cyclingortraindenotescertaincharacteristicsofauser.Oneuseofthisknowledgeistargeted and customized advertisements that may be deployed to the relevant markets. Aspeople have began to see the large potential of these datasets, and while ourtelecommunications infrastructurehas improved leapsandbounds, sohave thequalityof thecellularnetworkdatathatcomeswithit.AstepupfromtheusualCDRdataisCellularSignalingData (CSD). CSD consists of not some normal CDRs, but other cellular network related dataincludingbothnetworkandevent-drivendata(section3.4.2).Withanadditionalmap-matchingstep, CSD ultimately lends itself an increased spatial and temporal resolution. Similar to adtargeting strategies of some socialmedia platforms, there are potential avenues to generate

    1https://www.lta.gov.sg/content/dam/ltaweb/corp/RoadsMotoring/files/SmartMobility2030.pdf

  • 3

    morerevenuewithmoretargetedandcustomizedservices.Telecommunicationprovidershavetaken notice and begun to invest in developing techniques tomine information and provideaccess to this data. As such, this thesis will attempt to achieve the aims of transport modedetectionwithCSD.ThisresearchisincollaborationwithParkbob,arapidlygrowingcompanythatdeliverscontext-awareparking information todriversaswell asusingpredictivemodelswith real-time, crowdsensed data to provide parking availability information.While this is in the realm of LBS, themotivation for this is largely driven by the desire to understand the transport demand thatdrivestheneedforthisLBS,andishenceliesmoreintheveinoftransportationscience.

    1.2 Problemstatementandresearchaims

    With the aims of transportmodedetection inmind, CSDoffer amuchmore opportunities interms of types of modes and performance due to its higher quality. However, despite thispassivedatatypehavingtheadvantagesoflargesamplesizeandlongobservationperiods,theyalso have obviousweaknesses: cell phone traces can be sparsely sampled in time during idleperiods,theymightprovideonlyalowspatialresolutionandincludenoisestemmingfrompuresignalmovement. Therefore thedatahas tobe carefullyprocessed toextract triporigins anddestinations. Access is also hindered due to varying privacy and business-sensitivityconsiderations. Many of the previous studies involving cellular network data for mobilityanalysishavebeenlimitedtoCDRsandhavecomeupwithmethodstoalleviatetheimpactofthesechallenges(Quetal.,2015;Wangetal.,2010).Itwasonlyrecentlythatcompanieshaveallowed access to this new cellular signaling data whose greater detail means much moreinformationcanbeminedascomparedtoCDR.Themainchallengeherehowever, ishandlingthe still much lower spatial and temporal resolutions associated with this passive data typewithout having to actively solicit other supplementary data in a time and resource intensivemanner. Privacy concerns also mean that existing studies do not have ground truth data toevaluatetheirresults.Becauseofthis,thenumberofmodesthathavebeendistinguishedusingpassive mobile phone data has been rather limited, usually to betweenmotorized and non-motorized,orprivateandpublictransportationmodes.

  • 4

    As such, the aim of this paper is to overcome the restrictions of active data types (GPS) andpassivedatatypes(CDR-only) formodedetectionandproposenovelmethodsforthisrelativenewcomer,CSD,whilealsoaccountingforitslowandirregularspatialandtemporalresolution.ThestudyareaconsistsoftwomajorcitiesinAustria,ViennaandGraz.Themodesoftransportof interestwillbebothprivateandpublictransportationmodes:car,bike,walk, tram,S-Bahn(commutertrains)andU-Bahn(metro).This isalsolimitedbytheamountofactualdatamadeavailabletothisstudy.MethodsthataretakenfromexistingCDRandGPSstudies,whether inpartor inwhole,willbeadjustedsothat theyaremoresuitable todealwiththeuniquedatacharacteristicsofCSD.This thesiswillconsistof twomainparts.First, severalmethodswillbedevelopedusingcombinedapproachesofseveralpopularmodedetectionmethodsproposedbyseveral existing studies. Thiswill include both scenarioswhereby labels are available and themorelikelyoneswherebytheyarenot.Thesecondpartwillevaluatetheperformancesoftheseproposedapproaches,andcomparethembasedonvariousperformancemetrics.

    Research Question 1: Development and Implementation: How can various modes oftransportation (walk, bus, tram, car) be detected from cellular signaling data (CSD)consideringitslowerandmoreirregularspatialandtemporalresolution?Hypothesis 1: Due to the unique characteristics of this dataset, tailoring existingmethodstobeappliedherecandetectvariousmodesoftransportation.Thisispossiblebydistinguishingbetweentheirspatiotemporalcharacteristicsaswellascomplementarycommon sense information such as locations of transport networks. Supervisedmodedetectionmethodsdevelopedherewouldfollowacombinationofrule-basedheuristicsand fuzzy logic systems or machine learning. Unsupervised mode detection methodswouldfollowaclusteringapproachcombinedwithunsupervisedrandomforests.Thesemethodswillusevariablesselectedthroughvariousvariableselectionmeasures.

    Following the implementation of this developed methods the next question addresses theirperformance and quality and determines the bestmethod that should be adopted formodedetection in urban areas for these particular modes of interest. Using various performancemetrics,theresultsoftheseproposedmethodswillbecomparedagainsteachotheraswellasagainstthoseinexistingstudiestogiveanideaofhowwellthechosenmethodperforms.

  • 5

    Research Question 2: Evaluation and comparison: How do these proposed methods(RQ1) perform and compare against each other?Which is the best method of modedetection for detecting these modes of transportation? What are the most usefulfeaturesfortransportmodedetectionusingCSD?Hypothesis 2: The results of these algorithms will be compared against ground truthprovided by the data collectors’ annotations. Due to the noisier and dirtiercharacteristics of CSD as compared to GPS data, it is likely that the inclusion ofcontextualdatasuchasGISdataofthetransportationnetworktosupplementtheCSDwillleadtobetterperformancesofthemethods.

    1.3 Mainexpectedoutcomes

    Themain contributionof this thesis is thusamethodology todetectmodesof transportationfrom CSD data based on their spatial and temporal features. A set of methods that arepermutationsofvariousexistingapproacheswillbeproposedwiththegoalof findingtheonebestsuitedtoCSDafterevaluatingtheirresults.ThiswillbeanovelcontributiontothefieldastransportmodedetectionusingCSDisstillverymuchinitsinfancy.

    1.4 Thesisstructure

    A summary of related works will be presented in the next chapter, chapter 2. It will alsohighlighttheprosandconsofexistingmodedetectionmethodsandhowlessonslearnedfromthem are used in the development and creation of our proposed methods. Following this,chapter 3 will give an overview of the data and chapter 4, themethods proposed formodedetection in this study. The evaluation results will be presented in chapter 5, along withsensitivity analyses of the chosen parameters. Upon discussion of these various methods inChapter6,themethod(/s)deemedasbestandmostsuitedforCSDwillberecommended.Bothresearchquestionsaswellaslimitationsofthestudywillalsobediscussedinthischapter.Lastly,Chapter 7 concludes the study with a summary as well as considerations for future work.

  • 6

    CHAPTER2BACKGROUNDANDRELATEDWORK

    2.1 Theworldoftransportation

    Recentdecadesofmassivepopulationgrowthcombinedwiththehugeinfluxofurbanmigrationhascalled for theneed tomanageoururban resources inamoreeffectiveway tostreamlinealready limited resources. Transportation is one of these key issues that growing populationshavetograpplewith,asit isanunavoidableaspectofeverydaylife.Motivatedbytheneedtobetterservesociety,citiesneedtobeabletoforecastfuturetraveldemandsoastochanneltherightinvestmentsintherightvolumestotherightplaces,suchastolarge-scaletransportationprojects.Much effort has gone into seeking to developmodels that predictwhere andwhenpeopletravelto,howtheydoit,andwhataffectsthesechoices.Informationlikehowtransportnetworksperformintermsofcongestionandflow,orhowrouteandmodechoicesrespondtoroadpricingschemesareextremelyimportantinaidingdensecitieskeepupwiththeincreasingpressures and demands of a growing population. By evaluating that information, decisionmakers are offered valuable insights into urban activities and movement flows, enablingthemselves intomaking thebestandmostprudentdecisions for thegoodof thepeople theyserve (Rasmussen et al., 2015). For example, urbanplanners canmake a citymore livable bymitigating congestion and planning for developments to cater to high volumes of people.Transportplannerscanunderstand themobilitypatterns ingreaterdepth,knowing timesandlocationsof traffichotspotson the roads,aswell ason the transitnetworks.Evencompanieswhowanttostreamlinetheirproductsandservicescandosotothosewhoneeditmost.Inthegranderschemeofthings,modelsthatcanpredicthowpeopletravelintimeandspacecanhelpin the fight to reduce our global carbon footprint. Our reliance on motorized forms oftransportation isoneof thekeydriving forcesofclimatechange, furthermotivating transportresearchers and providers to strive towards more sustainable transport networks, one thatpromotestheuseofnon-motorizedorpublictransportationforexample.

  • 7

    2.2 Transportmodedetection

    Akeywaytoachieveagreaterunderstandingofhumantravelpatternsisanunderstandingofthemodesoftransportationtheytake,aswellasitscorrespondingtemporaldistributions.Thisproblemhasbeentackleddifferentlybasedonthewhatobjectiveoftheresearchersisandcanbe summarised into three main branches (Prelipcean et al., 2017). First is Location-BasedServices(LBS)wherebythegoal istodetectthemodeasclosetorealtimeaspossiblesothatimportantandrelevantinformationcanbegiventothecommutersorinterestedpartiesatthesuitable times and places, such as with Parkbob’s smart car parking application2. This “on-demand”kindofmodedetectionisinlinewithmanycitiesaspirationstowardafullyfunctionalsmartcity,whereresourcescanbeallocatedontheflytoplacesorpeoplewhorequirethem.Another huge and long-standing branch is transportation science, which aims to generatereliable and usable statistical data on usage of the transport system. This data is ued as thefoundation to answer many city planning questions and further many of the applicationsmentioned previously. With such valuable uses of this data, transportation scientists havecontinuallytriedtogatherthisinformationmainlyintheformofactivelysolicitedtraveldiaries,or paper, internet and phone surveys (Rojas et al., 2016; Shen and Stopher, 2014a). Thesetraditionalapproacheshaveproventobeinaccurate,time-consumingandresourceintensiveasit relies on people manually self-reporting their daily activities, travels, and correspondingschedules. Often, these get under-reported due to forgetfulness or the amount of effort itrequires (Bohte andMaat, 2009). Transportation scientists aremotivated to overcome theseproblemsandautomatepartofthisdatacollection,asthepositiveimpactofgoodqualitydataon transportation mode usage is compelling. Thirdly, transport mode detection is of greatinteresttothefieldofhumangeography,whoseobjectivesarelargelytoenrichthesedatasetswithdomain-specific semantics suchwithassociatedPoints-Of-Interests. This fieldof researchhasmethodsofmodedetectionsimilartothatoftransportationscience,buthasoutputsthatcover ahuge scope,using thesehumanmobility trajectories to answeramyriadofquestionssuchaslinguisticevolutionorhumaninteractionpatterns(Prelipceanetal.,2017).Thisresearchwillhaveaimsmore linewith thatof transportationscience, indevelopingmethods tocollectinformation on people’s mode choices. The applications however, are motivated to supportParkbob,whoseservicesliemoreintheLBSrealm.

    2http://www.parkbob.com/

  • 8

    2.3 TransportmodedetectionusingGlobalPositioningSystemsdata

    InMay2000,theUSgovernmentdecidedtoremoveselectiveavailabilityofGlobalpositioningsystems(GPS),whichwasamilitaryefforttowardssecurityreasonstointentionallydegradeGPSsignals.NowGPSdevicescandeterminelocationswithaccuraciesoflessthan10m(BohteandMaat,2009)forvariouspurposesinthecivilianworld.Someexamplesincludeinagriculturetoaccuratelymonitor yielddataor to enablework in poor visibility orweather, aviation for thecontinuousprovisionofreliableandaccurateinformationonflightsaswellasformoreefficientairtrafficmanagement(KaplanandHegarty,2005).Disastermanagement isanotherareathatbenefits from this technology. When little information is available, GPS makes mapping ofdisasterzonespossible.Floodandearthquakepredictioncapabilitiesarealsoimprovedwiththistechnology(KaplanandHegarty,2005)..TransportationisagreatbenefactorofGPS,withmoreaccuratepositioningleadingtobetterscheduleadherenceandtransportdemands,forexample. As compared tomore conventionalmeans of collecting such data through surveys and traveldiaries, collection via GPS devices alleviates many of the formers’ shortcomings, on top ofproviding greater opportunities of quality and quantity of data collected. Providing morecomprehensive information on origins, destinations and the routes taken between them, tripstartandendtimesaswelltriplengthscanbemorerealisticallyachievedbytherespondentastheydonotrelyonmemoryorneedtomaketheefforttopendowntheirschedule.Thedatatends to be more accurate and independent on the respondents perception of durations,distances, and departure/arrival times (Rojas et al., 2016; Shen and Stopher, 2014a).Underreporting is also avoided as the GPS logger captures all movements of participants(Stopheretal.,2008).Thisalsomeansthatdatacollectioncanbedoneoveraprolongedperiodoftime.Furthermore,GPScanalsobeusedinconjunctionwithtraditionalmethodsasameansofverification.Theseadvantageshave led to the riseof incorporating these technologiesasasupplementarytoolortocompletelyreplacetravelsurveys.Now, GPS units are commonplace and are accurate and lightweight enough to make this afeasiblealternative todatacollection, facilitatingmorecompleteanalyses (Bolboletal.,2012;Gongetal.,2012).Morecomplextravelpatternscanbeminedfrominformationonmodesuchas what combinations are taken, route choices in multi-modal trips and how they vary ondifferentdaysoratdifferenttimes.

  • 9

    2.3.1 Pre-processing

    TherawGPSdatasetscollectedareextremely large,sometimescontaininglogs intheorderofmillions.Thisalsoincludesirrelevantdatasuchaswhenthepersonisnottravelling.Combinedwithissueslikesignallossandcoldstarts(whenthedeviceisturnedbackonhandtakestimetorecalculateinformation),asetofpre-processingtechniquesmustbeappliedtoturnthisdatasetinto a comprehendible information source. This usually entails cleaning thedataof noise andthensegmentingthemintoindividualtrajectories,ortripswithstartandendpoints(alsoknownasSegment IdentificationorSI).A commonlyusedmethod for this step is through theuseofrule-basedalgorithms,oftenby identifying stoppointsandassign themas start/endpointsofthetrajectory(ShenandStopher,2014b).Manystudiesuseathresholdof120secondsastheminimumtimeapersonmustbeinthesameplaceforittobeconsideredastartorendpointwhichcouldbeactivitiesormodechangingpoints(ChungandShalaby,2005;Gongetal.,2012;Stopheretal.,2008).Trafficlightchangetimesorbusstoptendtobelowerthanthat,makingita reasonable criterion.Todate, this rule is stillbeingused,butalso supplementedwithotherrules. For example, Schüssler&Axhausen (2009) combine this threshold andpoint density astheircriteriaforactivitydetection.Activitylocationsaredetectedwhenobservationsmeettwocriteria: (1) low speeds (

  • 10

    2.3.2 Modedetection

    Earlier GPS studies differentiated betweenwalking, driving andmotorizedmodes (Bohte andMaat, 2009; Chung and Shalaby, 2005), butmore recent studies have begun to detect publictransportationmodesaswell(AxhausenandSchüssler,2009),andmanygoastepfurtherastoclassify thesemotorizedmodes into the variousmodes of public transportation like buses ortrains(Gongetal.,2012;Rasmussenetal.,2015;Stennethetal.,2011).Input variables that determine modes such as average, maximum and standard deviation ofspeed,accelerationmeasures,averagedwell timeandaverageheadingchangeare frequentlyused in this stage (Gonzalez et al., 2010; Stenneth et al., 2011; Xiao et al., 2015).Due to thegrowingpopularityofgeospatialdata,manymodedetectionstudiesalsoincorporatedatafromexternal sources suchas the transportationnetworkor real timepublic transport information(Asgarietal.,2016;Gongetal.,2012;Stennethetal.,2011;TsuiandShalaby,2006).Especiallyintimesofheavytrafficwhenmovementisslow,itisdifficulttoinfermodesolelyfromvelocity.Theareaswhereabusortramcanbearegenerallyfixed.ThistypeofdatafusionwithGISdatahasproventomakemodedetectionmorerobustandproducemuchbetterresultsthanwhencomparedtothebaselinemethodwithoutcontextualinformation.When it comes to research more in the line of transportation science, there seems to be apreference for inferringmodesusingRule-basedmethods (BohteandMaat, 2009;ChungandShalaby, 2005; Gong et al., 2012) and fuzzy logic systems (Axhausen and Schüssler, 2009;Rasmussenetal.,2015).Somealsousesupervised learningmethodssuchasRandomForests.Thesethreetypesofmethodswillbedescribedingreaterdepthinthefollowingsection.

    Rule-basedheuristics

    Gong et al. (2012) use a rule-based GIS algorithm that automatically processes GPS data todetect5modes.Thealgorithmalsorecognizeswhethermodetransferswithinatriparefeasible.BycombiningGISdata likestreetcenterlines,busroutesandstops,subway lines,stationsandstationentrances,thismethodisabletoachieveapromising82.6%accuracy.First,trajectoriesare split into segments by identifying stop and mode change points. Through a set ofhierarchical rules,walkmodesare inferred firstbasedon speeds.Next, by comparison to thepublictransportationnetwork,railfollowedbybusmodesareinferred.Therestareconsideredascarmodes.Thresholdsusedarebasedonthespecificationsof thecity.Forexample, in thestudy area of NYC, themaximum length of trains was 184m, so the threshold formaximum

  • 11

    distancetoastationtobeconsideredrailmodewassetat200mtoaccountforthefactthattheusercouldbeattheendofthetrainwhile itstoppedatthestation.Otherrulesderivedfromcontextaware information includethethirdrule,wherethemaximumspeedandaccelerationofanexpressbus inNewYorkCity is88km/h,or1.5m/s2.The full setof rules canbe seen inFigure1.Thestudyshowedpromisingresults,howevernotedthattherelativelyloweraccuracyofbusandcarmodeidentificationwasduetothedensestreetnetworksandtheconsequencesoftheurbancanyoneffectwhichsometimescauseaparallelshiftofGPSobservations(Gongetal.,2012).Thiscanleadtomisclassificationofbusmodesascarorwalkmodes.Map-matchingtechniques derived from Chung and Shalaby's (2005) paperwere also applied tomatchwalksegments to street segments. Furthermore, that paper developed a trip reconstruction toolusingGPSdatawitha rule-basedalgorithmaswell,whichachievedanaccuracyof92%of allfour modes of interest. Bohte and Maat (2009) also use straightforward rules (Figure 2) onmeasuresofmaximumandaveragetripspeedstoinfermodes,startingfromtheslowermodes,walk,thenbicycleandcar,followedbypublictrainmodes,duetoitscharacteristiclocationthatisconstrainedbytherailnetwork.Stopheretal. (2008)managetoachievean impressive95%accuracywithanotherhierarchical setof rules togetherwithexternal transportnetworkdata.Furthermore, the studies found that the distinction between bus and car modes was verysensitive to their specific rules, anddue to the similar speedprofiles of bothmodes, there isusuallyahigh tradeoffbetweensuccess rates foronemodeand theother (BohteandMaat,2009;Gongetal.,2012).

    Figure1RulesusedinGongetal.'spaperinthemodedetectionprocess(Gongetal.,2012)

  • 12

    Figure2RulesusedinBohte&Maat’spaperformodedetection(BohteandMaat,2009)

    Another study by Kasahara applied a rule-basedmode detectionmethod, but detected high-speedmodesfirst,andassignedmodestotheindividualobservationsinsteadoftrips(Kasaharaetal.,2017).Observationsofthesamemodearesubsequentlymergedintotrips,providedthetimeperiodislessthanacertainthreshold.However,despitethehighperformance,someopinethismethodstruggleswithlowgeneralizabilityasrulesobtainedfromaonecitymaynotbesoapplicabletoanothercityduetovariousreasonslikethebuiltenvironmentaffectingGPSsignals(or density of cell towers, affecting overall coverage and signal strength). However, thesimplicityandcomprehensibilityofthesemethodsmeanthatitisfeasibletoderiveparametersfor each city (lengths of trains or average distance between stops from the cities transportprovider,forexample).

    FuzzyLogicSystems

    Fuzzy logic (FL) systems are powerful predictive models as they can handle uncertainty andvagueness inawaythat isunderstandablebyhumans.However,thesuccessofafuzzyexpertsystemliesinproperselectionofitsfunctionsandparameters,whichareusuallydonemanually(Das andWinter, 2016a). Unlike crisp sets with hard border values, fuzzy set theory assignsmembership values to an element, introducing the concept of partial membership of thatelementinaset,oranumberofsets.ThewayFLisusedinthesestudiesismostlysubjective,astheempiricalapproachtogeneratingtheserulesisdependentonhumanjudgmenttodefinethem.Schussler&Axhausen(2009)usean open source FL platform to generate trapezoidal membership functions of their fuzzyvariables.Thevariablesweremedianofspeed,95thpercentilespeedandacceleration,andwereexplicitly chosen over average values to make the algorithms more robust against outliers.Figure3showsthemembershipfunctionsofeachvariable.Aminimumofoneruleisdefinedforeach mode based on these membership functions, as seen by the examples in Table 1.

  • 13

    Ambiguity and fuzziness is intentionally introduced through the rules as well as from theoverlappingmembership functions.This canbeespeciallyuseful formodes thathavevariablespeedprofiles,suchasbuseswhichstartandstopfrequently,andspeedchangesdependingonwhether they are in the city center and the stops are close together, and in residential areaswhere there are longer stretches between stops (Tsui and Shalaby, 2006).Modes are finallyinferredbasedonthemembershipvaluesfromtheaggregatedmembershipfunctions.Ramussenetal.(2015)appliedasimilartechniquetotheirstudyareainCopenhagenusingthesame variables, butwith values derived from their own expert knowledge and analysis. Theyalso combined the FL system with a Rule-Basedmethod first sieve out rail trips, due to theassumptionthatraillinesarecharacteristicallydifferenttoroadnetworksandthusatrajectoryalignedwithraillineshasahighpossibilityofbeingarailtip.ThestudyfoundthattheFLruleswerestillinsufficienttoeffectivelydistinguishbetweenbusandcarmodes.Inresponsetothat,theymeasure alignmentof the identifiedGPS stopswith thatof thebus routes. Both studiesappliedfeedbackmechanismsforweirdcombinationssuchascartobicycleorcar-bus-carwereappliedtocorrectthesetomorerealisticmodes.Forexample, ifthesequenceofmodeswerecar-bus-car,thealgorithmwouldflagitandreclassifythetrajectoriesasacartrip.Highspatialand temporal granularity allows for shorter andmore detailed trips thatmay constitute onesinglejourneytobeidentified,allowingforthismethodtoactasasuitablefeedbackalgorithm.

    Figure3FuzzyLogicmembershipfunctionsgeneratedbyhumanexpertise(AxhausenandSchüssler,2009)

  • 14

    Table1Examplesoffuzzyrulesformodedetection(AxhausenandSchüssler,2009)

    Thepaperdidnotreportanyaccuracymeasuresofthisprobabilisticmethod,butcomparedtheresultswiththeofficialcensusdataontravelbehaviorthatwasreleasedafewyearspriortothestud,andconcludedthatthisformofmodedetectionyieldsrealisticandreasonableresults.Astudyreleasedafter thatdesignedamorecomplexFLsystemsoas toclassifymoremodes.Afew more fuzzy variables were added to the list, including proximity to a network like therailwayorbusnetwork.Thismethodwasabletodetectwalking,bicycling,car, ferryboat,sailboat,train,subway,bus,tramandflightmodesusingGPSdataandthesefuzzyvariableswithanaccuracyof91.6%(Biljecki,2010).Thefuzzysystemhadcertaintyfactorsappliedtoeachresulttomeasuretheconfidenceoftheinference.One drawback of FL systems is that these rules tend not to take into account inter-variablecorrelation,andastherulesaregeneratedusingexperts’understandingofthefield,anyclass(mode)additionstothemodelwouldbeextremelycostly(Elkanetal.,1994).Thismayprovetobe a problem when trying to transfer this method that was designed for GPS data to CSD.However, it is still possible to construct a FL model without expert given a set of input andoutputpairs.Thetaskthenisfundamentallyakintodeterminingasystemthatprovidesthebestfittothesepairs(Mendel,1997),andwillbeexploredfurtherinthenextchapter.

    MachineLearning

    Duetocertainlimitationsofsettingrulesandalgorithmsinprograms,somestudieshaveturnedtomachine learningmethods instead.Thesemethodscommonly includeneuralnetworksandtree-basedmodels,amongafewothers.Gonzalesetal.appliedareducedsubsampledGPSdatasettoneuralnetworkstoinfermodes.Thesubsetconsistedonlyofcriticalpoints,whichwerecharacterizedbyheadingchangeanda

  • 15

    minimumspeedsoas to remove redundantdataandminimizeprocessing.The inputs chosenfor the neural network algorithm were the common variables such as acceleration, speed,distances between stop locations, dwell time, aswell as GPS specific variables like estimatedhorizontalaccuracyuncertainty(Gonzalezetal.,2010).Theneuralnetworkwasabletolearntodistinguish betweenwalk, car and bus trips.However, the paper cited that the critical pointsusedintheproposedalgorithmwereinsufficienttoachieveagoodresult.Tsui&Shalaby(2006)proposedahybridmethodthatcombinesthisneuralnetworkwithafuzzylogicsystemformodedetection.ThefuzzyvariableschosenweresimilartotheotherFLstudiesdescribedabove,withtheinclusionofdataquality.However,theparametersofthesevariablesandtheirmembershipfunctionsweresetbyaneuralnetworkalgorithm(NEFCLASS-J).Theirworkmanagedtoidentifymodes(walk,bus,bicycle,car,rail)withanoverallaccuracyof91%,thoughtheperformanceofbus modes was relatively poor due to the considerable overlap of characteristics with othertravelmodes.These,combinedwiththe largevariabilityofmovements inbuseswerecitedasreasonsforthispoorerperformance.Thepaperhowever,didnotreportonthemodeshareoftheactualdata.Severalworksalsoincludetemporalmeasuresintheirlearningmethodssuchastimeofdaytogivecontexttoaprobabilitymodeltoestimatemodechoice(Liaoetal.,2007).Stennethetal.(2011)incorporatelivebusandtraintimeswheninferringbetweenstationary,walking,cycling,bus,driving,andtrainmodes.Theyextractedvariablessuchasaveragespeed,headingchange,acceleration, as well as context-aware information like average bus line closeness, rail lineclosenessaswellasbusstopcloseness.Theauthorsranthesevariablesthroughseverallearningalgorithms (Random Forest, Decision Tree, Naïve Bayes, Bayesian Network and MultilayerPerceptron) and found that Random Forest had the best performance. A few importantstrengths of RF that are relevant to this study are that it is one of the highest performingmachine learning algorithms in terms of accuracy and can run efficiently on large datasets.Estimatesofwhatvariablesareimportantarealsoincludedintheoutput,whichcanbeusefulfor purposes like dimension reduction (Degenhardt et al., 2017). It is also able to generatepairwise proximities between data points that can be used as input in other classificationmethodslikeunsupervisedk-medoidclustering.

  • 16

    2.4 Transportmodedetectionusingcellularnetworkdata

    Pervasive technologies suchasmobilephonescreatedatasets thatgivean inside lookofhowpeople use the city’s infrastructure. Urban planning is one of the greater benefactors of theanalysisof this collectivepersonal locationdata.Mobilephone tracescontribute toamassivepool of passive data that can provide knowledge on the whereabouts and movements ofindividual users. According to the Global System for Mobile Communications Association(GSMA),aninternationaltradebodythatrepresentstheinterestsoftheworld’smobilenetworkoperators,thereareabout7.7billionmobileconnectionsby5billionuniquesubscribersin2017.465 million of these subscribers reside in Europe alone3. This large potential has not goneunnoticedasmanyoftheseoperatorshavebeguntoexperimentwithnewbusinessmodelsthatwouldgeneraterevenuefromboththeirmobilesubscribersaswellasothercustomerssuchastraffic analysis, advertising andmarketing, and social networking companies. As such, it is nosurprisethatthesharingofsuchmobiledatawithresearchcommunitieshasstartedtopickupspeed(Calabreseetal.,2015).Themainhurdle isthe lowerspatialresolution, inconsistentandsometimessparsesamplesofdata. As such, they require a specialized set of techniques for extracting valuable and usableinformation from them. There are various types of cellular network data such as call detailrecordsandcellularsignalingdata.Thelatter,whichistheonethatwillbeusedinthisresearchis known by many names, including floating cellular/phone data, sightings data, and so on.Furthermore, this data type can have varying properties depending onwhether the phone isconnectedtoa2G,3Gor4Gnetwork.Thisgivesanindicationofhowrecentthisdatatypehasbeenincorporatedintosuchresearchfields.Forexample,SevtsukandRatti(2010)addresshowcoarse-grainedcallvolumedatainRomecanbeused to teaseoutpropertiesofusermobility,where they found regularityandpatterns inurbanmobility at different timesof thedays, aswell aswhichdayof theweek itwas.Otherindicators likedemographic,economicand infrastructural indicatorswereusedtosupplementandaccountforthesepatterns.Travelroutescanalsobeestimatedusingcellularnetworkbasedvoronoisandmapmatching(Tettamantietal.,2012).Otherstudiesalsousethefluctuationsof

    3https://www.gsma.com/newsroom/press-release/number-mobile-subscribers-worldwide-hits-5-billion/

  • 17

    signalstrengthinGSMcellulardatatoestimatemoreprecisegeographiccoordinatesforthesepurposes(Thiagarajanetal.,2011).Transportmodedetectionisanotherareaofresearchusingcellularnetworkdatathatisstillinits early stages. This is largely due to data’s lower spatial and temporal granularity, themainchallengestothecomputationofspecificmeasurementsonspeed.However,therehavebeenafew studies attempting to estimate coarse speeds according to the rate of change of theconnected cells, as well as the distances between them (Gonzalez et al., 2008; Reddy et al.,2010a;Sohnetal.,2006).OthershavemadesenseofcoarseCDRdatabyclusteringtraveltimesoftripsintothecorrespondingtransportationmodeclusters(KalatianandShafahi,2016;Wangetal.,2010).Thenext sectionswilldelvedeeper intohowcellularnetworkdata isgenerated,processed,andusedfortransportmodedetection.

    2.4.1 Mobilephonenetworkstructure

    The basic network structure is composed of a Core Network (CN) and Radio Access Network(RAN).TheCNisdividedintoeitherCircuit-Switched(CS)foractivitieslikevoicecallsorPacket-Switched (PS) domains for packet data transfers. Radio communication occurs between themobilephones(terminals)andthebasestationservingthatcell.Assuch,cellsarethesmallestspatialentitiesinthecellularnetwork,withageographiccoveragethatvariesfrommagnitudesofmeters (microcells),up to severalkilometers (macrocells). Several cells togethermakeupaLocationArea(LA)(Janeceketal.,2012;Miaoetal.,2016).ThisstructureisillustratedinFigure4.

    2.4.2 DataGeneration

    Mobile phone positioning occurs whenever a terminal communicates with the network,essentiallywhen a user uses his phone (Chen et al., 2016). Calabrese et al. (2015) categorizemobile phone data into two types, event driven and network driven. Event driven data isgenerated during billed activities such when calls or texts are made, or when data is beingtransferredwhilebrowsing the Internet forexample.These includeCallDetailRecords (CDRs)andInternetProtocoldetailrecords(IPDRs)respectively.Atthisstage,theterminalissaidtobe

  • 18

    in active state, whereby the voice call or data connection is open. At any given time, themajorityofmobileterminalsarenotintheactivestate,butintheidlestate.Eventerminalsthathavetheirdataconnectionpermanentlyswitchedonremain inthe idlestate,switchingtotheactive state only during packet bursts, like data downloads (Janecek et al., 2012). Theinformationineacheventthatisultimatelyrecordedinthedatadependslargelyonthemobilephone provider that operates the network. For example, CDRs could include the IDs of thecallers, receivers, cell towers and start and end time stamps. Similarly, IPDRs will consist ofinformationonInternetusageandothercellulardatarelatedactivities.Network driven data is also known as floating cellular/phone data or signaling data and isgeneratedwheneveraphoneislocalized,i.e.,duringdifferenttypesoflocationupdates(Figure4). Periodic updates occur on a periodic basis as determined by the telecommunicationsprovider to generate periodic information on which cell tower the terminal is currentlyconnectedto.Handoversaregeneratedwhenanactiveterminalmovesbetweentwocellsandlastly,mobility location updates are generatedwhen a terminalmoves between two locationareas.Assuch,dependingonthestate(activevs.idle)oftheterminal,thespatialgranularityofthedatarecordedcanbeatthecellleveloratthelocationarealevel.

    Figure4a)LocationAreaandBaseStationsb)Periodicupdatesc)Handoversd)Mobilitylocationupdates(Calabreseetal.,2015)

  • 19

    2.4.3 SpatialandTemporalGranularity

    The frequency of the data depends on the type ofmobile data generated and is largely userdependent.Chenetal.(2016)foundthatthefrequencyofbotheventandnetworkdrivendatatypesdisplayhighheterogeneity inthenumberoftimesthephonewas localizedoracallwasmade forexample,whereby themajorityofusershavea smallnumberof recordsandonlyafewhavemore a largenumbers. Each voice call generates oneCDR.However, that same callmight generatemultiple network driven data points ifmultiple cells are traversed during thedurationofthecall.Periodicupdatesaretypically intheorderofafewhours(Wildhametal.,2015).Forpurposesofclarity,fromthispointonthetermcellularsignalingdata(CSD)willrefertobotheventdrivenandnetworkdrivendata.AstudythatusedCDRsfoundthattheaveragetimeintervalbetweeneacheventwasabout8hours (Gonzalezetal., 2008).These intervalsaccurately represent the time intervalsbetweeneach call and aremuch longer thandatasets that includenetwork drivendata aswell. In thelatter,asasinglecallmighttriggermultiplenetworkdrivenevents,theseeventsmighttendtobemore clustered together in terms of the times they were recorded(Chen et al., 2016). InCalabrese et al.'s (2011) paper, the average inter-event time intervals that included networkdrivendatawasfoundtobe260minutes,withtheaverageofthequartiles’medianstobelessthan1.5hours.Assuch,thedatawasfineenoughfortheresearcherstoidentifystopsoflesserthanthattime interval.However, it isalso importanttonotethatstopsof lessthan1.5hourswill be missed. This might add to inaccuracies of the processed data especially since somehousehold travel surveys define a stop exceeding 5minutes to be an activity that should berecorded (Chen et al., 2016). As for location area updates, there may be cases whereby noupdates are sent from themobile phone despite large amounts ofmovement if the locationareacoversalargearea,someseveralhundredsofkilometers.In termsof spatial granularity, thesedata typesdiffer in fromGPSdata in the sense that thelocation information has to be estimated using variousmethods. As thismeans that the cellphoneeventsonlycontainapproximatedlocations,thisissignificantlylessaccuratethanthatofGPSdata(Hornetal.,2014).Triangulationisoftenusedandresults incoordinatesthatdonotcorrespondto thecell tower locationbutareanestimateof the terminalsposition.Measuressuchasreceivedsignalstrength,transmissiontimeandanglesofmultipletowersareusedintheestimation if there are multiple base stations in available range (Chen et al., 2016). Thealgorithmshereusedareusuallyundisclosedbythemobileproviderandusebotheventdrivenand network data. Furthermore, most mobile providers do not disclose the structure and

  • 20

    organization of their cellular networks, or the spatial extents of each cell or LAs, which varydepending on the density of towers and the level of urbanization (Widhalm et al., 2015).Experimentsshowthatthespatial resolutionwasfoundtobefromtheorderofa fewmeters(Chenetal.,2016)toabout300m(Calabreseetal.,2011;Jiangetal.,2013)or500m(Hornetal.,2014) in urban areas where the density of cell towers is much higher, to that of severalkilometers(Hornetal.,2014;Widhalmetal.,2015)inrural,lessheavilypopulatedareas.

    2.4.4 Pre-processing

    Severalpre-processingtechniquesmustfirstbeappliedbeforevaluableinformationonhumanpatterns can be extracted. These vary depending on the research aims, but generally, noisereduction techniques in the form of filters or through clusters are applied to filter outinaccuraciesinthedata.Next,theseobservationsarethensegmentedintoindividualsegments,whereassomeGPSstudiesassignmodestotheobservationsandthengroupsimilarconsecutiveobservationsintoamodetrajectory(Kasaharaetal.,2017;Reddyetal.,2010b).

    Noisereduction

    Pre-processingneedstobedonetoreducenoise.Thisisdonetolowertheinfluenceofoutliersthatarisefromvariousphenomenaonthefinalanalysis.Onesuchphenomenon,knownastheping-pong effect, occurs when the terminal bounces back and forth between multiple basestationswhiletheuserisnotmoving(Fiadinoetal.,2012;Miaoetal.,2016).Thisoccursduetofluctuationsinthereceivedsignalstrengthandhenceleadstotheoscillationbetweendifferentcelltowersdespitebeingstationary.Thesefluctuationsinsignalstrengtharealsolikelytohaveanimpactontheestimatedtriangulatedposition,leadingtowhatappearstobedriftsandshiftsinthelocationofthedatapoints.Also,intheeventthereareseveralcelltowerswhosesignalsreachaterminal,theconnectionofthisdevicemayhopbetweenthesetowers.Thismeansthatoutlierscansuddenlyoccurkilometersawaywithinanunrealisticallyshortperiodoftime.Whilesome (like the above) can be the result of localization errors, others can be intentionallytriggeredforprivacyprotectionreasons.Forinstance,arbitraryeventscanbeinsertedintothemobile traces to prevent the creation ofmovement profiles. These events are part of efforts

  • 21

    towardsprivacyprotectionandarecalledtemporarymobilesubscriberidentitiesorTMSI(3GPP,2010)4.Oneapproachisthroughpattern-basedrecognitionandthisrequiresinformationonwhichcelltower the phone is connected to (Iovan et al., 2013; Schlaich et al., 2010). Users with highoscillations between cell towers are identified with a proposed “jumpiness rule” using thenumber of updates and area codes. Another approach does this through speed-basedcorrections.Athresholdischosentodistinguishbetweenwhatisareasonablespeedandwhatis not. Instances that produce values that exceed this threshold are flagged. This can also bedone through a number of ways as explored by Horn et al. (2014). They tested a series offiltering techniques including a recursive naïve filter, recursive look-ahead filter and Kalmanfilter.Outliersare identifiedasdatapointswherethespeedscalculatedareexceptionallyfast,and the thresholdwas set to 250km/h. The recursivenaïve filter simply removes anyoutliersfrom a sequential stream of events whereas the recursive look-ahead filter accounts for thepossibility that theeventbefore theoutlying event is theoutlier instead. TheKalman filter ismorecomplexinthatittakesaprobabilisticapproachandisapopularchoiceindatapredictiontasks including trafficmodelingusingGPSandothersensordata (Faragher,2012). Itproducesestimates of unknown variables in nosy time series through approximating joint probabilitydistributions over the variables in their time frames (Kalman, 1960). Results show that therecursive filtersoutperformedtheKalman filter,andoneof the reasonsproposedwasdue tothetemporalsparsenessofthecellularsignalingdata.Assuch,theformerismoresuitableasanoisereductionmethodformoreirregulardatalikeCSD.

    Tripextraction

    Inmanyofthestudiesusingcellularnetworkdatatominehumanpatterns,individualtripsarefirstextractedbeforetheyareanalyzedformodedetectionoraggregatedformorelarge-scaleanalysissuchasinurbanactivityanalysis.Toachievethis,keyplacesmustbeidentified.Ontopofassigningtheseuserstotheselocations,theremustbeadistinctiononwhethertheseplacesare stopsor theuser ismerelypassing through it. The former canbeplacesof activities, likework,homeor leisure,originsordestinationswhentryingtogenerateOrigin-Destination(OD)matrices,ormoreexactlocationsasstartandendpointsoftrips.Assigningthesestartandend

    43rdGenerationPartnershipProject;TechnicalSpecificationGroupCoreNetworkandTerminals;Numbering,addressingandidentification

  • 22

    points to specific locations can be done using a few methods. A more straightforward andcommonlyusedapproachisbyusingcentroidsofcellareasifthecelltowerlocationsareknown(Gonzalezetal.,2008;Tettamantietal.,2012).Manyotherstudiesusestopdetection to filteroutsignificantplaces,with the rationalebeingthat ifapersonstopsthereforareasonabletimeperiodtheseplacesare important inhumanpatterns. Due to the noisy and raw nature of cellular network data, the same event cansometimesberegisteredasmanyconsecutiveeventsthatcouldberelatedtovariouslocationsinitssurroundings.Theselocationsarefilteredouteitherbyasetofrulespertainingtospace,space and time or speed. A commonly used method here is through spatial and temporalclustering.Wangetal.(2010)workedwithCDRs,anduseanincrementalclusteringalgorithmtoextractthesestoplocations.Aradiuscorrespondingtotheestimatedpositioningerror(setat1km)andminimumdwelltimeweredefinedasthethresholdstoformclusters.Themedoidsoftheseclusterswerefoundandtheremainingpointsintheclustersweredeleted.Consequently,thesemedoidsweresetasthestartandendpointsofeachtrip.Jiangetal.(2013)applysimilarthresholdstocellularsignalingdata,butwithfinerthresholdsof300mand10minutestodetectstaylocations,asillustratedinFigure5.

    Figure5Illustrationofdatapre-processingtoextractstaylocations(Jiangetal.,2013)withdistanceandtimeconstraintsof300mand10minutes.

  • 23

    Another study byWidhalm et al. (2015) uses a similar clustering algorithm to both CDRs andcellular signalingdatabut incorporates thegeometryof the trajectory to filteroutpassingbylocations.Figure6showsanexampleofhowBinI)isnotdetectedasastopandthatinII)is.Asthe study was mining urban activity patterns, the reasoning behind this was that significantextradistancestravelledareoftenmotivatedbyanactivity.

    Figure6Detectionofstaysusinggeometry(Widhalmetal.,2015)

    2.4.5 Modedetection

    Oncethesetripsareextractedtheyarenowreadytobeanalyzedtoidentifythetripmodes.Inordertodoso, informationonthesetripsareextractedastrip features.Unlike inGPSstudiesthat assign modes to individual observations, most cellular network studies only do so aftergroupingtheseobservationsintosegments,treatingthesegmentsasthesmallestunitto infermodes insteadofoneach individualobservation.Due to the infancyofmodedetectionusingmobilephonedata, thestudiesareusually limited toCDR-onlydataandtheexistingmethodsused here can be classified into unsupervised k-means clustering, rule-based heuristics andmachinelearning.

    ClusteringK-meansAwell-knownmethodofmodedetectionusingcellularnetworkdataiswithk-meansclusteringoftraveltimes.Wangetal.(2010)workedwithanonymizedCDRsintheirstudy.Startandendpoints of these tripswere assigned to cells in a grid. Tripswith the sameODswere groupedtogether.Tripsover63minuteswereremovedastheyweredeemedtobetoolongatravelling

  • 24

    timewithinthestudyarea.K-meansunsupervisedclusteringwasthenperformedonthetraveltimesofeachoftheseODgroups,withdistinctionsbetweenweekdaysandweekends.K-meansisacentroid-basedclusteringalgorithmthatusesthemeanvalueofeachcluster (centroid) torepresentthecluster(

    Figure7).ThegoalofK-meansisthustoreducethesumofsquarederrorbetweentheindividualobjects in the cluster and their centroids (Hastie et al., 2009a). The clusteringpartitioned therecords intotwoseparateclusterscorrespondingtothemodesof interest,namelydrivingandpublictransport(Wangetal.,2010),wheretheclusterswiththeshortertraveltimeisassignedto driving and vice versa. There is an assumption of singlemodal trips here, similar tomanymode detection studies of this nature. The error of the inference is then calculated as theaverageofdifferencesbetween the travel timesandobtained fromtheclusteringand thatofreported by Google Maps. Silhouette values to measure how well associated the clustermembersaretotherepresentativeoftheclusterwasalsomeasured,andthisindicatedagoodperformance of the model. Due to lack of official census data of the city with regards totransportationmode, themodel could not be validated against such official records. Kalatianand Shafahi’s (2016) paper also detected walking, and used a similar approach. Their studyworkedon anonymized signaling data and grouped the trips into traffic zones insteadof gridcells,andseparatedbytimeofdaytoaccountfortraffic.Groupingthembythehourmeantthattherewereinsufficientrecordstoperformclusteringwell.Assuchtheyweregroupedintotripsoccurringatsimilarhoursoftheday,suchaswhenpeoplecommutehomefrom4PMto9PM.However,while thepaper stated that validationwasdoneagainst surveyeddata collectedbythecity,theresultsofthatvalidationwerenotincludedinthepaper.Theseclusteringmethodsonlyuseonefeatureofthetrips;thetraveltimeandassigningmodestotheseclustersmaynotbe so straightforward. In a city with a well-integrated public transport system such asmanymajorcities inEurope,traveltimeswhenprivateorpublicmotorizedmodescanbeextremelysimilar,ifnotshorter.

    Figure7Tripdata clustered to twosubgroups,drivingandpublic transit. Thearrows show theaveragetraveltimesofthesubgroups.BlacklinesarethetraveltimesasreportedbyGoogleMaps.(Wangetal.,2010)

  • 25

    Rule-basedmodesplit

    Quetal.(2015)workedwithCDRdatatodetecttransportationmodeusingarule-basedmodesplit algorithm that combined speed, trip distance and a logit model. The paper focused onestimating transportationmodesharesat the traffic zone levelof thecity,andonly lookedatcommutesbetweenworkandhome.Thesehome-work tripswereextracted througha longerobservational period of 3 weeks and was possible as the dataset was not subject toanonymizationevery24hours.Byapproximatingthehomeandworkareasasplaceswheretheuser is mostly found between 8pm – 7am and 9am- 5pm respectively, the travel times aresubsequentlyestimatedas the timedifferencebetween the latest timeone is foundathomeandtheearliesttimeoneisfoundatwork.Thisistoaccountforthefactthatitisunlikelythatausermakesacalljustbeforeleavingoruponarrival.Asaresult,theywereabletoestimatethetraveldistancesandtimesbetweenhomeandwork,andsubsequentlyfromthesetwovalues,the speed. Here, the distinction is also between driving, public transportation and walking,whereeachtriponlyconstitutesoneofthesemodes.

    Figure8FrameworkproposedbyQuetal.(2015)formodedetectionwithCDRdatathroughaspeedsplit,augmentingthedatasetwithtransportationnetworkinformationandutilityapproximations.Basedontheassumptionthat15km/histhemaximumspeedofanon-motorizedmode,Figure8 shows the speed ruleused to split the trips intohighand low speed trips.High-speed tripswhoseaveragedistancetotheunderlyingpublictransportationnetworkcountedasacartrip.Therestarefedthroughthelogitmodel.Forthelowspeedtrips,thedistinctionismadeusingtriplengthsbasedontherationalethatpeopledonotwalkformorethan3km.Thosemorethan3km are also fed through the logitmodel, which is a discrete choicemodel that predicts anindividualschoicebaseonutilityorattractiveness.Forexample,inthestudyareaofBoston,itis

  • 26

    regardedasmoreattractivetousepublic transportation inthecentralBostonregionandcarsfor the surrounding suburb region. This differs from many GPS studies using rule-basedalgorithms that usually detect slowest modes first. This can be attributed to the betterresolutionofdata,enablingmorerepresentativemeasurementsofslowerspeedsinmodeslikewalking, with lesser chances of data inaccuracies resulting in higher speeds. Linear relationsbetween census data and the predictions for each census tract are used to evaluate theperformanceof themodelandthemodeldoeswell for identifyingcarmodes,butnot for theother two. Also, while some areas observe high prediction accuracies, others have largerdeviationsfromthesurveydata.Onereasoncitedwastheconfoundingeffectsofotherfactorssuchasincomeandlandusethatmayhavecausedlargererrorsespeciallyintheirlogitmodel.

    Machinelearning

    Machinelearningissometimesusedinmodedetectionstudiesusingmobilephonedata,morespecifically, the GPS and accelerometer data collected from phone applications. However, astudy by Sohn et al., (2006) applied some of these machine learning methods on cellularnetworkdata,ormorespecificallyGSMdata.Aspecialmobileapplicationwascreatedforthisstudy to capture this data. The dataset they generated were labeled with these modes andincluded signal strength values, cell IDs, as well as the channel numbers of atmost 7 of thenearest cell towers. The method used here assumed that a user is stationary when theobservationshaveaconsistentsetoftowersandsignalstrengths,andmovingwhentherearechanges in these sets. They also found that the Euclidean distances between consecutiveobservationswereproportionalwiththespeedofmovement.In essence, a theoretical fingerprint of the signal strength and constituent cell towers werecreatedforeachobservationandfromthis,sevenfeatureswerechosentotrainthemodel.Thisincluded theEuclideandistance, correlationof signal strengths fromcommoncell towers andnumber of common cell towers between two measurements. The remaining variables werevarious descriptive statistics of Euclidean distances in the variouswindows ofmeasurements.The classifiers were trained with a boosted logistic regression technique with a single-nodedecisiontree.Overallthemodelperformedwellwithanaccuracyof85%thoughtheidentifiedmodes were only stationary, walk and drive. The signal strength information however is notavailabletothisstudyastheywerecollectedbyanappdevelopedbytheresearchers.Thereisstillvalueintheirworkintermsofimportantvariablesthatcanbeusedinourstudy.

  • 27

    Inamorerecentstudy,Asgarietal. (2016)developedanunsupervisedalgorithmthatenablesthemappingofcoarsemobilephonetracesoveramultimodal transportationnetwork,wheremobile trajectories are the observations and hidden states to be predicted are nodes of themultilayer graph. This unsupervised HMM completes the originally sparse trajectory andenriches it with the used modes by leveraging on the transportation layer type and theirtopological properties (i.e. route complexity). Transition probability predicts how likely anindividual moves from one hidden state to another using factors like edge type, speed, andlength. The model performs well and proves that using the transport network improvesperformance.Other studieshavealsoattempted tomatch theobservations to theunderlyingnetwork.

    Figure 9 Average Euclidean distance between subsequent observations during stationary, walking anddrivingperiods.(Sohnetal.,2006)

    2.5 Variableselection

    Methods like machine learning, FL systems and unsupervised clustering have proven to bepowerfultoolsforclassificationtasksinbothGPSandmobilephonedatastudies.Especiallyfordatawithhighdimensionality,oftenselectingareducedsetofrelevantvariableswouldbeideal

  • 28

    if theobjectivewas tobuild a classificationmodel for thepurposesof identification. Thiswillreducetheprocessingtimeandstoragespaceneeded.Furthermore,selectedvariablesmayalsoprovideasuggested framework for futurestudiesusingCSD.To thebestofourknowledgeofexistingworkusingCSD,thereseemtobenodocumentedcasesofvariableselectionprocesses,ordescriptionsofsuchmethods.Assuch,thispaperhasexploredafewtechniquesthatcouldberelevant toour researchaims.Twooptionsareexploredhere:RandomForest (RF),a tree-basedmodelwhereby labelsarerequiredandPrincipalComponentAnalysis (PCA)wheretheyarenot.

    RandomForest

    RandomForest is a tree-based learningmodel thathasbeen shown toperformwell inmodedetectionstudiesusingGPSdata.Labelsareused inthegenerationoftherandomforest(RF).Thesealgorithmshavethepowertohandlehighdimensionaldataandakeystrengthisthatitoutputs themost significant variables, which is especially relevant if the objective is variableselection. Another benefit of using RF is that it is able to account for and balance errors inimbalanced datasets, where one class may be disproportionately more represented thananother.Tree-basedmodelsuselabelstobuildtheirtrees,bysplittingthepopulationintotwoormorehomogenoussetsbasedonthemostimportantvariable.ThisisdecidedbyusingtheGiniindexor entropy to evaluate the quality of a particular split, and is usually used in classificationproblemsratherthanregressionones(Jamesetal.,2013).TheGiniindexisdefinedas:

    where istheproportionof individualsthathaveclasscatnoden.Gini is lowestwhenallobservationsinthenodesbelongtothesameclass,andincreasesastheobservationsofthesamenodehaveamoreevenclassdistribution.ThemeandecreaseofGiniortheinformationgainforsplittingatnodenonvariablexi,isdefinedasthedifferencebetweenimpuritiesofthenodeandtheweightedaveragesoftheirchildnodes:

    Gain(xi,n)=Gini(xi,n)-wLGini(xi,nL)–wRGini(xi,nR)

  • 29

    WherenLandnRaretheleftandrightchildnodesofparentnoden,andtheweightsassignedtotheleftandrightnodesarewLandwRrespectively.Basedonthiscalculation,variablexiwiththelowest impurity is selected to be the basis of the split at node n.An alternative to the Ginicoefficientisanothermeasureofmeandecreaseinaccuracy.Eachtreethatusesthisparticularattributewillcomputevalueseparatelyandthentheaverageofalllossofaccuracyiscalculated(Degenhardtetal.,2017).Forexample,inasampleof100childrenwithvariablesgender,heightandage,halfofthematemeat and the other did not. The most important variable of determining their meatconsumptionstatuswouldbetheonethatproducesthemosthomogenousorpuresetsafterasplitbasedonthatvariable,whereoneresultingsethasahighpercentageofnon-meateatersandtheotherhasahighpercentageofmeateaters.RFisanextensionofthesedecisiontreesinthatitgrowsmultipletrees.ThedefinitionofanRFalgorithmis“RFisaclassifierconsistingofacollection of tree-structured classifiers {h(x, k ), k = 1,...} where the {k } are independentidenticallydistributedrandomvectorsandeachtreecastsaunitvoteforthemostpopularclassatinputx.”(Breiman,2001,p.2).Eachtreewillhaveonevotethatwillcounttowardsthefinalclassification.While RF is generally considered a supervisedmachine learningmethod, it canalsobeadapted inforunsupervised learningtoderiveaproximitymatrix fromunlabeleddata(ShiandHorvath,2006).ThiswillbeelaboratedfurtherinSection4.8.Asforitsroleinvariableselection,therearevariousapproachesproposedtoidentifythemostimportantvariablesbasedonthisranking.Degenhardtetal.(2017)didacomparativestudyofthesemethodsand concluded that theBorutamethodwas themostpowerful approach, andwillbedescribed further inSection4.9.Theoverarchingconcept is toaddrandomness to thesystemandbycollectingresultsfromthissystem,thedeceptiveimpactsofrandomfluctuationsand correlations can be lessened, providing a better picture of which attributes are reallyimportant(Kursaetal.,2010).

    PrincipalComponentAnalysis(PCA)

    PCA is an unsupervised process of transforming data by plotting it on different axes so as toderive a set of smaller representative variables, or principal components (Abdi andWilliams,2010). The aim of PCA is to try and explain asmuch variation as possible in the data. Sincelabeleddata is not required, PCA is especially useful todetermine inputs tomethods such asunsupervisedclustering.Theseprincipalcomponentsareaxeswherebythedataismostspread

  • 30

    outwhenprojectedtoit.Inordertofindtheselines,onederiveseigenvectorsandvalues,whichcomeinpairs.Theeigenvectorisadirectionandtheeigenvalueisanumberrepresentinghowmuch variance there is in the data in that direction, or how spread out the data is in thatdirection.Thefirstprincipalcomponentisthustheeigenvectorwiththehighesteigenvalue,andcanalsobeseenasthelinethatisclosesttotheoriginaldata(AbdiandWilliams,2010).

    Figure10Exampleoffirsttwocomponentsbasedontwovariables,word lengthandnumberof lines indictionarydefinition.Greenlinesrepresentthevectorswhosemaindirectionleadstothemaximumsumofsquareddistancesfromthepointstothevectors(AbdiandWilliams,2010).

    PCAiscommonlyusedasadimensionreductiontechniquewherelargedatasetswithredundantvariables can be discarded without the loss of variation. Each of the resulting principalcomponents will have differing contributions from different variables. The first principalcomponentcanbedominatedbyafewvariables,andthiscanbethebasisofhowtheanalysisisinterpreted. These variables can give an indication of what key combinations of variablesaccountforahighproportionoftotalvarianceinthedata.Butbecausethedataisorthogonallytransformed onto a new coordinate system and because the values are scaled, the variables(principal components) do not explicitly represent the system-produced variables, hence

  • 31

    applyingPCA to thedata setmight cause it to lose interpretability.Nevertheless, PCA canbeusedtoselectthosevariablesthatcontainthemostinformation.KingandJackson(1999)didacomparative research study on the bestmethod of variable selection using PCA, so that thereduced subset you are left with is as representative of the original dataset as possible. Theresultsof thestudywereconclusiveandtheyrecommendedthat theB4methodworkedbestwhen complemented with the Broken-stick model criterion of number of variables to select(Jackson,1993;KingandJackson,1999).ThiswillbedescribedinSection4.9.

    2.6 Ethicalissues

    Despite overcomingmany of the shortcomings ofGPS, the use of CSD in any formhasmanyethicalimplicationsthatneedtobecarefullyconsideredbeforechargingforwardwiththeuseofthis typeofdata. Locationdata frommobilephones can reveal a lotaboutaperson,not justwhere one works and lives, but also visits. Activities like participation in protests, or alcoholconsumptionbasedonfrequencyofbeinglocatedinbarsandpubscanbeinferredandassumedofauser.Thisincludestheirschedulesaswell(Carteretal.,2015).Unsurprisingly,thisisamajorconcern, especially when such information is made available to applications that serve thirdpartiessuchascommercialcompanies(Calabreseetal.,2015).Tocombatthis,regulationsliketheGeneralDataProtectionRegulation5havebeenputinplaceinMay2018,dictatingthatthedata telecommunication companies release must be treated such that it was impossible toassociatethelocationdatawithacellphonenumber.Exceptionsincludecaseswherebyconsentoftheuserswhoaretrackedisexplicitlyconveyed.Assuch,researchersdevelopmethodsthatreflectcomplianceandthatabidebytheseregulations.Someoftheseattemptsincludelocationobsfucation, where locations are slightly altered but within the realms of being useful forservices (Krumm, 2009). Another key change in the GDPR is the strengthening to privacy bydesignprinciplesbymakinga legal requirement.Companiesarenowrequired to includedataprotectionfromtheonsetofdesigninganewsystem,ratherthananadditional featureattheend.

    5TheGeneralDataProtectionRegulation (GDPR)aims toprotectall EUcitizens fromprivacyanddatabreachesinanincreasinglydata-drivenworld.Itwasfirstestablishedin1995

  • 32

    However, despite the efforts made to deliberately encumber any form of matching of thesetrajectories to individual users, it can be argued that it is insufficient to truly protect users’privacies.Whileit isthepredictabilityandrepetitivenessofhumanbehaviorthatmakemobilephonedatavaluableintermsofmobilityresearch,itistheverysamesetoftraitsthatmakesitdifficult to completely anonymise this data. A recent study found that just 4 spatiotemporalobservationsarerequiredtouniquelyidentify95%oftheindividualsintheirtests(deMontjoyeet al., 2013). As such, moving forward, more complex techniques should be designed andimplementedtoprotectindividualprivacy.Furthermore,thereisalsotheissueof“groupprivacy”where people can be targeted on the basis of the social group they belong to. For example,certain groups may be represented more in mobile phone datasets depending on their age,gender, ethnicity etc. as indirect reasons for their level ofmobile phone activity at particulartimesorparticularplaces(Calabreseetal.,2015).Implicationsofbeingrecognizedasaresultofidentifyingwithaparticularsocialgroupstarttobecomeaconcern(Letouzéetal.,2015).Weacknowledgethattheseconstraintstoindividualprivacyareimportantissuestoconsider.Inthe years to come, it is expected that the scrutiny on datamining and its associated privacyconcernswillcontinuetoincrease.Usersofthisdatamustbesensitivetotheirmethodologiesandhowlegalprivacy issuesmight impactthem(Calabreseetal.,2015).Withrisingconsumerconcern, there might be legal challenges that this field runs into if these concerns are notadequatelyaddressedwiththeimplementationofmoreeffectivedesignframeworksdevotedtoprivacyprotection.

    2.7 SummaryandResearchGaps

    Amajormotivationofpursuingthisresearch isthatCSDisapassivedatatype.Whencarryingoutimportanturbanmovementanalyseswhoseresultswillhaveanimpactondecisionsmadebycityandtransportplanners,thisdatamustbeasrepresentativeofthepopulationinquestionaspossible.Sincecellularnetworkdataalreadyexistsandmuchofthepopulationalreadycarrypersonalmobiledevices,th