1
Visual Data Quality Analysis for Taxi GPS Data Zuchao Wang 1,3 Xiaoru Yuan 1 Tangzhi Ye 1 Youfeng Hao 1 Siming Chen 1 Jie Liang 1 Qiusheng Li 2 Haiyang Wang 2 Yadong Wu 2 1) Key Laboratory of Machine Perception (Ministry of Education), and School of EECS, Peking University, Beijing, P.R. China 2) School of Computer Science and Technology, Southwest University of Science and Technology, Sichuang, P.R. China 3) Now at Network Security Research Lab, Qihoo 360 Co. Ltd., Beijing, P.R. China What Quality Problems in This Data? Any systematic way to detect quality problems? Any way to detect unknown problems? Any way to discover strange shapes? Contact [email protected] http://vis.pku.edu.cn Funding This work is supported by NSFC No. 61170204, and partially funded by NSFC Key Project No. 61232012 and the National Program on Key Basic Research Project (973 Program) No. 2015CB- 352500. This work is also supported by PKU-Qihu Joint Data Visual Analytics Research Center. Time: Mar 4 th , 2009, 7-9 am (2 hours) Taxis: 13,080 (50% random sample) Points: 747,431 Data size: 74.2 MB Sampling: 30 secs/point Partition: 5 min pieces See What We Discovered in a Sample! Beijing taxi GPS data Find 8 quality problems Over sampling Data missing Duplication Stopage Long jump V-shape jump Back-forth jump Random movement How Do We Discovered Them? Normal Back-forth jump Over sampling Random movement Interactive discovery of potential problems Project trajectory pieces into spatial, temporal & feature spaces Detect data outliers and clusters (potentially problematic) #sampling points, distance, sum(turning angles) min/max/recsum/logsum of turning angle min/max/recsum/logsum of segment distance min/max/recsum/logsum of segment time interval min/max/recsum/logsum of segment average speed 9:35:02 9:35:32 9:39:02 9:39:32 Feature space definition (19 features) Showing data distribution from different aspects Visual confirmation of suspected data Temporal Spatial Feature Feature Show the spatial shape and point/segment attribute change Help expert users confirm whether there’s a problem Automatic extraction of data with confirmed problems Binary SVM classifiers to extract problematic data Classifier training with interactive labeling Active learning: label data that best optimize classifiers Similarity search: label data similar to problematic ones Next Step ? Select features systematically Consider interactions among different quality problems Improve scalability

Visual Data Quality Analysis for Taxi GPS Datasimingchen.me/docs/vast15_cleaning_poster.pdf352500. This work is also supported by PKU-Qihu Joint Data Visual Analytics Research Center

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • VisualDataQualityAnalysisforTaxiGPSDataZuchaoWang1,3 Xiaoru Yuan1 Tangzhi Ye1 Youfeng Hao1 Siming Chen1 Jie Liang1 Qiusheng Li2 HaiyangWang2 Yadong Wu2

    1) KeyLaboratoryofMachinePerception(MinistryofEducation),andSchoolofEECS,PekingUniversity,Beijing,P.R.China2) SchoolofComputerScienceandTechnology,SouthwestUniversityofScienceandTechnology,Sichuang,P.R.China

    3) NowatNetworkSecurityResearchLab,Qihoo 360Co.Ltd.,Beijing,P.R.China

    WhatQualityProblemsinThisData?• Anysystematicwaytodetectqualityproblems?

    • Anywaytodetectunknownproblems?

    • Anywaytodiscoverstrangeshapes?

    [email protected]://vis.pku.edu.cn

    FundingThisworkissupportedbyNSFCNo.61170204,andpartiallyfundedbyNSFCKeyProjectNo.61232012andtheNationalProgramonKeyBasicResearchProject(973Program)No.2015CB-352500.ThisworkisalsosupportedbyPKU-Qihu JointDataVisualAnalyticsResearchCenter.

    • Time:Mar4th,2009,7-9am(2hours)

    • Taxis:13,080 (50%randomsample)

    • Points:747,431• Datasize:74.2MB• Sampling:30secs/point• Partition:5minpieces

    SeeWhatWeDiscoveredinaSample!BeijingtaxiGPSdata Find8qualityproblems

    • Oversampling• Datamissing• Duplication• Stopage• Longjump• V-shapejump• Back-forthjump• Randommovement

    HowDoWeDiscoveredThem?

    Normal Back-forthjump

    Oversampling Randommovement

    Interactivediscoveryofpotentialproblems• Projecttrajectorypiecesintospatial,temporal&feature spaces• Detectdataoutliersandclusters(potentiallyproblematic)

    • #samplingpoints,distance,sum(turningangles)• min/max/recsum/logsumofturningangle• min/max/recsum/logsumofsegmentdistance• min/max/recsum/logsumofsegmenttimeinterval• min/max/recsum/logsumofsegmentaveragespeed

    9:35:029:35:32

    9:39:029:39:32

    Featurespacedefinition(19features)

    Showingdatadistributionfromdifferentaspects

    Visualconfirmationofsuspecteddata

    TemporalSpatialFeatureFeature

    • Showthespatialshapeandpoint/segmentattributechange• Helpexpertusersconfirmwhetherthere’saproblem

    Automaticextractionofdatawithconfirmedproblems• BinarySVMclassifierstoextractproblematicdata• Classifiertrainingwithinteractivelabeling

    • Activelearning:labeldatathatbestoptimizeclassifiers• Similaritysearch:labeldatasimilartoproblematicones

    NextStep? • Selectfeaturessystematically• Considerinteractionsamongdifferentqualityproblems• Improvescalability