Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
VisualDataQualityAnalysisforTaxiGPSDataZuchaoWang1,3 Xiaoru Yuan1 Tangzhi Ye1 Youfeng Hao1 Siming Chen1 Jie Liang1 Qiusheng Li2 HaiyangWang2 Yadong Wu2
1) KeyLaboratoryofMachinePerception(MinistryofEducation),andSchoolofEECS,PekingUniversity,Beijing,P.R.China2) SchoolofComputerScienceandTechnology,SouthwestUniversityofScienceandTechnology,Sichuang,P.R.China
3) NowatNetworkSecurityResearchLab,Qihoo 360Co.Ltd.,Beijing,P.R.China
WhatQualityProblemsinThisData?• Anysystematicwaytodetectqualityproblems?
• Anywaytodetectunknownproblems?
• Anywaytodiscoverstrangeshapes?
[email protected]://vis.pku.edu.cn
FundingThisworkissupportedbyNSFCNo.61170204,andpartiallyfundedbyNSFCKeyProjectNo.61232012andtheNationalProgramonKeyBasicResearchProject(973Program)No.2015CB-352500.ThisworkisalsosupportedbyPKU-Qihu JointDataVisualAnalyticsResearchCenter.
• Time:Mar4th,2009,7-9am(2hours)
• Taxis:13,080 (50%randomsample)
• Points:747,431• Datasize:74.2MB• Sampling:30secs/point• Partition:5minpieces
SeeWhatWeDiscoveredinaSample!BeijingtaxiGPSdata Find8qualityproblems
• Oversampling• Datamissing• Duplication• Stopage• Longjump• V-shapejump• Back-forthjump• Randommovement
HowDoWeDiscoveredThem?
Normal Back-forthjump
Oversampling Randommovement
Interactivediscoveryofpotentialproblems• Projecttrajectorypiecesintospatial,temporal&feature spaces• Detectdataoutliersandclusters(potentiallyproblematic)
• #samplingpoints,distance,sum(turningangles)• min/max/recsum/logsumofturningangle• min/max/recsum/logsumofsegmentdistance• min/max/recsum/logsumofsegmenttimeinterval• min/max/recsum/logsumofsegmentaveragespeed
9:35:029:35:32
9:39:029:39:32
Featurespacedefinition(19features)
Showingdatadistributionfromdifferentaspects
Visualconfirmationofsuspecteddata
TemporalSpatialFeatureFeature
• Showthespatialshapeandpoint/segmentattributechange• Helpexpertusersconfirmwhetherthere’saproblem
Automaticextractionofdatawithconfirmedproblems• BinarySVMclassifierstoextractproblematicdata• Classifiertrainingwithinteractivelabeling
• Activelearning:labeldatathatbestoptimizeclassifiers• Similaritysearch:labeldatasimilartoproblematicones
NextStep? • Selectfeaturessystematically• Considerinteractionsamongdifferentqualityproblems• Improvescalability