Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
DOI: 10.4018/IJDWM.2019100103
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
Copyright©2019,IGIGlobal.CopyingordistributinginprintorelectronicformswithoutwrittenpermissionofIGIGlobalisprohibited.
48
A Comparative Study of Data Cleaning ToolsSamson Oni, University of Maryland Baltimore County, USA
Zhiyuan Chen, University of Maryland Baltimore County, USA
Susan Hoban, University of Maryland, Baltimore County, USA
Onimi Jademi, University of Maryland, Baltimore County, USA
ABSTRACT
Intheinformationera,dataiscrucialindecisionmaking.Mostdatasetscontainimpuritiesthatneedtobeweededoutbeforeanymeaningfuldecisioncanbemadefromthedata.Hence,datacleaningisessentialandoftentakesmorethan80percentoftimeandresourcesofthedataanalyst.Adequatetoolsandtechniquesmustbeusedfordatacleaning.Thereexistalotofdatacleaningtoolsbutitisunclearhowtochoosetheminvarioussituations.Thisresearchaimsathelpingresearchersandorganizationschoosetherighttoolsfordatacleaning.Thisarticleconductsacomparativestudyoffourcommonlyuseddatacleaningtoolsontworealdatasetsandanswerstheresearchquestionofwhichtoolwillbeusefulbasedondifferentscenario.
KeyWoRDSBig Data, Data Cleaning, Data Cleansing, Data Fusion, Data Quality, Data Wrangler, Dirty Data, Open Refine
INTRoDUCTIoN
Dataisconstantlybeingproducedineverysector.However,dataisproducedinmanyforms,withvariouslevelsofqualityandsomedatamayhavepoorquality.Datacleaning,sometimescalleddatascrubbingordatacleansing,isthedetectionandremovaloferrorsandinconsistencyfromdatawiththeaimofimprovingdataquality.InBigDataprocessing,datacleaningisacriticalandimportantsteppriortodataprocessingandmaintenance(Müller&Freytag,2005).Datacleaningisimportanttobothdatafromasinglesourceanddatafrommultiplesources.Datacleaningisanessentialstepforthedatafusionprocess,whichistheprocessofmergingdatafrommultiplesources(Haghighat,Abdel-Mottaleb,&Alhalabi,2016).Fusingpoorqualitydatafromvarioussourcestogetherwillcausemoreissuesafterwards.Therefore,adequatecleaningofdatafromvarioussourcesbeforeintegrationwillhavesignificantimpactontheoutcomeofdatafusion.
Cleaningdatarequires identifyingincorrect, invalidorduplicateentries.Thequalityofdataisdeterminedbythedegreetowhichthedatainquestionmeetsspecificneeds,whichinanycasewillbehigherasthedatabecomescleaner(Kandel,Paepcke,Hellerstein,&Heer,2011).Validity,completeness,accuracyandprecisionarethemeasuresofdataquality(Kandeletal.,2011).Theimportanceofaccurateandcorrectdataforfusion/ETLprocesscannotbeoveremphasized.
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
49
Dataanalystsalsospendagreatdealoftimeandresourcestryingtofixdataqualityproblems.Dasuetal.(Dasu&Johnson,2003)emphasizedtheruleofthumb,whichstatesthatmorethaneightypercent(80%)oftimeonadataanalysisprojectisspentoncleaningandpreprocessing.
Althoughtherearemanydatacleaningtools,theyoftenhavedistinctivefeatures.Theyalsorequiredistinctlevelsofskillstousethemandhavedifferentcostsandlearningcurves.Determiningthebesttoolsforanygivencleaningtaskdependsonmanyfactors.However,inpractice,usersareoftennotexpertsondatacleaningtoolsandtechnologiessothereisgreatneedtoprovidesomeguidanceonhowtochoosedatacleaningtools.
Theobjectiveofthispaperistoanalyzefourpopulardatacleaningtoolsanddeterminewhichtoolsareappropriateforvariousscenarios.Thispapercomparesthefeaturesofthesetoolsandtheirperformanceoncleaningthesamedataset.Twodatasetswereusedforthisexperiment.Theresultsmayhelpuserschooseappropriatedatacleaningtools.
Thispapermakesthefollowingcontributions:
• Compared the performance of four data cleaning tools on two real world data sets. Themetricsincludetheirfeatures,requiredplatformsandskill level,timeofcompletion,easeofimplementation/usage,etc.
• Proposesaguidelineforchoosingdatacleaningtools.
Therestofthepaperisorganizedasfollows.Abackgroundstudyispresentedfirst,followedby an overview of various aspects of data cleaning. The methodology section describes themethodology used for the comparison study. The results section describes the results of thestudy.Thediscussionandconclusionsectionpresenttheguidelinesforchoosingdatacleaningtoolsandconcludesthepaper.
LITeRATURe ReVIeW
Therehasbeenlotofworkondatacleaning.Theworkcanberoughlydividedintotwocategories:thoseonmethodstoaddressspecificdataqualityissuesandthoseonmoregeneraltoolsorframeworkthatcanaddressmultipledataqualityissues.
Workonspecificdataqualityissues:Leeetal.(Lee,Lu,Ling,&Ko,1999)presentedseveraltechniquestopreprocessrecordsbeforesortingthemsothatpotentiallymatchingrecordswillbebroughttoclosetogether.Usingthesetechniques,theyimplementedadatacleaningsystemthatcandetectandremoveduplicaterecords.
VariousmethodsofhandlingmissingdatawerediscussedbyLuján-Mora(Martinez-Mosqueraetal.,2017).Theauthorsproposedalgorithmsusedinananalysisofanincompletedataset.Theauthorsproposedmultipleimputationmethods,includingregressionimputation(fillinginmissingdatawithvaluespredictedbyaregressionmodel)andsinglehotdeckimputation(replacingthemissingvalueswiththoseobtainedfromsimilarobjectsfromthesameexperiments).
Generaltoolsorframework:Martinez-Mosqueraetal.(Martinez-Mosquera,Luján-Mora,López,&Santos,2017)lookedatmodelingdatacleaningforBigDataanalysisbasedonpreviousresearchformodelingETLprocessesusingwhat isknownasUnifiedModelingLanguage (UML).Theypresentedtwousecases,oneformodelingthedatacleaningprocessforweblogsandtheotherformodelingthecleaningprocessforsecuritylogs.
Galhardas et al. (Galhardas, Florescu, Shasha, & Simon., 2000) developed a data cleaningframeworkcalledAJAX.Theirapproachseparatesphysicalandlogicallevelsofdatacleaning.Thelogicallevelsupportsthedesignofthedatacleaningworkflowandthephysicallevelimplementsthedatacleaningworkflow.Thisframeworktransformsexistingdatafromoneormoredatacollectionstoatargetschemawhileeliminatingduplicates.
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
50
Luján-Moraetal.(Martinez-Mosqueraetal.,2017)proposedatechniquefordatacleaningthatcanbeusedforcheckingdataqualityissuesonsecuritylogs.Theyusedpredefinedrulestocombinelogdataandscanforissues.Detectedissuesarethencorrectedbeforethedatasetisanalyzed.However,theirworkfocusedonaspecificdatatype:securitylogs.
Kandeletal.(Kandeletal.,2011)describedDataWranglerasatoolfortheinteractivecleaningofdatausingvisualspecificationsofdatatransformationscripts.TheyexplainedthatDataWranglercombinesdirectmanipulationofvisualizeddatawithautomaticinferenceofrelevanttransformationswhichinturncleansethedata.
KarrarandAli(Karrar&Ali,2016)conductedacomparativeanalysisofSQLServerandWinpuretoolsusingacademicandweatherdatasets.Theyanalyzedtwodatacleaningtools,whileourapproachusedfourtoolsthataremorecommonlyusedintheindustryfordatacleaning.Porwal&Vora(Porwal&Vora,2013)alsocarriedoutcomparativeanalysisontwodatacleaningalgorithms:TheAlliancerulealgorithmandHadcleanalgorithm.However,theydidnotcomparemorefull-fledgeddatacleaningtools.Theyalsodidnotconsiderotherfactorssuchasusabilityofthetools.
Inanutshell,therehasbeenalotofresearchontheneedfordatacleaningaswellasdatacleaningtechniques,tools,andframeworks.However,thereisnotmuchworkoncomparingtoolsandchoosingdatacleaningtoolsandthecriteriatoconsider.Thispaperconductsacomparativestudyonfourcommonlyuseddatacleaningtools.
oVeRVIeW oF DATA CLeANING
Thissectionwilldescribetypesofdata,dataqualityissuesconsideredinthispaper,generalstepsofdatacleaningforasinglesource,andgeneralstepsofdatacleaningformultiplesources.
Types of DataWithrespecttocategorizationofdata,therearethreetypesofdata:structured,semi-structuredandunstructureddata.
1. Structured data:Thistypeofdatahasahighdegreeoforganizationandadherestoapredefineddatamodel.Oneexampleisdatainarelationaldatabase.
2. Semi-structured data:Thistypeofdatadoesnotfitintorelationaldatabasebuthavesomeformoforganizationforeasyanalysis.AnexampleisXMLdata.
3. Unstructured data: This data type is not organized nor does it have a predefined model.Unstructureddataisnotagoodfitforrelationaldatabase.Examplesaretext,pdf,images.
Irrespectiveofthetypeofdata,poordataqualitywillleadtopooranalysisanddecisions.
Data Quality IssuesDataqualityissuescancomeinvariousformsrangingfromduplicatedata,missingdata,errors(likespellingstudentasstdent),toinconsistentformatetc.Belowareseveraltypesofdataqualityissuesconsideredinthispaper.
1. Misspelled data:Forexample,acolumnhas‘student’,anotherhas‘stdent’whichismisspelled.2. Duplicate data/records (Table 1):Thisiswhenthesameinformationisenteredorduplicated
inadatasetordatabase.3. Irrelevant data:thisisthedatainthedatasetthatarenotrelevanttothework.Thiskindofdata
needstoberemoved.4. Mixed ranges: Sometimes data is measured in ranges, e.g., salary, age. Ranges need to be
representedconsistentlyandappropriatelyinthedata.
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
51
5. Mixed numerical scales:thistypeofdatadealswithusingdifferentnumericalscalesofdatainthedataset.Forexample,representingonemillioncanberepresentedas1mwhileonebillioncanberepresentedas1B.Butacomputermaynoteasilygetthis.
6. Multiple representation: representing the samepieceof information indifferent formsyethavingthesamemeaningcancauseproblemswithinadataset.Forexample,usingmultiplerepresentationsforthecountryUnitedStates(i.e.U.S.A,US,UnitedStates,UnitedStatesofAmerica).Alltheserepresentationsmeanthesamebutusingamixtureofseveralrepresentationsforthesameinformationwithinadatasetwillcausetroubleforanalysis.
7. Wrong date format: Different date format are used in data today, but the mixture ofseveraldataformatsinonedatasetcanbetroublesome.Exampleofdifferentformatscanbe2/12/2018,February2,2018,and2-12-2018.Thethreedatesmeanthesame,buttheirpresentation differs. Another example of date inconsistency is the American (MM/DD/YYYY)andEuropean(DD/MM/YYYY)formatsmixture.InAmericanformatthedaywillbewrittenas2/12/2018tobe12thofFebruary2018,whileEuropeanswillrepresentthesamedateas12/2/2018startingwiththeday.
General Steps of Data Cleaning for a Single SourceThereareseveralphasesinvolvedindatacleaningforasingledatasource.
• Detecterrorsandinconsistenciesindatatoremove.• Verifythattheerrorisreallyanerror,notaspecialfeatureofthedataset(Rahm&Do,2000).
Thisoftenrequireshumaninteraction.• Extracterroneousrecordstoanewtemporarytable.• Performcleaningoperationsonthedatainthattemporarytable.
Mostoftheseprocessesarealreadybuiltindifferentdatacleaningtoolsasdoingthismanuallywillcostlotoftimeandresources.
General Steps of Data Cleaning for Multiple SourcesInmultipledatasources,eachdatasourcemaycontaindirtydata.Inaddition,datafromonesourcemaycontradictoroverlapwithdatafromothersources.Theprocessofmergingthesedataisalsoknownasdatafusion.
Table2depictsasingledatasource thathasafewdataquality issues includingmisspelling(NigeriawasspelledasNigria)andduplicateddata.
Table3andTable4showtwodatasourcesthatneedtobeintegrated.Eachdatasourcemayhavesomedataqualityissues(e.g.,Nigeriaismisspelledinsource2).Someoftheseissuescanbeaddressedintheindividualsource,butotherscanonlybeaddressedduringandafterintegration.
Table5showsintegrateddatainwhichcleaningwasdoneintheindividualsourcealone.Issueslikemisspellingareaddressed,butredundancyandoverlappingarenot.Forexample,wehavecolumnsforName,firstnameandlastname,andcolumnsforsexandgender.Table6showscleanintegrateddatawheretheseissuesareaddressed.
Table 1. An example of Duplicate record
No First Name Last Name Age Sex Phone No
1 Samson James 19 M 202-298-2014
2 Samson James 19 M 202-298-2014
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
52
Asidefromtheissueofdataoverlappingassociatedwithmultiplesources,namingandstructuralconflictsmayoccur(Batini,Lenzerini,&Navathe,1986)(Parent&Spaccapietra,1998).Namingconflictsisanissuethatariseswhendifferentnamesareusedforsameobjectsacrosssources,orwhenthesamenameisusedfordifferentobjectsacrosssources.Meanwhile,structuralconflictsoccurwhendifferentrepresentationsofthesameobjectariseindifferentdatasources.
Table 2. Data Quality issues in a single data source.
CampusId First Name Last Name Country Sex
VH609042 Samson James Nigria M
XV503267 Jane Mark India F
XV503267 Jane Mark India F
Table 3. Data in source 1
CampusId Name Address Sex Date of Birth
VH609042 SamsonJames 100IRC21222,MD 1 12-01-1989
XV503267 JaneMark 123Oceanstreet,21223,MD 0 02-01-1988
Table 4. Data in Source 2
CampusId First Name Last Name Country Gender Course
VH609042 Samson James Nigria M 10
XV503267 Jane Mark F 10
Table 5. Integrated data with data cleaning in individual sources only
CampusId Name Address Date of Birth Sex First
Name Last Name Country Gender Courses
VH609042 SamsonJames
100IRC21222,MD
12-01-1989 1 Samson James Nigeria M 10
XV503267 JaneMark
123Oceanstreet,21223,MD
02-01-1988 0 F Mark India F 10
Table 6. Clean Integrated data
CampusId FirstName LastName Sex Address Courses Date of Birth Country
VH609042 Samson James M 100IRC21222,MD 10 12-01-1989 Nigeria
XV503267 Jane Mark F123Oceanstreet,21223,MD
10 02-01-1988 India
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
53
Sothegeneralstepstocleandatafrommultiplesourcesinclude1)cleandataateachsource;2)dataintegration;3)addressdataqualityissuesinintegrateddata.
MeTHoDoLoGy
Thissectiondescribesthemethodologyofthecomparativestudy,includingthedatasets,thefourdatacleaningtools,andthedatacleaningtasks.
Data SetsInthiswork,weusedadatasetonatmosphericandclimateresearchfromtheU.SDepartmentofEnergywebsite(www.arm.gov)andadatasetaboutuniversities(universityData)extractedfromWikipedia.Thedata from theU.SDepartment ofEnergywebsite is theAtmosphericRadiationMeasurement(ARM)userfacilitydatacollectedthroughscientificexperimentsandroutineoperations.Theobservationsweremadeeveryhalfanhour.TheUniversitydatasetgivesanoverviewofdifferentuniversities:whentheywereestablished,thenumberoffaculty,staffandstudentscurrentlyenrolledaswellasthetotalendowmentamounteachuniversitycurrentlypossesses.TheinformationincludedinthedatasetisexplainedinTable7.
TheUniversitydatasethas10variables(p=10),containsover75,000records(n=75043)andissavedasCSV.TheARMdatasethas15variables(p=15),containsover12,000records(n=12,762)andissavedasCSV.Table8showsthecolumnsinuniversitydata.
Data Quality Issues in Data SetsFigure1andFigure2showthescreenshotsofthesetwodatasets,respectively.Thetwodatasetsusedforexperimentsareverymessyandhaveseveraldataqualityissues:
Table 7. Properties of data sets
File Name No. of Records No. of Fields Missing Values Duplicate Record
UniversityData 75,000 10 7.89% 32.7%
ARMData 12,762 15 27.6% 0%
Table 8. Columns of the University Data
Description of Variable Variable Name in Dataset
NameofUniversity University
Themonetaryamountofendowmenttheschoolhas Endowment
Thetotalnumberoffacultyemployedbyschool NumFaculty
NumberofDoctoral NumDoctoral
Countrywheretheschoolexits Country
Thetotalnumberofstaffmembersintheschool NumStaff
Theyeartheschoolwasestablished Established
NumberofPostgraduatestudents NumPostgrad
NumberofUndergraduatestudents NumUndergrad
Totalnumberofallstudentsenrolled NumStudents
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
54
Figure 1. Screenshot of UniversityData.csv opened with Excel
Figure 2. Screenshot of ARM data opened with Excel
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
55
• Inconsistentdatevalues.TheUniversitydatacontainsdifferentdateformatswhileintheARMdata,thecolumnfordatehasbothdateandtime,whichhavetobeseparated.
• Inconsistencyinabbreviationsandtermsused.e.g.,toindicateUnitedStatesofAmerica(USA)ascountry,somerecordsusetermslikeUS,USAandUnitedStates.Thesemightbeconsideredasdifferentcountrieswithoutdatacleaning.
• Mixtureofnumericalandtextvalues.• Missinginformation:TheUniversitydatasetcontainsseveralNAvalues,whichisnotunusual
foranyformofdatasetbutmightbeproblematicwhencarryingoutdataanalysisondataset.WhiletheARMdatahasmanymissingrecords.
• ValuesintheUniversitydatasetareseparatedbyinconsistentnumberofdoublequotes.WhilethatintheARMdataareseparatedbyaspacewhichdoesn’tmeetthecommarequirementofCSV.
• Duplicaterecords:Thedatasetissupposedtohaveonlyoneentryforeachuniversityinstance.However,asseeninthedata,someuniversitieshaveseveralentrieswithalldatasometimesbeingthesame,andsometimeshavingvariationse.g.,LamarUniversityhas33entriesbuttherearevariationsinthevaluesofthelastvariablewithsomeshowing13773,14388and14522.
• Outlierrecords:TheARMdatasethassomevaluesinsomecolumnsfarbeyondthenormal.Thatcanbeproblematicwhilecarryingoutanalysis.
• Missingrecordsinasequence:TheARMdatasetwascollectedoveranintervalofhalfanhour.Therearemanymissingrecordsforcertaintimes.
Thepurposeofthisstudyistouseseveraldatacleaningtoolsonthesamedatasettocomparethesedatacleaningtools.Throughthisstudy, it isanticipatedthatwewillgainbetter insightonhowthesedifferenttoolswork,thestrengthandweaknessesofeachtoolindatacleaningtechniquesaswellascomingupwithvaluablesuggestionsanddiscussionsaboutthefutureofdatacleaningtechniquesandtools.
Tool UsedForthisstudy,weusedfourdifferentdatacleaningtoolsnamelyOpenRefine,R,PythonandDataWrangler.Thesetoolsarethemostpopulartoolsusedfordatacleaningintherealworld.OpenRefine,RandPythonareopen source,whichmakes themeasilyaccessible foruse.DataWrangler is acommercialtoolbuthasacommunityversionwhichdoesagoodjobofdatacleaning.Thesetoolsusedaredescribedbelow:
• OpenRefine:OpenRefine(Verborgh&DeWilde,2013)isaweb-based,stand-alone,opensourceapplicationfordatacleanupand transformation toother formats. Itoperatesonrowsofthedatathathavecellsundercolumns,whichisverysimilartorelationaltables.Thistoolcleans,reshapesandeditsbatch,unstructuredandmessydata.ItwasformerlyknownasGoogleRefineandwasalsocalledFreebaseGridworksbeforethat.OperationsinOpenRefineincludefaceting(allowinguserstonarrowdownresultsthroughseveraldifferent dimensions), clustering, and reconciling, which all help in the data cleaningprocess.Italsoanalyzesthedatathroughfiltering,facetingandconvertingthedataintomorestructuredform.
OpenRefineisastandaloneapplicationthathasawebinterface.Itisnothostedonthewebbutcanbedownloadedandrunsonthelocalmachine.Inotherwords,itisadesktopapplicationthatopensinabrowserasalocalwebserver.
TransformationexpressionscanbewritteninGeneralRefineExpressionLanguage(GREL),Jython(i.e.Python)andClojure.Sinceitisanopensourceproject,itscodecanbereusedinotherprojects.
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
56
OpenRefinecarriesourcleaningtasksthroughfilteringandfaceting,andthenconvertsthedataintoamorestructuredformat.
• Data Wrangler:DataWrangler(Kandeletal.,2011)isaStanfordUniversityprojectthathelpsanalystscleanandpreparediverse,messydataquicklyandaccurately.Itisaninteractivetoolfordatacleaning.DataWranglercanworkwithdataintwoways.Userscansimplypastethedataintoitswebinterfaceorcanusethewebinterfacetoexporttheoperationsaspythoncodeandprocessarbitraryamountsofdata.ThewebinterfaceisusingJavaScriptandthereforehassomeperformanceissuesandonlysupportsupto1000rows,butuserscanuseittoconfigureDataWrangleronasubsetofthedataandthenapplytheconfigurationonthewholedataset.ThemostrecentversionofthistooliscalledTrifactaWrangler.
For the experiment we imported our data into Data Wrangler and the application began toautomaticallyorganizeandstructureourdataset.Thistoolcontainsstrongmachinelearningalgorithmsthathelpsuggestcommoncleaningtobedoneandcommontransformationandaggregations.DataWranglerallowsamixtureofnumericalandtextvalues.
• Python:Pythonisanothertoolthatcanbeusedfordatacleaning.Ithasseveralmodulesthatcanbeusedtocarryoutcleaning.OnepowerfulmoduleinPythonthatisusedfordatacleaningisPANDAS(Pythondataanalysistoolkit).Thismoduleisbasicallyfordataanalysis,whichdatacleaningispartof.AnothermoduleinPythonthatcanbeusefulwhencarryingoutcleaningistheNumpymodule.ThismoduleisusedforscientificcomputingwithPython.IthasapowerfulN-dimensionalarrayobjectthatisusefulforlargedatasets.
• R:Risaprogramminglanguageusedforstatisticalcomputation(Johnetel.,2016).Ithasbeenwidelyusedfordataanalysis.Rhasasetoftoolsthataredesignedtocleandataeffectivelyandcomprehensively.TheRenvironmenthasthecapacitytoreaddatainseveralformatsandprocessthesefiles.
InthecleaningofdatausingR,foursimplestepscanbetakenwhichRprovidesgreatresourcefor:
1. Readingdata:Rprovidesadequatereadingresourcefrompracticallyanyformatintodataframe.2. ExploratoryAnalysis:Afterreadingthedata,usersoftenconductaninitialexplorationofthe
dataframe.3. ExploratoryAnalysisinVisualform:Duringcleaningitisusefultovisualizedataateachstage.
Rprovidesadequatevisualizationtools.Threepowerfulvisualizationthatcanbeusefulduringdatacleaningare:Boxplot,HistogramandScatterplot.
The Data Cleaning TasksIntheexperimentthefollowingdatacleaningtaskswereconducted.
• Dealingwithtypographicalerrorsormultiplerepresentations:◦ Cleaningupinconsistentspellingofterms(i.e.“USA”,“U.S.A”,“U.S.”,etc.).◦ Convertingvaluesthataretextdescriptionsofnumericvalues(i.e.$123million)toactual
numericvalues(i.e.123000000)whichareusableforanalysis.◦ Extractingandcleaningvaluesfordates.
• Identifyingwhichrowsofaspecificcolumncontainasearchterm.• Removingduplicatedata.• Separatingdateandtime.• Handlingoutliers.
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
57
• Handlingmissingrecordsinasequence.Hereafterusingthetooltodiscoverthemissingrecordsinasequence,theuserdecideshowtoreplacemissingvalues,eitherbyimputation,inferencefromotherrecordsorothermethoddecidedbytheuser.
• Exportingcleaneddatatoseveralformats.• Handlingmissingfields,duplicaterecords,inconsistentformats.• Batcheditingofrowsandcolumn.
Twouserswithadvancedprogrammingskillsfinishedthesedatacleaningtasksusingthefourtools.Foreachdatacleaningtask,usersappliedadatacleaningtooltofixthedataqualityissueslistedinthetask.Theymanuallycheckedourresultsandrepeatedthecleaningtaskuntilwecouldnotfindmorerelatedqualityissuesinthedata.Theorderofapplyingeachtoolforeachtaskisrandomizedtoavoidbiasintroducedbytheorder.
When compare the four tools, we focused on the following criteria: key features, platform,scalability,skilllevelneeded,timeofcompletionandeaseofimplementation.
ReSULTS
Foreachtoolwedescribeitskeyfeatures,platform,skilllevelneeded,timeofcompletion,easeofimplementation,advantagesanddisadvantages,accuracy.
Key FeaturesOpenRefine:Ithasthefollowingkeyfeatures:• Importingdatafromvariousdatasourcesandsupportthefollowingformat:CSV,TSV,.xls,.xlsx,
JSON,XML,RDFasXMLandgoogledocument.Figure9showsascreenshotofimportingtheuniversitydatausingOpenRefine.
• Facetsandfilters:OpenRefineallowuserstousefacetsandfilterstofilterdataintosubsetsforeasyusage.Thiscanbedonefornumbers,textanddatescolumns.Forexample,fortheUniversitydata,ifauserfacetsdataonthegendercolumnwewillget2infemaleand1inmale.Iftheuserselectsfemale,thenitwillshowthetworowswithfemale.
• Support forexpressions thatcanbeused tocreatenewdatafromexistingdataor transformexistingdata.
• Reconciliation:reconciliationmatchestextnamesorvalueinthecolumnstodatabaseidentifiersinvariousdatabaseIDspaces.Ithelpsresolveinconsistentspellingissues.Forexample,US,USAandUnitedStatescanbematchedtoUnitedStatesofAmerica.ReconciliationcanbedonebycallingWebServicesordatabaseAPI.
• ExportingData:datacanbeexportedintoTabseparatedvalues(TSV),Commaseparatedvalues(CSV),ExcelandHTMLTable.
• Undo/redo:Undogivesuser the flexibility to rectifymistakes.Redoenables theuser torepeatastep.
Data Wrangler:Ithasthefollowingkeyfeatures:
• DataWranglersupportsthefollowingsixuserinteractionswhileusingthetoolforcleaning.◦ Selectcolumns◦ Selectrows◦ Selecttextwithinacell◦ Editdatawithinthetable◦ Clickbarsindataqualitymeter◦ Assigndatatypes,columnnamesandsemanticroles.
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
58
• DataWranglerhasasuggestionenginethatsuggestsnextdatacleaningsteps.• DataWranglersupportsautomatedscriptgeneration.• DataWranglerallowsusertohavestepbystepinteractionwithdata.• DataWranglersupportsCSV,JSONANDTDEdataformats.
Python:Ithasthefollowingkeyfeatures:
• Featurestovisualizeandexploredata.• Writingcustomizablecodeforspecificdatacleaningtasks.• Easyintegrationwithothertoolsorproduct.Pythoncancallprogramswritteninotherlanguages.
Pythoncodecanbecalledinotherlanguagesaswell.
R:Rhasthefollowingkeyfeatures.:
• Rhasmanyfunctionsthatcanbeusedfordatacleaning.• Rhasgoodvisualizationlibraries.• Writingcustomizablecodeforspecificdatacleaningtasks.
Platform and Needed Skill LevelOpenRefineisaweb-basedapplicationthereforeitisplatformindependent.ItcanrunonWindows,LinuxandMac.Itrequiresbasictointermediateskilllevel.
DataWranglerrunsonWindowsandMac.Itrequiresbasicskilllevel.BothPythonandRrunonallplatforms,includingLinux,WindowsandMac.Theybothrequire
advancedskilllevel,becauseusershavetoknowhowtoprogram.
Time of CompletionFigure3depictstheaveragecompletiontimeofcleaningusingallfourtoolsonUniversitydataandARMdata.TheusershavehighskilllevelandarefamiliarwithbothRandPython.
UsingDataWranglerhasthefastestcompletiontimefollowedbyusingOpenRefine.Thisisexpected,becausebothtoolsarehighlyinteractive.DataWrangleralsocansuggestdatacleaningsteps,soitleadstoevenfastercompletiontime.UsingRandPythontookmuchlongertimebecausebothrequirecustomizedprogramming.UsingRtookshortertimethanusingPython,becauseRhasalotofdataanalysisfunctionsthataresuitablefordatacleaning.Therelativeorderofdifferenttoolsisalsothesameforbothdatasets.
ease of ImplementationThereisnostandardsequenceofstepsincleaningdata.Sometimesitdependsonthespecificissuescontainedinthedata,whileothertimesitdependsontheuser’sapproach.Duetothisfact,wewerenotabletodoaquantitativeanalysis.However,wegatheredfeedbackfromusersofthesetools.SomeusersexpressedhowtheyfeltusingOpenRefineandDataWranglerfordatacleaningbasedontheinteractiveuserinterface.OthersdiscussedhowtheycouldusePythonandRincustomizedways.Basedonthefeedbackwegotandonourusageofthesetoolsforourexperiment,weassignedscale1-3ontheeaseofimplementationofthesetools.
Weclassifiedtheeaseofimplementation/usageofthesetoolsintoascaleofthree(3),with3astheeasiesttouse.
1. Scale 1:Highhumandependence,lowinteractivity,littleautomationandrequiringadvancedtechnicalskill.Thisscalemeansthattheusermustknowwhatheorsheisdoing,andthetoolgivesnosuggestionsorhints.Theeffectivenessofthetooldependsontheuser’sknowledge
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
59
andskills.Inaddition,theuserneedstohaveadvancedtechnicalskillstobeabletosuccessfullycompletetaskswiththetool.
2. Scale 2:Highhumandependence,highinteractivity,someautomationandrequiringbasictointermediatetechnicalskill.Heretheuserisexpectedtoknowwhatexactlyinthedatahewantstocleanbutthetoolinteractivelyhelpstheusercarryoutthetask.Basictechnicalskillsareneededforbasiccleaning,butintermediatetechnicalskillsmaybeneededforcomplextask.
3. Scale 3:Toolsinthiscategoryarehighlyinteractive,haslittletonohumandependence,andsuggestscleaningstepstotheuserandauserwithnoexperiencenortechnicalskillscanusethistooltoachievethedatacleaningtasks.
Figure4showstheeaseofimplementationscaleforeachtool.OpenRefinehasascaleof2becausethistoolishighlyinteractiveandonlyrequiresbasictointermediateskills.However,usersstillneedtospecifyallstepsinthedatacleaningprocess,soithashighhumandependenceandsomeautomation.
DataWranglerhasascaleof3becauseitishighlyinteractiveandonlyrequiresbasicskills.In addition, it suggestsdata cleaning steps tousers, so it has lowhumandependenceand ishighlyautomated.
PythonandRbothhaveascaleof1becausetheyhavehighhumandependence,lowinteractivity,littleautomationandrequireadvancedtechnicalskill.
Otheraspects:Welookedatseveralotheraspects,includingpossibilitytobeembedinothertools/programs,userinterface,massedits(editingmultiplecellsatthesametime),approach,compatibilitywithBigData.
BothOpenRefineandDataWranglerarestand-aloneandcannotbeembedded.RandPythoncanbeembeddedinotherprograms.
BothOpenRefineandDataWranglerhavegraphicuserinterface.RandPythondonot.Allofthemsupportmassediting,butRandPythonrequiresomecodingtodothat.
Figure 3. Time of completion in minuets using four data cleaning tools for University Data and ARM Data cleaning
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
60
Intermsofdatacleaningapproach,OpenRefinesupportssimpletasksasasimpleclick,butformorecomplextasks,usersneedtouseexpressionlanguage.ForDataWrangler,usersonlyneedtoclick,andthesystemalsosuggestsdatacleaningsteps.ForRandPythonusershavetomanuallywritescripts.
OpenRefinecanonlysupportcleaning5000recordssoitdoesnotdirectlysupportcleaningbigdata.Theotherthreetoolscanhandlebigdata.
Advantages and DisadvantagesOpenRefinehasthefollowingadvantages:
• Sincethistoolisadesktopapplicationwithouttheneedtoconnecttointernet,thedatasetisrelativelysafeandishardertotamperwith.
• Userscanuseitsfacetfeaturetofilterthedataintosubsets.• Ithaspowerfulfeaturestotransformdata.• Itprovidessimpledatasummarizationplatform.
OpenRefinehasthefollowingdisadvantages:
• Googleremovedsupportforthistool,andsomeoftheirfeaturesareredundant.• TheUIisnotuserfriendly,severalfeaturesarenoteasytofind.• OpenRefineisnotsuitableforprocessinglargedatasetsduetothe5000-recordlimit.• Itassumesthatdataisorganizedintabularform,whichisnotalwaystrue.
DataWranglerhasthefollowingadvantages:
Figure 4. Ease of implementation of the tools
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
61
• DataWranglerhastwoviews.TheGridviewandtheColumnview.• Thistoolalsosupportsdatavisualizationandsupportsvisualizationateverystepofdatacleaning.• Itsupportsmassediting.• Itusesnaturallanguagedescriptionsoftransformation.• Itrecommendscleaningstepstouploadeddata.
Overall,wefoundDataWranglerthemostuserfriendlyoutofthefourtools.DataWranglerhasthefollowingdisadvantages:
• Itconsumeslotsofmemory.• Justlimitedfeaturesavailableforfreeversion.
Pythonhasthefollowingadvantages:
• Userscancustomizetheirsolutiontofittheirneeds.• Thistoolisgreatasitiseasytofuseintootherapplication.
Pythonhasthefollowingdisadvantages:
• Itrequiresadvancedprogrammingskills.• ThelearningcurveishighasitrequiresusertolearnhowtousemanymodulesinPython.• It’snottimeeffectiveduetothehighlearningcurve.• Thismethodcanbecomplexanddifficulttoimplement.• Usersmusthavepreviousknowledgeofwhatstepstotakeinthecleaningprocess.
Rhasthefollowingadvantages:
• Itissuitablewhenthedataismainlyusedforstatisticalanalysis(e.g.,salesrecord).• Itisveryeasytovisualizedataateachstageofcleaning.
Rhasthefollowingdisadvantages:
• Itisnotagoodoptionforintegratingintootherprojectsinotherdomainsdifferentfromdatasciencedomains.OtherprojectsmightmakeuseofotherprogramminglanguagesthatRdoesn’tintegratewellwith.
• It’snottimeeffectiveduetothehighlearningcurve.• Thismethodcanbecomplexanddifficulttoimplement.• Usersmusthavepreviousknowledgeofwhatstepstotakeinthecleaningprocess.
AccuracyWewerenotabletoquantifytheaccuracyofthesetools,becauseusershavetogothroughmultipleiterations for each tool andonce some issues are fixed inone iteration, the toolmay findmoreissuesthatwillbefixedinthenextiteration.Intheexperiments,weobservedOpenRefineandDataWranglertohavehighaccuracywhendetectingspecificdataqualityissues(e.g.,missingvalues).Butsomemanualworkisneededtofixthefoundissues(e.g.,youcandecidetoremovemissingvaluesorassignsomevalues).
ForRandPython,theaccuracyalldependsontheuser’sskilllevel,whethertheusercanwritegoodprogramstodetectthoseissuesandsolvethem.
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
62
Attheend,mostdataqualityissuesareaddressedforeachtool.
Summary of ComparisonTable9summarizethecomparisonofthesefourtools.FollowingcomparisoncriteriausedbyPorwal&Vora(Porwal&Vora,2013)andKarrarandAli(Karrar&Ali,2016),wecameupwiththefollowingmetrics:importformat,performancetime,skilllevel,platform,easeofimplementation,keyfeatures,outputformat,skilllevel,platform,accuracy,possibilitytobeembedinothertools/programs,userinterface,massedit,approach,compatiblewithbigdataandtheirdisadvantages.
Table 9. Comparison of OpenRefine, Data Wrangler, Python and R
Criteria OpenRefine Wrangler Python R
ImportformatCSV,TSV,Excel(XLS/XLSX),JSON,XML,RDF
Excel(XLS/XLSX),CSV,TEXT All All
Performancetime Dependsondatasizeandformat
Dependsonuserchoiceanddatasize
Dependsonuserprogrammingskillsandlevelofartifactindata
Dependsonuserprogrammingskillsandlevelofartifactindata.
Keyfeatures
Facetsandfilters,Supportforexpressionlanguage,Reconciliation
Userinteractions,suggestionengine,automatedscriptgeneration,
Customizablebyuser,integratewithothertools,greatvisualizationlibrary
Customizablebyuser,greatvisualizationlibrary
Skilllevel BasictoIntermediate Basic Advanced Advanced
Platform Allplatform Windows,Mac Allplatform Allplatform
AccuracyHighaccuracywhendetectingspecificdataqualityissue
Highaccuracywhendetectingspecificdataqualityissue
Dependsontheuser’sskilllevel
Dependsontheuser’sskilllevel
Platform Allplatform Windows,Mac Allplatform Allplatform
Easeofimplantation 2 3 1 1
Outputformat TSV,CSV,ExcelandHTMLTable CSV,JSON,TDE Usermaycustomize
toanyformatUsermaycustomizetoanyformat
Possibilitytoembedded
No,Standalonebutcodeisavailable No,Standalone Yes Yes
GraphicUserInterface Yes Yes No No
EditMultipleValues Supportmassedit Supportmasseditanditseasy
Supportbutrequirecomplicatedcoding
Supportbutrequirecoding
Approach
Simpletaskcarriedoutwithaclick,butcomplextaskrequiresexpressionlanguage
Simpleclickandalsosuggestcleaningfeaturesforuser
Needtowritescripts Needtowritescripts
CompatiblewithBigData
No(suitableforonly5000records) Yes Yes Yes
Drawbacks
Googlestoppedsupport,advancedfeaturesrequiretechnicalskills
Memoryconsumptionishigh,costimplication
Requiregoodknowledgeofprogramming
Requireknowledgeofprogrammingandstatistics
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
63
CoNCLUSIoN
Theproblemof‘dirty’datacostsinstitutionslargeamountsofmoneyeveryyear.Morethan80%oftimeandresourcesarespentpreparingandcleaningdata.Thispaperconductedacomparisonstudyoffourcommonlyusedtoolsfordatacleaning.TheresultsshowthatOpenRefine,whichisanopensourcetooldevelopedbyGoogle,isausefultoolandhasseveralmeritssuchasthefeatureofrunninglocallywhichmakesuserdatamoresecure,andthefeaturewithagraphicalinterfaceandthemasseditfeature.ButOpenRefineneedsexperienceandexpertisetobeabletouseitsadvancedfeatures.OpenRefinealsoworksbetterforsmalldatasets.
DataWranglerhastheadvantageofbeingastandalonetool.Itisveryefficientforbigdataandhasauniquevisualizationfeatureateachstepandgivestheuseranopportunitytopreviewchangesmadegraphicallybeforecommittingthechange.Itcanalsorecommenddatacleaningsteps.Overallitistheeasiesttouse.However,thefreeversionhaslimitedfunctionalities.
PythonandRhavetheadvantagefortheusertocustomizethedataanyways/hewants,andtheycanbeembeddedintoothertools.BothPythonandRhavesamefeaturesincleaning,butPythonhaslotsofmodulestosupportdifferentaspectofcleaningandtheabilitytousethisdataforotheranalysis.PythonandRhowever,requiregreatprogrammingskills,whichmaynotbepresent.PythonandRalsotakelotsoftimetocarryoutcleaningaseachstepalongthewaymustbeimplementedmanually.
Inconclusion,DataWranglerwillbeagoodstartfornoviceuser,asmanydataanalystwillprefernottospendtoomuchtimecleaningdata,astheymustworkonthefunctionalityorusageofthesedata.Itwillbegoodforusersthatdon’tmindpayingforcleaningtool.Forusersseekingopensourcetool,OpenRefineisagoodoption.Fordataengineersthathavetimeandadequateskills,PythonorRwillbeagoodoption.
Onepossiblefutureworkistotakeeachtoolandlookathowitcanhelpusintheintegrationofdatafromdiversesources.
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
64
ReFeReNCeS
Batini,C.,Lenzerini,M.,&Navathe,S.B.(1986).Acomparativeanalysisofmethodologiesfordatabaseschemaintegration.ACM Computing Surveys,18(4),323–364.doi:10.1145/27633.27634
Castanedo,F.(2013).Areviewofdatafusiontechniques.The Scientific World Journal.PMID:24288502
Dasu,T.,&Johnson,T.(2003).Exploratory data mining and data cleaning(Vol.479).JohnWiley&Sons.doi:10.1002/0471448354
Galhardas,H.,Florescu,D.,Shasha,D.,&Simon,E.(2000).AJAX: an extensible data cleaning tool.
Haghighat,M.,Abdel-Mottaleb,M.,&Alhalabi,W. (2016).DiscriminantCorrelationAnalysis:Real-TimeFeatureLevelFusionforMultimodalBiometricRecognition.IEEE Transactions on Information Forensics and Security,11(9),1984–1996.doi:10.1109/TIFS.2016.2569061
John,F.,&Allison,L.(2016).RandtheJournalofStatisticalSoftware.Journal of Statistical Software,73(2).
Kandel,S.,Paepcke,A.,Hellerstein,J.,&Heer,J.(2011).Wrangler:Interactivevisualspecificationofdatatransformationscripts.Paper presented at theProceedings of the SIGCHI Conference on Human Factors in Computing Systems.
Karrar,A.E.,&Ali,M.M.(2016).ComparativeAnalysisofDataCleaningToolsUsingSQLServerandWinpureTool.International Journal of Computer Applications in Technology,3(7),371–377.
Kumar,S.,&Nadeem,M.(2008).Extraction,Transformation,Loading(ETL)andDataCleaningProblems.Journal of Independent Studies and Research on Computing,6(1).
Lee,M.L.,Lu,H.,Ling,T.W.,&Ko,Y.T.(1999).Cleansingdataforminingandwarehousing.Paper presented at the10th International Conference on Database and Expert Systems Applications.
Martinez-Mosquera,D.,Luján-Mora,S.,López,G.,&Santos,L.(2017).DataCleaningTechniqueforSecurityLogsBasedonFellegi-SunterTheory.Paper presented at the SIGSAND-EuroSymposium,Gdansk,Poland.
Müller,H.,&Freytag,J.-C.(2005).Problems, methods, and challenges in comprehensive data cleansing.
Parent,C.,&Spaccapietra,S.(1998).Issuesandapproachesofdatabaseintegration.Communications of the ACM,41(5es),166–178.doi:10.1145/276404.276408
Patel,S.(2012).RequirementtocleanseDATAinETLprocessandWhyisdatacleansinginBusinessApplication?International Journal of Engineering Research and Applications,2(3).
Porwal,S.,&Vora,D.(2013).AComparativeAnalysisofDataCleaningApproachestoDirtyData.International Journal of Computers and Applications,62(17).
Rahm,E.,&Do,H.H.(2000).Datacleaning:Problemsandcurrentapproaches.IEEE Data Eng. Bull.,23(4),3–13.
Vassiliadis,P.,Simitsis,A.,&Skiadopoulos,S.(2002,November).ConceptualmodelingforETLprocesses.InProceedings of the 5th ACM international workshop on Data Warehousing and OLAP(pp.14-21).ACM.
Verborgh,R.,&DeWilde,M.(2013).Using OpenRefine.PacktPublishingLtd.
International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019
65
Samson Oni is a PhD student of Information Systems in the University of Maryland Baltimore County (UMBC). He obtained his master’s degree in computer science University of Maryland, Baltimore County. He worked as a Research Assistant at the Imaging Research Center UMBC. His previous work includes technical intern for Joint Centre for Earth Systems (NASA-JCET) - UMBC and Full-stack developer for Department of education UMBC. His research focus is in cyber security and Data science and have carried out several projects in these domains. Currently, he is a research assistant at the Information Systems UMBC where he is working on semantic web, blockchain and cybersecurity-related projects. More information can be found at http://www.samdwise.com
Zhiyuan Chen is an Associate Professor in Department of Information Systems at University of Maryland Baltimore County. He received a PhD degree in Computer Science from Cornell University in August 2002. He has more than 10 years of extensive research experience in data privacy, privacy preserving data mining, database management, data science, and cyber security. His main research focus is in algorithms for preserving privacy of data and at the same time allows accurate analysis of the data. He has published over 40 papers in peer reviewed journals and publications and over 20 of them are in the area of privacy and security. More information can be found at https://userpages.umbc.edu/~zhchen/
Susan Hoban worked with NASA for over two decades, first as a scientist studying comets and the interstellar medium, then as a STEM Educator. Dr. Hoban develops curriculum for professional development of educators for classroom use and informal education venues. Dr. Hoban specializes in integrating hands-on activities with data collection and analysis to develop the habits-of-mind of STEM. Curriculum modules include, but are not limited to rocketry, environmental education, astronomy & astrobiology, computer modeling, STEM music, and robotics for learners of all ages. Dr. Hoban is currently also working on using analytics for cyber security.
Onimi Jademi is a PhD candidate in the Department of Information Systems at the University of Maryland, Baltimore County (UMBC). Her research focuses on natural language processing and machine learning, and its applications especially in the healthcare domain. She has experience with high quality qualitative and quantitative research methods.