18
DOI: 10.4018/IJDWM.2019100103 International Journal of Data Warehousing and Mining Volume 15 • Issue 4 • October-December 2019 Copyright©2019,IGIGlobal.CopyingordistributinginprintorelectronicformswithoutwrittenpermissionofIGIGlobalisprohibited. 48 A Comparative Study of Data Cleaning Tools Samson Oni, University of Maryland Baltimore County, USA Zhiyuan Chen, University of Maryland Baltimore County, USA Susan Hoban, University of Maryland, Baltimore County, USA Onimi Jademi, University of Maryland, Baltimore County, USA ABSTRACT Intheinformationera,dataiscrucialindecisionmaking.Mostdatasetscontainimpuritiesthatneed tobeweededoutbeforeanymeaningfuldecisioncanbemadefromthedata.Hence,datacleaning isessentialandoftentakesmorethan80percentoftimeandresourcesofthedataanalyst.Adequate toolsandtechniquesmustbeusedfordatacleaning.Thereexistalotofdatacleaningtoolsbutit isunclearhowtochoosetheminvarioussituations.Thisresearchaimsathelpingresearchersand organizationschoosetherighttoolsfordatacleaning.Thisarticleconductsacomparativestudyof fourcommonlyuseddatacleaningtoolsontworealdatasetsandanswerstheresearchquestionof whichtoolwillbeusefulbasedondifferentscenario. KeyWoRDS Big Data, Data Cleaning, Data Cleansing, Data Fusion, Data Quality, Data Wrangler, Dirty Data, Open Refine INTRoDUCTIoN Dataisconstantlybeingproducedineverysector.However,dataisproducedinmanyforms,with variouslevelsofqualityandsomedatamayhavepoorquality.Datacleaning,sometimescalleddata scrubbingordatacleansing,isthedetectionandremovaloferrorsandinconsistencyfromdatawith theaimofimprovingdataquality.InBigDataprocessing,datacleaningisacriticalandimportant steppriortodataprocessingandmaintenance(Müller&Freytag,2005).Datacleaningisimportant tobothdatafromasinglesourceanddatafrommultiplesources.Datacleaningisanessentialstep forthedatafusionprocess,whichistheprocessofmergingdatafrommultiplesources(Haghighat, Abdel-Mottaleb,&Alhalabi,2016).Fusingpoorqualitydatafromvarioussourcestogetherwillcause moreissuesafterwards.Therefore,adequatecleaningofdatafromvarioussourcesbeforeintegration willhavesignificantimpactontheoutcomeofdatafusion. Cleaningdatarequiresidentifyingincorrect,invalidorduplicateentries.Thequalityofdata isdeterminedbythedegreetowhichthedatainquestionmeetsspecificneeds,whichinanycase willbehigherasthedatabecomescleaner(Kandel,Paepcke,Hellerstein,&Heer,2011).Validity, completeness,accuracyandprecisionarethemeasuresofdataquality(Kandeletal.,2011).The importanceofaccurateandcorrectdataforfusion/ETLprocesscannotbeoveremphasized.

A Comparative Study of Data Cleaning Tools

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Comparative Study of Data Cleaning Tools

DOI: 10.4018/IJDWM.2019100103

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

Copyright©2019,IGIGlobal.CopyingordistributinginprintorelectronicformswithoutwrittenpermissionofIGIGlobalisprohibited.

48

A Comparative Study of Data Cleaning ToolsSamson Oni, University of Maryland Baltimore County, USA

Zhiyuan Chen, University of Maryland Baltimore County, USA

Susan Hoban, University of Maryland, Baltimore County, USA

Onimi Jademi, University of Maryland, Baltimore County, USA

ABSTRACT

Intheinformationera,dataiscrucialindecisionmaking.Mostdatasetscontainimpuritiesthatneedtobeweededoutbeforeanymeaningfuldecisioncanbemadefromthedata.Hence,datacleaningisessentialandoftentakesmorethan80percentoftimeandresourcesofthedataanalyst.Adequatetoolsandtechniquesmustbeusedfordatacleaning.Thereexistalotofdatacleaningtoolsbutitisunclearhowtochoosetheminvarioussituations.Thisresearchaimsathelpingresearchersandorganizationschoosetherighttoolsfordatacleaning.Thisarticleconductsacomparativestudyoffourcommonlyuseddatacleaningtoolsontworealdatasetsandanswerstheresearchquestionofwhichtoolwillbeusefulbasedondifferentscenario.

KeyWoRDSBig Data, Data Cleaning, Data Cleansing, Data Fusion, Data Quality, Data Wrangler, Dirty Data, Open Refine

INTRoDUCTIoN

Dataisconstantlybeingproducedineverysector.However,dataisproducedinmanyforms,withvariouslevelsofqualityandsomedatamayhavepoorquality.Datacleaning,sometimescalleddatascrubbingordatacleansing,isthedetectionandremovaloferrorsandinconsistencyfromdatawiththeaimofimprovingdataquality.InBigDataprocessing,datacleaningisacriticalandimportantsteppriortodataprocessingandmaintenance(Müller&Freytag,2005).Datacleaningisimportanttobothdatafromasinglesourceanddatafrommultiplesources.Datacleaningisanessentialstepforthedatafusionprocess,whichistheprocessofmergingdatafrommultiplesources(Haghighat,Abdel-Mottaleb,&Alhalabi,2016).Fusingpoorqualitydatafromvarioussourcestogetherwillcausemoreissuesafterwards.Therefore,adequatecleaningofdatafromvarioussourcesbeforeintegrationwillhavesignificantimpactontheoutcomeofdatafusion.

Cleaningdatarequires identifyingincorrect, invalidorduplicateentries.Thequalityofdataisdeterminedbythedegreetowhichthedatainquestionmeetsspecificneeds,whichinanycasewillbehigherasthedatabecomescleaner(Kandel,Paepcke,Hellerstein,&Heer,2011).Validity,completeness,accuracyandprecisionarethemeasuresofdataquality(Kandeletal.,2011).Theimportanceofaccurateandcorrectdataforfusion/ETLprocesscannotbeoveremphasized.

Page 2: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

49

Dataanalystsalsospendagreatdealoftimeandresourcestryingtofixdataqualityproblems.Dasuetal.(Dasu&Johnson,2003)emphasizedtheruleofthumb,whichstatesthatmorethaneightypercent(80%)oftimeonadataanalysisprojectisspentoncleaningandpreprocessing.

Althoughtherearemanydatacleaningtools,theyoftenhavedistinctivefeatures.Theyalsorequiredistinctlevelsofskillstousethemandhavedifferentcostsandlearningcurves.Determiningthebesttoolsforanygivencleaningtaskdependsonmanyfactors.However,inpractice,usersareoftennotexpertsondatacleaningtoolsandtechnologiessothereisgreatneedtoprovidesomeguidanceonhowtochoosedatacleaningtools.

Theobjectiveofthispaperistoanalyzefourpopulardatacleaningtoolsanddeterminewhichtoolsareappropriateforvariousscenarios.Thispapercomparesthefeaturesofthesetoolsandtheirperformanceoncleaningthesamedataset.Twodatasetswereusedforthisexperiment.Theresultsmayhelpuserschooseappropriatedatacleaningtools.

Thispapermakesthefollowingcontributions:

• Compared the performance of four data cleaning tools on two real world data sets. Themetricsincludetheirfeatures,requiredplatformsandskill level,timeofcompletion,easeofimplementation/usage,etc.

• Proposesaguidelineforchoosingdatacleaningtools.

Therestofthepaperisorganizedasfollows.Abackgroundstudyispresentedfirst,followedby an overview of various aspects of data cleaning. The methodology section describes themethodology used for the comparison study. The results section describes the results of thestudy.Thediscussionandconclusionsectionpresenttheguidelinesforchoosingdatacleaningtoolsandconcludesthepaper.

LITeRATURe ReVIeW

Therehasbeenlotofworkondatacleaning.Theworkcanberoughlydividedintotwocategories:thoseonmethodstoaddressspecificdataqualityissuesandthoseonmoregeneraltoolsorframeworkthatcanaddressmultipledataqualityissues.

Workonspecificdataqualityissues:Leeetal.(Lee,Lu,Ling,&Ko,1999)presentedseveraltechniquestopreprocessrecordsbeforesortingthemsothatpotentiallymatchingrecordswillbebroughttoclosetogether.Usingthesetechniques,theyimplementedadatacleaningsystemthatcandetectandremoveduplicaterecords.

VariousmethodsofhandlingmissingdatawerediscussedbyLuján-Mora(Martinez-Mosqueraetal.,2017).Theauthorsproposedalgorithmsusedinananalysisofanincompletedataset.Theauthorsproposedmultipleimputationmethods,includingregressionimputation(fillinginmissingdatawithvaluespredictedbyaregressionmodel)andsinglehotdeckimputation(replacingthemissingvalueswiththoseobtainedfromsimilarobjectsfromthesameexperiments).

Generaltoolsorframework:Martinez-Mosqueraetal.(Martinez-Mosquera,Luján-Mora,López,&Santos,2017)lookedatmodelingdatacleaningforBigDataanalysisbasedonpreviousresearchformodelingETLprocessesusingwhat isknownasUnifiedModelingLanguage (UML).Theypresentedtwousecases,oneformodelingthedatacleaningprocessforweblogsandtheotherformodelingthecleaningprocessforsecuritylogs.

Galhardas et al. (Galhardas, Florescu, Shasha, & Simon., 2000) developed a data cleaningframeworkcalledAJAX.Theirapproachseparatesphysicalandlogicallevelsofdatacleaning.Thelogicallevelsupportsthedesignofthedatacleaningworkflowandthephysicallevelimplementsthedatacleaningworkflow.Thisframeworktransformsexistingdatafromoneormoredatacollectionstoatargetschemawhileeliminatingduplicates.

Page 3: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

50

Luján-Moraetal.(Martinez-Mosqueraetal.,2017)proposedatechniquefordatacleaningthatcanbeusedforcheckingdataqualityissuesonsecuritylogs.Theyusedpredefinedrulestocombinelogdataandscanforissues.Detectedissuesarethencorrectedbeforethedatasetisanalyzed.However,theirworkfocusedonaspecificdatatype:securitylogs.

Kandeletal.(Kandeletal.,2011)describedDataWranglerasatoolfortheinteractivecleaningofdatausingvisualspecificationsofdatatransformationscripts.TheyexplainedthatDataWranglercombinesdirectmanipulationofvisualizeddatawithautomaticinferenceofrelevanttransformationswhichinturncleansethedata.

KarrarandAli(Karrar&Ali,2016)conductedacomparativeanalysisofSQLServerandWinpuretoolsusingacademicandweatherdatasets.Theyanalyzedtwodatacleaningtools,whileourapproachusedfourtoolsthataremorecommonlyusedintheindustryfordatacleaning.Porwal&Vora(Porwal&Vora,2013)alsocarriedoutcomparativeanalysisontwodatacleaningalgorithms:TheAlliancerulealgorithmandHadcleanalgorithm.However,theydidnotcomparemorefull-fledgeddatacleaningtools.Theyalsodidnotconsiderotherfactorssuchasusabilityofthetools.

Inanutshell,therehasbeenalotofresearchontheneedfordatacleaningaswellasdatacleaningtechniques,tools,andframeworks.However,thereisnotmuchworkoncomparingtoolsandchoosingdatacleaningtoolsandthecriteriatoconsider.Thispaperconductsacomparativestudyonfourcommonlyuseddatacleaningtools.

oVeRVIeW oF DATA CLeANING

Thissectionwilldescribetypesofdata,dataqualityissuesconsideredinthispaper,generalstepsofdatacleaningforasinglesource,andgeneralstepsofdatacleaningformultiplesources.

Types of DataWithrespecttocategorizationofdata,therearethreetypesofdata:structured,semi-structuredandunstructureddata.

1. Structured data:Thistypeofdatahasahighdegreeoforganizationandadherestoapredefineddatamodel.Oneexampleisdatainarelationaldatabase.

2. Semi-structured data:Thistypeofdatadoesnotfitintorelationaldatabasebuthavesomeformoforganizationforeasyanalysis.AnexampleisXMLdata.

3. Unstructured data: This data type is not organized nor does it have a predefined model.Unstructureddataisnotagoodfitforrelationaldatabase.Examplesaretext,pdf,images.

Irrespectiveofthetypeofdata,poordataqualitywillleadtopooranalysisanddecisions.

Data Quality IssuesDataqualityissuescancomeinvariousformsrangingfromduplicatedata,missingdata,errors(likespellingstudentasstdent),toinconsistentformatetc.Belowareseveraltypesofdataqualityissuesconsideredinthispaper.

1. Misspelled data:Forexample,acolumnhas‘student’,anotherhas‘stdent’whichismisspelled.2. Duplicate data/records (Table 1):Thisiswhenthesameinformationisenteredorduplicated

inadatasetordatabase.3. Irrelevant data:thisisthedatainthedatasetthatarenotrelevanttothework.Thiskindofdata

needstoberemoved.4. Mixed ranges: Sometimes data is measured in ranges, e.g., salary, age. Ranges need to be

representedconsistentlyandappropriatelyinthedata.

Page 4: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

51

5. Mixed numerical scales:thistypeofdatadealswithusingdifferentnumericalscalesofdatainthedataset.Forexample,representingonemillioncanberepresentedas1mwhileonebillioncanberepresentedas1B.Butacomputermaynoteasilygetthis.

6. Multiple representation: representing the samepieceof information indifferent formsyethavingthesamemeaningcancauseproblemswithinadataset.Forexample,usingmultiplerepresentationsforthecountryUnitedStates(i.e.U.S.A,US,UnitedStates,UnitedStatesofAmerica).Alltheserepresentationsmeanthesamebutusingamixtureofseveralrepresentationsforthesameinformationwithinadatasetwillcausetroubleforanalysis.

7. Wrong date format: Different date format are used in data today, but the mixture ofseveraldataformatsinonedatasetcanbetroublesome.Exampleofdifferentformatscanbe2/12/2018,February2,2018,and2-12-2018.Thethreedatesmeanthesame,buttheirpresentation differs. Another example of date inconsistency is the American (MM/DD/YYYY)andEuropean(DD/MM/YYYY)formatsmixture.InAmericanformatthedaywillbewrittenas2/12/2018tobe12thofFebruary2018,whileEuropeanswillrepresentthesamedateas12/2/2018startingwiththeday.

General Steps of Data Cleaning for a Single SourceThereareseveralphasesinvolvedindatacleaningforasingledatasource.

• Detecterrorsandinconsistenciesindatatoremove.• Verifythattheerrorisreallyanerror,notaspecialfeatureofthedataset(Rahm&Do,2000).

Thisoftenrequireshumaninteraction.• Extracterroneousrecordstoanewtemporarytable.• Performcleaningoperationsonthedatainthattemporarytable.

Mostoftheseprocessesarealreadybuiltindifferentdatacleaningtoolsasdoingthismanuallywillcostlotoftimeandresources.

General Steps of Data Cleaning for Multiple SourcesInmultipledatasources,eachdatasourcemaycontaindirtydata.Inaddition,datafromonesourcemaycontradictoroverlapwithdatafromothersources.Theprocessofmergingthesedataisalsoknownasdatafusion.

Table2depictsasingledatasource thathasafewdataquality issues includingmisspelling(NigeriawasspelledasNigria)andduplicateddata.

Table3andTable4showtwodatasourcesthatneedtobeintegrated.Eachdatasourcemayhavesomedataqualityissues(e.g.,Nigeriaismisspelledinsource2).Someoftheseissuescanbeaddressedintheindividualsource,butotherscanonlybeaddressedduringandafterintegration.

Table5showsintegrateddatainwhichcleaningwasdoneintheindividualsourcealone.Issueslikemisspellingareaddressed,butredundancyandoverlappingarenot.Forexample,wehavecolumnsforName,firstnameandlastname,andcolumnsforsexandgender.Table6showscleanintegrateddatawheretheseissuesareaddressed.

Table 1. An example of Duplicate record

No First Name Last Name Age Sex Phone No

1 Samson James 19 M 202-298-2014

2 Samson James 19 M 202-298-2014

Page 5: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

52

Asidefromtheissueofdataoverlappingassociatedwithmultiplesources,namingandstructuralconflictsmayoccur(Batini,Lenzerini,&Navathe,1986)(Parent&Spaccapietra,1998).Namingconflictsisanissuethatariseswhendifferentnamesareusedforsameobjectsacrosssources,orwhenthesamenameisusedfordifferentobjectsacrosssources.Meanwhile,structuralconflictsoccurwhendifferentrepresentationsofthesameobjectariseindifferentdatasources.

Table 2. Data Quality issues in a single data source.

CampusId First Name Last Name Country Sex

VH609042 Samson James Nigria M

XV503267 Jane Mark India F

XV503267 Jane Mark India F

Table 3. Data in source 1

CampusId Name Address Sex Date of Birth

VH609042 SamsonJames 100IRC21222,MD 1 12-01-1989

XV503267 JaneMark 123Oceanstreet,21223,MD 0 02-01-1988

Table 4. Data in Source 2

CampusId First Name Last Name Country Gender Course

VH609042 Samson James Nigria M 10

XV503267 Jane Mark F 10

Table 5. Integrated data with data cleaning in individual sources only

CampusId Name Address Date of Birth Sex First

Name Last Name Country Gender Courses

VH609042 SamsonJames

100IRC21222,MD

12-01-1989 1 Samson James Nigeria M 10

XV503267 JaneMark

123Oceanstreet,21223,MD

02-01-1988 0 F Mark India F 10

Table 6. Clean Integrated data

CampusId FirstName LastName Sex Address Courses Date of Birth Country

VH609042 Samson James M 100IRC21222,MD 10 12-01-1989 Nigeria

XV503267 Jane Mark F123Oceanstreet,21223,MD

10 02-01-1988 India

Page 6: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

53

Sothegeneralstepstocleandatafrommultiplesourcesinclude1)cleandataateachsource;2)dataintegration;3)addressdataqualityissuesinintegrateddata.

MeTHoDoLoGy

Thissectiondescribesthemethodologyofthecomparativestudy,includingthedatasets,thefourdatacleaningtools,andthedatacleaningtasks.

Data SetsInthiswork,weusedadatasetonatmosphericandclimateresearchfromtheU.SDepartmentofEnergywebsite(www.arm.gov)andadatasetaboutuniversities(universityData)extractedfromWikipedia.Thedata from theU.SDepartment ofEnergywebsite is theAtmosphericRadiationMeasurement(ARM)userfacilitydatacollectedthroughscientificexperimentsandroutineoperations.Theobservationsweremadeeveryhalfanhour.TheUniversitydatasetgivesanoverviewofdifferentuniversities:whentheywereestablished,thenumberoffaculty,staffandstudentscurrentlyenrolledaswellasthetotalendowmentamounteachuniversitycurrentlypossesses.TheinformationincludedinthedatasetisexplainedinTable7.

TheUniversitydatasethas10variables(p=10),containsover75,000records(n=75043)andissavedasCSV.TheARMdatasethas15variables(p=15),containsover12,000records(n=12,762)andissavedasCSV.Table8showsthecolumnsinuniversitydata.

Data Quality Issues in Data SetsFigure1andFigure2showthescreenshotsofthesetwodatasets,respectively.Thetwodatasetsusedforexperimentsareverymessyandhaveseveraldataqualityissues:

Table 7. Properties of data sets

File Name No. of Records No. of Fields Missing Values Duplicate Record

UniversityData 75,000 10 7.89% 32.7%

ARMData 12,762 15 27.6% 0%

Table 8. Columns of the University Data

Description of Variable Variable Name in Dataset

NameofUniversity University

Themonetaryamountofendowmenttheschoolhas Endowment

Thetotalnumberoffacultyemployedbyschool NumFaculty

NumberofDoctoral NumDoctoral

Countrywheretheschoolexits Country

Thetotalnumberofstaffmembersintheschool NumStaff

Theyeartheschoolwasestablished Established

NumberofPostgraduatestudents NumPostgrad

NumberofUndergraduatestudents NumUndergrad

Totalnumberofallstudentsenrolled NumStudents

Page 7: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

54

Figure 1. Screenshot of UniversityData.csv opened with Excel

Figure 2. Screenshot of ARM data opened with Excel

Page 8: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

55

• Inconsistentdatevalues.TheUniversitydatacontainsdifferentdateformatswhileintheARMdata,thecolumnfordatehasbothdateandtime,whichhavetobeseparated.

• Inconsistencyinabbreviationsandtermsused.e.g.,toindicateUnitedStatesofAmerica(USA)ascountry,somerecordsusetermslikeUS,USAandUnitedStates.Thesemightbeconsideredasdifferentcountrieswithoutdatacleaning.

• Mixtureofnumericalandtextvalues.• Missinginformation:TheUniversitydatasetcontainsseveralNAvalues,whichisnotunusual

foranyformofdatasetbutmightbeproblematicwhencarryingoutdataanalysisondataset.WhiletheARMdatahasmanymissingrecords.

• ValuesintheUniversitydatasetareseparatedbyinconsistentnumberofdoublequotes.WhilethatintheARMdataareseparatedbyaspacewhichdoesn’tmeetthecommarequirementofCSV.

• Duplicaterecords:Thedatasetissupposedtohaveonlyoneentryforeachuniversityinstance.However,asseeninthedata,someuniversitieshaveseveralentrieswithalldatasometimesbeingthesame,andsometimeshavingvariationse.g.,LamarUniversityhas33entriesbuttherearevariationsinthevaluesofthelastvariablewithsomeshowing13773,14388and14522.

• Outlierrecords:TheARMdatasethassomevaluesinsomecolumnsfarbeyondthenormal.Thatcanbeproblematicwhilecarryingoutanalysis.

• Missingrecordsinasequence:TheARMdatasetwascollectedoveranintervalofhalfanhour.Therearemanymissingrecordsforcertaintimes.

Thepurposeofthisstudyistouseseveraldatacleaningtoolsonthesamedatasettocomparethesedatacleaningtools.Throughthisstudy, it isanticipatedthatwewillgainbetter insightonhowthesedifferenttoolswork,thestrengthandweaknessesofeachtoolindatacleaningtechniquesaswellascomingupwithvaluablesuggestionsanddiscussionsaboutthefutureofdatacleaningtechniquesandtools.

Tool UsedForthisstudy,weusedfourdifferentdatacleaningtoolsnamelyOpenRefine,R,PythonandDataWrangler.Thesetoolsarethemostpopulartoolsusedfordatacleaningintherealworld.OpenRefine,RandPythonareopen source,whichmakes themeasilyaccessible foruse.DataWrangler is acommercialtoolbuthasacommunityversionwhichdoesagoodjobofdatacleaning.Thesetoolsusedaredescribedbelow:

• OpenRefine:OpenRefine(Verborgh&DeWilde,2013)isaweb-based,stand-alone,opensourceapplicationfordatacleanupand transformation toother formats. Itoperatesonrowsofthedatathathavecellsundercolumns,whichisverysimilartorelationaltables.Thistoolcleans,reshapesandeditsbatch,unstructuredandmessydata.ItwasformerlyknownasGoogleRefineandwasalsocalledFreebaseGridworksbeforethat.OperationsinOpenRefineincludefaceting(allowinguserstonarrowdownresultsthroughseveraldifferent dimensions), clustering, and reconciling, which all help in the data cleaningprocess.Italsoanalyzesthedatathroughfiltering,facetingandconvertingthedataintomorestructuredform.

OpenRefineisastandaloneapplicationthathasawebinterface.Itisnothostedonthewebbutcanbedownloadedandrunsonthelocalmachine.Inotherwords,itisadesktopapplicationthatopensinabrowserasalocalwebserver.

TransformationexpressionscanbewritteninGeneralRefineExpressionLanguage(GREL),Jython(i.e.Python)andClojure.Sinceitisanopensourceproject,itscodecanbereusedinotherprojects.

Page 9: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

56

OpenRefinecarriesourcleaningtasksthroughfilteringandfaceting,andthenconvertsthedataintoamorestructuredformat.

• Data Wrangler:DataWrangler(Kandeletal.,2011)isaStanfordUniversityprojectthathelpsanalystscleanandpreparediverse,messydataquicklyandaccurately.Itisaninteractivetoolfordatacleaning.DataWranglercanworkwithdataintwoways.Userscansimplypastethedataintoitswebinterfaceorcanusethewebinterfacetoexporttheoperationsaspythoncodeandprocessarbitraryamountsofdata.ThewebinterfaceisusingJavaScriptandthereforehassomeperformanceissuesandonlysupportsupto1000rows,butuserscanuseittoconfigureDataWrangleronasubsetofthedataandthenapplytheconfigurationonthewholedataset.ThemostrecentversionofthistooliscalledTrifactaWrangler.

For the experiment we imported our data into Data Wrangler and the application began toautomaticallyorganizeandstructureourdataset.Thistoolcontainsstrongmachinelearningalgorithmsthathelpsuggestcommoncleaningtobedoneandcommontransformationandaggregations.DataWranglerallowsamixtureofnumericalandtextvalues.

• Python:Pythonisanothertoolthatcanbeusedfordatacleaning.Ithasseveralmodulesthatcanbeusedtocarryoutcleaning.OnepowerfulmoduleinPythonthatisusedfordatacleaningisPANDAS(Pythondataanalysistoolkit).Thismoduleisbasicallyfordataanalysis,whichdatacleaningispartof.AnothermoduleinPythonthatcanbeusefulwhencarryingoutcleaningistheNumpymodule.ThismoduleisusedforscientificcomputingwithPython.IthasapowerfulN-dimensionalarrayobjectthatisusefulforlargedatasets.

• R:Risaprogramminglanguageusedforstatisticalcomputation(Johnetel.,2016).Ithasbeenwidelyusedfordataanalysis.Rhasasetoftoolsthataredesignedtocleandataeffectivelyandcomprehensively.TheRenvironmenthasthecapacitytoreaddatainseveralformatsandprocessthesefiles.

InthecleaningofdatausingR,foursimplestepscanbetakenwhichRprovidesgreatresourcefor:

1. Readingdata:Rprovidesadequatereadingresourcefrompracticallyanyformatintodataframe.2. ExploratoryAnalysis:Afterreadingthedata,usersoftenconductaninitialexplorationofthe

dataframe.3. ExploratoryAnalysisinVisualform:Duringcleaningitisusefultovisualizedataateachstage.

Rprovidesadequatevisualizationtools.Threepowerfulvisualizationthatcanbeusefulduringdatacleaningare:Boxplot,HistogramandScatterplot.

The Data Cleaning TasksIntheexperimentthefollowingdatacleaningtaskswereconducted.

• Dealingwithtypographicalerrorsormultiplerepresentations:◦ Cleaningupinconsistentspellingofterms(i.e.“USA”,“U.S.A”,“U.S.”,etc.).◦ Convertingvaluesthataretextdescriptionsofnumericvalues(i.e.$123million)toactual

numericvalues(i.e.123000000)whichareusableforanalysis.◦ Extractingandcleaningvaluesfordates.

• Identifyingwhichrowsofaspecificcolumncontainasearchterm.• Removingduplicatedata.• Separatingdateandtime.• Handlingoutliers.

Page 10: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

57

• Handlingmissingrecordsinasequence.Hereafterusingthetooltodiscoverthemissingrecordsinasequence,theuserdecideshowtoreplacemissingvalues,eitherbyimputation,inferencefromotherrecordsorothermethoddecidedbytheuser.

• Exportingcleaneddatatoseveralformats.• Handlingmissingfields,duplicaterecords,inconsistentformats.• Batcheditingofrowsandcolumn.

Twouserswithadvancedprogrammingskillsfinishedthesedatacleaningtasksusingthefourtools.Foreachdatacleaningtask,usersappliedadatacleaningtooltofixthedataqualityissueslistedinthetask.Theymanuallycheckedourresultsandrepeatedthecleaningtaskuntilwecouldnotfindmorerelatedqualityissuesinthedata.Theorderofapplyingeachtoolforeachtaskisrandomizedtoavoidbiasintroducedbytheorder.

When compare the four tools, we focused on the following criteria: key features, platform,scalability,skilllevelneeded,timeofcompletionandeaseofimplementation.

ReSULTS

Foreachtoolwedescribeitskeyfeatures,platform,skilllevelneeded,timeofcompletion,easeofimplementation,advantagesanddisadvantages,accuracy.

Key FeaturesOpenRefine:Ithasthefollowingkeyfeatures:• Importingdatafromvariousdatasourcesandsupportthefollowingformat:CSV,TSV,.xls,.xlsx,

JSON,XML,RDFasXMLandgoogledocument.Figure9showsascreenshotofimportingtheuniversitydatausingOpenRefine.

• Facetsandfilters:OpenRefineallowuserstousefacetsandfilterstofilterdataintosubsetsforeasyusage.Thiscanbedonefornumbers,textanddatescolumns.Forexample,fortheUniversitydata,ifauserfacetsdataonthegendercolumnwewillget2infemaleand1inmale.Iftheuserselectsfemale,thenitwillshowthetworowswithfemale.

• Support forexpressions thatcanbeused tocreatenewdatafromexistingdataor transformexistingdata.

• Reconciliation:reconciliationmatchestextnamesorvalueinthecolumnstodatabaseidentifiersinvariousdatabaseIDspaces.Ithelpsresolveinconsistentspellingissues.Forexample,US,USAandUnitedStatescanbematchedtoUnitedStatesofAmerica.ReconciliationcanbedonebycallingWebServicesordatabaseAPI.

• ExportingData:datacanbeexportedintoTabseparatedvalues(TSV),Commaseparatedvalues(CSV),ExcelandHTMLTable.

• Undo/redo:Undogivesuser the flexibility to rectifymistakes.Redoenables theuser torepeatastep.

Data Wrangler:Ithasthefollowingkeyfeatures:

• DataWranglersupportsthefollowingsixuserinteractionswhileusingthetoolforcleaning.◦ Selectcolumns◦ Selectrows◦ Selecttextwithinacell◦ Editdatawithinthetable◦ Clickbarsindataqualitymeter◦ Assigndatatypes,columnnamesandsemanticroles.

Page 11: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

58

• DataWranglerhasasuggestionenginethatsuggestsnextdatacleaningsteps.• DataWranglersupportsautomatedscriptgeneration.• DataWranglerallowsusertohavestepbystepinteractionwithdata.• DataWranglersupportsCSV,JSONANDTDEdataformats.

Python:Ithasthefollowingkeyfeatures:

• Featurestovisualizeandexploredata.• Writingcustomizablecodeforspecificdatacleaningtasks.• Easyintegrationwithothertoolsorproduct.Pythoncancallprogramswritteninotherlanguages.

Pythoncodecanbecalledinotherlanguagesaswell.

R:Rhasthefollowingkeyfeatures.:

• Rhasmanyfunctionsthatcanbeusedfordatacleaning.• Rhasgoodvisualizationlibraries.• Writingcustomizablecodeforspecificdatacleaningtasks.

Platform and Needed Skill LevelOpenRefineisaweb-basedapplicationthereforeitisplatformindependent.ItcanrunonWindows,LinuxandMac.Itrequiresbasictointermediateskilllevel.

DataWranglerrunsonWindowsandMac.Itrequiresbasicskilllevel.BothPythonandRrunonallplatforms,includingLinux,WindowsandMac.Theybothrequire

advancedskilllevel,becauseusershavetoknowhowtoprogram.

Time of CompletionFigure3depictstheaveragecompletiontimeofcleaningusingallfourtoolsonUniversitydataandARMdata.TheusershavehighskilllevelandarefamiliarwithbothRandPython.

UsingDataWranglerhasthefastestcompletiontimefollowedbyusingOpenRefine.Thisisexpected,becausebothtoolsarehighlyinteractive.DataWrangleralsocansuggestdatacleaningsteps,soitleadstoevenfastercompletiontime.UsingRandPythontookmuchlongertimebecausebothrequirecustomizedprogramming.UsingRtookshortertimethanusingPython,becauseRhasalotofdataanalysisfunctionsthataresuitablefordatacleaning.Therelativeorderofdifferenttoolsisalsothesameforbothdatasets.

ease of ImplementationThereisnostandardsequenceofstepsincleaningdata.Sometimesitdependsonthespecificissuescontainedinthedata,whileothertimesitdependsontheuser’sapproach.Duetothisfact,wewerenotabletodoaquantitativeanalysis.However,wegatheredfeedbackfromusersofthesetools.SomeusersexpressedhowtheyfeltusingOpenRefineandDataWranglerfordatacleaningbasedontheinteractiveuserinterface.OthersdiscussedhowtheycouldusePythonandRincustomizedways.Basedonthefeedbackwegotandonourusageofthesetoolsforourexperiment,weassignedscale1-3ontheeaseofimplementationofthesetools.

Weclassifiedtheeaseofimplementation/usageofthesetoolsintoascaleofthree(3),with3astheeasiesttouse.

1. Scale 1:Highhumandependence,lowinteractivity,littleautomationandrequiringadvancedtechnicalskill.Thisscalemeansthattheusermustknowwhatheorsheisdoing,andthetoolgivesnosuggestionsorhints.Theeffectivenessofthetooldependsontheuser’sknowledge

Page 12: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

59

andskills.Inaddition,theuserneedstohaveadvancedtechnicalskillstobeabletosuccessfullycompletetaskswiththetool.

2. Scale 2:Highhumandependence,highinteractivity,someautomationandrequiringbasictointermediatetechnicalskill.Heretheuserisexpectedtoknowwhatexactlyinthedatahewantstocleanbutthetoolinteractivelyhelpstheusercarryoutthetask.Basictechnicalskillsareneededforbasiccleaning,butintermediatetechnicalskillsmaybeneededforcomplextask.

3. Scale 3:Toolsinthiscategoryarehighlyinteractive,haslittletonohumandependence,andsuggestscleaningstepstotheuserandauserwithnoexperiencenortechnicalskillscanusethistooltoachievethedatacleaningtasks.

Figure4showstheeaseofimplementationscaleforeachtool.OpenRefinehasascaleof2becausethistoolishighlyinteractiveandonlyrequiresbasictointermediateskills.However,usersstillneedtospecifyallstepsinthedatacleaningprocess,soithashighhumandependenceandsomeautomation.

DataWranglerhasascaleof3becauseitishighlyinteractiveandonlyrequiresbasicskills.In addition, it suggestsdata cleaning steps tousers, so it has lowhumandependenceand ishighlyautomated.

PythonandRbothhaveascaleof1becausetheyhavehighhumandependence,lowinteractivity,littleautomationandrequireadvancedtechnicalskill.

Otheraspects:Welookedatseveralotheraspects,includingpossibilitytobeembedinothertools/programs,userinterface,massedits(editingmultiplecellsatthesametime),approach,compatibilitywithBigData.

BothOpenRefineandDataWranglerarestand-aloneandcannotbeembedded.RandPythoncanbeembeddedinotherprograms.

BothOpenRefineandDataWranglerhavegraphicuserinterface.RandPythondonot.Allofthemsupportmassediting,butRandPythonrequiresomecodingtodothat.

Figure 3. Time of completion in minuets using four data cleaning tools for University Data and ARM Data cleaning

Page 13: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

60

Intermsofdatacleaningapproach,OpenRefinesupportssimpletasksasasimpleclick,butformorecomplextasks,usersneedtouseexpressionlanguage.ForDataWrangler,usersonlyneedtoclick,andthesystemalsosuggestsdatacleaningsteps.ForRandPythonusershavetomanuallywritescripts.

OpenRefinecanonlysupportcleaning5000recordssoitdoesnotdirectlysupportcleaningbigdata.Theotherthreetoolscanhandlebigdata.

Advantages and DisadvantagesOpenRefinehasthefollowingadvantages:

• Sincethistoolisadesktopapplicationwithouttheneedtoconnecttointernet,thedatasetisrelativelysafeandishardertotamperwith.

• Userscanuseitsfacetfeaturetofilterthedataintosubsets.• Ithaspowerfulfeaturestotransformdata.• Itprovidessimpledatasummarizationplatform.

OpenRefinehasthefollowingdisadvantages:

• Googleremovedsupportforthistool,andsomeoftheirfeaturesareredundant.• TheUIisnotuserfriendly,severalfeaturesarenoteasytofind.• OpenRefineisnotsuitableforprocessinglargedatasetsduetothe5000-recordlimit.• Itassumesthatdataisorganizedintabularform,whichisnotalwaystrue.

DataWranglerhasthefollowingadvantages:

Figure 4. Ease of implementation of the tools

Page 14: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

61

• DataWranglerhastwoviews.TheGridviewandtheColumnview.• Thistoolalsosupportsdatavisualizationandsupportsvisualizationateverystepofdatacleaning.• Itsupportsmassediting.• Itusesnaturallanguagedescriptionsoftransformation.• Itrecommendscleaningstepstouploadeddata.

Overall,wefoundDataWranglerthemostuserfriendlyoutofthefourtools.DataWranglerhasthefollowingdisadvantages:

• Itconsumeslotsofmemory.• Justlimitedfeaturesavailableforfreeversion.

Pythonhasthefollowingadvantages:

• Userscancustomizetheirsolutiontofittheirneeds.• Thistoolisgreatasitiseasytofuseintootherapplication.

Pythonhasthefollowingdisadvantages:

• Itrequiresadvancedprogrammingskills.• ThelearningcurveishighasitrequiresusertolearnhowtousemanymodulesinPython.• It’snottimeeffectiveduetothehighlearningcurve.• Thismethodcanbecomplexanddifficulttoimplement.• Usersmusthavepreviousknowledgeofwhatstepstotakeinthecleaningprocess.

Rhasthefollowingadvantages:

• Itissuitablewhenthedataismainlyusedforstatisticalanalysis(e.g.,salesrecord).• Itisveryeasytovisualizedataateachstageofcleaning.

Rhasthefollowingdisadvantages:

• Itisnotagoodoptionforintegratingintootherprojectsinotherdomainsdifferentfromdatasciencedomains.OtherprojectsmightmakeuseofotherprogramminglanguagesthatRdoesn’tintegratewellwith.

• It’snottimeeffectiveduetothehighlearningcurve.• Thismethodcanbecomplexanddifficulttoimplement.• Usersmusthavepreviousknowledgeofwhatstepstotakeinthecleaningprocess.

AccuracyWewerenotabletoquantifytheaccuracyofthesetools,becauseusershavetogothroughmultipleiterations for each tool andonce some issues are fixed inone iteration, the toolmay findmoreissuesthatwillbefixedinthenextiteration.Intheexperiments,weobservedOpenRefineandDataWranglertohavehighaccuracywhendetectingspecificdataqualityissues(e.g.,missingvalues).Butsomemanualworkisneededtofixthefoundissues(e.g.,youcandecidetoremovemissingvaluesorassignsomevalues).

ForRandPython,theaccuracyalldependsontheuser’sskilllevel,whethertheusercanwritegoodprogramstodetectthoseissuesandsolvethem.

Page 15: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

62

Attheend,mostdataqualityissuesareaddressedforeachtool.

Summary of ComparisonTable9summarizethecomparisonofthesefourtools.FollowingcomparisoncriteriausedbyPorwal&Vora(Porwal&Vora,2013)andKarrarandAli(Karrar&Ali,2016),wecameupwiththefollowingmetrics:importformat,performancetime,skilllevel,platform,easeofimplementation,keyfeatures,outputformat,skilllevel,platform,accuracy,possibilitytobeembedinothertools/programs,userinterface,massedit,approach,compatiblewithbigdataandtheirdisadvantages.

Table 9. Comparison of OpenRefine, Data Wrangler, Python and R

Criteria OpenRefine Wrangler Python R

ImportformatCSV,TSV,Excel(XLS/XLSX),JSON,XML,RDF

Excel(XLS/XLSX),CSV,TEXT All All

Performancetime Dependsondatasizeandformat

Dependsonuserchoiceanddatasize

Dependsonuserprogrammingskillsandlevelofartifactindata

Dependsonuserprogrammingskillsandlevelofartifactindata.

Keyfeatures

Facetsandfilters,Supportforexpressionlanguage,Reconciliation

Userinteractions,suggestionengine,automatedscriptgeneration,

Customizablebyuser,integratewithothertools,greatvisualizationlibrary

Customizablebyuser,greatvisualizationlibrary

Skilllevel BasictoIntermediate Basic Advanced Advanced

Platform Allplatform Windows,Mac Allplatform Allplatform

AccuracyHighaccuracywhendetectingspecificdataqualityissue

Highaccuracywhendetectingspecificdataqualityissue

Dependsontheuser’sskilllevel

Dependsontheuser’sskilllevel

Platform Allplatform Windows,Mac Allplatform Allplatform

Easeofimplantation 2 3 1 1

Outputformat TSV,CSV,ExcelandHTMLTable CSV,JSON,TDE Usermaycustomize

toanyformatUsermaycustomizetoanyformat

Possibilitytoembedded

No,Standalonebutcodeisavailable No,Standalone Yes Yes

GraphicUserInterface Yes Yes No No

EditMultipleValues Supportmassedit Supportmasseditanditseasy

Supportbutrequirecomplicatedcoding

Supportbutrequirecoding

Approach

Simpletaskcarriedoutwithaclick,butcomplextaskrequiresexpressionlanguage

Simpleclickandalsosuggestcleaningfeaturesforuser

Needtowritescripts Needtowritescripts

CompatiblewithBigData

No(suitableforonly5000records) Yes Yes Yes

Drawbacks

Googlestoppedsupport,advancedfeaturesrequiretechnicalskills

Memoryconsumptionishigh,costimplication

Requiregoodknowledgeofprogramming

Requireknowledgeofprogrammingandstatistics

Page 16: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

63

CoNCLUSIoN

Theproblemof‘dirty’datacostsinstitutionslargeamountsofmoneyeveryyear.Morethan80%oftimeandresourcesarespentpreparingandcleaningdata.Thispaperconductedacomparisonstudyoffourcommonlyusedtoolsfordatacleaning.TheresultsshowthatOpenRefine,whichisanopensourcetooldevelopedbyGoogle,isausefultoolandhasseveralmeritssuchasthefeatureofrunninglocallywhichmakesuserdatamoresecure,andthefeaturewithagraphicalinterfaceandthemasseditfeature.ButOpenRefineneedsexperienceandexpertisetobeabletouseitsadvancedfeatures.OpenRefinealsoworksbetterforsmalldatasets.

DataWranglerhastheadvantageofbeingastandalonetool.Itisveryefficientforbigdataandhasauniquevisualizationfeatureateachstepandgivestheuseranopportunitytopreviewchangesmadegraphicallybeforecommittingthechange.Itcanalsorecommenddatacleaningsteps.Overallitistheeasiesttouse.However,thefreeversionhaslimitedfunctionalities.

PythonandRhavetheadvantagefortheusertocustomizethedataanyways/hewants,andtheycanbeembeddedintoothertools.BothPythonandRhavesamefeaturesincleaning,butPythonhaslotsofmodulestosupportdifferentaspectofcleaningandtheabilitytousethisdataforotheranalysis.PythonandRhowever,requiregreatprogrammingskills,whichmaynotbepresent.PythonandRalsotakelotsoftimetocarryoutcleaningaseachstepalongthewaymustbeimplementedmanually.

Inconclusion,DataWranglerwillbeagoodstartfornoviceuser,asmanydataanalystwillprefernottospendtoomuchtimecleaningdata,astheymustworkonthefunctionalityorusageofthesedata.Itwillbegoodforusersthatdon’tmindpayingforcleaningtool.Forusersseekingopensourcetool,OpenRefineisagoodoption.Fordataengineersthathavetimeandadequateskills,PythonorRwillbeagoodoption.

Onepossiblefutureworkistotakeeachtoolandlookathowitcanhelpusintheintegrationofdatafromdiversesources.

Page 17: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

64

ReFeReNCeS

Batini,C.,Lenzerini,M.,&Navathe,S.B.(1986).Acomparativeanalysisofmethodologiesfordatabaseschemaintegration.ACM Computing Surveys,18(4),323–364.doi:10.1145/27633.27634

Castanedo,F.(2013).Areviewofdatafusiontechniques.The Scientific World Journal.PMID:24288502

Dasu,T.,&Johnson,T.(2003).Exploratory data mining and data cleaning(Vol.479).JohnWiley&Sons.doi:10.1002/0471448354

Galhardas,H.,Florescu,D.,Shasha,D.,&Simon,E.(2000).AJAX: an extensible data cleaning tool.

Haghighat,M.,Abdel-Mottaleb,M.,&Alhalabi,W. (2016).DiscriminantCorrelationAnalysis:Real-TimeFeatureLevelFusionforMultimodalBiometricRecognition.IEEE Transactions on Information Forensics and Security,11(9),1984–1996.doi:10.1109/TIFS.2016.2569061

John,F.,&Allison,L.(2016).RandtheJournalofStatisticalSoftware.Journal of Statistical Software,73(2).

Kandel,S.,Paepcke,A.,Hellerstein,J.,&Heer,J.(2011).Wrangler:Interactivevisualspecificationofdatatransformationscripts.Paper presented at theProceedings of the SIGCHI Conference on Human Factors in Computing Systems.

Karrar,A.E.,&Ali,M.M.(2016).ComparativeAnalysisofDataCleaningToolsUsingSQLServerandWinpureTool.International Journal of Computer Applications in Technology,3(7),371–377.

Kumar,S.,&Nadeem,M.(2008).Extraction,Transformation,Loading(ETL)andDataCleaningProblems.Journal of Independent Studies and Research on Computing,6(1).

Lee,M.L.,Lu,H.,Ling,T.W.,&Ko,Y.T.(1999).Cleansingdataforminingandwarehousing.Paper presented at the10th International Conference on Database and Expert Systems Applications.

Martinez-Mosquera,D.,Luján-Mora,S.,López,G.,&Santos,L.(2017).DataCleaningTechniqueforSecurityLogsBasedonFellegi-SunterTheory.Paper presented at the SIGSAND-EuroSymposium,Gdansk,Poland.

Müller,H.,&Freytag,J.-C.(2005).Problems, methods, and challenges in comprehensive data cleansing.

Parent,C.,&Spaccapietra,S.(1998).Issuesandapproachesofdatabaseintegration.Communications of the ACM,41(5es),166–178.doi:10.1145/276404.276408

Patel,S.(2012).RequirementtocleanseDATAinETLprocessandWhyisdatacleansinginBusinessApplication?International Journal of Engineering Research and Applications,2(3).

Porwal,S.,&Vora,D.(2013).AComparativeAnalysisofDataCleaningApproachestoDirtyData.International Journal of Computers and Applications,62(17).

Rahm,E.,&Do,H.H.(2000).Datacleaning:Problemsandcurrentapproaches.IEEE Data Eng. Bull.,23(4),3–13.

Vassiliadis,P.,Simitsis,A.,&Skiadopoulos,S.(2002,November).ConceptualmodelingforETLprocesses.InProceedings of the 5th ACM international workshop on Data Warehousing and OLAP(pp.14-21).ACM.

Verborgh,R.,&DeWilde,M.(2013).Using OpenRefine.PacktPublishingLtd.

Page 18: A Comparative Study of Data Cleaning Tools

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

65

Samson Oni is a PhD student of Information Systems in the University of Maryland Baltimore County (UMBC). He obtained his master’s degree in computer science University of Maryland, Baltimore County. He worked as a Research Assistant at the Imaging Research Center UMBC. His previous work includes technical intern for Joint Centre for Earth Systems (NASA-JCET) - UMBC and Full-stack developer for Department of education UMBC. His research focus is in cyber security and Data science and have carried out several projects in these domains. Currently, he is a research assistant at the Information Systems UMBC where he is working on semantic web, blockchain and cybersecurity-related projects. More information can be found at http://www.samdwise.com

Zhiyuan Chen is an Associate Professor in Department of Information Systems at University of Maryland Baltimore County. He received a PhD degree in Computer Science from Cornell University in August 2002. He has more than 10 years of extensive research experience in data privacy, privacy preserving data mining, database management, data science, and cyber security. His main research focus is in algorithms for preserving privacy of data and at the same time allows accurate analysis of the data. He has published over 40 papers in peer reviewed journals and publications and over 20 of them are in the area of privacy and security. More information can be found at https://userpages.umbc.edu/~zhchen/

Susan Hoban worked with NASA for over two decades, first as a scientist studying comets and the interstellar medium, then as a STEM Educator. Dr. Hoban develops curriculum for professional development of educators for classroom use and informal education venues. Dr. Hoban specializes in integrating hands-on activities with data collection and analysis to develop the habits-of-mind of STEM. Curriculum modules include, but are not limited to rocketry, environmental education, astronomy & astrobiology, computer modeling, STEM music, and robotics for learners of all ages. Dr. Hoban is currently also working on using analytics for cyber security.

Onimi Jademi is a PhD candidate in the Department of Information Systems at the University of Maryland, Baltimore County (UMBC). Her research focuses on natural language processing and machine learning, and its applications especially in the healthcare domain. She has experience with high quality qualitative and quantitative research methods.