of 23 /23
1 Workflow Support for Continuous Data Quality Control in a FilteredPush Network J. Hanken, D. Lowery, B. Ludäscher, J. Macklin, P. Morris, B. Morris, T. Song

Tdwg14 fp-kurator-ludaescher

Embed Size (px)

DESCRIPTION

Workflow Support for Continuous Data Quality Control in a FilteredPush Network J. Hanken, D. Lowery, B. Ludäscher, J. Macklin, T. McPhillips, P. Morris, B. Morris, T. Song Presentation given at TDWG 2014 Jönköping, Sweden

Text of Tdwg14 fp-kurator-ludaescher

  • 1. 1Workflow Support for Continuous Data QualityControl in a FilteredPush NetworkJ. Hanken, D. Lowery, B. Ludscher, J. Macklin, T. McPhillipsP. Morris, B. Morris, T. Song

2. Problem: Data & Metadata Quality Collections & occurrence data is all over the map literally (off the map!) DQ Issues, e.g., Lat/Long transposition,coordinate & projection issues Scientific Names (spellingerrors, other) Data entry/creation, fuzzydata, naming issues, bit rot,data conversions andtransformations, schemamappings, (you name it) Related Projects: Filtered-Push Kurator2 3. What problems are we trying to solve? Detect and flag data quality issues Repair if possible ask human curators as needed Keep track of provenance automatic repairs human curators edits Employ workflow (semi-)automation Scientific workflow systems: Kepler/COMAD, Restflow, Galaxy, Biovel/Taverna, Argo, VisTrails, Related technologies Akka parallel execution platform Script-based automation (e.g. Python) and digital notebooks (iPython)3 4. Data Curation Workflow4Dou, Lei., G. Cao, P.J. Morris, R.A. Morris, B. Ludscher, J.A. Macklin, J. Hanken. 2012. Kurator: A Kepler Package forData Curation Workflows, Procedia Computer Science, 9:1614-1619, doi:10.1016/j.procs.2012.04.177 5. Customers of Curation Workflows Collection Managers who are managing the collections databases Can run curation workflows periodically in the presence of new data and/or new curation services (Biodiversity) Researchers To perform an analysis in the presence of (partially)dirty data, researchers need to Clean or fix dirty data Throw out unfixable data Reporting back to the collection managers (cf. FPush)5 6. Filtered Pushhttp://xkcd.com/386/(1) Kvetch about data(2) Push to interested parties(3) Human Filter(4) Change datain databases(5) Store allassertionsSource: Paul J. Morris6 7. 7Introduction NEVP Digiitization NEVP Data Flow Annotations Duplicates Quality ControlSymbiota InstanceSymbiota Instance & DBAkka curation workflowon FP2, working on DWspreadsheet reportsSource: Paul J. Morris 8. Overall DataflowAccessPointSymbiotaPortal FilteredPushNodeAkkaKuratorWorkflowsOccurrenceRecordsQuality ControlAnnotationsQuality ControlWorkflow Quality ControlledData SetSource: Paul J. Morris8 9. Example Curation Workflow Load Dataset Scientific Name Validation Georeference Validation Collection Date Validation [Create Annotations into FPush Network] Output results translate to spreadsheet with provenance!9some steps of a larger workflow 10. Curation Workflow Output 10 11. close up CORRECT Checked and OK CURATED: Checked and fixed UNABLE_CURATE Internally inconsistent cannot fix UNABLED_DET_VALIDITY Not enough data: No external reference found11 12. even more close: Spreadsheet Provenance Assertions made sign changed coordinates are on the Earth's surface Coordinates not inside country transposed/sign changed coordinates to place inside country Transposed/sign changed coordinates are near georeferenceof locality from Geolocate Sources used Land data from Natural Earth Country boundary data from GeoCommunity GeoLocate12 13. Date Validation Check: Collectors life span .. vs. Date-Collected Possible outcomes: Valid Corrected Unable to validate Internal inconsistency Contradicting dates External inconsistency Lack of date data13 14. The Logic Behind Each Step Date Collected collectors life-time vs date collected Georeference Validation Lat/long valid (on Earth) within a country (shape file), point in polygon If georef is bad then try transpositions, sign-swapping etc of lat/long If they match fix it! Make sure to record in provenance Using the transposed (or sign-fixed) original date(not the Geolocate)14 15. Logic Behind Each Step (contd) Scientific Name Validation Customer-dependent: Collection Managers: Nomenclature Researchers: Taxonomy (current names) Several Remote services IPNI, GNI, . 15 16. Curation Workflow Challenges:Machine Cycles Scalability & Technology Issues: Clean aggregated data at a FP Node Headless Use of Kepler/COMAD, pros & cons: OK on human cycles, but NOT OK on machine cycles Akka Parallelize remote service invocation: helps Non-trivial programming => add another layer on top of Akka .. or ?? 16 17. Challenges: Human Cycles New Kurator project: Enable tool makers Make it easy to build components (software actors, services) workflows (gluing services together) Data Curation Workflows Interest Group !? Service builders Service & Workflow Registries cf. myExperiment Service aggregators cf. BioVel, DwC validator, 17 18. What is Kurator? NSF-DBI #1356751 Collaborative Research: ABI Development:Kurator: A Provenance-enabled Workflow Platformand Toolkit to Curate Biodiversity Data Sept. 2014 2017 @Illinois: B. Ludscher, James Macklin, Tim McPhillips, @Harvard: James Hanken, Paul Morris, Bob Morris, @TDWG community 18 19. Kurator Tenets Technology Agnostic to the extent we can avoid reinventing the wheel one size probably doesnt fit all=> Deploy curation steps on different wf systems, platforms For Tool Makers Agile, Community-Driven Development Kurator just started, evolving Get involved now! Kick-off meeting November 17 & 18 @ NCSA (University of Illinois, Urbana-Champaign)19 20. How we do it Build a library of curation services such thatcuration workflows can be run from variousplatforms Scientific workflow systems e.g. Restflow, Kepler, Taverna, Galaxy Other platforms e.g. Akka, Python-based, leveraging existing technologies20 21. How we do it Open source, community-friendly approach git repository (NCSA open source projects) Agile software development NCSA support tools, e.g. JIRA, Bamboo Inspired by Small bioinformatics tools manifesto (post-facto) cf. Unix tenets (small tools, use filters, pipes, KISS!) Experience with other (sometimes not so agile)development projects21 22. Agile Kurator Development22Interested in looking under the hood?Kurator/Akka curation wf demo:Wed PMInitial URL:opensource.ncsa.illinois.edu/projects/KURATOR 23. Related Research (Tianhong Song, UC Davis) Analyze linear workflowstory Use patterns to discover wfdesign issues (e.g. use beforeupdate); then fix them Parallelize when possible23