16
KURATOR: A Provenance-enabled Workflow Platform and Toolkit to Curate Biodiversity Data Bertram Ludäscher Graduate School of Library and Information Science (GSLIS) National Center for Supercomputing Applications (NCSA)

Kurator Project Overview (Brief)

Embed Size (px)

DESCRIPTION

eScience Research Round Table (ERRT) @ GSLIS on Data Curation for Biodiversity Informatics

Citation preview

Page 1: Kurator Project Overview (Brief)

KURATOR: A Provenance-enabled Workflow Platform and Toolkit to Curate

Biodiversity Data

Bertram Ludäscher

Graduate School of Library and Information Science (GSLIS)National Center for Supercomputing Applications (NCSA)

Page 2: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 2

• Kurator:– What problems is Kurator tackling and for whom? – Curation Workflow Example– How we’re going about it

• Not Today:– Related Biodiversity Informatics Projects

• Filtered-Push• Exploring Taxon Concepts (ETC)• Euler

– Other Informatics Projects• DataONE• SKOPE

Outline

Page 3: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 3

What is Kurator?

• NSF-DBI #1356751 – Collaborative Research: ABI Development:

Kurator: A Provenance-enabled Workflow Platform and Toolkit to Curate Biodiversity Data

– Sept. 2014 – 2017– @Illinois:

• B. Ludäscher, James Macklin, Tim McPhillips, …

– @Harvard: • James Hanken, Paul Morris, Bob Morris, …

Page 4: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 4

Problem: Data & Metadata Quality• Collections & occurrence data is

all over the map– … literally (off the map!)

• Issues:– Lat/Long transposition,

coordinate & projection issues– Scientific Names (spelling

errors, other) – Data entry/creation, “fuzzy”

data, naming issues, bit rot, data conversions and transformations, schema mappings, … (you name it)

• Precursor:– Filtered-Push Collaboration

Page 5: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 5

What Problems does Kurator try to solve?

• Detect and flag data quality issues

• Repair if possible

• Keep track of provenance– automatic repairs– human curator edits

Page 6: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 6

Who are the customers?

• Collection Managers – … who are managing the collections databases– Can run curation workflows periodically

• … in the presence of new data and/or new curation services

• (Biodiversity) Researchers– To perform an analysis in the presence of (partially)

dirty data, researchers need to• Clean or fix dirty data• Throw out unfixable data

– Pushing changes to the original data collections and collection managers (cf. FPush)

Page 7: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 7

Example: Kepler/Kurator (FPush project)

Page 8: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 8

Simplified Example Workflow

• Related Research (Tianhong Song, UC Davis)– Analyze linear workflow “story”– Use patterns to discover wf design issues

(e.g. use before update); then fix them– Parallelize when possible

• Kurator:– Allow easy assembly

of such workflows– For tool makers– … and tool users – … scalability

challenge.

Page 9: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 9

Example Output …

Page 10: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 10

… close up …

Page 11: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 11

How we do it

• Build a library of curation services such that curation workflows can be run from various platforms– Scientific workflow systems

• e.g. Restflow, Kepler, Taverna, Galaxy

– Other platforms• e.g. Akka, Python-based, …

• … leveraging existing technologies

Page 12: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 12

How we do it

• Open source, community-friendly approach– git repository (NCSA open source projects)

• Agile software development– NCSA support tools, e.g. JIRA, Bamboo

• Inspired by – Small bioinformatics tools manifesto (post-facto)– Unix tenets (small, interoperable tools, … )– Experience with other (sometimes not so agile)

development projects

Page 13: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 13

Kurator: Agile Development

Page 14: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 14

Q & A …

• What does data curation, quality control mean in you domain / application / research?

• Are there particular issues that are important to you?

• Join us!– Kurator & other Biodiversity Interest

• Hackers welcome, too.

– Email: [email protected]

Page 15: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 15

Related Research (Tianhong Song)

• Automated Design, Analysis, Optimization of Curation Workflows.

• Idea:

• Example Workflow[Scientific Name Validation] [GeoRef Validation] [Date Validation]

Page 16: Kurator Project Overview (Brief)

ERRT @ GSLIS 10/22/2014 16

Related Research (Tianhong Song)

• Analyze linear workflow “story”

• Use patterns to discover wf design issues (e.g. use before update); then fix them

• Parallelize when possible