85
ETC & Authors in the Driver’s Seat vs YesWorkflow: Revealing data-/workflow from scripts Kurator: Automating data curation workflows EulerX: Agreeing to disagree about taxonomies Whole-Tale: Reproducible, computational narratives Bertram Ludäscher [email protected] ETC+Authors @ Biosphere 2 2018-01-10..12 Director,Center for Informatics Research in Science & Scholarship (CIRSS) School of Information Sciences (iSchool@Illinois) & National Center for Supercomputing Applications (NCSA) & Department of Computer Science (CS@Illinois) 1

ETC & Authors in the Drivers Seat

Embed Size (px)

Citation preview

Page 1: ETC & Authors in the Drivers Seat

ETC&AuthorsintheDriver’sSeatvs

YesWorkflow:Revealingdata-/workflowfromscriptsKurator:AutomatingdatacurationworkflowsEulerX:AgreeingtodisagreeabouttaxonomiesWhole-Tale:Reproducible,computationalnarratives

BertramLudä[email protected]

ETC+Authors @Biosphere22018-01-10..12

Director,CenterforInformaticsResearchinScience&Scholarship(CIRSS)SchoolofInformationSciences(iSchool@Illinois)

&NationalCenterforSupercomputingApplications(NCSA)&DepartmentofComputerScience(CS@Illinois)

1

Page 2: ETC & Authors in the Drivers Seat

Author’sDriving..• Curators:dealingwithproblemsofdataquality,reuse,interoperability,etc.assoonastheycan– butoften:“downtheroad”…

• Authors:address(meta-)dataqualityupstream– ..atthesource,whendataiscreated

=>Resonateswith“empoweringscientists”themewe’repursuinginotherprojects(e.g.WT,YW..)

Ludäscher:Workflows&Provenance=>Understanding 2

Page 3: ETC & Authors in the Drivers Seat

Provenance(Lineage)matters…

• Oneofthesesoldfor$180M,theotheronefor$22K(butcouldbeworthmore...definitelymaybe...)

• Whichonewouldyouliketoown?

Ludäscher:Workflows&Provenance=>Understanding 3

Page 4: ETC & Authors in the Drivers Seat

Provenance(Lineage)matters…

• Oneofthesesoldfor$180M,theotheronefor…• …$450M!!!Ludäscher:Workflows&Provenance=>Understanding 4

Page 5: ETC & Authors in the Drivers Seat

Provenanceis:keepingrecords …

• GrandCanyon’srocklayersarearecordoftheearlygeologichistoryofNorthAmerica.Theancestralpuebloan granariesatNankoweap Creektellarchaeologistsaboutmorerecenthumanhistory.(ByDrenaline,licensedunderCCBY-SA3.0)

• Notshown:computationalarchaeologistsreconstructingpastclimatefrommultipletree-ringdatabasesè computationalprovenanceiskeyfortransparency &reproducibility

Ludäscher:Workflows&Provenance=>Understanding 5

Page 6: ETC & Authors in the Drivers Seat

...andprovenanceis:Understanding whathappened!

Zrzavý,Jan,DavidStorch,and StanislavMihulka.Evolution:EinLese-Lehrbuch.

Springer-Verlag,2009.

Author:Jkwchui (BasedondrawingbyTruth-seeker2004)

Ludäscher:Workflows&Provenance=>Understanding 6

Page 7: ETC & Authors in the Drivers Seat

Computational Provenance …• Origin,processinghistoryofartifacts

– dataproducts,figures,...– also:underlyingworkflowè understandmethods,dataflow,anddependencies

Ludäscher:Workflows&Provenance=>Understanding 7

Climate Change Impacts in the United States

U.S. National Climate AssessmentU.S. Global Change Research Program

Page 8: ETC & Authors in the Drivers Seat

Rewind: Data Curation Workflows (Filtered-Push … Kepler … Kurator projects)

Ludäscher:Workflows&Provenance=>Understanding 8

Page 9: ETC & Authors in the Drivers Seat

DataCurationWorkflows&Provenance

• Datacurationanddatacleaningworkflows– …canbedefinedusingaworkflowsystem

• workflow=“prospective”provenance(=generalrecipe)

– ...orusinggood-old scripts (bash,Python,R,...)• …whichiswhatmany“meremortals”use!

• Script-basedworkflows– …benefitfromhavingtheworkflowexposedanddataflowdependenciesrevealed

Ludäscher:Workflows&Provenance=>Understanding 9

Page 10: ETC & Authors in the Drivers Seat

RuntimeProvenance(a.k.a.traces,logs,

retrospectiveprovenance,“Trace-land”)

WorkflowModeling&Design(a.k.a.prospective provenance

“Workflow-land”)

Ludäscher:Workflows&Provenance=>Understanding 10

Workflowsó Provenanceanimportantlink!

Page 11: ETC & Authors in the Drivers Seat

=W3CPROV+DataONE extensions

11

Trace

Workflow

Data (extensible)

See purl.dataone.org/provone-v1-dev

Page 12: ETC & Authors in the Drivers Seat

• …NSFSKOPE: systemandtoolstodiscover,access,analyze,visualizepaleoenvironmentaldata– unprecedentedabilitytoexploreprovenance

(detailed,comprehensiblerecordofcomputationalderivationofresults)

– forresearchers,tinkerers,andmodelers

• …NSFWholeTale:– leverage&contributetoexistingCItosupportthe

wholetale(“livingpaper”),fromworkflowruntoscholarlypublication

– integratetools&CI(DataONE,Globus,iRODS,NDS,...)tosimplifyuseandpromotebestpractices.

– drivenbyscienceWGs(Archaeology/SKOPE,materialsscience,astro,bio..)

RelatedProjects:NSFDataONE (ProvONE ..)+…

Ludäscher:Workflows&Provenance=>Understanding 12

Page 13: ETC & Authors in the Drivers Seat

ProvenanceSupportforReproducibleScienceExample:PaleoclimateReconstruction

Sciencepaper(OA)uses:• opensourcecode:

– R,PaleoCAR,…

• Isthatallweneed?• Whatwasthe“workflow”?

• Isthereprospectiveand/orretrospectiveprovenance?

Ludäscher:Workflows&Provenance=>Understanding 13

Page 14: ETC & Authors in the Drivers Seat

SKOPE:SynthesizedKnowledgeOfPastEnvironmentsBocinsky,Kohleretal.studyrain-fedmaizeof Anasazi

– FourCorners;AD600–1500. ClimatechangeinfluencedMesaVerdeMigrations;late13thcenturyAD.Usesnetworkoftree-ringchronologiestoreconstructaspatio-temporalclimatefieldatafairlyhighresolution(~800m)fromAD1–2000.Algorithmestimatesjointinformationintree-ringsandaclimatesignaltoidentify“best” tree-ringchronologiesforclimatereconstructing.

K.Bocinsky,T.Kohler,A2000-yearreconstructionoftherain-fedmaizeagriculturalnicheintheUSSouthwest.Nature

Communications.doi:10.1038/ncomms6618

… implemented as an R Script … Ludäscher:Workflows&Provenance=>Understanding 14

Page 15: ETC & Authors in the Drivers Seat

YesWorkflow:Prospective&RetrospectiveProvenance…(almost)forfree!

• YWannotationsina(Python,R,…)scriptrecreateaworkflowviewfromthescript…

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

YW!

Ludäscher:Workflows&Provenance=>Understanding 15

@BEGIN..@END..@IN..@OUT..@URI..@LOG..

Page 16: ETC & Authors in the Drivers Seat

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years

Paleoclimate Reconstruction(openSKOPE.org)• …explainedusingYesWorkflow!

KyleB.,(computational)archaeologist:"Ittookmeabout20minutestocomment.LessthananhourtolearnandYW-annotate,all-told."

Ludäscher:Workflows&Provenance=>Understanding 16

Page 17: ETC & Authors in the Drivers Seat

YWDemoUseCases(IDCC’17)Domain Usecase Programminglanguage Provenancemethods

Climatescience C3C4 MATLAB YW+MATLABRunManager

Astrophysics LIGO Python YW+NW(code-level)

Protein crystalsamples Simulatedatacollection

Python YW+NW(code-level)

Biodiversitydatacuration

kurator-SPNHC Python YW-recon+YW-logging

Socialnetwork analysis Twitter Python YW +NW(file-level)

Oceanography OHIBC Howe Sound(multi-run multi-script)

R YW +RRunManager

Ludäscher:Workflows&Provenance=>Understanding 17

Page 18: ETC & Authors in the Drivers Seat

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

YW-RECON:Prospective&RetrospectiveProvenance…(almost)forfree!

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

• URI-templateslink conceptualentitiestoruntimeprovenance“leftbehind”bythescriptauthor…

• …facilitatingprovenancereconstructionLudäscher:Workflows&Provenance=>Understanding 18

Page 19: ETC & Authors in the Drivers Seat

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q1:Whatsamples didthescriptruncollectimagesfrom?

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

Ludäscher:Workflows&Provenance=>Understanding 19

Page 20: ETC & Authors in the Drivers Seat

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q2:Whatenergies wereusedforimagecollectionfromsampleDRT322?

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

Ludäscher:Workflows&Provenance=>Understanding 20

Page 21: ETC & Authors in the Drivers Seat

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q3:WhereistherawimageofthecorrectedimageDRT322_11000ev_030.img?run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

Ludäscher:Workflows&Provenance=>Understanding 21

Page 22: ETC & Authors in the Drivers Seat

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

Q5:Whatcassette-idhadthesampleleadingtoDRT240_10000ev_001.img?

Ludäscher:Workflows&Provenance=>Understanding 22

Page 23: ETC & Authors in the Drivers Seat

Hybrid Provenance:YWModel + RuntimeObservables (filelevel)

Ludäscher:Workflows&Provenance=>Understanding 23

�����������������

�����

���������

��������������

����������������

����������

�����������������

����������������

�������

����������

������������������

����������������

�����������������

�������������������

�����������

������������������

����������

�����������������

�����������

������������

�������������

���������������������

�������������������������������������������������������������������

�����������������

�������������������������������������������������������������������������

• TheYWmodelcanbeconnectedwithruntimeobservables

• è YWrecon(prov reconstruction)• Here:

• Whatspecificfileswereread,writtenandwheredotheyoccurintheworkflow?

Page 24: ETC & Authors in the Drivers Seat

C3-C4ProspectiveProvenance

Ludäscher:Workflows&Provenance=>Understanding

C3_C4_map_present_NA

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

fetch_monthly_mean_air_temperature_data

Tair_Matrix

fetch_monthly_mean_precipitation_data

Rain_Matrix

initialize_Grass_Matrix

Grass_variable

examine_pixels_for_grass

C3_Data C4_Data

generate_netcdf_file_for_C3_fraction

C3_fraction_datafile:outputs/SYNMAP_PRESENTVEG_C3Grass_RelaFrac_NA_v2.0.nc

generate_netcdf_file_for_C4_fraction

C4_fraction_datafile:outputs/SYNMAP_PRESENTVEG_C4Grass_RelaFrac_NA_v2.0.nc

generate_netcdf_file_for_Grass_fraction

Grass_fraction_datafile:outputs/SYNMAP_PRESENTVEG_Grass_Fraction_NA_v2.0.nc

SYNMAP_land_cover_map_datainputs/land_cover/SYNMAP_NA_QD.nc

mean_airtempfile:inputs/narr_air.2m_monthly/air.2m_monthly_{start_year}_{end_year}_mean.{month}.nc

mean_precipfile:inputs/narr_apcp_rescaled_monthly/apcp_monthly_{start_year}_{end_year}_mean.{month}.nc

24

Page 25: ETC & Authors in the Drivers Seat

WhatdoesC4_fraction_data dependon?C3_C4_map_present_NA

examine_pixels_for_grass

C4_Data

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

fetch_monthly_mean_precipitation_data

Rain_Matrix

fetch_monthly_mean_air_temperature_data

Tair_Matrix

generate_netcdf_file_for_C4_fraction

C4_fraction_data

SYNMAP_land_cover_map_data

mean_airtempmean_precipC3_C4_map_present_NA

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

fetch_monthly_mean_air_temperature_data

Tair_Matrix

fetch_monthly_mean_precipitation_data

Rain_Matrix

initialize_Grass_Matrix

Grass_variable

examine_pixels_for_grass

C3_Data C4_Data

generate_netcdf_file_for_C3_fraction

C3_fraction_data

generate_netcdf_file_for_C4_fraction

C4_fraction_data

generate_netcdf_file_for_Grass_fraction

Grass_fraction_data

SYNMAP_land_cover_map_data

mean_airtempmean_precip

C4_fraction_datalineage verysimilartooverallworkflowgraph!

Ludäscher:Workflows&Provenance=>Understanding 25

Page 26: ETC & Authors in the Drivers Seat

WhatdoesGrass_fraction_data dependon?

C3_C4_map_present_NA

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

fetch_monthly_mean_air_temperature_data

Tair_Matrix

fetch_monthly_mean_precipitation_data

Rain_Matrix

initialize_Grass_Matrix

Grass_variable

examine_pixels_for_grass

C3_Data C4_Data

generate_netcdf_file_for_C3_fraction

C3_fraction_data

generate_netcdf_file_for_C4_fraction

C4_fraction_data

generate_netcdf_file_for_Grass_fraction

Grass_fraction_data

SYNMAP_land_cover_map_data

mean_airtempmean_precip

C4_fraction_datalineage differentfromoverallworkflowgraph!- Smaller subgraph- Dependsononly1of3inputs!

C3_C4_map_present_NA

initialize_Grass_Matrix

Grass_variable

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

generate_netcdf_file_for_Grass_fraction

Grass_fraction_data

SYNMAP_land_cover_map_data

Ludäscher:Workflows&Provenance=>Understanding 26

Page 27: ETC & Authors in the Drivers Seat

Whathappensafterrunningthescript?Hybrid provenancegraph!

• 3inputsspreadacross25 (=2x24+1)files

• Doall3outputfilesdependonall25inputs?

C3_C4_map_present_NA

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

fetch_monthly_mean_air_temperature_data

Tair_Matrix

fetch_monthly_mean_precipitation_data

Rain_Matrix

initialize_Grass_Matrix

Grass_variable

examine_pixels_for_grass

C3_Data C4_Data

generate_netcdf_file_for_C3_fraction

C3_fraction_data

outputs/SYNMAP_PRESENTVEG_C3Grass_RelaFrac_NA_v2.0.nc

generate_netcdf_file_for_C4_fraction

C4_fraction_data

outputs/SYNMAP_PRESENTVEG_C4Grass_RelaFrac_NA_v2.0.nc

generate_netcdf_file_for_Grass_fraction

Grass_fraction_data

outputs/SYNMAP_PRESENTVEG_Grass_Fraction_NA_v2.0.nc

SYNMAP_land_cover_map_data

inputs/land_cover/SYNMAP_NA_QD.nc

mean_airtemp

inputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.9.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.2.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.1.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.6.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.10.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.3.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.7.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.11.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.4.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.8.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.12.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.5.nc

mean_precip

inputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.4.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.8.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.1.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.12.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.5.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.9.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.2.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.6.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.10.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.3.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.7.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.11.nc

Ludäscher:Workflows&Provenance=>Understanding 27

Page 28: ETC & Authors in the Drivers Seat

WhatC4_fraction_datadependson(hybrid)…

C3_C4_map_present_NA

examine_pixels_for_grass

C4_Data

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

fetch_monthly_mean_precipitation_data

Rain_Matrix

fetch_monthly_mean_air_temperature_data

Tair_Matrix

generate_netcdf_file_for_C4_fraction

C4_fraction_data

SYNMAP_land_cover_map_data

mean_airtempmean_precip

Earlierprospectivequeryresult

C3_C4_map_present_NA

examine_pixels_for_grass

C4_Data

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

fetch_monthly_mean_precipitation_data

Rain_Matrix

fetch_monthly_mean_air_temperature_data

Tair_Matrix

generate_netcdf_file_for_C4_fraction

C4_fraction_data

outputs/SYNMAP_PRESENTVEG_C4Grass_RelaFrac_NA_v2.0.nc

SYNMAP_land_cover_map_data

inputs/land_cover/SYNMAP_NA_QD.nc

mean_airtemp

inputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.4.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.8.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.1.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.12.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.5.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.9.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.2.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.6.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.10.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.3.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.7.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.11.nc

mean_precip

inputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.10.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.3.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.7.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.11.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.4.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.8.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.1.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.12.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.5.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.9.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.2.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.6.nc

Ludäscher:Workflows&Provenance=>Understanding 28

Page 29: ETC & Authors in the Drivers Seat

WhatGrass_fraction_data dependson(hybrid)…

C3_C4_map_present_NA

initialize_Grass_Matrix

Grass_variable

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

generate_netcdf_file_for_Grass_fraction

Grass_fraction_data

SYNMAP_land_cover_map_data

C3_C4_map_present_NA

initialize_Grass_Matrix

Grass_variable

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

generate_netcdf_file_for_Grass_fraction

Grass_fraction_data

outputs/SYNMAP_PRESENTVEG_Grass_Fraction_NA_v2.0.nc

SYNMAP_land_cover_map_data

inputs/land_cover/SYNMAP_NA_QD.ncC3_C4_map_present_NA

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

fetch_monthly_mean_air_temperature_data

Tair_Matrix

fetch_monthly_mean_precipitation_data

Rain_Matrix

initialize_Grass_Matrix

Grass_variable

examine_pixels_for_grass

C3_Data C4_Data

generate_netcdf_file_for_C3_fraction

C3_fraction_data

generate_netcdf_file_for_C4_fraction

C4_fraction_data

generate_netcdf_file_for_Grass_fraction

Grass_fraction_data

SYNMAP_land_cover_map_data

mean_airtempmean_precip

Overallworkflow

UpstreamofGrass_fraction_data

(prospective)

UpstreamofGrass_fraction_data(hybrid)

# @BEGIN

Gravitational_Wave_Detection

# @IN fn_d @as FN_Detector

# @IN fn_sr @as FN_Sampling_Rate

# @OUT shifted.wav @as

shifted_wave

# @OUT whitenbp.wav @as

whitened_bandpass

import numpy as np

from scipy import signal

# @BEGIN

Amplitude_Spectral_Density

# @IN strain_H1

# @IN strain_L1

# @PARAM fs

# @OUT psd_H1

# @OUT psd_L1

# @OUT GW150914_ASDs.png @URI …

NFFT = 1*fs

fmin, fmax = 10, 2000

YesWorkflow-annotatedscripts

File I/OEvents

Log filesLogicrulesforreconstructing,

querying,andvisualizingprospective andretrospective

provenancetogether

upstream(strain_LI_whitenbp) [NW-recon]

WHITENING

strain_L1_whitenstrain_L1_whiten = array([8.494, -1.672, ..., 72.156])

AMPLITUDE_SPECTRAL_DENSITY

PSD_L1psd_L1 = scipy.interpolate.interpolate.interp1d

object at 0x113969418

LOAD_DATA

strain_L1strain_L1 = array([-1.779e-18, -1.765e-18, ..., -1.719e-18])

BANDPASSING

strain_L1_whitenbpstrain_L1_whitenbp = array([8.184, 19.935,..., -0.684])

FN_Detectorfn_d = L-L1_LOSC_4_V1-1126259446-32.hdf5

fsfs = 4096

upstream(strain_LI_whitenbp) [prospective]

WHITENING

strain_H1_whiten strain_L1_whiten

AMPLITUDE_SPECTRAL_DENSITY

PSD_H1 PSD_L1

LOAD_DATA

strain_H1 strain_L1

BANDPASSING

strain_L1_whitenbp

FN_Detectorfile:{Detector}_LOSC_4_V1-...

FN_Sampling_ratefile:H-H1_LOSC_{Rate}_V1-...

fs

upstream(strain_L1_whitenbp) [URI-recon]

WHITENING

strain_H1_whiten strain_L1_whiten

AMPLITUDE_SPECTRAL_DENSITY

PSD_H1 PSD_L1

LOAD_DATA

strain_H1 strain_L1

BANDPASSING

strain_L1_whitenbp

FN_Detector

L-L1_LOSC_4_V1-1126259446-32.hdf5H-H1_LOSC_4_V1-1126259446-32.hdf5

FN_Sampling_rate

H-H1_LOSC_4_V1-1126259446-32.hdf5H-H1_LOSC_16_V1-1126259446-32.hdf5

fs

ProvenanceRecorders

Functioncallgraphandvariabledependencies

Rawruntimeobservations

YesWorkflow toolkitExtract annotationsand

model scriptasaworkflow

YesWorkflow toolkitReconstruct scriptrunandretrospectiveprovenance

YesWorkflow toolkitRenderworkflowmodelgraphically

ProspectiveProvenanceuser-defined

workflowmodels

HybridProvenance

Generalpurposeprovenancebridges

ProvenancequeriesQuery provenance(esp.graphs)andvisualize results

ProvenanceExportersQuery andvisualize

provenance

noWorkflowtoolkitQuery andvisualize

provenance

RetrospectiveProvenancePythonruntimeobservables

prospective+code-levelruntimeobservables

subgraph

NW_FILTERED_LINEAGE_GRAPH_FOR_STRAIN_L1_WHITENBP

whiten

141 fn_d = 'L-L1_LOSC_4_V1-1126259446-32.hdf5'

142 loaddata = (array([ -1.77955839e-18, ... 1, 1, 1], dtype=uint32)})

142 time_L1 = array([ 1.12625945e+09, ... 8e+09, 1.12625948e+09]) 142 strain_L1 = array([ -1.77955839e-18, ... 6e-18, -1.71969299e-18]) 151 fs = 4096

153 time = array([ 1.12625945e+09, ... 8e+09, 1.12625948e+09])

155 dt = 0.000244140625

266 NFFT = 4096

270 psd = (array([ 2.22851728e-36, ... e+03, 2.04800000e+03]))

270 freqs = array([ 0.00000000e+00, ... 0e+03, 2.04800000e+03]) 270 Pxx_L1 = array([ 2.22851728e-36, ... 5e-46, 1.77059496e-46])

274 psd_L1 = <scipy.interpolate.interp ... 1d object at 0x1095b0260>

334 return = array([ 8.49413154, -1. ... .39942945, 72.15659253])

333 white_ht = array([ 8.49413154, -1. ... .39942945, 72.15659253])

325 strain = array([ -1.77955839e-18, ... 6e-18, -1.71969299e-18])

325 interp_psd = <scipy.interpolate.interp ... 1d object at 0x1095b0260>

325 dt = 0.000244140625

326 len

326 Nt = 131072

327 rfftfreq = array([ 0.00000000e+00, ... 5e+03, 2.04800000e+03])

327 freqs = array([ 0.00000000e+00, ... 5e+03, 2.04800000e+03])

331 rfft = array([ -2.39692348e-13 + ... 54e-19 +0.00000000e+00j])

331 hf = array([ -2.39692348e-13 + ... 54e-19 +0.00000000e+00j]) 332 (np.sqrt(interp_psd(freqs) /dt/2.))

332 white_hf = array([ -3.54798023e+03 + ... 58e+02 +0.00000000e+00j])

333 irfft = array([ 8.49413154, -1. ... .39942945, 72.15659253])

338 strain_L1_whiten = array([ 8.49413154, -1. ... .39942945, 72.15659253])

362 butter = (array([ 0.0012848 , 0. ... 9166733 , 0.32217438]))

362 ab = array([ 1. , -6. ... .9166733 , 0.32217438])362 bb = array([ 0.0012848 , 0. ... 0. , 0.0012848 ])

364 filtfilt = array([ 8.18464884, 19. ... .18198039, -0.68432653])

364 strain_L1_whitenbp = array([ 8.18464884, 19. ... .18198039, -0.68432653])

whiten

write_wavfile write_wavfile

write_wavfilewrite_wavfile

get_filter_coefs

iir_bandstopsiir_bandstops iir_bandstopsiir_bandstops iir_bandstopsiir_bandstopsiir_bandstops iir_bandstopsiir_bandstopsiir_bandstopsiir_bandstops iir_bandstops iir_bandstops iir_bandstops iir_bandstopsiir_bandstops

reqshift reqshift

write_wavfile

write_wavfile

reqshift

whiten whiten

filter_data

filter_data filter_datafilter_data

136 loaddata

135 fn_H1

136 time_H1 136 strain_H1 136 chan_dict_H1139 loaddata

138 fn_L1

139 time_L1 139 strain_L1139 chan_dict_L1

163 genfromtxt163 ndarray.transpose163 NR_H1163 NRtime

175 len175 ndarray.min 175 ndarray.mean 175 ndarray.max 176 len 176 ndarray.min 176 ndarray.mean 176 ndarray.max 177 len 177 ndarray.min 177 ndarray.mean 177 ndarray.max

181 len

180 bits

181 ndarray.min 181 array_str181 ndarray.mean 181 ndarray.max 181 array_str183 len

182 bits

183 ndarray.min 183 array_str183 ndarray.mean 183 ndarray.max 183 array_str 185 len

184 bits

185 ndarray.min 185 array_str185 ndarray.mean 185 ndarray.max 185 array_str187 len

186 bits

187 ndarray.min 187 array_str 187 ndarray.mean 187 ndarray.max 187 array_str189 len

188 bits

189 ndarray.min 189 array_str 189 ndarray.mean 189 ndarray.max 189 array_str 191 len

190 bits

191 ndarray.min 191 array_str 191 ndarray.mean 191 ndarray.max 191 array_str

207 where

204 tevent205 deltat

207 indxt

209 figure

210 plot 211 plot

212 str(tevent)212 xlabel 212 str(tevent)

213 ylabel 214 legend 215 title GW150914_strain.png

216 savefig

258 psd

142 fs

255 NFFT

258 Pxx_H1 258 freqs259 psd259 freqs 259 Pxx_L1

262 psd_H1 263 psd_L1

266 figure

267 np.sqrt(Pxx_H1)267 loglog267 np.sqrt(Pxx_H1) 268 np.sqrt(Pxx_L1)268 loglog 268 np.sqrt(Pxx_L1)

269 axis

256 fmin 257 fmax

270 grid 271 ylabel 272 xlabel 273 legend 274 title GW150914_ASDs.png

275 savefig

323 return

322 white_ht

314 strain 314 interp_psd 314 dt

146 dt

315 len 315 Nt

316 rfftfreq 316 freqs

320 rfft320 hf

321 (np.sqrt(interp_psd(freqs) /dt/2.))321 white_hf

322 irfft

144 time

326 strain_H1_whiten

323 return

322 white_ht

314 strain 314 interp_psd314 dt

315 len315 Nt

316 rfftfreq 316 freqs

320 rfft320 hf

321 (np.sqrt(interp_psd(freqs) /dt/2.))321 white_hf

322 irfft

327 strain_L1_whiten

323 return

322 white_ht

314 strain314 interp_psd 314 dt

315 len315 Nt

316 rfftfreq 316 freqs

320 rfft320 hf

321 (np.sqrt(interp_psd(freqs) /dt/2.))321 white_hf

322 irfft

328 NR_H1_whiten 351 butter351 ab 351 bb

352 filtfilt 352 strain_H1_whitenbp 353 filtfilt 353 strain_L1_whitenbp354 filtfilt 354 NR_H1_whitenbp

368 int(0.007*fs)368 roll368 strain_L1_shift 368 int(0.007*fs)

370 figure

371 plot

372 plot

373 plot

374 xlim 375 ylim

376 str(tevent)376 xlabel 376 str(tevent)

377 ylabel 378 legend 379 title GW150914_strain_whitened.png

380 savefig

414 where

411 tevent 412 deltat

414 indxt 422 blackman

417 NFFT

422 window

431 figure

433 plt.specgram(strain_H1[in ... xextent=[-deltat,deltat])

427 spec_cmap419 NOVL

432 im 432 spec_H1 432 freqs 432 bins

433 specgram433 plt.specgram(strain_H1[in ... xextent=[-deltat,deltat])

434 str(tevent)434 xlabel 434 str(tevent)

435 ylabel 436 colorbar

437 axis

438 title GW150914_H1_spectrogram.png

439 savefig

442 figure

444 plt.specgram(strain_L1[in ... xextent=[-deltat,deltat])

443 im 443 spec_H1 443 freqs 443 bins

444 specgram 444 plt.specgram(strain_L1[in ... xextent=[-deltat,deltat])

445 str(tevent)445 xlabel 445 str(tevent)

446 ylabel 447 colorbar

448 axis

449 title GW150914_L1_spectrogram.png

450 savefig

478 where

475 tevent

476 deltat

478 indxt 486 blackman

481 NFFT

486 window

489 figure

491 plt.specgram(strain_H1_wh ... xextent=[-deltat,deltat])

483 NOVL

490 im 490 spec_H1 490 freqs490 bins

491 specgram491 plt.specgram(strain_H1_wh ... xextent=[-deltat,deltat])

492 str(tevent)492 xlabel 492 str(tevent)

493 ylabel 494 colorbar 495 axis 496 title GW150914_H1_spectrogram_whitened.png

497 savefig

500 figure

502 plt.specgram(strain_L1_wh ... xextent=[-deltat,deltat])

501 im 501 spec_H1 501 freqs 501 bins

502 specgram 502 plt.specgram(strain_L1_wh ... xextent=[-deltat,deltat])

503 str(tevent)503 xlabel 503 str(tevent)

504 ylabel 505 colorbar 506 axis 507 title GW150914_L1_spectrogram_whitened.png

508 savefig

608 return

575 coefs

572 fs

586 butter

580 order584 low 585 high

586 ab586 bb

587 list.append

593 np.array( [14.0,3 ... 331.49, 510.02, 1009.99])

591 notchesAbsolute

593 array

597 array

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

597 bn597 an

598 list.append

597 array

596 notchf

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

597 bn 597 an

598 list.append

597 array

596 notchf

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

597 bn 597 an

598 list.append

597 array

596 notchf

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

597 bn 597 an

598 list.append

597 array

596 notchf

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

597 bn597 an

598 list.append

597 array

596 notchf

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

597 bn 597 an

598 list.append

597 array

596 notchf

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

597 bn 597 an

598 list.append

597 array

596 notchf

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

597 bn 597 an

598 list.append

597 array

596 notchf

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

597 bn597 an

598 list.append

597 array

596 notchf

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

597 bn597 an

598 list.append

597 array

596 notchf

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

597 bn597 an

598 list.append

597 array

596 notchf

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

597 bn597 an

598 list.append

597 array

596 notchf

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

597 bn597 an

598 list.append

597 array

596 notchf

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

597 bn597 an

598 list.append

601 array

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

601 bn 601 an

602 list.append

605 array

535 fstops

569 return

568 a 568 b

535 fs

545 array545 zd546 array546 pd

559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

554 low 555 high 556 low2 557 high2

542 nyq

558 p 558 k 558 z

559 iirdesign 559 iirdesign([low,high], [lo ... pe='ellip', output='zpk')

560 append560 zd561 append 561 pd

564 zpk2tf564 aPrelim 564 bPrelim

565 freqz565 outg0565 outFreq

568 zpk2tf

605 bn605 an

606 list.append

639 coefs642 RandomState.randn 642 data

631 return

630 data

624 data_in624 coefs

625 ndarray.copy625 data

630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

645 resp

649 psd

648 NFFT

649 freqs649 Pxx_data

650 psd650 Pxx_resp650 freqs

653 np.sqrt(Pxx_data)653 ndarray.mean 653 np.sqrt(Pxx_data)653 norm

654 np.sqrt(Pxx_data)654 asd_data

655 np.sqrt(Pxx_resp)655 asd_resp

659 ones

658 Nc

659 filt_resp662 freqz

661 b661 a

662 r662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b661 a

662 r 662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b661 a

662 r 662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b661 a

662 r 662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b661 a

662 r 662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b661 a

662 r 662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b661 a

662 r 662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b661 a

662 r 662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b 661 a

662 r 662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b 661 a

662 r 662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b661 a

662 r662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b 661 a

662 r662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b 661 a

662 r662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b 661 a

662 r662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b 661 a

662 r662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b 661 a

662 r662 w

663 np.abs(r)663 filt_resp

662 freqz

661 b661 a

662 r662 w

663 np.abs(r)663 filt_resp

669 figure

670 plot

671 plot

672 plot

664 freqf

666 filt_resp

673 xlim 674 grid 675 ylabel 676 xlabel 677 legend GW150914_filter.png

678 savefig

631 return

630 data

624 data_in 624 coefs

625 ndarray.copy 625 data

630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

688 strain_H1_filt

631 return

630 data

624 data_in 624 coefs

625 ndarray.copy 625 data

630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

689 strain_L1_filt

631 return

630 data

624 data_in 624 coefs

625 ndarray.copy 625 data

630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

630 data630 filtfilt

627 b 627 a

692 NR_H1_filt

710 figure

711 plot 712 plot

713 xlim

714 str(tevent)714 xlabel 714 str(tevent)

715 ylabel 716 legend 717 title GW150914_H1_strain_unfiltered.png

718 savefig

722 int(0.007*fs)722 roll722 strain_L1_fils 722 int(0.007*fs)

724 figure

725 plot

726 plot

727 plot

728 xlim 729 ylim

730 str(tevent)730 xlabel 730 str(tevent)

731 ylabel 732 legend 733 title GW150914_H1_strain_filtered.png

734 savefig

776 where

772 tevent773 deltat

776 indxt

779 int(fs)

768 fs 768 data768 filename

769 np.abs(data)769 amax769 np.int16(data/np.max(np.abs(data)) * 32767 * 0.9)769 d 769 np.abs(data)

770 int(fs)770 write 770 int(fs)

GW150914_H1_whitenbp.wav

779 int(fs) 780 int(fs)

768 fs 768 data768 filename

769 np.abs(data)769 amax769 np.int16(data/np.max(np.abs(data)) * 32767 * 0.9)769 d 769 np.abs(data)

770 int(fs)770 write 770 int(fs)

GW150914_L1_whitenbp.wav

780 int(fs) 781 int(fs)

768 fs 768 data768 filename

769 np.abs(data)769 amax769 np.int16(data/np.max(np.abs(data)) * 32767 * 0.9)769 d 769 np.abs(data)

770 int(fs)770 write 770 int(fs)

GW150914_NR_whitenbp.wav

781 int(fs)

824 int(float(fs)*float(speedup)) 824 float(fs)824 float(speedup)

821 fs823 speedup

824 fss

818 return

817 z

808 data 808 fshift

822 fshift

808 sample_rate

811 rfft811 x

812 len812 T812 float(sample_rate)

814 int(fshift/df)

813 df

814 nbins

816 roll816 y 816 roll

817 irfft

827 strain_H1_shifted

818 return

817 z

808 data808 fshift808 sample_rate

811 rfft811 x

812 len812 T812 float(sample_rate)

814 int(fshift/df)

813 df

814 nbins

816 roll816 y 816 roll

817 irfft

828 strain_L1_shifted

818 return

817 z

808 data 808 fshift808 sample_rate

811 rfft811 x

812 len 812 T 812 float(sample_rate)

814 int(fshift/df)

813 df

814 nbins

816 roll816 y 816 roll

817 irfft

829 NR_H1_shifted

845 int(fs)

768 fs 768 data768 filename

769 np.abs(data)769 amax769 np.int16(data/np.max(np.abs(data)) * 32767 * 0.9)769 d 769 np.abs(data)

770 int(fs)770 write 770 int(fs)

GW150914_H1_shifted.wav

845 int(fs) 846 int(fs)

768 fs 768 data768 filename

769 np.abs(data)769 amax769 np.int16(data/np.max(np.abs(data)) * 32767 * 0.9)769 d 769 np.abs(data)

770 int(fs)770 write 770 int(fs)

GW150914_L1_shifted.wav

846 int(fs) 847 int(fs)

768 fs 768 data768 filename

769 np.abs(data)769 amax769 np.int16(data/np.max(np.abs(data)) * 32767 * 0.9)769 d 769 np.abs(data)

770 int(fs)770 write 770 int(fs)

GW150914_NR_shifted.wav

847 int(fs)

876 loaddata

875 fn_16

876 time_16876 strain_16876 chan_dict

878 loaddata

877 fn_4

878 time_4878 strain_4 878 chan_dict883 psd

881 fs

882 NFFT

883 freqs_16 883 Pxx_16

887 psd

885 fs

886 NFFT

887 Pxx_4 887 freqs_4

892 figure

893 np.sqrt(Pxx_16)893 loglog 893 np.sqrt(Pxx_16)

894 np.sqrt(Pxx_4)894 loglog894 np.sqrt(Pxx_4) 895 axis

889 fmin 890 fmax

896 grid 897 ylabel 898 xlabel 899 legend 900 title GW150914_H1_ASD_16384.png

901 savefig

913 figure

914 np.sqrt(Pxx_16)914 plot 914 np.sqrt(Pxx_16)

915 np.sqrt(Pxx_4)915 plot915 np.sqrt(Pxx_4) 916 axis

910 fmin 911 fmax

917 grid 918 ylabel 919 xlabel 920 legend 921 title GW150914_H1_ASD_16384_zoom.png

922 savefig

937 decimate

935 factor936 numtaps

937 strain_4new

941 psd

939 fs

940 NFFT

941 Pxx_4new 941 freqs_4

946 figure947 np.sqrt(Pxx_4new)947 plot947 np.sqrt(Pxx_4new) 948 np.sqrt(Pxx_4)948 plot 948 np.sqrt(Pxx_4) 949 axis

943 fmin 944 fmax

950 grid 951 ylabel 952 xlabel 953 legend 954 title GW150914_H1_ASD_4096_zoom.png

955 savefig

979 loaddata

978 fn

979 strain 979 chan_dict 979 time

982 dict.items982 keys 982 values 982 keys 982 values 982 keys 982 values 982 keys 982 values 982 keys 982 values 982 keys982 values 982 keys 982 values 982 keys 982 values 982 keys 982 values 982 keys 982 values 982 keys 982 values 982 keys 982 values 982 keys 982 values

984 array_str 984 array_str 984 array_str 984 array_str 984 array_str 984 array_str 984 array_str 984 array_str 984 array_str 984 array_str 984 array_str 984 array_str 984 array_str

989 np.isnan(strain)989 sum 989 np.isnan(strain) 990 len 995 dq_channel_to_seglist

993 DQflag

995 segment_list

996 len

1003 len

1002 seg_strain

1009 dq_channel_to_seglist1009 segment_list

1010 len

1015 len

1014 seg_strain

Workflowmodel(graph)Facts(Prolog)

ReconstructedprovenanceFacts(Prolog)

RunobservationsFacts(Prolog)

prospective+file-level runtimeobservables

Ludäscher:Workflows&Provenance=>Understanding 29

Page 30: ETC & Authors in the Drivers Seat

LIGOexample:Whatstrain_L1_whitenbp dependson…

Overallworkflow

Upstreamofstrain_L1_whitenbp

(prospective)

GRAVITATIONAL_WAVE_DETECTION

LOAD_DATA

Load hdf5 data.

strain_H1strain_L1 strain_16 strain_4

AMPLITUDE_SPECTRAL_DENSITY

Amplitude spectral density.

ASDsfile:GW150914_ASDs.png

PSD_H1PSD_L1

WHITENING

suppress low frequencies noise.

strain_H1_whiten strain_L1_whiten

BANDPASSING

remove high frequency noise.

strain_H1_whitenbp strain_L1_whitenbp

STRAIN_WAVEFORM_FOR_WHITENED_DATA

plot whitened data.

WHITENED_strain_datafile:GW150914_strain_whitened.png

SPECTROGRAMS_FOR_STRAIN_DATA

plot spectrogram for strain data.

spectrogramfile:GW150914_{detector}_spectrogram.png

SPECTROGRAMS_FOR_WHITEND_DATA

plot spectrogram for whitened data.

spectrogram_whitenedfile:GW150914_{detector}_spectrogram_whitened.png

FILTER_COEFS

Filter signal in time domain (bandpassing).

COEFFICIENTS

FILTER_DATA

filter data.

filtered_white_noise_datafile:GW150914_filter.png

strain_H1_filtstrain_L1_filt

STRAIN_WAVEFORM_FOR_FILTERED_DATA

plot the filtered data.

H1_strain_filteredfile:GW150914_H1_strain_filtered.png

H1_strain_unfilteredfile:GW150914_H1_strain_unfiltered.png

WAVE_FILE_GENERATOR_FOR_WHITENED_DATA

Make sound files for whitened data.

whitened_bandpass_wavefilefile:GW150914_{detector}_whitenbp.wav

SHIFT_FREQUENCY_BANDPASSED

shift frequency of bandpassed signal.

strain_H1_shifted strain_L1_shifted

WAVE_FILE_GENERATOR_FOR_SHIFTED_DATA

Make sound files for shifted data.

shifted_wavefilefile:GW150914_{detector}_shifted.wav

DOWNSAMPLING

Downsampling from 16384 Hz to 4096 Hz.

H1_ASD_SamplingRatefile:GW150914_H1_ASD_{SamplingRate}.png

FN_Detectorfile:{Detector}_LOSC_4_V1-1126259446-32.hdf5

FN_Sampling_ratefile:H-H1_LOSC_{DownSampling}_V1-1126259446-32.hdf5

fs

upstream(strain_LI_whitenbp) [prospective]

WHITENING

strain_H1_whiten strain_L1_whiten

AMPLITUDE_SPECTRAL_DENSITY

PSD_H1 PSD_L1

LOAD_DATA

strain_H1 strain_L1

BANDPASSING

strain_L1_whitenbp

FN_Detectorfile:{Detector}_LOSC_4_V1-...

FN_Sampling_ratefile:H-H1_LOSC_{Rate}_V1-...

fs

upstream(strain_L1_whitenbp) [URI-recon]

WHITENING

strain_H1_whiten strain_L1_whiten

AMPLITUDE_SPECTRAL_DENSITY

PSD_H1 PSD_L1

LOAD_DATA

strain_H1 strain_L1

BANDPASSING

strain_L1_whitenbp

FN_Detector

L-L1_LOSC_4_V1-1126259446-32.hdf5H-H1_LOSC_4_V1-1126259446-32.hdf5

FN_Sampling_rate

H-H1_LOSC_4_V1-1126259446-32.hdf5H-H1_LOSC_16_V1-1126259446-32.hdf5

fs

upstream(strain_LI_whitenbp) [NW-recon]

WHITENING

strain_L1_whitenstrain_L1_whiten = array([8.494, -1.672, ..., 72.156])

AMPLITUDE_SPECTRAL_DENSITY

PSD_L1psd_L1 = scipy.interpolate.interpolate.interp1d

object at 0x113969418

LOAD_DATA

strain_L1strain_L1 = array([-1.779e-18, -1.765e-18, ..., -1.719e-18])

BANDPASSING

strain_L1_whitenbpstrain_L1_whitenbp = array([8.184, 19.935,..., -0.684])

FN_Detectorfn_d = L-L1_LOSC_4_V1-1126259446-32.hdf5

fsfs = 4096

Upstreamofstrain_L1_whitenbp(hybridYW-NWatthecode-

level)

Upstreamofstrain_L1_whitenbp(hybridYW-NWatthefile-level)

3inputsspreadacross5 (=2x2+1)files

Doesintermediatedatastrain_L1_whitenbpdependonall5inputs?

• Intermediatedatastrain_L1_whitenbpdependonlyon2 outof5inputs!

Ludäscher:Workflows&Provenance=>Understanding 30

Page 31: ETC & Authors in the Drivers Seat

DwCA TaxonLookupWorkflow

• Declareinputs,outputs,andsteps ofascript(orwf)withYWannotationsto...– communicateprovenancegraphically(viagraphviz)

– combine differentformsofprovenance

– query provenance• SimpleYWannotationsincomments:– @BEGINStep,@ENDStep– @INData,@OUTData– @URITemplate,@LOGPattern

Ludäscher:Workflows&Provenance=>Understanding 31

�����������������

�������������������������������������������������������������������

��������������������������������������������������������������

������������������������������������������������

�������������������������

�������������������������������������������������������������

����������

�������������������������������������������������������������������������������������������������������

����������������

���������������������

�������������������������������������������������������

����������������

�������������������������������������������������������

�������������������

������������������������������������������

������������������

����������������������������������������

�����������������

���������������������������������������

������������

�������������������������������������������������������������������

��������������������������������������������������������

�����������������

Page 32: ETC & Authors in the Drivers Seat

TaxonLookupWorkflow:DataViewandProcessView

Ludäscher:Workflows&Provenance=>Understanding 32

Page 33: ETC & Authors in the Drivers Seat

Thestoryoftwoindividual

records

Ludäscher:Workflows&Provenance=>Understanding 33

�����������������

�����������������

�������������������

�������

����������

����������

�����������������

�����

���������

��������������

����������������

����������

���������������

�����������������

����������������

������

������������������

����������������

�������������������������������

�����������

������������������

����

�����������

������������

�������������

���������������������

�������������������������������������������������������������������

�����������������

�������������������������������������������������������������������������

�����������������

������������������

����������������

�������

����������

�����������

������������������

�����

���������

��������������

����������������

����������

���������������

�����������������

����������������

���������

�����������������

�������������������

���������������������������������

����������

�����������������

��������������������������������������

�����������

������������

�������������

���������������������

�������������������������������������������������������������������

�����������������

������������������������������������������������������������������

• OnetooktheGBIFroute,while…

• … theotherwentallWORMS!

Page 34: ETC & Authors in the Drivers Seat

Theaggregate story..

Ludäscher:Workflows&Provenance=>Understanding 34

�����������������

�����

���������

��������������

����������������

��������������������

�����������������

��������������������������

�������

����������

������������������

�������������������������

�����������������

����������������������������

�����������

�������������������������������

���������

����������

������������������������������

��������

�����������

������������

�������������

���������������������

�������������������������������������������������������������������

�����������������

�������������������������������������������������������������������������

• Howmanyrecordswereobservedasinputsoroutputsofworkflowsteps?

• WerethereanyNULLvalues?Howmany?

Page 35: ETC & Authors in the Drivers Seat

SummaryI• YWannotationscanbeaddedeasilytoyourscriptstoreapworkflowbenefits– Documentation ofwhat’simportant

– Visualization ofdependencies– Queryingprovenance(prospective,retrospective,andhybrid)

èmakeprovenanceactionableè provenanceforself!

=> github.com/yesworkflow-org/yw=> try.yesworkflow.org

Ludäscher:Workflows&Provenance=>Understanding 35

�����������������

�������������������������������������������������������������������

��������������������������������������������������������������

������������������������������������������������

�������������������������

�������������������������������������������������������������

����������

�������������������������������������������������������������������������������������������������������

����������������

���������������������

�������������������������������������������������������

����������������

�������������������������������������������������������

�������������������

������������������������������������������

������������������

����������������������������������������

�����������������

���������������������������������������

������������

�������������������������������������������������������������������

��������������������������������������������������������

�����������������

�����������������

�����

���������

��������������

����������������

��������������������

�����������������

��������������������������

�������

����������

������������������

�������������������������

�����������������

����������������������������

�����������

�������������������������������

���������

����������

������������������������������

��������

�����������

������������

�������������

���������������������

�������������������������������������������������������������������

�����������������

�������������������������������������������������������������������������

Page 36: ETC & Authors in the Drivers Seat

JoãoF.Pimentel,SaumenDey,TimothyMcPhillips,KhalidBelhajjame,DavidKoop,LeonardoMurta,

VanessaBraganholo,BertramLudascher

Yin&Yang:Demonstrating complementaryprovenancefromnoWorkflow &

YesWorkflow

36

Page 37: ETC & Authors in the Drivers Seat

module.__build_class__

module.__build_class__

simulate_data_collection

180 return

180 run_logger

201 return

201 new_image_file

230 parser

231 cassette_id

236 add_option

241 add_option

246 add_option

248 set_usage

251 parse_args251 args

251 options254 module.len

24 cassette_id

24 sample_score_cutoff

24 data_redundancy

24 calibration_image_file

30 exists

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

36 run_log

37 write

38 str(sample_score_cutoff)

38 write

38 str(sample_score_cutoff)

49 str.format

49 sample_spreadsheet_file

50 spreadsheet_rows

cassette_q55_spreadsheet.csv

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format 51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

72 str.format

72 write

73 open

73 rejection_log

74 str.format

74 TextIOWrapper.write

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

calibration.img

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write

91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open119 collection_log_file 120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open 119 collection_log_file 120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file 120 module.writer120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

128 return

run/run_log.txt

run/rejected_samples.txt

run/raw/q55/DRT240/e10000/image_001.raw

run/data/DRT240/DRT240_10000eV_001.img

run/collected_images.csv

run/raw/q55/DRT240/e10000/image_002.raw

run/data/DRT240/DRT240_10000eV_002.img

run/raw/q55/DRT240/e11000/image_001.raw

run/data/DRT240/DRT240_11000eV_001.img

run/raw/q55/DRT240/e11000/image_002.raw

run/data/DRT240/DRT240_11000eV_002.img

run/raw/q55/DRT240/e12000/image_001.raw

run/data/DRT240/DRT240_12000eV_001.img

run/raw/q55/DRT240/e12000/image_002.raw

run/data/DRT240/DRT240_12000eV_002.img

run/raw/q55/DRT322/e10000/image_001.raw

run/data/DRT322/DRT322_10000eV_001.img

run/raw/q55/DRT322/e10000/image_002.raw

run/data/DRT322/DRT322_10000eV_002.img

run/raw/q55/DRT322/e11000/image_001.raw

run/data/DRT322/DRT322_11000eV_001.img

run/raw/q55/DRT322/e11000/image_002.raw

run/data/DRT322/DRT322_11000eV_002.img

noWorkflow:not onlyWorkflow!

• Scriptshaveprovenance,too!

• Transparently capturesome/allprovenancefromPythonscriptruns.

• Usefilterqueries to“zoom”intorelevantparts..

37

Page 38: ETC & Authors in the Drivers Seat

simulate_data_collection

230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>

251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])

251 args = ['q55']

251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>

24 cassette_id = 'q55'

24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0

24 calibration_image_file = 'calibration.img'

49 str.format

49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'

50 spreadsheet_rows(sample_spreadsheet_file)

50 sample_name = 'DRT240'50 sample_quality = 45

61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])

61 accepted_sample = 'DRT240'61 num_images = 2

61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'

92 collect_next_image(casset ... _{frame_number:03d}.raw')

92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'

106 str.format

106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')

calibration.img

run/data/DRT240/DRT240_11000eV_002.img

$now dataflow-f"run/data/DRT240/DRT240_11000eV_002.img"

$(NW_FILTERED_LINEAGE_GRAPH).gv: $(NW_FACTS)now helper df_style.pynow dataflow -v 55 -f $(RETROSPECTIVE_LINEAGE_VALUE) -m simulation| python df_style.py -d BT -e > $(NW_FILTERED_LINEAGE_GRAPH).gv

..auto-“make” this!

noWorkflow lineageofanimagefile

ProvenanceinformationaboutPythonfunctioncalls,variable assignments,etc.

38

Page 39: ETC & Authors in the Drivers Seat

simulate_data_collection

initialize_run

run_log load_screening_results

sample_namesample_quality

calculate_strategy

accepted_samplerejected_sample num_imagesenergies

log_rejected_sample

rejection_log

collect_data_set

sample_id energyframe_number raw_image

transform_images

corrected_imagetotal_intensitypixel_count

log_average_image_intensity

collection_log

sample_spreadsheet

calibration_image

sample_score_cutoffdata_redundancy

cassette_id

simulate_data_collection

collect_data_set

sample_id energy frame_number raw_image

calculate_strategy

accepted_sample num_imagesenergies

load_screening_results

sample_namesample_quality

transform_images

corrected_image

sample_spreadsheet

calibration_image

sample_score_cutoff data_redundancy

cassette_id

module.__build_class__

module.__build_class__

simulate_data_collection

180 return

180 run_logger

201 return

201 new_image_file

230 parser

231 cassette_id

236 add_option

241 add_option

246 add_option

248 set_usage

251 parse_args251 args

251 options254 module.len

24 cassette_id

24 sample_score_cutoff

24 data_redundancy

24 calibration_image_file

30 exists

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

36 run_log

37 write

38 str(sample_score_cutoff)

38 write

38 str(sample_score_cutoff)

49 str.format

49 sample_spreadsheet_file

50 spreadsheet_rows

cassette_q55_spreadsheet.csv

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format 51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

72 str.format

72 write

73 open

73 rejection_log

74 str.format

74 TextIOWrapper.write

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

calibration.img

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write

91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open119 collection_log_file 120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open 119 collection_log_file 120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file 120 module.writer120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

128 return

run/run_log.txt

run/rejected_samples.txt

run/raw/q55/DRT240/e10000/image_001.raw

run/data/DRT240/DRT240_10000eV_001.img

run/collected_images.csv

run/raw/q55/DRT240/e10000/image_002.raw

run/data/DRT240/DRT240_10000eV_002.img

run/raw/q55/DRT240/e11000/image_001.raw

run/data/DRT240/DRT240_11000eV_001.img

run/raw/q55/DRT240/e11000/image_002.raw

run/data/DRT240/DRT240_11000eV_002.img

run/raw/q55/DRT240/e12000/image_001.raw

run/data/DRT240/DRT240_12000eV_001.img

run/raw/q55/DRT240/e12000/image_002.raw

run/data/DRT240/DRT240_12000eV_002.img

run/raw/q55/DRT322/e10000/image_001.raw

run/data/DRT322/DRT322_10000eV_001.img

run/raw/q55/DRT322/e10000/image_002.raw

run/data/DRT322/DRT322_10000eV_002.img

run/raw/q55/DRT322/e11000/image_001.raw

run/data/DRT322/DRT322_11000eV_001.img

run/raw/q55/DRT322/e11000/image_002.raw

run/data/DRT322/DRT322_11000eV_002.img

simulate_data_collection

230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>

251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])

251 args = ['q55']

251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>

24 cassette_id = 'q55'

24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0

24 calibration_image_file = 'calibration.img'

49 str.format

49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'

50 spreadsheet_rows(sample_spreadsheet_file)

50 sample_name = 'DRT240'50 sample_quality = 45

61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])

61 accepted_sample = 'DRT240'61 num_images = 2

61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'

92 collect_next_image(casset ... _{frame_number:03d}.raw')

92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'

106 str.format

106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')

calibration.img

run/data/DRT240/DRT240_11000eV_002.img

lineagequerylineagequery

YesWorkflow:Conceptual workflowmodel

noWorkflow:Python tracemodel

Buthowdowebridgethisgap???

WouldliketouseYWmodeltoqueryNW

data!

39

Page 40: ETC & Authors in the Drivers Seat

HabemusPons!We’vegottheBridge!Thebridgeisthejourney..(Thejourneyisthedestination)

LineageofimagefileintermsofYW

model,withdetailsfromNWprovenance

40

Page 41: ETC & Authors in the Drivers Seat

DataONE:SearchandProvenanceDisplay

41Ludäscher:Workflows&Provenance=>Understanding

Page 42: ETC & Authors in the Drivers Seat

DataONE:SearchandProvenanceDisplay

42Ludäscher:Workflows&Provenance=>Understanding

Page 43: ETC & Authors in the Drivers Seat

Adding YesWorkflow to DataONEYaxing’s script withinputs &outputproducts

Christopher’sYesWorkflow

model

ChristopherusingYaxing’s outputsasinputsforhisscript

Christopher’sresultscanbetracedbackall

thewaytoYaxing’sinput

Ludäscher:Workflows&Provenance=>Understanding 43

Page 44: ETC & Authors in the Drivers Seat

DemoTime

Ludäscher:Workflows&Provenance=>Understanding 44

(Disclaimer) https://github.com/idaks/dataone-ahm-2016-posterhttps://github.com/idaks/wt-prov-summer-2017https://github.com/yesworkflow-org/yw-idcc-17

Page 45: ETC & Authors in the Drivers Seat

WholeTale:Thenextstepintheevolutionofthescholarlyarticle:The“Living”Paper

• 1st Generation:– narrative (prose)

• 2nd Generation:plus …– name..identify..include(accessto)data

• 3rd Generation:plus …– name..reference..includecode (software)..– andprovenance …andexecenvironment(containers)

Ludäscher:Workflows&Provenance=>Understanding 45

WholeTale

WholeTaleDashboard

Page 46: ETC & Authors in the Drivers Seat

WholeTale:What’sinaname?

(1)WholeTale⇔WholeStory:◦ Support(computational /data)scientists◦…alongthecompleteresearchlifecycle◦ ...fromexperimentto(newkindof)publication◦ ...andback!

(2)WholeTale⇔ fortheLongTailofScience–Easysharingofyourcomputationalnarratives,data,andexec-env since2017!

–Powerapplicationsforeveryone!

46Ludäscher:Workflows&Provenance=>Understanding

Page 47: ETC & Authors in the Drivers Seat

Whole TaleVision• Can'treproduceresultbecause:

• Don'tknowhowtorunanalysis

• Can'tgetthesoftwarerunning

• Can'tpayforthecomputerorcomputepowertheresultwascomputedon

Source:BryceMecum,NCEAS(WTteam)47

Page 48: ETC & Authors in the Drivers Seat

Whole TaleVisionAddressingreproducibility

48

Data Code

ExecutionEnvironment

Article

Source:BryceMecum,NCEAS(WTteam)

Page 49: ETC & Authors in the Drivers Seat

Whole TaleVision• Livingpublication

(data+code+environment)

• Increaseoddsofreproducibility

• Encourageinvestigationofresultsmakingiteasytorecreatetheenvironmenttheresultwascreatedin

Article

Source:BryceMecum,NCEAS(WTteam)

Page 50: ETC & Authors in the Drivers Seat

Whole TaleVisionAddressingreproducibility

Article

Tale

+

Source:BryceMecum,NCEAS(WTteam)

Page 51: ETC & Authors in the Drivers Seat

WholeTaleVision

Tale

Data

{Code

D1PROV

Source:BryceMecum,NCEAS(WTteam)

Page 52: ETC & Authors in the Drivers Seat

WholeTaleTeamNSF-DIBBSaward:TheWholeTale:MergingScienceandCyberinfrastructurePathways($5Mtotal,over5years,5teams)

WTTeam:• Illinois(NCSA&iSchool)• BertramLudäscher(PI),KandaceTurner(PM),VictoriaStodden(coPI),MattTurk(coPI)

• KacperKowalik(sw-architect),CraigWillis(sw-dev)• UofChicago• KyleChard(coPI),MihaelHategan(sw-dev)

• UTAustin• NiallGaffney(coPI),SivaKulasekaran(sw-dev)

• UNotreDame• JarekNabrzyski(coPI),IanTaylor(sw-dev),AdamBrinckman(sw-dev)

• UCSB• Matt Jones(coPI),BryceMecum(sw-dev)

Page 53: ETC & Authors in the Drivers Seat

DEMO!

Ludäscher:Workflows&Provenance=>Understanding 53

Page 54: ETC & Authors in the Drivers Seat

Lastnotleast:Non-unitary syntheses

of systematic knowledge

Please

@taxonbytes

Nico Franz

School of Life Sciences, Arizona State University

CIRSS Seminar – Center for Informatics Research in Science and Scholarship

February 17, 2017 – iSchool, University of Illinois Urbana-Champaign

@ http://www.slideshare.net/taxonbytes/franz-2017-uiuc-cirss-non-unitary-syntheses-of-systematic-knowledge 54

Page 55: ETC & Authors in the Drivers Seat

55

Page 56: ETC & Authors in the Drivers Seat

http://taxonbytes.org/wp-content/uploads/2014/10/Peet-BIGCB-2014-Changing-Perspectives-on-Plant-Distributions.pdf56

Page 57: ETC & Authors in the Drivers Seat

Use case 1.a. Aligning Microcebus + Mirza sec. MSW3 (2005)

"Taxonomic concept labels"identify input concept regions

RCC–5 articulations providedfor each species-level concept

• Input visualization: MSW3 (2005) versus MSW2 (1993)

Source: Franz et al. 2016. Two influential primate classifications logical aligned. doi:10.1093/sysbio/syw023

57

Page 58: ETC & Authors in the Drivers Seat

• Alignment visualization: "grey means taxonomically congruent"

Use case 1.a. Aligning Microcebus + Mirza sec. MSW3 (2005)

58

Page 59: ETC & Authors in the Drivers Seat

One name &congruent region

Many names &congruent region

One name &non-congruent regions

Many names &non-congruent regions

New names &exclusive regions

• Application of coverage constraint: parent-to-parent articulations (><) arefully defined by alignment signal propagated from their respective children.

è Sensible when complete sampling of children is intended.

Use case 1.a. Aligning Microcebus + Mirza sec. MSW3 (2005)

• Alignment visualization: "grey means taxonomically congruent"

59

Page 60: ETC & Authors in the Drivers Seat

1 in 3 names is unreliable across MSW2/MSW3 classifications

Source: Franz et al. 2016. Two influential primate classifications logical aligned. doi:10.1093/sysbio/syw023

60

Page 61: ETC & Authors in the Drivers Seat

The 'consensus' The 'bible'

The (formerly) federal

'standard'

The 'best', latest regional flora

"Controlling the taxonomic variable"

Expert viewsare in conflict

"Just bad"

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

61

Page 62: ETC & Authors in the Drivers Seat

The 'consensus' The 'bible'

The (formerly) federal

'standard'

The 'best', latest regional flora

Impact:Name-based aggregation has created

a novel synthesis that nobody believes in

"Controlling the taxonomic variable"

"Just bad"

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

62

Page 63: ETC & Authors in the Drivers Seat

The 'consensus' The 'bible'

The (formerly) federal

'standard'

The 'best', latest regional flora

"Controlling the taxonomic variable"

"Just bad"

Expert viewsare reconciled

Solution:Instead of aggregating

an artificial 'consensus',build translation services

Source: Franz et al. 2016. Controlling the taxonomic variable: […]. RIO Journal. doi:10.3897/rio.2.e10610

63

Page 64: ETC & Authors in the Drivers Seat

Leavingtaxonandspeciesheadaches…• ToillustrateEulerthinkofasimplerusecase:• Agreeingtodisagree!• …whentherearemultiple,legitimateperspectives

• Sortingthingsout!– Eulerasataxonconcept(&name)“microscope”...– ..orscalpel– ..or...?

64

Page 65: ETC & Authors in the Drivers Seat

Yi-YunCheng1,NicoFranz2,JodiSchneider1,Shizhuo Yu3,ThomasRodenhausen4,BertramLudäscher11SchoolofInformationSciences,UniversityofIllinoisatUrbana-Champaign;2SchoolofLifeSciences,ArizonaStateUniversity;3DepartmentofComputerScience,UniversityofCaliforniaatDavis;4SchoolofInformation,UniversityofArizona

Agreeing to Disagree: Reconciling Conflicting Taxonomic Views using a Logic-based Approach

Acknowledgments

Supportoftheauthors’researchthroughtheNationalScienceFoundationiskindlyacknowledged(DEB-1155984,DBI-1342595,andDBI-1643002).TheauthorsthankProfessorKathrynLaBarreforhercommentsandsuggestions.WewouldalsoliketothankDr.LaetitiaNavarroandJeffTerstriep forhelpwithcreatingmapoverlaysinQGIS.

CONCLUSION

• Ourlogic-basedtaxonomyalignmentapproachcanbeusedtosolvecrosswalking issuesWewillbeabletomitigatethemembershipconditionproblemsthatoccurinequivalentcrosswalking.

• RCC-5approachpreservestheoriginaltaxonomieswhileprovidinganalignmentviewWecansolvedataintegrationproblemsthathappeninthemorecoarse-grainedrelativecrosswalking,whichotherwiseissubjectedtoinformationloss.

• Ourstudyalsounderscoresthebenefitsofdesigningdifferentalignmentworkflows(Bottomupvs.Top-down)tomatchtheneedsofspecifictaxonomyalignmentproblemsBottom-upapproach:seemstoworkwellwheneverwehavenon-overlappingrelationshipsattheleaf-level(lowest-level)articulations,andwearenotsurehowthehigher-levelconceptsshouldbealigned.

Top-downapproach:seemsfavorablewhenthereisanexpectationofcertainhigher-levelarticulationsinconjunctionwithunder-specified,complex,andoftenoverlappingleaf-levelrelations.

RELATEDWORK

• TaxonomyAlignmentProblems(TAP)TaxonomiesT1,T2 areinter-linkedviaasetofinputarticulations A,definedasRCC-5relations, toyielda“merged”taxonomyT3 .

• Euler/XArticulations – aconstraintorrulethatdefinesarelationship(asetconstraint)betweentwoconceptsfromdifferenttaxonomies.

RegionConnectionCalculus(RCC-5)

PossibleWorlds–WhenencodingandsolvingTAPsviaASP,thedifferentanswersetsrepresentalternativetaxonomymergesolutionsorpossibleworlds(PWs).

INTRODUCTION

Tina:HeyAmy,canyourecommendasignaturedishfromwhereyoulive?

Amy:Oh,definitelythehalf-smokesfromtheNortheast!Theyarethesetastyhalf-porkandhalf-beefsausages.

Tina:Whatacoincidence!Wehavehalf-smokesintheSouth,too!WheredoyouliveintheNortheast?NewYork?Boston?

Amy:Wrongguesses!WheredoyouliveintheSouth?

TinaandAmytogether:Washington,D.C.

[Thetwoofthemlookateachother,confused.]

“Inthefaceofincompatibleinformationordatastructuresamongusersoramongthosespecifyingthesystem,attemptstocreateunitaryknowledgecategoriesarefutile.Rather,parallelormultiplerepresentationalformsarerequired…”(Bowker&Star,2000).

CASE1RESULTS:CENvs.NDC

• State-levelalignmentsareallcongruent(Bottom-up)• Inferrednewarticulationsforregional-levelalignments

CASE2RESULTS:CENvs.TZ

Figure 3. (Left) CEN-NDC taxonomy alignment problem with 49 input articulations between TCEN and TNDC

Figure 4. (Right) The unique possible world (PW) T3 reconciling TCEN and TNDC via inferred relationships

Figure 1. National Diversity Council map (NDC) vs. Census Bureau map (CEN)

• Github link:https://github.com/EulerProject/ASIST17

• Email:[email protected]

West

Southwest Southeast

Midwest North-east

West

South

Midwest North-east

PacificMountain

CentralEastern

West

South

Midwest

North-east

RESEARCHDESIGN

Step1. SupplyinputtaxonomiesT1 andT2Step2.FormulateRCC-5articulationsbetweenT1 andT2Step3. IterativelyeditarticulationsinEuler/X

Y X X YX Y X Y X Y

CongruenceX == Y

InclusionX > Y

Inverse InclusionX < Y

OverlapX>< Y

DisjointnessX ! Y

T1 T2

T1 T2

Inconsistent (N=0) Ambiguous (N>1)

T3

Add/Edit Articulations A

Euler/X

N Possible Worlds

N=1 N=0 or N>1

R1

R2

R3

R4

R5

R6

R7

R8

R9

CEN.Midwest

CEN.USATZ.USA

CEN.West

CEN.NortheastTZ.Eastern\CEN.Midwest

TZ.Eastern\CEN.South

CEN.South

CEN.South*TZ.CentralTZ.Central\CEN.Midwest

CEN.South\TZ.Eastern

CEN.South\TZ.Mountain

TZ.Central

CEN.Midwest\TZ.Eastern

TZ.Mountain\CEN.SouthTZ.Mountain

CEN.Midwest\TZ.Mountain

TZ.Mountain\CEN.Midwest

CEN.Midwest*TZ.Mountain

CEN.Midwest\TZ.Central

TZ.Mountain\CEN.West

CEN.Midwest*TZ.Eastern

CEN.West*TZ.Mountain

CEN.South*TZ.MountainCEN.South\TZ.Central

TZ.Eastern

CEN.South*TZ.Eastern

CEN.Midwest*TZ.CentralTZ.Central\CEN.South

TZ.PacificCEN.West\TZ.Mountain

Nodes

CEN 4newComb 18comb 1TZ 4

Edges

input 6inferred 37

CEN.IL NDC.IL==

CEN.IN NDC.IN==

CEN.RI NDC.RI==

CEN.IA NDC.IA==

CEN.WV NDC.WV==

CEN.KS NDC.KS==

CEN.KY NDC.KY==

CEN.TX NDC.TX==

CEN.NortheastCEN.VTCEN.MA

CEN.ME

CEN.CT

CEN.PA

CEN.NY

CEN.NH

CEN.NJ

CEN.South

CEN.TN

CEN.MS

CEN.MD

CEN.DC

CEN.DE

CEN.VA

CEN.FL

CEN.AR

CEN.AL

CEN.OK

CEN.SC

CEN.LACEN.GA

CEN.NC

CEN.ID NDC.ID==

NDC.TN==

CEN.WY NDC.WY==

NDC.VT==

NDC.MS==

CEN.MT NDC.MT==

NDC.MA==

CEN.USA

CEN.Midwest

CEN.West

NDC.ME==

NDC.MD==

CEN.MI NDC.MI==

CEN.MN NDC.MN==

NDC.DC==

NDC.DE==

CEN.OR NDC.OR==

CEN.OH NDC.OH==

NDC.VA==

NDC.FL==

NDC.AR==

CEN.AZ NDC.AZ==

NDC.AL==

NDC.OK==

NDC.CT==

CEN.CO NDC.CO==

CEN.CA NDC.CA==

CEN.SD NDC.SD==

NDC.SC==

CEN.MO

CEN.ND

CEN.NE

CEN.WI

NDC.LA==

NDC.MO==

CEN.UT NDC.UT==

NDC.GA==

NDC.PA==

CEN.NV

CEN.NM

CEN.WA

NDC.NY==

NDC.NV==

NDC.NM==

NDC.WA==

NDC.NH==

NDC.NJ==

NDC.ND==

NDC.NE==

NDC.WI==

NDC.NC==

NDC.West

NDC.Midwest

NDC.Northeast

NDC.Southeast

NDC.USA

NDC.Southwest

Nodes

CEN 54NDC 55 Edges

isa_CEN 53isa_NDC 54Art. 49

CEN.West

NDC.Southwest

CEN.USANDC.USA

CEN.Northeast

NDC.Northeast

CEN.SouthNDC.Southeast

NDC.West

CEN.DCNDC.DC

CEN.NMNDC.NM

CEN.NDNDC.ND

CEN.MidwestNDC.Midwest

CEN.AZNDC.AZ

CEN.CANDC.CA

CEN.MTNDC.MT

CEN.MANDC.MA

CEN.INNDC.IN

CEN.NVNDC.NV

CEN.MDNDC.MD

CEN.CTNDC.CT

CEN.NHNDC.NH

CEN.KYNDC.KY

CEN.PANDC.PA

CEN.CONDC.CO

CEN.WANDC.WA

CEN.MINDC.MI

CEN.VANDC.VA

CEN.WINDC.WI

CEN.NENDC.NE

CEN.SDNDC.SD

CEN.MNNDC.MN

CEN.MSNDC.MS

CEN.IDNDC.ID

CEN.WVNDC.WV

CEN.NYNDC.NY

CEN.NJNDC.NJ

CEN.UTNDC.UT

CEN.MENDC.ME

CEN.ILNDC.IL

CEN.TNNDC.TN

CEN.VTNDC.VT

CEN.GANDC.GA

CEN.DENDC.DE

CEN.NCNDC.NC

CEN.OKNDC.OK

CEN.MONDC.MO

CEN.SCNDC.SC

CEN.ARNDC.AR

CEN.TXNDC.TX

CEN.LANDC.LA

CEN.OHNDC.OH

CEN.IANDC.IA

CEN.KSNDC.KS

CEN.RINDC.RI

CEN.WYNDC.WY

CEN.FLNDC.FL

CEN.ORNDC.OR

CEN.ALNDC.AL

Nodes

CEN 3NDC 4comb 51 Edges

input 61inferred 3

overlapsinferred 3

CEN.Northeast

TZ.Eastern

<

CEN.Midwest><

TZ.Mountain

><

TZ.Pacific

!

CEN.South

><

><

!

TZ.Central

><

CEN.USA

CEN.West

TZ.USA

==

!

><

!

Nodes

CEN 5TZ 5

Edges

isa_CEN 4isa_TZ 4Art. 12

CEN.Midwest

CEN.USATZ.USA

TZ.Eastern

TZ.Central

TZ.Mountain

CEN.South

CEN.Northeast

CEN.West TZ.Pacific

Nodes

CEN 4comb 1TZ 4

Edges

input 7overlapsinput 6overlapsinferred 1

inferred 1

R1 R2

R3

R4

R5

R6 R7

R8

R9

Figure 2. The process of aligning taxonomies T1 and T2 with Euler/X

Figure 5. Top-downinput alignments between TCEN and TTZ

Figure 6. The unique PW for the TCEN with TTZ alignment

Figure 10. Combined concepts solution for TCEN and TTZ

taxonomy CEN Census_Regions(USA Northeast Midwest South West)(Northeast CT MA ME NH NJ NY PA RI VT)(Midwest IL IN IA KS MI MN MO NE ND OH SD WI)(South AL AR DE DC FL GA KY LA MD MS NC OK SC TN TX VA WV)(West AZ CA CO ID MT NV NM OR UT WA WY)

taxonomy NDC National_Diversity_Council(USA Midwest Northeast Southeast Southwest West)(Northeast CT DC DE MD MA ME NH NJ NY PA RI VT)(Midwest IA IL IN KS MI MN MO ND NE OH SD WI)(Southeast AL AR FL GA KY LA MS NC SC TN VA WV)(Southwest AZ NM OK TX)(West CA CO ID MT NV OR WA WY UT)

articulations CEN NDC[CEN.AL equals NDC.AL][CEN.AR equals NDC.AR][CEN.AZ equals NDC.AZ][CEN.CA equals NDC.CA][CEN.CO equals NDC.CO][CEN.CT equals NDC.CT][CEN.DC equals NDC.DC][CEN.DE equals NDC.DE][CEN.FL equals NDC.FL][CEN.GA equals NDC.GA][CEN.IA equals NDC.IA][CEN.ID equals NDC.ID][CEN.IL equals NDC.IL][CEN.IN equals NDC.IN][CEN.KS equals NDC.KS][CEN.KY equals NDC.KY][CEN.LA equals NDC.LA][CEN.MA equals NDC.MA][CEN.MD equals NDC.MD][CEN.ME equals NDC.ME][CEN.MI equals NDC.MI][CEN.MN equals NDC.MN]...

Quick Scan!

taxonomy CEN Census_Regions(USA Midwest South West Northeast)

taxonomy TZ Time_Zone(USA Pacific Mountain Central Eastern)

articulations CEN TZ[CEN.Midwest disjoint TZ.Pacific][CEN.Midwest overlaps TZ.Eastern][CEN.Midwest overlaps TZ.Mountain][CEN.Northeast is_included_in TZ.Eastern][CEN.South disjoint TZ.Pacific][CEN.South overlaps TZ.Central][CEN.South overlaps TZ.Eastern][CEN.South overlaps TZ.Mountain][CEN.USA equals TZ.USA][CEN.West disjoint TZ.Central][CEN.West disjoint TZ.Eastern][CEN.West overlaps TZ.Mountain]

Page 66: ETC & Authors in the Drivers Seat

TwoTaxonomies:NDC vs CEN

“…in the face of incompatible information or data structures among users or among thosespecifying the system, attempts to create unitary knowledge categories are futile. Rather, parallelor multiple representational forms are required” [Bowker & Star, 2000, p.159]

West

Southwest Southeast

Midwest North-east

West

South

Midwest North-east

NationalDiversityCouncilmap(NDC) USCensusBuero map(CEN)

Source:Yi-Yun(Jessica)Cheng(PhDstudent,iSchool @Illinois)

Page 67: ETC & Authors in the Drivers Seat

Thetaxonomies

11/01/17Cheng

• TheCensusRegionsMap(CEN),consistsoffour regions:West,Midwest,Northeast,andSouth,i.e.,thecontiguous48statesandWashingtonD.C.

West

South

Midwest

North-east

Page 68: ETC & Authors in the Drivers Seat

Thetaxonomies

• TheNationalDiversityCouncilMap(NDC),consistsoffiveregions:West,Southwest,Midwest,Northeast,Southeast,the48statesandWashingtonD.C.

NDC(withstates)

West

Southwest Southeast

Midwest North-east

• NDC splits South into SW and SE

• Do NDC and CEN agree on “West”? “Midwest”? …

• How can we sort this out?

Page 69: ETC & Authors in the Drivers Seat

Sortingthingsout…

11/01/17Cheng

CEN.Midwest

CEN.USA

CEN.South CEN.West CEN.Northeast NDC.Northeast

NDC.USA

NDC.Southeast NDC.Midwest NDC.Southwest NDC.West

Nodes

CEN 5NDC 6 Edges

is_a (CEN) 4is_a (NDC) 5

CEN.South

NDC.Northeast

o

NDC.Southwest

o

NDC.Southeast>

CEN.Midwest NDC.Midwest=

CEN.USA

CEN.West

CEN.NortheastNDC.USA

=

!

oNDC.West

>

<

Nodes

CEN 5NDC 6 Edges

is_a (CEN) 4is_a (NDC) 5articulations 9

CEN.Midwest

CEN.USA

CEN.South CEN.West CEN.Northeast NDC.Northeast

NDC.USA

NDC.Southeast NDC.Midwest NDC.Southwest NDC.West

Nodes

CEN 5NDC 6 Edges

is_a (CEN) 4is_a (NDC) 5

• Given:– taxonomiesT1,T2– andrelationsT1~T2

(articulations,alignment)• Find:

– mergedtaxonomyT3• Suchthat:

– T1,T2arepreserved– allpairwiserelationsare

explicit

T1 T2

Page 70: ETC & Authors in the Drivers Seat

5waystorelateconcepts(regions)

• Idea:relateconceptsXandYwitharticulations

• ArticulationLanguage:RegionConnectionCalculus (RCC5):congruence,inclusion,inverseinclusion,overlap,disjointness

Y X X YX Y X Y X Y

CongruenceX == Y

InclusionX > Y

Inverse InclusionX < Y

OverlapX>< Y

DisjointnessX ! Y

CEN.South

NDC.Northeast

><

NDC.Southwest

><

NDC.Southeast>

CEN.Midwest NDC.Midwest==

CEN.USA

CEN.West

CEN.NortheastNDC.USA

==

!

><NDC.West

>

<

Nodes

CEN 5NDC 6 Edges

is_a (CEN) 4is_a (NDC) 5articulations 9

Page 71: ETC & Authors in the Drivers Seat

MergedtaxonomyT3

CEN.South

NDC.Northeast

NDC.Southwest

CEN.USANDC.USA

CEN.West

CEN.Northeast

NDC.Southeast

NDC.West

CEN.MidwestNDC.Midwest

Nodes

CEN 3NDC 4

congruent 2 Edges

is_a (input) 8overlaps (input) 3

CEN.Midwest

CEN.USA

CEN.South CEN.West CEN.Northeast NDC.Northeast

NDC.USA

NDC.Southeast NDC.Midwest NDC.Southwest NDC.West

Nodes

CEN 5NDC 6 Edges

is_a (CEN) 4is_a (NDC) 5

CEN.Midwest

CEN.USA

CEN.South CEN.West CEN.Northeast NDC.Northeast

NDC.USA

NDC.Southeast NDC.Midwest NDC.Southwest NDC.West

Nodes

CEN 5NDC 6 Edges

is_a (CEN) 4is_a (NDC) 5

CEN.South

NDC.Northeast

><

NDC.Southwest

><

NDC.Southeast>

CEN.Midwest NDC.Midwest==

CEN.USA

CEN.West

CEN.NortheastNDC.USA

==

!

><NDC.West

>

<

Nodes

CEN 5NDC 6 Edges

is_a (CEN) 4is_a (NDC) 5articulations 9

T1 T2

T1~T2 T3

Page 72: ETC & Authors in the Drivers Seat

HowwealigntwotaxonomiesT1andT2

• Step1. SupplyinputtaxonomiesT1andT2

• Step2.DescribetherelationshipsbetweenT1 andT2

• Step3. IterativelyeditarticulationsinEuler/X

T1 T2

T1 T2

Inconsistent (N=0) Ambiguous (N>1)

T3

Add/Edit Articulations A

Euler/X

N Possible Worlds

N=1 N=0 or N>1

• … but where do the articulationscome from??– expert opinion– automatically derived from data

Page 73: ETC & Authors in the Drivers Seat

Case1:CensusRegionvs.NationalDiversityCouncil

Cheng

West

South

Midwest

North-east

NDC(withstates)

West

Southwest Southeast

Midwest North-east

CEN NDC

• … but where do the articulationscome from??– automatically derived from data– expert input

Page 74: ETC & Authors in the Drivers Seat

11/01/17Cheng

CEN.IL NDC.IL==

CEN.IN NDC.IN==

CEN.RI NDC.RI==

CEN.IA NDC.IA==

CEN.WV NDC.WV==

CEN.KS NDC.KS==

CEN.KY NDC.KY==

CEN.TX NDC.TX==

CEN.NortheastCEN.VTCEN.MA

CEN.ME

CEN.CT

CEN.PA

CEN.NY

CEN.NH

CEN.NJ

CEN.South

CEN.TN

CEN.MS

CEN.MD

CEN.DC

CEN.DE

CEN.VA

CEN.FL

CEN.AR

CEN.AL

CEN.OK

CEN.SC

CEN.LACEN.GA

CEN.NC

CEN.ID NDC.ID==

NDC.TN==

CEN.WY NDC.WY==

NDC.VT==

NDC.MS==

CEN.MT NDC.MT==

NDC.MA==

CEN.USA

CEN.Midwest

CEN.West

NDC.ME==

NDC.MD==

CEN.MI NDC.MI==

CEN.MN NDC.MN==

NDC.DC==

NDC.DE==

CEN.OR NDC.OR==

CEN.OH NDC.OH==

NDC.VA==

NDC.FL==

NDC.AR==

CEN.AZ NDC.AZ==

NDC.AL==

NDC.OK==

NDC.CT==

CEN.CO NDC.CO==

CEN.CA NDC.CA==

CEN.SD NDC.SD==

NDC.SC==

CEN.MO

CEN.ND

CEN.NE

CEN.WI

NDC.LA==

NDC.MO==

CEN.UT NDC.UT==

NDC.GA==

NDC.PA==

CEN.NV

CEN.NM

CEN.WA

NDC.NY==

NDC.NV==

NDC.NM==

NDC.WA==

NDC.NH==

NDC.NJ==

NDC.ND==

NDC.NE==

NDC.WI==

NDC.NC==

NDC.West

NDC.Midwest

NDC.Northeast

NDC.Southeast

NDC.USA

NDC.Southwest

Nodes

CEN 54NDC 55 Edges

isa_CEN 53isa_NDC 54Art. 49

CEN.IL NDC.IL==

CEN.IN NDC.IN==

CEN.RI NDC.RI==

CEN.IA NDC.IA==

CEN.WV NDC.WV==

CEN.KS NDC.KS==

CEN.KY NDC.KY==

CEN.TX NDC.TX==

CEN.NortheastCEN.VTCEN.MA

CEN.ME

CEN.CT

CEN.PA

CEN.NY

CEN.NH

CEN.NJ

CEN.South

CEN.TN

CEN.MS

CEN.MD

CEN.DC

CEN.DE

CEN.VA

CEN.FL

CEN.AR

CEN.AL

CEN.OK

CEN.SC

CEN.LACEN.GA

CEN.NC

CEN.ID NDC.ID==

NDC.TN==

CEN.WY NDC.WY==

NDC.VT==

NDC.MS==

CEN.MT NDC.MT==

NDC.MA==

CEN.USA

CEN.Midwest

CEN.West

NDC.ME==

NDC.MD==

CEN.MI NDC.MI==

CEN.MN NDC.MN==

NDC.DC==

NDC.DE==

CEN.OR NDC.OR==

CEN.OH NDC.OH==

NDC.VA==

NDC.FL==

NDC.AR==

CEN.AZ NDC.AZ==

NDC.AL==

NDC.OK==

NDC.CT==

CEN.CO NDC.CO==

CEN.CA NDC.CA==

CEN.SD NDC.SD==

NDC.SC==

CEN.MO

CEN.ND

CEN.NE

CEN.WI

NDC.LA==

NDC.MO==

CEN.UT NDC.UT==

NDC.GA==

NDC.PA==

CEN.NV

CEN.NM

CEN.WA

NDC.NY==

NDC.NV==

NDC.NM==

NDC.WA==

NDC.NH==

NDC.NJ==

NDC.ND==

NDC.NE==

NDC.WI==

NDC.NC==

NDC.West

NDC.Midwest

NDC.Northeast

NDC.Southeast

NDC.USA

NDC.Southwest

Nodes

CEN 54NDC 55 Edges

isa_CEN 53isa_NDC 54Art. 49

Page 75: ETC & Authors in the Drivers Seat

11/01/17Cheng

CEN.West

NDC.Southwest

CEN.USANDC.USA

CEN.Northeast

NDC.Northeast

CEN.SouthNDC.Southeast

NDC.West

CEN.DCNDC.DC

CEN.NMNDC.NM

CEN.NDNDC.ND

CEN.MidwestNDC.Midwest

CEN.AZNDC.AZ

CEN.CANDC.CA

CEN.MTNDC.MT

CEN.MANDC.MA

CEN.INNDC.IN

CEN.NVNDC.NV

CEN.MDNDC.MD

CEN.CTNDC.CT

CEN.NHNDC.NH

CEN.KYNDC.KY

CEN.PANDC.PA

CEN.CONDC.CO

CEN.WANDC.WA

CEN.MINDC.MI

CEN.VANDC.VA

CEN.WINDC.WI

CEN.NENDC.NE

CEN.SDNDC.SD

CEN.MNNDC.MN

CEN.MSNDC.MS

CEN.IDNDC.ID

CEN.WVNDC.WV

CEN.NYNDC.NY

CEN.NJNDC.NJ

CEN.UTNDC.UT

CEN.MENDC.ME

CEN.ILNDC.IL

CEN.TNNDC.TN

CEN.VTNDC.VT

CEN.GANDC.GA

CEN.DENDC.DE

CEN.NCNDC.NC

CEN.OKNDC.OK

CEN.MONDC.MO

CEN.SCNDC.SC

CEN.ARNDC.AR

CEN.TXNDC.TX

CEN.LANDC.LA

CEN.OHNDC.OH

CEN.IANDC.IA

CEN.KSNDC.KS

CEN.RINDC.RI

CEN.WYNDC.WY

CEN.FLNDC.FL

CEN.ORNDC.OR

CEN.ALNDC.AL

Nodes

CEN 3NDC 4comb 51 Edges

input 61inferred 3

overlapsinferred 3

CEN.West

NDC.Southwest

CEN.USANDC.USA

CEN.Northeast

NDC.Northeast

CEN.SouthNDC.Southeast

NDC.West

CEN.DCNDC.DC

CEN.NMNDC.NM

CEN.NDNDC.ND

CEN.MidwestNDC.Midwest

CEN.AZNDC.AZ

CEN.CANDC.CA

CEN.MTNDC.MT

CEN.MANDC.MA

CEN.INNDC.IN

CEN.NVNDC.NV

CEN.MDNDC.MD

CEN.CTNDC.CT

CEN.NHNDC.NH

CEN.KYNDC.KY

CEN.PANDC.PA

CEN.CONDC.CO

CEN.WANDC.WA

CEN.MINDC.MI

CEN.VANDC.VA

CEN.WINDC.WI

CEN.NENDC.NE

CEN.SDNDC.SD

CEN.MNNDC.MN

CEN.MSNDC.MS

CEN.IDNDC.ID

CEN.WVNDC.WV

CEN.NYNDC.NY

CEN.NJNDC.NJ

CEN.UTNDC.UT

CEN.MENDC.ME

CEN.ILNDC.IL

CEN.TNNDC.TN

CEN.VTNDC.VT

CEN.GANDC.GA

CEN.DENDC.DE

CEN.NCNDC.NC

CEN.OKNDC.OK

CEN.MONDC.MO

CEN.SCNDC.SC

CEN.ARNDC.AR

CEN.TXNDC.TX

CEN.LANDC.LA

CEN.OHNDC.OH

CEN.IANDC.IA

CEN.KSNDC.KS

CEN.RINDC.RI

CEN.WYNDC.WY

CEN.FLNDC.FL

CEN.ORNDC.OR

CEN.ALNDC.AL

Nodes

CEN 3NDC 4comb 51 Edges

input 61inferred 3

overlapsinferred 3

USA,MidwestandState-levelalignmentsareallcongruent

Page 76: ETC & Authors in the Drivers Seat

11/01/17Cheng

CEN.West

NDC.Southwest

CEN.USANDC.USA

CEN.Northeast

NDC.Northeast

CEN.SouthNDC.Southeast

NDC.West

CEN.DCNDC.DC

CEN.NMNDC.NM

CEN.NDNDC.ND

CEN.MidwestNDC.Midwest

CEN.AZNDC.AZ

CEN.CANDC.CA

CEN.MTNDC.MT

CEN.MANDC.MA

CEN.INNDC.IN

CEN.NVNDC.NV

CEN.MDNDC.MD

CEN.CTNDC.CT

CEN.NHNDC.NH

CEN.KYNDC.KY

CEN.PANDC.PA

CEN.CONDC.CO

CEN.WANDC.WA

CEN.MINDC.MI

CEN.VANDC.VA

CEN.WINDC.WI

CEN.NENDC.NE

CEN.SDNDC.SD

CEN.MNNDC.MN

CEN.MSNDC.MS

CEN.IDNDC.ID

CEN.WVNDC.WV

CEN.NYNDC.NY

CEN.NJNDC.NJ

CEN.UTNDC.UT

CEN.MENDC.ME

CEN.ILNDC.IL

CEN.TNNDC.TN

CEN.VTNDC.VT

CEN.GANDC.GA

CEN.DENDC.DE

CEN.NCNDC.NC

CEN.OKNDC.OK

CEN.MONDC.MO

CEN.SCNDC.SC

CEN.ARNDC.AR

CEN.TXNDC.TX

CEN.LANDC.LA

CEN.OHNDC.OH

CEN.IANDC.IA

CEN.KSNDC.KS

CEN.RINDC.RI

CEN.WYNDC.WY

CEN.FLNDC.FL

CEN.ORNDC.OR

CEN.ALNDC.AL

Nodes

CEN 3NDC 4comb 51 Edges

input 61inferred 3

overlapsinferred 3

CEN.West

NDC.Southwest

CEN.USANDC.USA

CEN.Northeast

NDC.Northeast

CEN.SouthNDC.Southeast

NDC.West

CEN.DCNDC.DC

CEN.NMNDC.NM

CEN.NDNDC.ND

CEN.MidwestNDC.Midwest

CEN.AZNDC.AZ

CEN.CANDC.CA

CEN.MTNDC.MT

CEN.MANDC.MA

CEN.INNDC.IN

CEN.NVNDC.NV

CEN.MDNDC.MD

CEN.CTNDC.CT

CEN.NHNDC.NH

CEN.KYNDC.KY

CEN.PANDC.PA

CEN.CONDC.CO

CEN.WANDC.WA

CEN.MINDC.MI

CEN.VANDC.VA

CEN.WINDC.WI

CEN.NENDC.NE

CEN.SDNDC.SD

CEN.MNNDC.MN

CEN.MSNDC.MS

CEN.IDNDC.ID

CEN.WVNDC.WV

CEN.NYNDC.NY

CEN.NJNDC.NJ

CEN.UTNDC.UT

CEN.MENDC.ME

CEN.ILNDC.IL

CEN.TNNDC.TN

CEN.VTNDC.VT

CEN.GANDC.GA

CEN.DENDC.DE

CEN.NCNDC.NC

CEN.OKNDC.OK

CEN.MONDC.MO

CEN.SCNDC.SC

CEN.ARNDC.AR

CEN.TXNDC.TX

CEN.LANDC.LA

CEN.OHNDC.OH

CEN.IANDC.IA

CEN.KSNDC.KS

CEN.RINDC.RI

CEN.WYNDC.WY

CEN.FLNDC.FL

CEN.ORNDC.OR

CEN.ALNDC.AL

Nodes

CEN 3NDC 4comb 51 Edges

input 61inferred 3

overlapsinferred 3

Theoverlappingrelationsareautomaticallyderivedfromdata

Page 77: ETC & Authors in the Drivers Seat

11/01/17Cheng

CEN.West

NDC.Southwest

CEN.USANDC.USA

CEN.Northeast

NDC.Northeast

CEN.SouthNDC.Southeast

NDC.West

CEN.DCNDC.DC

CEN.NMNDC.NM

CEN.NDNDC.ND

CEN.MidwestNDC.Midwest

CEN.AZNDC.AZ

CEN.CANDC.CA

CEN.MTNDC.MT

CEN.MANDC.MA

CEN.INNDC.IN

CEN.NVNDC.NV

CEN.MDNDC.MD

CEN.CTNDC.CT

CEN.NHNDC.NH

CEN.KYNDC.KY

CEN.PANDC.PA

CEN.CONDC.CO

CEN.WANDC.WA

CEN.MINDC.MI

CEN.VANDC.VA

CEN.WINDC.WI

CEN.NENDC.NE

CEN.SDNDC.SD

CEN.MNNDC.MN

CEN.MSNDC.MS

CEN.IDNDC.ID

CEN.WVNDC.WV

CEN.NYNDC.NY

CEN.NJNDC.NJ

CEN.UTNDC.UT

CEN.MENDC.ME

CEN.ILNDC.IL

CEN.TNNDC.TN

CEN.VTNDC.VT

CEN.GANDC.GA

CEN.DENDC.DE

CEN.NCNDC.NC

CEN.OKNDC.OK

CEN.MONDC.MO

CEN.SCNDC.SC

CEN.ARNDC.AR

CEN.TXNDC.TX

CEN.LANDC.LA

CEN.OHNDC.OH

CEN.IANDC.IA

CEN.KSNDC.KS

CEN.RINDC.RI

CEN.WYNDC.WY

CEN.FLNDC.FL

CEN.ORNDC.OR

CEN.ALNDC.AL

Nodes

CEN 3NDC 4comb 51 Edges

input 61inferred 3

overlapsinferred 3

CEN.West

NDC.Southwest

CEN.USANDC.USA

CEN.Northeast

NDC.Northeast

CEN.SouthNDC.Southeast

NDC.West

CEN.DCNDC.DC

CEN.NMNDC.NM

CEN.NDNDC.ND

CEN.MidwestNDC.Midwest

CEN.AZNDC.AZ

CEN.CANDC.CA

CEN.MTNDC.MT

CEN.MANDC.MA

CEN.INNDC.IN

CEN.NVNDC.NV

CEN.MDNDC.MD

CEN.CTNDC.CT

CEN.NHNDC.NH

CEN.KYNDC.KY

CEN.PANDC.PA

CEN.CONDC.CO

CEN.WANDC.WA

CEN.MINDC.MI

CEN.VANDC.VA

CEN.WINDC.WI

CEN.NENDC.NE

CEN.SDNDC.SD

CEN.MNNDC.MN

CEN.MSNDC.MS

CEN.IDNDC.ID

CEN.WVNDC.WV

CEN.NYNDC.NY

CEN.NJNDC.NJ

CEN.UTNDC.UT

CEN.MENDC.ME

CEN.ILNDC.IL

CEN.TNNDC.TN

CEN.VTNDC.VT

CEN.GANDC.GA

CEN.DENDC.DE

CEN.NCNDC.NC

CEN.OKNDC.OK

CEN.MONDC.MO

CEN.SCNDC.SC

CEN.ARNDC.AR

CEN.TXNDC.TX

CEN.LANDC.LA

CEN.OHNDC.OH

CEN.IANDC.IA

CEN.KSNDC.KS

CEN.RINDC.RI

CEN.WYNDC.WY

CEN.FLNDC.FL

CEN.ORNDC.OR

CEN.ALNDC.AL

Nodes

CEN 3NDC 4comb 51 Edges

input 61inferred 3

overlapsinferred 3

DCisinboththeSouthandtheNortheast

Page 78: ETC & Authors in the Drivers Seat

Case2:CensusRegionvsTimeZone

Cheng

PacificMountain

CentralEastern

West

South

Midwest

North-east

CEN TZ

• … but where do the articulationscome from??– automatically derived from data– expert input

Page 79: ETC & Authors in the Drivers Seat

Cheng

CEN.Northeast

TZ.Eastern

<

CEN.Midwest><

TZ.Mountain

><

TZ.Pacific

!

CEN.South

><

><

!

TZ.Central

><

CEN.USA

CEN.West

TZ.USA

==

!

><

!

Nodes

CEN 5TZ 5

Edges

isa_CEN 4isa_TZ 4Art. 12

CEN.Midwest

CEN.USATZ.USA

TZ.Eastern

TZ.Central

TZ.Mountain

CEN.South

CEN.Northeast

CEN.West TZ.Pacific

Nodes

CEN 4comb 1TZ 4

Edges

input 7overlapsinput 6overlapsinferred 1

inferred 1

InputOutput:PossibleWorld

Top-downregionalalignment

Page 80: ETC & Authors in the Drivers Seat

Howdoweknowifour‘expertarticulations’arecorrect?

11/01/17Cheng

R1 R2

R3

R4

R5

R6 R7

R8

R9

GIS solution as the Ground Truth..

Page 81: ETC & Authors in the Drivers Seat

11/01/17Cheng

R1

R2

R3

R4

R5

R6

R7

R8

R9

CEN.Midwest

CEN.USATZ.USA

CEN.West

CEN.NortheastTZ.Eastern\CEN.Midwest

TZ.Eastern\CEN.South

CEN.South

CEN.South*TZ.CentralTZ.Central\CEN.Midwest

CEN.South\TZ.Eastern

CEN.South\TZ.Mountain

TZ.Central

CEN.Midwest\TZ.Eastern

TZ.Mountain\CEN.SouthTZ.Mountain

CEN.Midwest\TZ.Mountain

TZ.Mountain\CEN.Midwest

CEN.Midwest*TZ.Mountain

CEN.Midwest\TZ.Central

TZ.Mountain\CEN.West

CEN.Midwest*TZ.Eastern

CEN.West*TZ.Mountain

CEN.South*TZ.MountainCEN.South\TZ.Central

TZ.Eastern

CEN.South*TZ.Eastern

CEN.Midwest*TZ.CentralTZ.Central\CEN.South

TZ.PacificCEN.West\TZ.Mountain

Nodes

CEN 4newComb 18comb 1TZ 4

Edges

input 6inferred 37

Combinedconceptssolutionforregional-levelalignments

Page 82: ETC & Authors in the Drivers Seat

DothetaxonomieshavetobespatialinordertouseRCC-5?

• No!Themoretypicalcasesfortaxonomyalignmentareusuallybetweennon-spatialtaxonomies– forwhichno“GISroute”ordirectvisualcuesaboutregionalextensionsareavailable

– theuseofRCC-5asanalignmentvocabularyisasuitableapproachtoperformawiderangeofmulti-hierarchyreconciliations

Cheng

Page 83: ETC & Authors in the Drivers Seat

Conclusion&Discussion• Underscoresthebenefitsofdesigningdifferentalignmentworkflows(Bottom-upvs.Top-Down)– Bottom-up:non-overlappingrelationshipsatthelowest-levelarticulations,notsurehowtoalignthehigher-levelconcepts

– Top-Down:whenthereisoftenoverlappingleaf-levelrelations..Expertinputwillfrequentlybeneededtoestablishsuchexpectationsunderthetop-downapproach

11/01/17Cheng

https://github.com/EulerProject/[email protected]

Page 84: ETC & Authors in the Drivers Seat

Implications

• Logic-basedtaxonomyalignmentapproach– Disambiguatename-basedtaxonomyalignmentovertime

• 40%oftheconceptsinbiologytaxonomiesundergoesnamechangeovertime(Franzetal.,2016)

– Maymitigateproblemsinequivalentcrosswalking• Membershipconditionproblemthatwasoftencriticizedincrosswalking

– Preservestheoriginaltaxonomieswhileprovidinganalignmentview

• Solvedataintegrationproblemsthathappeninthemorecoarse-grainedrelativecrosswalking

11/01/17Cheng

https://github.com/EulerProject/[email protected]

Page 85: ETC & Authors in the Drivers Seat

• …Aristotle…• …Euler…• …• …GregWhitbread…

• [BPB93]J.H.Beach,S.Pramanik,andJ.H.Beaman.Hierarchictaxonomicdatabases.,Advances inComputerMethodsforSystematicBiology:ArtificialIntelligence,Databases,ComputerVision,1993

• [Ber95]WalterG.Berendsohn.Theconceptof“potentialtaxa” indatabases.Taxon,44:207–212,1995.

• [Ber03]WalterG.Berendsohn.MoReTax – HandlingFactualInformationLinkedtoTaxonomicConceptsinBiology.No.39inSchriftenreihe fürVegetationskunde.Bundesamt für Naturschutz,2003.

• [GG03]M.Geoffroy andA.Güntsch.Assemblingandnavigatingthepotentialtaxongraph.In[Ber03],pages71–82,2003.

• [TL07]Thau,D.,&Ludäscher,B.(2007).Reasoningabouttaxonomiesinfirst-orderlogic.EcologicalInformatics,2(3),195-209.

• [FP09]Franz,N.M.,&Peet,R.K.(2009).Perspectives:towardsalanguageformappingrelationshipsamongtaxonomicconcepts.SystematicsandBiodiversity,7(1),5-20.

• … 85

SomeHistory