Download pdf - FireWorks workflow software

Anubhav Jain

FireWorks workflow software

MAVRL workshop | Nov 2014

Energy & Environmental Technologies Berkeley Lab

¡  There was no real “system” for running jobs

¡  Everything was very VASP specific

¡  No error detection / failure recovery

¡  When there was a mistake, it would take a week of manual labor to fix and rerun

¡  The first attempt was a horrible mash-up of things we had already built §  Complicated by having 2 people “in charge”

¡  Sometimes it is better to start from a blank piece of paper with 1 leader

¡  #1 Google hit for “Python workflow software” §  now even beats Adobe Fireworks for #1 spot for

“Fireworks workflow”!

¡  Won NERSC award for innovative use of HPC

¡  Used in many applications §  genomics to computer graphics §  this is not an “internal code” for running crystals

¡  Doc page ~200 hits/week §  1/10th of Materials Project

¡  What is FireWorks and why use it? ¡  Practical: learn to use FireWorks

calc1

restart

test_2

scp files/qsub

wait for finish

retry failures/copy files/qsub again

calc1

restart

try_2

scp files/qsub

wait for finish

retry failures/copy files/qsub again

LAUNCHPAD FW 1

FW 2

FW 3 FW 4

ROCKET LAUNCHER / QUEUE LAUNCHER

Directory 1 Directory 2

?

You can scale without human effort Easily customize what gets run where

¡  Easy-to-install §  FW currently at NERSC, SDSC, group clusters

– Blue Gene planned ¡  Work within the limits of queue policies ¡  Pack jobs automatically

No job left behind!

what machine what time what directory what was the output when was it queued when did it start running when was it completed

LAUNCH

¡  both job details (scripts+parameters) and launch details are automatically stored

¡  Soft failures, hard failures, human errors ¡  We’ve been through it many times now… ¡  No longer a week’s effort

§  “lpad detect_lostruns –rerun” OR §  “lpad rerun –s FIZZLED”

Xiaohui can be replaced by

digital Xiaohui, programmed into FireWorks

¡  Submitting millions of jobs §  Easy to lose track

of what was done before

¡  Multiple users

submitting jobs

¡  Sub-workflow duplication

A A

Duplicate Job detection

(if two workflows contain an identical step, ensure that the step is only run once and relevant information is still passed)

¡  Within workflow, or between workflows ¡  Completely flexible

Now seems like a good time to bring up the last few lines of the OUTCAR of all

failed jobs...

¡  Ridiculous amount of documentation and tutorials §  complete strangers are

experts w/o my help §  but many grad students/

postdocs still complain w/o reading the docs

¡  Built in tasks §  run BASH/Python scripts §  file transfer (incl. remote) §  write/copy/delete files

¡  Paper in submission §  happy to share preprint

¡  What is FireWorks and why use it? ¡  Practical: learn to use FireWorks

FW 1 Spec

FireTask 1

FireTask 2

•  Each FireWork is run in a separate directory, maybe on a different machine, within its own batch job (in queue mode)

•  The spec contains parameters needed to carry out FireTasks

•  FireTasks are run in succession in the same directory

•  A FireWork can modify the Spec of its children based on its output (pass information) through a FWAction

•  The FWAction can also modify the workflow

FW 2 Spec

FireTask 1

FW 3 Spec

FireTask 1

FireTask 2

FireTask 3

FWAction

FWAction

input_array: [1, 2, 3] 1.  Sum input array 2.  Write to file 3.  Pass result to next job


input_data: [6, 15] 1.  Sum input data 2.  Write to file 3.  Pass result to next job ------------------------------------- 1.  Copy result to home dir

6 15

class MyAdditionTask(FireTaskBase): _fw_name = "My Addition Task" def run_task(self, fw_spec): input_array = fw_spec['input_array'] m_sum = sum(input_array) print("The sum of {} is: {}".format(input_array, m_sum)) with open('my_sum.txt', 'a') as f: f.writelines(str(m_sum)+'\n') # store the sum; push the sum to the input array of the next sum return FWAction(stored_data={'sum': m_sum}, mod_spec=[{'_push': {'input_array': m_sum}}])

See also: http://pythonhosted.org/FireWorks/guide_to_writing_firetasks.html




input_data: [6, 15] 1.  Sum input data 2.  Write to file 3.  Pass result to next job ------------------------------------- 1.  Copy result to home dir

6 15!

# set up the LaunchPad and reset it launchpad = LaunchPad() launchpad.reset('', require_password=False) # create Workflow consisting of a AdditionTask FWs + file transfer fw1 = Firework(MyAdditionTask(), {"input_array": [1,2,3]}, name="pt 1A") fw2 = Firework(MyAdditionTask(), {"input_array": [4,5,6]}, name="pt 1B") fw3 = Firework([MyAdditionTask(), FileTransferTask({"mode": "cp", "files": ["my_sum.txt"], "dest": "~"})], name="pt 2") wf = Workflow([fw1, fw2, fw3], {fw1: fw3, fw2: fw3}, name="MAVRL test") launchpad.add_wf(wf) # launch the entire Workflow locally rapidfire(launchpad, FWorker())

¡  lpad get_wflows -d more ¡  lpad get_fws -i 3 -d all

¡  lpad webgui

¡  Also rerun features See all reporting at official docs: http://pythonhosted.org/FireWorks

¡  There are a ton in the documentation and tutorials, just try them! §  http://pythonhosted.org/FireWorks

¡  I want an example of running VASP! §  https://github.com/materialsvirtuallab/fireworks-vasp §  https://gist.github.com/computron/ ▪  look for “fireworks-vasp_demo.py”

§  Note: demo is only a single VASP run §  multiple VASP runs require passing directory names

between jobs ▪  currently you must do this manually ▪  in future, perhaps build into FireWorks

¡  It is not an accident that we are able to support so many advanced features in such a short time §  many features not found anywhere else!

¡  FireWorks is designed to: §  leverage modern tools §  be extensible at a fundamental level, not post-hoc

feature additions

fws: -‐ fw_id: 1 spec: _tasks: -‐ _fw_name: ScriptTask: script: echo 'To be, or not to be,’ -‐ fw_id: 2 spec: _tasks: -‐ _fw_name: ScriptTask script: echo 'that is the question:’ links: 1: -‐ 2 metadata: {}

(this is YAML, a bit prettier for humans but less pretty for computers)

The same JSON document will produce the same result on any computer (with the same Python functions).

fws: -‐ fw_id: 1 spec: _tasks: -‐ _fw_name: ScriptTask: script: echo 'To be, or not to be,’ -‐ fw_id: 2 spec: _tasks: -‐ _fw_name: ScriptTask script: echo 'that is the question:’ links: 1: -‐ 2 metadata: {}

Just some of your search options: •  simple matches •  match in array •  greater than/less than •  regular expressions •  match subdocument •  Javascript function •  MapReduce…

All for free, and all on the native workflow format!

(this is YAML, a bit prettier for humans but less pretty for computers)

Use MongoDB’s dictionary update language to allow for JSON document updates

Workflows can create new workflows or add to current workflow •  a recursive workflow •  calculation “detours” •  branches

¡  Theme: Worker machine pulls a job & runs it

¡  Variation 1: §  different workers can be configured to pull different

types of jobs via config + MongoDB ¡  Variation 2:

§  worker machines sort the jobs by a priority key and pull matching jobs the highest priority

Queue launcher (running on Hopper head node)

thruput job

thruput job

thruput job

thruput job

thruput job

thruput job

thruput job

¡  more complex queuing schemes also possible §  it’s always the same pull and run, or a slight variation

on it!

Job wakes up when PBS runs it

Grabs the latest job description from an external DB (pull)

Runs the job based on DB description

¡  Multiple processes pull and run jobs simultaneously §  It is all the same thing, just sliced* different ways!

Query&Job&*>&&&job&A!!*>&update&DB&

Query&Job&*>&&&job&B!!*>&update&DB&&

Query&Job&*>&&&job&X&&*>&Update&DB&

mpirun&*>&Node&1%

mpirun&*>&Node&2%

mpirun&*>&Node&n%

1!large!job!

Independent&Processes&

mol&a%

mol&b%

mol&x%

*get it? wink wink

because jobs are JSON, they are completely serializable!

¡  When a job runs, a separate thread periodically pings an “alive” signal to the database

¡  If that alive signal doesn’t appear for some time, the job is dead §  this method is robust for all types of failures

¡  The ping thread is reused to also track the output files and report the results to the database