View
624
Download
3
Category
Tags:
Preview:
DESCRIPTION
http://open-bio.org/bosc2004/accepted_abstracts.html
Citation preview
BioMakeBioMakeChris MungallChris Mungall
Berkeley Drosophila Genome Berkeley Drosophila Genome ProjectProject
cjm@fruitfly.orgcjm@fruitfly.org
build networksbuild networks
Many bioinformaticians spend large amounts Many bioinformaticians spend large amounts of time coding and running of time coding and running build/make build/make networksnetworks
A build network is a recipe describing the A build network is a recipe describing the execution of a collection of execution of a collection of interdependent interdependent heterogeneous tasksheterogeneous tasks sequence analysis pipelinessequence analysis pipelines data ‘compilation’data ‘compilation’
importing, transforming and exporting dataimporting, transforming and exporting data LIMSLIMS
Error prone, tedious repetitive code and hard Error prone, tedious repetitive code and hard to configureto configure
Existing approachesExisting approaches
Run tasks by hand, or with ad-hoc scriptsRun tasks by hand, or with ad-hoc scripts doesn’t scale, leads to insanitydoesn’t scale, leads to insanity
Unix/GNU makefilesUnix/GNU makefiles ++: concise, generic, high level abstraction++: concise, generic, high level abstraction --: limited expressive power, hacky--: limited expressive power, hacky
makefile replacements (cons,scons,ant,build…)makefile replacements (cons,scons,ant,build…) geared towards software developmentgeared towards software development
Bio compute pipeline software (biopipe, Bio compute pipeline software (biopipe, enspipe)enspipe) excellent for certain tasksexcellent for certain tasks not completely genericnot completely generic
biomake: executable biomake: executable computational protocolscomputational protocols
A declarative language for specifying A declarative language for specifying build networksbuild networks concise, Turing-complete, highly configurableconcise, Turing-complete, highly configurable
Dependency managementDependency management e.ge.g mask genomic sequence prior to mask genomic sequence prior to
genefindinggenefinding
Local and remote job executionLocal and remote job execution compute farm job managementcompute farm job management
Filesystem or database orientedFilesystem or database oriented
Example Genomic Example Genomic Sequence Analysis Sequence Analysis
PipelinePipeline Prepare and cut assembled sequence into slicesPrepare and cut assembled sequence into slices Download latest NR peptide dataset and index itDownload latest NR peptide dataset and index it Blast genomic slices against NR and other Blast genomic slices against NR and other
datasetsdatasets RepeatMask genomic slices and run genefinders RepeatMask genomic slices and run genefinders
on masked sequenceon masked sequence Synthesise gene models and do peptide Synthesise gene models and do peptide
analysis on resultsanalysis on results Store everything in a relational db, and prepare Store everything in a relational db, and prepare
files for export to publicfiles for export to public
Target dependenciesTarget dependencies
AssembledGenomic Sequence
ChunkedSequence
RepeatMaskedSequence
GenePredictions
BlastAlignments
FastaDB(local)
FastaDB
(remote)
BlastIndexedFastaDB
RelationalDatabase
GeneModels
BlastP & HMMAlignments
XMLExpo
rt
XMLImport
FlatfileExport
Specifying TargetsSpecifying Targets
ChunkedSequence
BlastAlignments
FastaDB(local)
BlastIndexedFastaDB
formatdb(D) run: formatdb -i D
blast(P,S,D,A) req: formatdb(D) run: blastall -p P -i S -d D A flat: S-blast/bn(S).D.P.out
A generic targetpattern has a nameand arguments
A target pattern has tags for specifying dependencies, actions and filesystem or database IDs
BioMake ExecutionBioMake ExecutionTARGET:blast(blastx, my.seq, nr.fly, -)
SUBTARGET:formatdb(nr.fly)
RUN: formatdb -i nr -p TOUT: nr.fly.{psq,pin,phr} formatdb-nr.fly.OK
RUN: blastall -p blastx -i my.seq -d nr.fly
OUT:my.seq-blast/my.seq.nr.fly.blastx.outmy.seq-blast/my.seq.nr.fly.blastx.out.OK
formatdb(D) run: formatdb -i D
blast(P,S,D,A) req: formatdb(D) run: blastall -p P -i S -d D A flat: S-blast/bn(S).D.P.out
Targets can be nestedTargets can be nested
bop(S,B) run: apollo -bop -s S B -o target flat: B.game.xml
store(XML) run: xml2db XML
store(bop(my.seq, genscan(repeatmask(my.seq, drosophila))))
target instantiations can be thought of as a skolem IDs
repeatmask(S,Org) run: repeatmask S -a Org flat: S.mask
genscan(S) run: genscan S flat: S-pred/bn(S).genscan.out
IterationIteration
Pipelines frequently involve iterating Pipelines frequently involve iterating over collections of data:over collections of data: Perform a sequence analysis on every entry Perform a sequence analysis on every entry
in a multi-fasta format filein a multi-fasta format file Perform a peptide analysis on every gene Perform a peptide analysis on every gene
prediction in some genscan outputprediction in some genscan output Query a database for a list of IDs and Query a database for a list of IDs and
perform some task on eachperform some task on each
biomake has language constructs for biomake has language constructs for iterationiteration
Iterating over datasetsIterating over datasets
splitfasta(F) run: splitfasta.pl -d F-split -md5 F flat: F-split/bn(F).pathlist comment: splitfasta.pl is part of the biomake distro
analyze_multifasta(F) iterate: analyze_seq(S) where S in splitfasta(F)
analyze_seq(S) req: genie(S) blast(blastx,S,nr,-)
MultiFasta>seq1TAGGTATTGGTTAGGTGCGTCCTC>seq2GCGGTATAGCTTTTCCTTCTCTCT>seq3CAAAGCAGAGATATATTTATTCGC
>seq1TAGGTATTGGTTAGGTGCGTCCTC
>seq2GCGGTATAGCTTTTCCTTCTCTCT
>seq3CAAAGCAGAGATATATTTATTCGC
seq1.genie.out
seq1.nr.blastx.out
seq2.genie
seq2.nr.blastx.out
seq3.nr.blastx.out
seq3.genie
Controlling the runmodeControlling the runmode
Tasks can be run locally or on a Tasks can be run locally or on a compute farm, synchronously or compute farm, synchronously or asynchronouslyasynchronously wrapper provided for PBSwrapper provided for PBS
runmode:runmode: tag states the mode and tag states the mode and wrapper for a particular target patternwrapper for a particular target pattern can be set globally and per-patterncan be set globally and per-pattern
special status targets provide special status targets provide execution statusexecution status
runmode examplerunmode example
blast(P,S,D,A) req: formatdb(D) run: blastall -p P -i S -d D A flat: S-blast/bn(S).D.P.out runmode: async(qsubwrap)
The blast job will beexecuted on the computefarm via qsubwrap (comes with biomake distro)
Upon submission, the status targetstatus_run(blast(P,S,D,A))will be generated; oncompletion, the targetstatus_ok (blast(P,S,D,A))will be generated
biomake can automaticallyhandle moving data inand out between user’sfilesystem (or db) and local cluster nodes
DatastoresDatastores
BioMake persists targets in DatastoresBioMake persists targets in Datastores The The flat:flat: tag flattens targets to unique tag flattens targets to unique
datastore IDsdatastore IDs Datastore can be filesystem or relational Datastore can be filesystem or relational
databasedatabase default is filesystemdefault is filesystem can be set globally or per targetcan be set globally or per target
e.g. analysis result targets can be stored on filesystem, e.g. analysis result targets can be stored on filesystem, status targets stored in DBstatus targets stored in DB
NFS traffic can be avoided on compute farm by NFS traffic can be avoided on compute farm by storing targets and status targets in a databasestoring targets and status targets in a database
Asynchronous ExecutionAsynchronous Execution
local machinerunning biomake
scheduler
node
NFS
For each target T to be built:1) biomake fetches status of T skips T if status = ok/run2) biomake stores status_run(T)3) biomake creates a runner agent script and submits it to the cluster4) continue onto next target
1) agent fetches any input data2) agent runs command (eg blast)synchronously3) agent stores result4) agent stores status of T as‘ok’ or ‘err’
AGENT
AGENT
Specifying rulesSpecifying rules
Pipeline systems often require a rule Pipeline systems often require a rule basebase
only do nuc to nuc alignments on one species or only do nuc to nuc alignments on one species or two recently diverged speciestwo recently diverged species
use sequence ontology hierarchy to decide use sequence ontology hierarchy to decide analyses or parametersanalyses or parameters
biomake protocols can have prolog facts biomake protocols can have prolog facts and rules embedded inside themand rules embedded inside them
biomake distro comes with SO prolog db biomake distro comes with SO prolog db and rules for graph traversaland rules for graph traversal
<data relation=“fastadb” cols=“ID,SeqAlphabet,SOType,Org” del=“ws”>protein.fly.fstprotein.fly.fst aa
aapolypeptidepolypeptide D melanogasterD melanogaster
est.fly.fstest.fly.fst nnaa
ESTEST D melanogasterD melanogaster
cdna.fly.fstcdna.fly.fst nnaa
cDNAcDNA D melanogasterD melanogaster</data><prolog> nucdb(D):- fastadb(D,na,_,_). pepdb(D):- fastadb(D,aa,_,_).</prolog>
formatdb(D) run: formatdb -i D -p ‘T’ {pepdb(D)} run: formatdb -i D -p ‘F’ {nucdb(D)}
Embedding prolog factsEmbedding prolog facts
BioMake module systemBioMake module system
The biomake core language is generic The biomake core language is generic no bioinformatics-specific code or tweaksno bioinformatics-specific code or tweaks
biomake uses a module systembiomake uses a module system biomake distro comes withbiomake distro comes with
biosequence_analysis modulebiosequence_analysis module Sequence Ontology prolog db and rulesSequence Ontology prolog db and rules scripts for handling bioinformatics datascripts for handling bioinformatics data
biomake extensibilitybiomake extensibility
biomake is a declarative languagebiomake is a declarative language embodies both logical and functional embodies both logical and functional
paradigmsparadigms targets are actually higher order functionstargets are actually higher order functions standard FP functions available in fp standard FP functions available in fp
modulemodule cons,map,grep,filter,fold,…cons,map,grep,filter,fold,…
goal: expressive power, concise goal: expressive power, concise specifications and simplicityspecifications and simplicity
BioMake in useBioMake in use
currently we’re using biomake for…currently we’re using biomake for… analysis of repeat families found in analysis of repeat families found in
orthologous and paralogous intronsorthologous and paralogous introns Building the Gene Ontology databaseBuilding the Gene Ontology database
……but we are still dependent on but we are still dependent on legacy pipeline code for many legacy pipeline code for many analysesanalyses
Running biomakeRunning biomake
Get distro from Get distro from http://skam.sourceforge.nethttp://skam.sourceforge.net Requires XSB PrologRequires XSB Prolog
http://xsb.sourceforge.nethttp://xsb.sourceforge.net Run via command lineRun via command line
similar to unix make commandsimilar to unix make command Works on both OS X and LinuxWorks on both OS X and Linux Relational datastore requires mysql (Pg Relational datastore requires mysql (Pg
soon)soon) Better docs coming soon, lots of examplesBetter docs coming soon, lots of examples
AcknowledgementsAcknowledgements
Shengqiang ShuSima MisraErwin FriseEric SmithMark YandellGeorge HartzellChris SmithSimon Prochnik
Jon TupyJosh KaminkerKaren EilbeckNomi HarrisSuzanna LewisGerry Rubin
Problem SpecificationProblem Specification A build network consist of multiple A build network consist of multiple TargetsTargets
e.ge.g the output from a the output from a blastxblastx alignment of alignment of my.seqmy.seq to to the protein database the protein database nr.flynr.fly
Targets have a Targets have a logical patternlogical pattern e.ge.g blastblast alignment using alignment using PP of some seq of some seq SS vs some db vs some db
DD Targets are Targets are dependentdependent on other Targets on other Targets
e.ge.g blast depends on the indexing of db blast depends on the indexing of db DD using using formatdbformatdb
Upstream changes trigger downstream actionsUpstream changes trigger downstream actions Targets are built by Targets are built by running actionsrunning actions
““formatdb -p T -i nr.flyformatdb -p T -i nr.fly”” ““blastall -p blastx -i my.seq -d nr.flyblastall -p blastx -i my.seq -d nr.fly””
Recommended