Upload
jtdudley
View
4.303
Download
3
Embed Size (px)
DESCRIPTION
This is a talk I've given twice at Stanford recently. It's essentially a brain dump of my thoughts on being a Bioinformatician with lots of links to useful tools.
Citation preview
Tips & Tricks for Software Engineering in
Bioinformatics
Presented by:Joel Dudley
Who is this guy?
0
2.5
5.0
7.5
10.0
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 25 26 27 28 29 30 31 32
Age (years)
Avg
. tim
e sp
ent
prog
ram
min
g (h
ours
)
http://www.megasoftware.net
Kumar S. and Dudley J. “Bioinformatics software for biologists in the genomics era.” Bioinformatics (2007) vol. 23 (14) pp. 1713-7
Bioinformatics Philosophy
Build Your Toolbox
Learn UNIX!
Be a jack of all trades, but master of one.
http://oreilly.com/news/graphics/prog_lang_poster.pdf
C/C++ PHP R PERL
Python
Ruby
Java
LISP
VB
Java is not just for Java
http://www.jython.org http://jruby.codehaus.org
Simplified Wrapper and Interface Generator (SWIG)
http://www.swig.org/
Greasy-fast C library
Doughy-soft scripting language
Frameworks are Friends
BioBike
Stand on the slumped, dandruff-covered shoulders of millions of computer nerds.
Don’t trust yourself (or your hard disk).
#!/usr/bin/perl# 472-byte qrpff, Keith Winstein and Marc Horowitz <[email protected]># MPEG 2 PS VOB file -> descrambled output on stdout.# usage: perl -I <k1>:<k2>:<k3>:<k4>:<k5> qrpff# where k1..k5 are the title key bytes in least to most-significant order
s''$/=\2048;while(<>){G=29;R=142;if((@a=unqT="C*",_)[20]&48){D=89;_=unqb24,qT,@b=map{ord qB8,unqb8,qT,_^$a[--D]}@INC;s/...$/1$&/;Q=unqV,qb25,_;H=73;O=$b[4]<<9|256|$b[3];Q=Q>>8^(P=(E=255)&(Q>>12^Q>>4^Q/8^Q))<<17,O=O>>8^(E&(F=(S=O>>14&7^O)^S*8^S<<6))<<9,_=(map{U=_%16orE^=R^=110&(S=(unqT,"\xb\ntd\xbz\x14d")[_/16%8]);E^=(72,@z=(64,72,G^=12*(U-2?0:S&17)),H^=_%64?12:0,@z)[_%8]}(16..271))[_]^((D>>=8)+=P+(~F&E))for@a[128..$#a]}print+qT,@a}';s/[D-HO-U_]/\$$&/g;s/q/pack+/g;eval
Don’t be afraid to use more than three letters to define a variable!
ArchitectureAccomplishment
Object-Oriented Software Design Decisions
module GraphBuilder LINE_TYPES = [:solid,:dashed,:dotted] module Nodes SHAPE_TYPES = [:rectangle,:roundrectangle,:ellipse,:parallelogram,:hexagon,:octagon,:diamond,:triangle,:trapezoid,:trapezoid2,:rectangle3d] class BaseNode attr_accessor :label,:geometry,:fill_colors,:outline,:degree,:data def initialize(opts={}) @opts = { :form=>:ellipse, :height=>50.0, :width=>50.0, :label=>"GraphNode#{self.object_id}", :line_type=>:solid, :fill_color => {:R=>255,:G=>204,:B=>0,:A=>255}, :fill_color2 => nil, :data => {}, :outline_color=>{:R=>0,:G=>0,:B=>0,:A=>255}, # Set to nil or {:R=>0,:G=>0,:B=>0,:A=>0} for no outline }.merge(opts) @data = @opts[:data] # for storing application-specific data @label = Labels::NodeLabel.new(@opts[:label]) @geometry = {:pos_x=>0.0,:pos_y=>0.0,:width=>1.0,:height=>1.0} @fill_colors = [@opts[:fill_color],nil] @outline = {:line_type=>@opts[:line_type],:color=>@opts[:outline_color]} @degree = {:in=>0,:out=>0} end def clone_params { :label=>text, :fill_color=>@fill_colors.first, :form=>@form, :height=>@geometry[:height], :width=>@geometry[:width] } end end
class ShapeNode < BaseNode attr_accessor :form def initialize(opts={}) super @form = @opts[:form] @geometry[:height] = @opts[:height] @geometry[:width] = @opts[:width] end
To Subclass or not to subclass? Use mixins!class Array def arithmetic_mean self.inject(0.0) { |sum,x| x = x.real if x.is_a?(Complex); sum + x.to_f } / self.length.to_f end def geometric_mean begin Math.exp(self.select { |x| x > 0.0 }.collect { |x| Math.log(x) }.arithmetic_mean) rescue Errno::ERANGE Math.exp(self.select { |x| x > 0.0 }.collect { |x| BigMath.log(x,50) }.arithmetic_mean) end end def median if self.length.odd? self[self.length / 2] else upper_median = self[self.length / 2] lower_median = self[(self.length / 2) - 1] [upper_median,lower_median].arithmetic_mean end end def standard_deviation mean = self.arithmetic_mean deviations = self.map { |x| x - mean } sqr_deviations = deviations.map { |x| x**2 } sum_sqr_deviations = sqr_deviations.inject(0.0) { |sum,x| sum + x } Math.sqrt(sum_sqr_deviations/(self.length - 1).to_f) end alias_method :sd, :standard_deviation def shuffle sort_by { rand } end
def shuffle! self.replace shuffle endend
• Come up with a convention for your “headers”
• Use automated documentation generation tools
• JavaDoc
• Rdoc
• Pydoc / Epydoc
• Save code snippets in a searchable repository
Documenting code sucks! Automate it.
• General tools
• DTrace
• strace
• gdb
• Language specific
• Ruby-prof
• Psyco/Pyrex
• JBoss Profiler/JIT
A little performance optimization goes a long way
Working with data
# Copyright © 1996-2007 SRI International, Marine Biological Laboratory, DoubleTwist Inc., # The Institute for Genomic Research, J. Craig Venter Institute, University of California at San Diego, and UNAM. All Rights Reserved.### Please see the license agreement regarding the use of and distribution of this file.# The format of this file is defined at http://bioinformatics.ai.sri.com/ptools/flatfile-format.html .## Species: E. coli K-12# Database: EcoCyc# Version: 11.5# File Name: dnabindsites.dat# Date and time generated: August 6, 2007, 17:32:33## Attributes:# UNIQUE-ID# TYPES# COMMON-NAME# ABS-CENTER-POS# APPEARS-IN-BINDING-REACTIONS# CITATIONS# COMMENT# COMPONENT-OF# COMPONENTS# CREDITS# DATA-SOURCE# DBLINKS# INSTANCE-NAME-TEMPLATE# INVOLVED-IN-REGULATION# LEFT-END-POSITION# REGULATED-PROMOTER# RELATIVE-CENTER-DISTANCE# RIGHT-END-POSITION# SYNONYMS#UNIQUE-ID - BS86TYPES - DNA-Binding-SitesABS-CENTER-POS - 4098761CITATIONS - 94018613CITATIONS - 94018613:EV-EXP-IDA-BINDING-OF-CELLULAR-EXTRACTS:3310246267:martinCITATIONS - 14711822:EV-COMP-AINF-SIMILAR-TO-CONSENSUS:3310246267:martinCOMPONENT-OF - TU00064INVOLVED-IN-REGULATION - REG0-5521TYPE-OF-EVIDENCE - :BINDING-OF-CELLULAR-EXTRACTS//
http://www.oracle.com/technology/products/berkeley-db/index.html
If you can represent most of your data as key/value pairs, then at the very least use a BerkeleyDB
In most cases a relational database is an appropriate choice for bioinformatics data
•Clean and consolidated (vs. a rats nest of files and folders)
• Improved performance (memory usage and File I/O)
•Data consistency through constraints and transactions
• Easily portable (SQL92 standard)
•Querying (asking questions about data) vs. Parsing (reading and loading data)
•Commonly used data processing functions can be implemented as stored procedures
“But I’m a scientist, not a DBA! Harrumph!”
“...SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine...”
http://www.sqlite.org
But seriously, don’t write any SQL (What?)
Object Relational Mapper (ORM)
Relational Database(MySQL, PostgreSQL, Oracle, etc)
Model
Instance
Beyond the RDBMS
http://strokedb.com/
http://www.hypertable.org
http://incubator.apache.org/couchdb
Thinking in Parallel
• Each task is independent
• No synchronous inter-task communication
• Example: Computing a Maximum Likelihood Phylogeny for every gene family in the Panther Database
• Software: OpenPBS, SGE, Xgrid, PlatformLSF
• Tasks are interdependent
• Synchronous inter-task communication via messaging interface
• Example: Monte Carlo simulation of 3D protein interactions in cytoplasm
• Software: OpenMPI, MPICH, PVM
Loosely Coupled Tightly Coupled
Use your idle CPU cores!
Start thinking in terms of MapReduce (old hat for Lisp programmers!)
Image source: http://code.google.com/edu/parallel/mapreduce-tutorial.html
map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");
reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values: result += ParseInt(v);Emit(AsString(result)); [1]
map(String key, String value): // key: Sequence alignment file name // value: multiple alignment for each exon w in value: EmitIntermediate(w, CpGIndex);
reduce(String key, Iterator values):// key: an exon// values: a list of CpG Index Valuesint result = 0;for each i in values: result += ParseInt(v);Emit(AsString(result/length(values)); [1]
http://sourceforge.net/projects/cloudburst-bio/
MapReduce Implementations
http://hadoop.apache.org/core/ http://skynet.rubyforge.org/
http://discoproject.org/
http://labs.trolltech.com/page/Projects/Threads/QtConcurrent
Embracing Hardware
Single Instruction, Multiple Data (SIMD)
Graphics Processing Unit (GPU): Not just fun and games
GPU Programming is Getting Easier
OpenCLCompute Unified
Device Architecture
http://www.nvidia.com/cuda http://s08.idav.ucdavis.edu/munshi-opencl.pdf
Field Programmable Gate Arrays (FPGA)
Field Programmable Gate Arrays (FPGA)
Playing nice with others
• JSON
• YAML
• XML
• Microformats
• RDF
Data Interchange Formats
person = { "name": "Joel Dudley", "age": 32, "height": 1.83, "urls": [ "http://www.joeldudley.com/", "http://www.linkedin.com/in/joeldudley" ]}
<person> <name>Joel Dudley</name> <age>32</age> <height>1.83</height> <urls> <url>http://www.joeldudley.com/</url> <url> http://www.linkedin.com/in/joeldudley </url> </urls></person>
VS.
• Remote Procedure Call (RPC)
• Representational State Transfer (ReST)
• SOAP
• ActiveResource Pattern
Web Services
class Video < ActiveYouTube self.site = "http://gdata.youtube.com/feeds/api"
## To search by categories and tags def self.search_by_tags (*options) from_urls = [] if options.last.is_a? Hash excludes = options.slice!(options.length-1) if excludes[:exclude].kind_of? Array from_urls << excludes[:exclude].map{|keyword| "-"+keyword}.join("/") else from_urls << "-"+excludes[:exclude] end end from_urls << options.find_all{|keyword| keyword =~ /^[a-z]/}.join("/") from_urls << options.find_all{|category| category =~ /^[A-Z]/}.join("%7C") from_urls.delete_if {|x| x.empty?} self.find(:all,:from=>"/feeds/api/videos/-/"+from_urls.reverse.join("/")) endend
class User < ActiveYouTube self.site = "http://gdata.youtube.com/feeds/api"end
class Standardfeed < ActiveYouTube self.site = "http://gdata.youtube.com/feeds/api"end
class Playlist < ActiveYouTube self.site = "http://gdata.youtube.com/feeds/api"end
search = Video.find(:first, :params => {:vq => 'ruby', :"max-results" => '5'}) puts search.entry.length
## video information of id = ZTUVgYoeN_o vid = Video.find("ZTUVgYoeN_o") puts vid.group.content[0].url
## video comments comments = Video.find_custom("ZTUVgYoeN_o").get(:comments) puts comments.entry[0].link[2].href
## searching with category/tags results = Video.search_by_tags("Comedy") puts results[0].entry[0].title # more examples: # Video.search_by_tags("Comedy", "dog") # Video.search_by_tags("News","Sports","football", :exclude=>"soccer")
Teamwork
Be Agile
Manifesto for Agile Software Development
We are uncovering better ways of developing software by doing it and helping others do it. Through this work we have come to value:
• Individuals and interactions over processes and tools• Working software over comprehensive documentation • Customer collaboration over contract negotiation • Responding to change over following a plan
That is, while there is value in the items on the right, we value the items on the left more.
http://agilemanifesto.org/
Be Agile
As a [role], I want to [goal], so I can [reason].
Storyboard
Feedback
Iterate!
Unit TestingAcceptance
Testing
Automate Development
http://nant.sourceforge.net/
http://nant.sourceforge.net/http://www.capify.org/
http://www.scons.org/
Lightweight Tools for Project Management
• Focus on the goal (Biology/Medicine)
• Don’t be clever (you’ll trick yourself)
• Value your time
• Outsource everything but genius
• Use the tools available to you
• Have fun!
Closing Remarks