57
Tips & Tricks for Software Engineering in Bioinformatics Presented by: Joel Dudley

Tips And Tricks For Bioinformatics Software Engineering

Embed Size (px)

DESCRIPTION

This is a talk I've given twice at Stanford recently. It's essentially a brain dump of my thoughts on being a Bioinformatician with lots of links to useful tools.

Citation preview

Page 1: Tips And Tricks For Bioinformatics Software Engineering

Tips & Tricks for Software Engineering in

Bioinformatics

Presented by:Joel Dudley

Page 2: Tips And Tricks For Bioinformatics Software Engineering

Who is this guy?

Page 3: Tips And Tricks For Bioinformatics Software Engineering

0

2.5

5.0

7.5

10.0

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 25 26 27 28 29 30 31 32

Age (years)

Avg

. tim

e sp

ent

prog

ram

min

g (h

ours

)

Page 4: Tips And Tricks For Bioinformatics Software Engineering

http://www.megasoftware.net

Page 5: Tips And Tricks For Bioinformatics Software Engineering

Kumar S. and Dudley J. “Bioinformatics software for biologists in the genomics era.” Bioinformatics (2007) vol. 23 (14) pp. 1713-7

Page 6: Tips And Tricks For Bioinformatics Software Engineering

Bioinformatics Philosophy

Page 7: Tips And Tricks For Bioinformatics Software Engineering

Build Your Toolbox

Page 8: Tips And Tricks For Bioinformatics Software Engineering

Learn UNIX!

Page 9: Tips And Tricks For Bioinformatics Software Engineering

Be a jack of all trades, but master of one.

http://oreilly.com/news/graphics/prog_lang_poster.pdf

Page 10: Tips And Tricks For Bioinformatics Software Engineering

C/C++ PHP R PERL

Python

Ruby

Java

LISP

VB

Page 11: Tips And Tricks For Bioinformatics Software Engineering

Java is not just for Java

http://www.jython.org http://jruby.codehaus.org

Page 12: Tips And Tricks For Bioinformatics Software Engineering

Simplified Wrapper and Interface Generator (SWIG)

http://www.swig.org/

Greasy-fast C library

Doughy-soft scripting language

Page 13: Tips And Tricks For Bioinformatics Software Engineering

Frameworks are Friends

BioBike

Page 14: Tips And Tricks For Bioinformatics Software Engineering

Stand on the slumped, dandruff-covered shoulders of millions of computer nerds.

Page 15: Tips And Tricks For Bioinformatics Software Engineering
Page 16: Tips And Tricks For Bioinformatics Software Engineering

Don’t trust yourself (or your hard disk).

Page 17: Tips And Tricks For Bioinformatics Software Engineering

#!/usr/bin/perl# 472-byte qrpff, Keith Winstein and Marc Horowitz <[email protected]># MPEG 2 PS VOB file -> descrambled output on stdout.# usage: perl -I <k1>:<k2>:<k3>:<k4>:<k5> qrpff# where k1..k5 are the title key bytes in least to most-significant order

s''$/=\2048;while(<>){G=29;R=142;if((@a=unqT="C*",_)[20]&48){D=89;_=unqb24,qT,@b=map{ord qB8,unqb8,qT,_^$a[--D]}@INC;s/...$/1$&/;Q=unqV,qb25,_;H=73;O=$b[4]<<9|256|$b[3];Q=Q>>8^(P=(E=255)&(Q>>12^Q>>4^Q/8^Q))<<17,O=O>>8^(E&(F=(S=O>>14&7^O)^S*8^S<<6))<<9,_=(map{U=_%16orE^=R^=110&(S=(unqT,"\xb\ntd\xbz\x14d")[_/16%8]);E^=(72,@z=(64,72,G^=12*(U-2?0:S&17)),H^=_%64?12:0,@z)[_%8]}(16..271))[_]^((D>>=8)+=P+(~F&E))for@a[128..$#a]}print+qT,@a}';s/[D-HO-U_]/\$$&/g;s/q/pack+/g;eval

Don’t be afraid to use more than three letters to define a variable!

Page 18: Tips And Tricks For Bioinformatics Software Engineering

ArchitectureAccomplishment

Object-Oriented Software Design Decisions

Page 19: Tips And Tricks For Bioinformatics Software Engineering

module GraphBuilder LINE_TYPES = [:solid,:dashed,:dotted] module Nodes SHAPE_TYPES = [:rectangle,:roundrectangle,:ellipse,:parallelogram,:hexagon,:octagon,:diamond,:triangle,:trapezoid,:trapezoid2,:rectangle3d] class BaseNode attr_accessor :label,:geometry,:fill_colors,:outline,:degree,:data def initialize(opts={}) @opts = { :form=>:ellipse, :height=>50.0, :width=>50.0, :label=>"GraphNode#{self.object_id}", :line_type=>:solid, :fill_color => {:R=>255,:G=>204,:B=>0,:A=>255}, :fill_color2 => nil, :data => {}, :outline_color=>{:R=>0,:G=>0,:B=>0,:A=>255}, # Set to nil or {:R=>0,:G=>0,:B=>0,:A=>0} for no outline }.merge(opts) @data = @opts[:data] # for storing application-specific data @label = Labels::NodeLabel.new(@opts[:label]) @geometry = {:pos_x=>0.0,:pos_y=>0.0,:width=>1.0,:height=>1.0} @fill_colors = [@opts[:fill_color],nil] @outline = {:line_type=>@opts[:line_type],:color=>@opts[:outline_color]} @degree = {:in=>0,:out=>0} end def clone_params { :label=>text, :fill_color=>@fill_colors.first, :form=>@form, :height=>@geometry[:height], :width=>@geometry[:width] } end end

class ShapeNode < BaseNode attr_accessor :form def initialize(opts={}) super @form = @opts[:form] @geometry[:height] = @opts[:height] @geometry[:width] = @opts[:width] end

Page 20: Tips And Tricks For Bioinformatics Software Engineering

To Subclass or not to subclass? Use mixins!class Array def arithmetic_mean self.inject(0.0) { |sum,x| x = x.real if x.is_a?(Complex); sum + x.to_f } / self.length.to_f end def geometric_mean begin Math.exp(self.select { |x| x > 0.0 }.collect { |x| Math.log(x) }.arithmetic_mean) rescue Errno::ERANGE Math.exp(self.select { |x| x > 0.0 }.collect { |x| BigMath.log(x,50) }.arithmetic_mean) end end def median if self.length.odd? self[self.length / 2] else upper_median = self[self.length / 2] lower_median = self[(self.length / 2) - 1] [upper_median,lower_median].arithmetic_mean end end def standard_deviation mean = self.arithmetic_mean deviations = self.map { |x| x - mean } sqr_deviations = deviations.map { |x| x**2 } sum_sqr_deviations = sqr_deviations.inject(0.0) { |sum,x| sum + x } Math.sqrt(sum_sqr_deviations/(self.length - 1).to_f) end alias_method :sd, :standard_deviation def shuffle sort_by { rand } end

def shuffle! self.replace shuffle endend

Page 21: Tips And Tricks For Bioinformatics Software Engineering

• Come up with a convention for your “headers”

• Use automated documentation generation tools

• JavaDoc

• Rdoc

• Pydoc / Epydoc

• Save code snippets in a searchable repository

Documenting code sucks! Automate it.

Page 22: Tips And Tricks For Bioinformatics Software Engineering

• General tools

• DTrace

• strace

• gdb

• Language specific

• Ruby-prof

• Psyco/Pyrex

• JBoss Profiler/JIT

A little performance optimization goes a long way

Page 23: Tips And Tricks For Bioinformatics Software Engineering

Working with data

Page 24: Tips And Tricks For Bioinformatics Software Engineering

# Copyright © 1996-2007 SRI International, Marine Biological Laboratory, DoubleTwist Inc., # The Institute for Genomic Research, J. Craig Venter Institute, University of California at San Diego, and UNAM. All Rights Reserved.### Please see the license agreement regarding the use of and distribution of this file.# The format of this file is defined at http://bioinformatics.ai.sri.com/ptools/flatfile-format.html .## Species: E. coli K-12# Database: EcoCyc# Version: 11.5# File Name: dnabindsites.dat# Date and time generated: August 6, 2007, 17:32:33## Attributes:# UNIQUE-ID# TYPES# COMMON-NAME# ABS-CENTER-POS# APPEARS-IN-BINDING-REACTIONS# CITATIONS# COMMENT# COMPONENT-OF# COMPONENTS# CREDITS# DATA-SOURCE# DBLINKS# INSTANCE-NAME-TEMPLATE# INVOLVED-IN-REGULATION# LEFT-END-POSITION# REGULATED-PROMOTER# RELATIVE-CENTER-DISTANCE# RIGHT-END-POSITION# SYNONYMS#UNIQUE-ID - BS86TYPES - DNA-Binding-SitesABS-CENTER-POS - 4098761CITATIONS - 94018613CITATIONS - 94018613:EV-EXP-IDA-BINDING-OF-CELLULAR-EXTRACTS:3310246267:martinCITATIONS - 14711822:EV-COMP-AINF-SIMILAR-TO-CONSENSUS:3310246267:martinCOMPONENT-OF - TU00064INVOLVED-IN-REGULATION - REG0-5521TYPE-OF-EVIDENCE - :BINDING-OF-CELLULAR-EXTRACTS//

Page 25: Tips And Tricks For Bioinformatics Software Engineering

http://www.oracle.com/technology/products/berkeley-db/index.html

If you can represent most of your data as key/value pairs, then at the very least use a BerkeleyDB

Page 26: Tips And Tricks For Bioinformatics Software Engineering

In most cases a relational database is an appropriate choice for bioinformatics data

•Clean and consolidated (vs. a rats nest of files and folders)

• Improved performance (memory usage and File I/O)

•Data consistency through constraints and transactions

• Easily portable (SQL92 standard)

•Querying (asking questions about data) vs. Parsing (reading and loading data)

•Commonly used data processing functions can be implemented as stored procedures

Page 27: Tips And Tricks For Bioinformatics Software Engineering

“But I’m a scientist, not a DBA! Harrumph!”

“...SQLite is a software library that implements a self-contained, serverless, zero-configuration, transactional SQL database engine...”

http://www.sqlite.org

Page 28: Tips And Tricks For Bioinformatics Software Engineering

But seriously, don’t write any SQL (What?)

Object Relational Mapper (ORM)

Relational Database(MySQL, PostgreSQL, Oracle, etc)

Model

Instance

Page 29: Tips And Tricks For Bioinformatics Software Engineering

Beyond the RDBMS

http://strokedb.com/

http://www.hypertable.org

http://incubator.apache.org/couchdb

Page 30: Tips And Tricks For Bioinformatics Software Engineering

Thinking in Parallel

Page 31: Tips And Tricks For Bioinformatics Software Engineering

• Each task is independent

• No synchronous inter-task communication

• Example: Computing a Maximum Likelihood Phylogeny for every gene family in the Panther Database

• Software: OpenPBS, SGE, Xgrid, PlatformLSF

• Tasks are interdependent

• Synchronous inter-task communication via messaging interface

• Example: Monte Carlo simulation of 3D protein interactions in cytoplasm

• Software: OpenMPI, MPICH, PVM

Loosely Coupled Tightly Coupled

Page 32: Tips And Tricks For Bioinformatics Software Engineering

Use your idle CPU cores!

Page 33: Tips And Tricks For Bioinformatics Software Engineering

Start thinking in terms of MapReduce (old hat for Lisp programmers!)

Image source: http://code.google.com/edu/parallel/mapreduce-tutorial.html

Page 34: Tips And Tricks For Bioinformatics Software Engineering

map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");

reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values: result += ParseInt(v);Emit(AsString(result)); [1]

Page 35: Tips And Tricks For Bioinformatics Software Engineering

map(String key, String value): // key: Sequence alignment file name // value: multiple alignment for each exon w in value: EmitIntermediate(w, CpGIndex);

reduce(String key, Iterator values):// key: an exon// values: a list of CpG Index Valuesint result = 0;for each i in values: result += ParseInt(v);Emit(AsString(result/length(values)); [1]

Page 36: Tips And Tricks For Bioinformatics Software Engineering

http://sourceforge.net/projects/cloudburst-bio/

Page 37: Tips And Tricks For Bioinformatics Software Engineering

MapReduce Implementations

http://hadoop.apache.org/core/ http://skynet.rubyforge.org/

http://discoproject.org/

http://labs.trolltech.com/page/Projects/Threads/QtConcurrent

Page 38: Tips And Tricks For Bioinformatics Software Engineering

Embracing Hardware

Page 39: Tips And Tricks For Bioinformatics Software Engineering

Single Instruction, Multiple Data (SIMD)

Page 40: Tips And Tricks For Bioinformatics Software Engineering

Graphics Processing Unit (GPU): Not just fun and games

Page 41: Tips And Tricks For Bioinformatics Software Engineering
Page 42: Tips And Tricks For Bioinformatics Software Engineering

GPU Programming is Getting Easier

OpenCLCompute Unified

Device Architecture

http://www.nvidia.com/cuda http://s08.idav.ucdavis.edu/munshi-opencl.pdf

Page 43: Tips And Tricks For Bioinformatics Software Engineering
Page 44: Tips And Tricks For Bioinformatics Software Engineering

Field Programmable Gate Arrays (FPGA)

Page 45: Tips And Tricks For Bioinformatics Software Engineering

Field Programmable Gate Arrays (FPGA)

Page 46: Tips And Tricks For Bioinformatics Software Engineering

Playing nice with others

Page 47: Tips And Tricks For Bioinformatics Software Engineering

• JSON

• YAML

• XML

• Microformats

• RDF

Data Interchange Formats

Page 48: Tips And Tricks For Bioinformatics Software Engineering

person = { "name": "Joel Dudley", "age": 32, "height": 1.83, "urls": [ "http://www.joeldudley.com/", "http://www.linkedin.com/in/joeldudley" ]}

<person> <name>Joel Dudley</name> <age>32</age> <height>1.83</height> <urls> <url>http://www.joeldudley.com/</url> <url> http://www.linkedin.com/in/joeldudley </url> </urls></person>

VS.

Page 49: Tips And Tricks For Bioinformatics Software Engineering

• Remote Procedure Call (RPC)

• Representational State Transfer (ReST)

• SOAP

• ActiveResource Pattern

Web Services

Page 50: Tips And Tricks For Bioinformatics Software Engineering

class Video < ActiveYouTube self.site = "http://gdata.youtube.com/feeds/api"

## To search by categories and tags def self.search_by_tags (*options) from_urls = [] if options.last.is_a? Hash excludes = options.slice!(options.length-1) if excludes[:exclude].kind_of? Array from_urls << excludes[:exclude].map{|keyword| "-"+keyword}.join("/") else from_urls << "-"+excludes[:exclude] end end from_urls << options.find_all{|keyword| keyword =~ /^[a-z]/}.join("/") from_urls << options.find_all{|category| category =~ /^[A-Z]/}.join("%7C") from_urls.delete_if {|x| x.empty?} self.find(:all,:from=>"/feeds/api/videos/-/"+from_urls.reverse.join("/")) endend

class User < ActiveYouTube self.site = "http://gdata.youtube.com/feeds/api"end

class Standardfeed < ActiveYouTube self.site = "http://gdata.youtube.com/feeds/api"end

class Playlist < ActiveYouTube self.site = "http://gdata.youtube.com/feeds/api"end

Page 51: Tips And Tricks For Bioinformatics Software Engineering

search = Video.find(:first, :params => {:vq => 'ruby', :"max-results" => '5'}) puts search.entry.length

## video information of id = ZTUVgYoeN_o vid = Video.find("ZTUVgYoeN_o") puts vid.group.content[0].url

## video comments comments = Video.find_custom("ZTUVgYoeN_o").get(:comments) puts comments.entry[0].link[2].href

## searching with category/tags results = Video.search_by_tags("Comedy") puts results[0].entry[0].title # more examples: # Video.search_by_tags("Comedy", "dog") # Video.search_by_tags("News","Sports","football", :exclude=>"soccer")

Page 52: Tips And Tricks For Bioinformatics Software Engineering

Teamwork

Page 53: Tips And Tricks For Bioinformatics Software Engineering

Be Agile

Manifesto for Agile Software Development

We are uncovering better ways of developing software by doing it and helping others do it. Through this work we have come to value:

• Individuals and interactions over processes and tools• Working software over comprehensive documentation • Customer collaboration over contract negotiation • Responding to change over following a plan

That is, while there is value in the items on the right, we value the items on the left more.

http://agilemanifesto.org/

Page 54: Tips And Tricks For Bioinformatics Software Engineering

Be Agile

As a [role], I want to [goal], so I can [reason].

Storyboard

Feedback

Iterate!

Unit TestingAcceptance

Testing

Page 55: Tips And Tricks For Bioinformatics Software Engineering

Automate Development

http://nant.sourceforge.net/

http://nant.sourceforge.net/http://www.capify.org/

http://www.scons.org/

Page 56: Tips And Tricks For Bioinformatics Software Engineering

Lightweight Tools for Project Management

Page 57: Tips And Tricks For Bioinformatics Software Engineering

• Focus on the goal (Biology/Medicine)

• Don’t be clever (you’ll trick yourself)

• Value your time

• Outsource everything but genius

• Use the tools available to you

• Have fun!

Closing Remarks