Bio IT World europe 2010

Cloud Computing and Genomics

Richard HollandBioITWorld Europe 2010 Cloud Computing Workshop

Cloud Computing and Genomics● Company background● Who we support● Sequence analysis in the cloud● Genomic data in the cloud● Cloud summary

Raison d'être

● Founded by 3 ex-Ensembl staff.

● Commercial services for open-source bioinformatics.

● Support open-source through collaborations.

● RedHat of bioinformatics.

What we do

● Help you choose the right solution for the problem.

● Right tools for the right jobs.● For any chosen open-source system:

✔ Support and train.✔ Modify, extend, and improve.✔ Integrate private data/systems.✔ Manage private mirrors.

What we do

● Genomic data analysis pipelines✔ SNP calling✔ miRNA detection✔ Probe mapping✔ Assembly and

annotation✔ Etc.

● Results integrated into other systems.

Open-source

● Used wherever possible.● Produced wherever possible.● Contribute all changes.● Cutting-edge.● Quality of concept (not necessarily code).● Rapidly adjustable.

Cloud

● Low-cost for ad-hoc work.

● Scalable.● Low-maintenance.● More secure.● Accessible.● No licences.● No installation.


Ensembl®

● Not only a genome browser, but also:● Genome databases.● Perl API and BioMart data warehouse.

● Collaboration agreement with EMBLEM.● Exchange knowledge.● Share revenue.

● EBI are working with Amazon.Ensembl is a registered trademark of Genome Research Ltd., and is developed by WTSI/EBI.

Taverna

● Workflow design and execution.● Collaboration agreement with the University

of Manchester (in progress).● Exchange knowledge.● Share revenue.

● Taverna are actively researching cloud deployment.

TraitTag

● Trait detection in seedlings.● Avoid waiting for adult plants.● Collaboration agreement with Plant

Bioscience Ltd. (John Innes Centre)● Exchange knowledge.● Share revenue.

● Already cloud-capable.

Others

● Other knowledge areas:● O|B|F Bio* projects

– e.g. BioJava, BioPerl, BioSQL, etc. ● BioMart● DebianMed

● No formal collaborations with these.

Big Pharma

● Drug development.● Pharmacogenomics.● Software-as-a-service.● Operational bioinformatics.

● Traditionally suspicious of both open-source and the cloud. This is changing.

Agri-biotech

● Chemical development.● Seed development.● Software-as-a-service.● Operational bioinformatics.

● Into open-source (DuPont has BioPerl people), ready to explore the cloud.

Biotech SMEs

● Anything and everything.● Highly specialised.● Poorly funded.● Lack in-house bioinformatics.● Many don't even know what it is.● One-off analyses and consulting projects.

● Love anything that saves them money.

Others

● CROs.● Universities.● Government institutes.● Animal health an emerging area.

● Public bodies are hardest to convince about open-source and the cloud.

● Funding agencies often the cause.


eHive

● Used for all Eagle's data analysis pipelines.● Developed by the Ensembl Compara project.● Generic framework (yet another one?).● Perl.● Ensembl-aware but independent.● Scalable and robust.● Open-source.

eHive

Severin et al. eHive: An Artificial Intelligence workflow systemfor genomic analysis. BMC Bioinformatics 2010, 11:240

eHive

● Works out-of-the-box on LSF.● Can run standalone in a single machine.● Modified to work on Condor and SGE.● Modified to work on Amazon EC2 without

needing Condor/SGE/LSF etc. (crafty trick)

eHive

Standalone pipelines

Packaged into VMs(black-box appliances)

Doesn't scale wellCumbersome to

distribute

What about the cloud?

eHive

● Easy to set up same pipeline and interface on EC2.

● But doesn't solve scaling, and clusters are hard to implement in the cloud.

● Crafty trick – self-replicating self-terminating instances.

● Scales to as many parallel instances as required (up to a preset limit).

● Keeps idle instances alive to just short of the hour just-in-case.

eHive on EC2 – an example

● A big pharma customer had some microarrays.

● Probe sequence data provided by vendors (Affymetrix, Agilent, etc.).

● Need high-quality mapping with quality scores and other metadata.

● Some chips have public mappings but not to required standard/format.

● Many chips have no public mappings at all.


Probe sequences

Ens funcgen db

Reports

Library ofmappingpipelines

Extended Ens API

eHive+EC2


● Getting data in/out no problem – very small.● Mix-up with keys lead to key-per-project.● Blackboard and results MySQL performance.● Needed to prove that the job management

is working.● Not firing up too many machines.● Not forgetting to shut them down.


● Scales nicely to about 50 machines.● Beyond that MySQL is the bottleneck.● Performance-tuning MySQL raises limit.● Deals better with lots of small input files

and/or input as a database table.● S3 slow but still plenty fast enough for

projects like this (1 or 2 extra hours in the context of 100s is not much).

● Really need multiple-mount-RO EBS.


● Customer data was sent to us encrypted, transferred to Amazon encrypted, only decrypted once inside Amazon.

● Results transferred out in the same way.● All processing internal to Amazon.● Usual steps to prevent access – firewall,

stop unused daemons, disable logins, etc.


Hosted data services

● Genomic data can be big, but can also be small.

● Analysed data vs. raw data.● Generic repositories vs. specialised

resources.● Organisations still sensitive about query

intercept and log mining.● Leads to in-house or secured third-party

hosting.

Hosted data services

● Simplest resources can run in a VM.● Ensembl web browser (but not database).● Functional, effective, if a bit slow.● VM files slow to download, expensive to

ship.● Can achieve the same effect using the cloud.● Distribute AMIs, or host instances?

Hosted data services – an example

● A big pharma needs Ensembl in-house.● Fed up with in-house maintenance.● Wanted to try it in the cloud.● Must run entirely within their own Amazon

account.


Eagle a/c

Customer a/cPublic Ens DB

DatabaseInstance

Browserinstance

PrivateEns DB

DatabaseAMI

BrowserAMI


● Ensembl public data in us-east region. ● Migration to other regions can be slow and

expensive.● No guaranteed update schedule.● Updating it ourselves slow and expensive.

● How to avoid having to maintain multiple instances for multiple customers?

Pistoia Sequence Services

● Pistoia Alliance is an industry alliance of big pharma and related companies.

● Sharing pre-competitive resources.● Pistoia Sequence Services proof-of-concept

● To share Ensembl and related services.● Eagle in partnership with Cognizant

Technology Solutions were successful participants.


● Already had the solution – an Ensembl AMI.● Need to make it capable of running more

than just Ensembl – easy.● Some of the extra services need small

amounts of private data, secured and partitioned – fairly easy.

● Need to secure it, and scale it – pretty hard.


Users Users Users

Zeusload balancer

with SSL

Usageand

Billing

MappingservicePlasMapperEnsembl

Public mirror Private data

etc.


● All user connections via SSL.● Authentication using SSO against customer

auth servers.● OpenAM TrafficScript plugin for Zeus.● SAML2 to LDAP, ActiveDirectory, etc.

● App servers firewalled.● Load balancer connections only.● No inter-connection except to RDBMS.


● In future could add almost any sequence-related service that is/can be web-enabled.

● PoC runs until Christmas, with free access for all Pistoia members.● [email protected]● [email protected]

● Full commercial service launches Q1 2011 subject to demand and PoC success.

mailto:[email protected]



Cloud is bad

● Expensive (always-on solutions).● Slow data transfer in/out (it's the internet).● Not big (S3 5GB limit).● Not fast (HTC not HPC).● Myths and misconceptions.● Over-hyped.● Misused.

Cloud is good

● Scalable.● Upgradeable.● Secure.● Low-cost (for ad-hoc work).● Accessible.● Standardised.● Fast enough.● Big enough.

Choose your tools to suit the job

● Don't assume your paradigm will translate.

● Rephrase or optimise.

● Do it right, reap the rewards.

● Do it wrong, better off not bothering.

Thanks

● Jason Stowe at CycleComputing.● Matt Wood at Amazon.● Cambridge Healthtech Institute.● Our partners at Cognizant.● Our collaborators at the EBI, WTSI, JIC, and

University of Manchester.● All producers of open-source bioinformatics

software and data.

[email protected]

http://www.eaglegenomics.com/