Upload
eagle-genomics-ltd
View
997
Download
2
Tags:
Embed Size (px)
Citation preview
Cloud Computing and Genomics
Richard HollandBioITWorld Europe 2010 Cloud Computing Workshop
Cloud Computing and Genomics● Company background● Who we support● Sequence analysis in the cloud● Genomic data in the cloud● Cloud summary
Raison d'être
● Founded by 3 ex-Ensembl staff.
● Commercial services for open-source bioinformatics.
● Support open-source through collaborations.
● RedHat of bioinformatics.
What we do
● Help you choose the right solution for the problem.
● Right tools for the right jobs.● For any chosen open-source system:
✔ Support and train.✔ Modify, extend, and improve.✔ Integrate private data/systems.✔ Manage private mirrors.
What we do
● Genomic data analysis pipelines✔ SNP calling✔ miRNA detection✔ Probe mapping✔ Assembly and
annotation✔ Etc.
● Results integrated into other systems.
Open-source
● Used wherever possible.● Produced wherever possible.● Contribute all changes.● Cutting-edge.● Quality of concept (not necessarily code).● Rapidly adjustable.
Cloud
● Low-cost for ad-hoc work.
● Scalable.● Low-maintenance.● More secure.● Accessible.● No licences.● No installation.
Cloud Computing and Genomics● Company background● Who we support● Sequence analysis in the cloud● Genomic data in the cloud● Cloud summary
Ensembl®
● Not only a genome browser, but also:● Genome databases.● Perl API and BioMart data warehouse.
● Collaboration agreement with EMBLEM.● Exchange knowledge.● Share revenue.
● EBI are working with Amazon.Ensembl is a registered trademark of Genome Research Ltd., and is developed by WTSI/EBI.
Taverna
● Workflow design and execution.● Collaboration agreement with the University
of Manchester (in progress).● Exchange knowledge.● Share revenue.
● Taverna are actively researching cloud deployment.
TraitTag
● Trait detection in seedlings.● Avoid waiting for adult plants.● Collaboration agreement with Plant
Bioscience Ltd. (John Innes Centre)● Exchange knowledge.● Share revenue.
● Already cloud-capable.
Others
● Other knowledge areas:● O|B|F Bio* projects
– e.g. BioJava, BioPerl, BioSQL, etc. ● BioMart● DebianMed
● No formal collaborations with these.
Big Pharma
● Drug development.● Pharmacogenomics.● Software-as-a-service.● Operational bioinformatics.
● Traditionally suspicious of both open-source and the cloud. This is changing.
Agri-biotech
● Chemical development.● Seed development.● Software-as-a-service.● Operational bioinformatics.
● Into open-source (DuPont has BioPerl people), ready to explore the cloud.
Biotech SMEs
● Anything and everything.● Highly specialised.● Poorly funded.● Lack in-house bioinformatics.● Many don't even know what it is.● One-off analyses and consulting projects.
● Love anything that saves them money.
Others
● CROs.● Universities.● Government institutes.● Animal health an emerging area.
● Public bodies are hardest to convince about open-source and the cloud.
● Funding agencies often the cause.
Cloud Computing and Genomics● Company background● Who we support● Sequence analysis in the cloud● Genomic data in the cloud● Cloud summary
eHive
● Used for all Eagle's data analysis pipelines.● Developed by the Ensembl Compara project.● Generic framework (yet another one?).● Perl.● Ensembl-aware but independent.● Scalable and robust.● Open-source.
eHive
Severin et al. eHive: An Artificial Intelligence workflow systemfor genomic analysis. BMC Bioinformatics 2010, 11:240
eHive
● Works out-of-the-box on LSF.● Can run standalone in a single machine.● Modified to work on Condor and SGE.● Modified to work on Amazon EC2 without
needing Condor/SGE/LSF etc. (crafty trick)
eHive
Standalone pipelines
Packaged into VMs(black-box appliances)
Doesn't scale wellCumbersome to
distribute
What about the cloud?
eHive
● Easy to set up same pipeline and interface on EC2.
● But doesn't solve scaling, and clusters are hard to implement in the cloud.
● Crafty trick – self-replicating self-terminating instances.
● Scales to as many parallel instances as required (up to a preset limit).
● Keeps idle instances alive to just short of the hour just-in-case.
eHive on EC2 – an example
● A big pharma customer had some microarrays.
● Probe sequence data provided by vendors (Affymetrix, Agilent, etc.).
● Need high-quality mapping with quality scores and other metadata.
● Some chips have public mappings but not to required standard/format.
● Many chips have no public mappings at all.
eHive on EC2 – an example
Probe sequences
Ens funcgen db
Reports
Library ofmappingpipelines
Extended Ens API
eHive+EC2
eHive on EC2 – an example
● Getting data in/out no problem – very small.● Mix-up with keys lead to key-per-project.● Blackboard and results MySQL performance.● Needed to prove that the job management
is working.● Not firing up too many machines.● Not forgetting to shut them down.
eHive on EC2 – an example
● Scales nicely to about 50 machines.● Beyond that MySQL is the bottleneck.● Performance-tuning MySQL raises limit.● Deals better with lots of small input files
and/or input as a database table.● S3 slow but still plenty fast enough for
projects like this (1 or 2 extra hours in the context of 100s is not much).
● Really need multiple-mount-RO EBS.
eHive on EC2 – an example
● Customer data was sent to us encrypted, transferred to Amazon encrypted, only decrypted once inside Amazon.
● Results transferred out in the same way.● All processing internal to Amazon.● Usual steps to prevent access – firewall,
stop unused daemons, disable logins, etc.
Cloud Computing and Genomics● Company background● Who we support● Sequence analysis in the cloud● Genomic data in the cloud● Cloud summary
Hosted data services
● Genomic data can be big, but can also be small.
● Analysed data vs. raw data.● Generic repositories vs. specialised
resources.● Organisations still sensitive about query
intercept and log mining.● Leads to in-house or secured third-party
hosting.
Hosted data services
● Simplest resources can run in a VM.● Ensembl web browser (but not database).● Functional, effective, if a bit slow.● VM files slow to download, expensive to
ship.● Can achieve the same effect using the cloud.● Distribute AMIs, or host instances?
Hosted data services – an example
● A big pharma needs Ensembl in-house.● Fed up with in-house maintenance.● Wanted to try it in the cloud.● Must run entirely within their own Amazon
account.
Hosted data services – an example
Eagle a/c
Customer a/cPublic Ens DB
DatabaseInstance
Browserinstance
PrivateEns DB
DatabaseAMI
BrowserAMI
Hosted data services – an example
● Ensembl public data in us-east region. ● Migration to other regions can be slow and
expensive.● No guaranteed update schedule.● Updating it ourselves slow and expensive.
● How to avoid having to maintain multiple instances for multiple customers?
Pistoia Sequence Services
● Pistoia Alliance is an industry alliance of big pharma and related companies.
● Sharing pre-competitive resources.● Pistoia Sequence Services proof-of-concept
● To share Ensembl and related services.● Eagle in partnership with Cognizant
Technology Solutions were successful participants.
Pistoia Sequence Services
● Already had the solution – an Ensembl AMI.● Need to make it capable of running more
than just Ensembl – easy.● Some of the extra services need small
amounts of private data, secured and partitioned – fairly easy.
● Need to secure it, and scale it – pretty hard.
Pistoia Sequence Services
Users Users Users
Zeusload balancer
with SSL
Usageand
Billing
MappingservicePlasMapperEnsembl
Public mirror Private data
etc.
Pistoia Sequence Services
● All user connections via SSL.● Authentication using SSO against customer
auth servers.● OpenAM TrafficScript plugin for Zeus.● SAML2 to LDAP, ActiveDirectory, etc.
● App servers firewalled.● Load balancer connections only.● No inter-connection except to RDBMS.
Pistoia Sequence Services
● In future could add almost any sequence-related service that is/can be web-enabled.
● PoC runs until Christmas, with free access for all Pistoia members.● [email protected]● [email protected]
● Full commercial service launches Q1 2011 subject to demand and PoC success.
Cloud Computing and Genomics● Company background● Who we support● Sequence analysis in the cloud● Genomic data in the cloud● Cloud summary
Cloud is bad
● Expensive (always-on solutions).● Slow data transfer in/out (it's the internet).● Not big (S3 5GB limit).● Not fast (HTC not HPC).● Myths and misconceptions.● Over-hyped.● Misused.
Cloud is good
● Scalable.● Upgradeable.● Secure.● Low-cost (for ad-hoc work).● Accessible.● Standardised.● Fast enough.● Big enough.
Choose your tools to suit the job
● Don't assume your paradigm will translate.
● Rephrase or optimise.
● Do it right, reap the rewards.
● Do it wrong, better off not bothering.
Thanks
● Jason Stowe at CycleComputing.● Matt Wood at Amazon.● Cambridge Healthtech Institute.● Our partners at Cognizant.● Our collaborators at the EBI, WTSI, JIC, and
University of Manchester.● All producers of open-source bioinformatics
software and data.