Upload
guy-coates
View
1.355
Download
1
Embed Size (px)
DESCRIPTION
This talks covers the current challenges and opportunities for using cloud computing for data-heavy, research computing. Talk given at the Marcus Evans "Cloud Computing in the Pharmaceutical Industry" conference, Frankfurt 2011.
Citation preview
Outline
Background
Cloud Experiences
Barriers
Future Directions
The Sanger Institute Funded by Wellcome Trust.
• 2nd largest research charity in the world.• ~700 employees.• Based in Hinxton Genome Campus,
Cambridge, UK.
Large scale genomic research.• Sequenced 1/3 of the human genome.
(largest single contributor).• We have active cancer, malaria,
pathogen and genomic variation / human health studies.
All data is made publicly available.• Websites, ftp, direct database. access,
programmatic APIs.
Lost in the clouds...
Victory!
Our Cloud Experiences
Hype Cycle
Awesome!
Just works...
Ensembl
Ensembl is a system for genome Annotation.
Data visualisation / Mining web services.• www.ensembl.org• Provides web / programmatic interfaces to genomic data.• 10k visitors / 126k page views per day.
Compute Pipeline (HPTC Workload)• Take a raw genome and run it through a compute pipeline to find genes
and other features of interest.• Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate
genomes.
• Software is Open Source (apache license).• Data is free for download.
We have web services and HPTC workloads running on Iaas.
Why Cloud?
Web services• Was hosted in a single datacentre at the Genome Campus, UK.• 1 datacentre = Single point of failure.• Access slow if you were not in western Europe.
Cloud Application• Build worldwide network of mirrors on IaaS.
HPC• People want to run Ensembl HPC pipeline on their own data.• Requires skilled bioinformatician to get the software running and access
to a HPC cluster.
Cloud Application• Build HPC SaaS.• Users deploy ready-to-run Ensembl code on AWS, self-assembles into a
HPC cluster and analyses their data.
Hype Cycle
Web services /Web services /Some HPCSome HPC
That was easy...
Hype cycle
Sequencinginformatics
DNA sequencing
Economic Trends:
As cost of sequencing halves every 12 months.• cf Moore's Law
The Human genome project: • 13 years.• 23 labs.• $500 Million.
A Human genome today:• 3 days.• 1 machine.• $10,000.• Large centres are now doing studies with 10,000s of
genomes.
Trend will continue:• Generation 3 sequencers are on their way.• $500 genome is probable within 5 years.
The scary graph
Peak Yearly capillary sequencing: 30 Gbase
Current weeky sequencing:3000 Gbase
19941995
19961997
19981999
20002001
20022003
20042005
20062007
20082009
0
1000
2000
3000
4000
5000
6000
Disk Storage
Year
Te
rab
yte
s
Managing Growth We have exponential growth in
storage and compute.• Storage /compute doubles every 12
months.• 2009 ~7 PB raw
Gigabase of sequence ≠ Gigbyte of storage.• 16 bytes per base for for sequence
data.• Intermediate analysis typically need 10x
disk space of the raw data.
Moore's law will not save us.• Transistor/disk density: T
d=18 months
• Sequencing cost: Td=12 months
• Sequencing output: Td=3-6 months
What do you need to do sequencing?
SequencerSequencer analysis softwareanalysis software
LIMS System / Data TrackingLIMS System / Data Tracking
Sample prepSample prep Datarepository
Datarepository
External repositoryExternal
repository
HPC Resource
HPC Resource
Integratedcompute
Integratedcompute
What IT do you need to do sequencing?
SequencerSequencer analysis softwareanalysis software
Datarepository
Datarepository
External repositoryExternal
repository
LIMS System / Data TrackingLIMS System / Data Tracking
Sample prepSample prep
HPC Resource
HPC Resource
Integratedcompute
Integratedcompute
Part covered in the grant
This is really hard...
We have a whole division of HPC specialists, LIMs developers, bio-informaticians.
What about smaller labs with 1 or 2 sequencers?
...and then change it.
Sequencing informatics is massively fluid.• New chemistry.• More sequencing machines.• New analysis software.
Constant cycle of development and deployment.
How can cloud help?
What can we put on the Cloud?
SequencerSequencer analysis softwareanalysis software
LIMS System / Data TrackingLIMS System / Data Tracking
Sample prepSample prep Datarepository
Datarepository
External repositoryExternal
repository
HPC Resource
HPC Resource
Integratedcompute
Integratedcompute
Does it Cloud?
How do we decide what to cloud?
Rule of thumb borrowed from HPC.• Small data / High CPU work better in distributed environments.
IO Bound / Large data
CPU Bound / small data
Sequencing Data
( Raw data (TB) )
Alignments (200 GB)
Sequence + quality data (500 GB)
Variation data (1GB)
Individual features (3MB)
Structured data(databases)
Unstructured data(flat files)
Data size per Genome
Tracking / LIMs (100s Kbytes)
Sequencing Data
( Raw data (TB) )
Alignments (200 GB)
Sequence + quality data (500 GB)
Variation data (1GB)
Individual features (3MB)
Structured data(databases)
Unstructured data(flat files)
Data size per Genome
Cloud FriendlyCloud Friendly
Cloud UnfriendlyCloud Unfriendly
Tracking / LIMs (100s Kbytes)
Can we Cloudify Sequencing?
SequencerSequencer analysis softwareanalysis softwareSample prepSample prep Data
repositoryData
repository
External repositoryExternal
repository
HPC Resource
HPC Resource
Integratedcompute
Integratedcompute
LIMS System / Data TrackingLIMS System / Data Tracking
What are the blockers?
HPC infrastructure is now available in the cloud.• Good enough for 95% of sequencing.
Doing big data is hard:
1. You have to get the data there first.
2. You may not be allowed to put the data there.
Moving data is hard
Tools:• (FTP,ssh/rsync) are not suited to wide-area networks.• WAN tools: gridFTP/FDT/Aspera.
Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).• Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s)• Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s) • 11 hours to move 1TB to Dublin.• 23 hours to move 1 TB to East coast.
What speed should we get?• Once we leave JANET (UK academic network) finding out what the
connectivity is and what we should expect is almost impossible.
Do you have fast enough disks at each end to keep the network full?
Why not just ship disks?• Logistical nightmare.• Format issues, corruption, slow.
Networking
How do we improve data transfers across the public internet?• CERN approach; don't.• Dedicated networking has been
put in between CERN and the T1 centres who get all of the CERN data.
Can it work for cloud?• Buy dedicated bandwidth to a
provider.• Ties you in.• Should they pay?
We need good connectivity to everywhere.
Data Security
Are you allowed to put data on the cloud?
Default policy:
“Our data is confidential/important/critical to our business. We must keep our data on our computers.”
What does “My System” mean?
Purchased computer in my data centre
Leased computer inmy data centre
Purchased computer in a co-lo facility
Traditionally outsourced IT service
IaaS on a cloud provider
SaaS on a cloud provider
My System Not my system
Root / Admin Access?
Encrypted/ Non encrypted?
VPN / inside or outside firewall?
Legal / IP agreement in place?
How confidential is the data?
Publically available Genome data
Anonymised datasets(eg individual genomes with no identifiers)
Trade Secret / Patentable data
Low Risk High Risk
Personally identifiable datasets
Reasons to be optimistic:
Most (all?) data security issues can be dealt with.• But the devil is in the details.• Data can be put on the cloud, if care is taken.
It is probably more secure there than in your own data-centre.• Can you match AWS data availability guarantees?
Are cloud providers different from any other organisation you outsource to?
Outstanding Issues
Audit and compliance:• If you need IP agreements, above your providers standard T&Cs, how do
you push them through?
Geographical boundaries mean little in the cloud.• Data can be replicated across national boundaries, without end user
being aware.
Moving personally identifiable data outside of the EU is potentially problematic.• (Can be problematic within the EU; privacy laws are not as harmonised as
you might think.)• More sequencing experiments are trying to link with phenotype data. (ie
personally identifiable medical records).
Private Cloud to rescue?
Sequencing increasingly takes place in large consortiums.• Eg International Cancer Genome Consortium http://www.icgc.org)
Can we do private clouds within the consortium?
Traditional Collaboration
SequencingCentre + DCCSequencing
Centre + DCC
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
ITIT
ITIT
ITIT
ITIT
Cloud Collaborations
SequencingCentre
SequencingCentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Sequencingcentre
Private CloudIaaS / SaaS
Private CloudIaaS / SaaS
Private CloudIaaS / SaaS
Private CloudIaaS / SaaS
Private Cloud
Advantages:• LIMS / analysis software easily shared with consortium.
• Small organisations leverage expertise of big IT organisations.• Academia tends to be linked by fast research networks.
• Moving data is easier.• Consortium will be signed up to data-access agreements.
• Simplifies data governance.
Problems:• Big change in funding model.• Are big centres set up to provide private cloud services?
• Selling services is hard if you are a charity.• Can we do it as well as the big internet companies?
Cloud data archives
Dark Archives
Storing data in an archive is not particularly useful.• You need to be able to access the
data and do something useful with it.
Data in current archives is “dark”.• You can put/get data, but cannot
compute across it.• Is data in an inaccessible archive
really useful?
Example problem:
“We want to run out pipeline across 100TB of data currently in EGA/SRA.”
We will need to de-stage the data to Sanger, and then run the compute.• Extra 0.5 PB of storage, 1000 cores of compute.• 3 month lead time.• ~$1.5M capex.
Cloud / Computable archives
Move the compute to the data.• Upload workload onto VMs.• Put VMs on compute that is
“attached” to the data.
Federated between centres• Grid software build on top of
cloud components.• Avoids scaling problems
inherent in putting everything on one place.
CPUCPU CPUCPU CPUCPU CPUCPUDataData
VMVMDataData
CPUCPU CPUCPU CPUCPU CPUCPU
Acknowledgements
Sanger
• Phil Butcher• James Beal• Pete Clapham• Simon Kelley• Gen-Tao Chiang
• Steve Searle• Jan-Hinnerk Vogel• Bronwen Aken
EBI
Glenn Proctor Steve Keenan