Sanger Institute's experiences with the cloud.Given at Green Datacentre & Cloud Control 2011
- 1. Cloud Experiences
2. Wellcome Trust Sanger Institute 3. [email_address]
4. The Sanger Institute
- Funded by Wellcome Trust.
- 2 ndlargest research charity in the world. 5. ~700 employees.
6. Based in Hinxton Genome Campus, Cambridge, UK.
- Large scale genomic research.
- Sequenced 1/3 of the human genome. (largest single
contributor). 7. We have active cancer, malaria, pathogen and
genomic variation / human health studies.
- All data is made publicly available.
- Websites, ftp, direct database. access, programmatic APIs.
8. DNA
SequencingTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG
AAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA
TGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC
ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG
TGCACTCCAGCTTGGGTGACACAGCAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG
AAATAATCAGTTTCCTAAGATTTTTTTCCTGAAAAATACACATTTGGTTTCA
ATGAAGTAAATCGATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 250 Million
* 75-108 Base fragments Human Genome (3GBases) 9. Moore's Law
Compute/disk doubles every 18 months Sequencing doubles every 12
months 10. Economic Trends:
- The Human genome project:
- 13 years. 11. 23 labs. 12. $500 Million.
- A Human genome today:
- 3 days. 13. 1 machine. 14. $8,000.
- Trend will continue:
- $500 genome is probable within 3-5 years.
15. The scary graph Peak Yearly capillary sequencing: 30 Gbase
Current weekly sequencing: 6000 Gbase 16. Our Science 17. UK 10K
Project
- Decode the genome of 10,000 people inthe uk. 18. Will improve
the understanding of human genetic variation and disease.
Genome Research Limited Wellcome Trust launches study of 10,000
human genomes in UK; 24 June 2010
www.sanger.ac.uk/about/press/2010/100624-uk10k.html 19. New scale,
new insights . . . to common disease
-
- Coronary heart disease 20. Hypertension 21. Bipolar disorder
22. Arthritis 23. Obesity 24. Diabetes (types I and II) 25. Breast
cancer 26. Malaria 27. Tuberculosis
28. Cancer Genome Project
- Cancer is a disease caused by abnormalities in a cell's
genome.
29. Detailed Changes:
- Sequencing hundreds of cancer samples 30. First Comprehensive
look at cancer genomes
- Lung Cancer 31. Malignant melanoma 32. Breast cancer
- Identify driver mutations for:
- Improved diagnostics 33. Development of novel therapies 34.
Targeting of existing therapeutics
Lung Cancer and melanoma laid bare; 16 December
2009www.sanger.ac.uk/about/press/2009/091216.html 35. IT Challenges
36. Managing Growth
- Analysing the data takes a lot of compute and disk space
- Finished sequence is the start of the problem, not the
end.
- Growth of compute & storage
- Storage /compute doubles every 12 months.
- Moore's law will not save us. 37. 1000$ genome* 38.
*Informatics not included
39. Sequencing data flow. Alignments (200GB) Variation data
(1GB) Feature (3MB) Raw data (10 TB) Sequence (500GB) Sequencer
Processing/ QC Comparative analysis datastore Structured data
(databases) Unstructured data (Flat files) Internet 40. Data
centre
- 4x250 M 2Data centres.
- 2-4KW / M 2cooling. 41. 1.8 MW power draw 42. 1.5 PUE
- Overhead aircon, power and networking.
- Allows counter-current cooling. 43. Focus on power & space
efficient storage and compute.
- Technology Refresh.
- 1 data centre is an empty shell.
- Rotate into the empty room every 4 years and refurb.
- Fallow Field principle.
rack rack rack rack 44. Our HPC Infrastructure
- Compute
- 8500 cores 45. 10GigE / 1GigE networking.
- High performance storage
- 1.5 PB DDN 9000&10000 storage 46. Lustre filesystem
- LSF queuing system
47. Ensembl
- Data visualisation / Mining web services.
- www.ensembl.org 48. Provides web / programmatic interfaces to
genomic data. 49. 10k visitors / 126k page views per day.
- Compute Pipeline (HPTC Workload)
- Take a raw genome and run it through a compute pipeline to find
genes and other features of interest. 50. Ensembl at Sanger/EBI
provides automated analysis for 51 vertebrate genomes.
- Software is Open Source (apache license). 51. Data is free for
download.
52. Sequencing data flow. Alignments (200GB) Variation data
(1GB) Feature (3MB) Raw data (10 TB) Sequence (500GB) Sequencer
Processing/ QC Comparative analysis datastore Structured data
(databases) Unstructured data (Flat files) Internet HPC Compute
Pipeline Web / Databaseinfrastructure 53.
TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG
GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA
TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA
TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC
AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC
TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA
ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG
AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC
TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG
AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG
AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA
GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT
ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 54.
Annotation 55. Annotation 56. Why Cloud? 57. Web Services
- Ensembl has a worldwide audience. 58. Historically, web site
performance was not great, especially for non-european institutes.
- Pages were quite heavyweight. 59. Not properly cached etc.
- Web team spent a lot oftime re-designing the code to make it
more streamlined.
- Greatly improved performance.
- Coding can only get you so-far.
- 150-240ms round trip time from Europe to the US. 60. We need a
set of geographically dispersed mirrors.
61. Colocation
- Real machines in a co-lo facility in California.
- Hardware was initially configured on site.
- 16 servers, SAN storage, SAN switches, SAN management
appliance, Ethernet switches, firewall, out-of-band management
etc.
- Shipped to the co-lo for installation.
- Sent a person to California for 3 weeks. 62. Spent 1 week
getting stuff into/out of customs.
- Additional infrastructure work.
- Incredibly time consuming.
- Really don't want to end up having to send someone on a plane
to the US to fix things.
63. Cloud Opportunities
- We wanted more mirrors.
- US East coast, asia-pacific.
- Investigations into AWS already ongoing. 64. Many people would
like to run ensembl webcode to visualise their own data.
- Non trivial for the non-expert user.
- Can we distribute AMIs instead?
- Can we eat our own dog-food?
- Run mirror site from the AMIs?
65. What we actually did: AWS Sanger Sanger VPN 66. Building a
mirror on AWS
- Application development was required
- Significant code changes required to make the webcode mirror
aware.
- Mostly done for the original co-location site.
- Some software development / sysadmin work needed.
- Preparation of OS images,software stack configuration. 67. VPN
configuration
- Significant amount of tuning required.
- Initial mysql performance was pretty bad, especially for the
large ensembl databases. (~1TB). 68. Lots of people doing
Apache/mysql on AWS, so there is a good amount of best-practice etc
available.
69. Traffic 70. Is it cost effective?
- Lots of misleading cost statements made about cloud.
- Our analysis only cost $500. 71. CPU is only $0.085 / hr.
- What are we comparing against?
- Doing the analysis once? Continually? 72. Buying a $2000
server? 73. Leasing a $2000 server for 3 years? 74. Using $150 of
time at your local supercomputing facility? 75. Buying a $2000 of
server but having to build a $1M datacentre to put it in?
- Requires the dreaded Total Cost of Ownership (TCO) calculation.
- hardware+ power + cooling + facilities + admin/developers etc
76. Breakdown:
- Comparing costs to the real Co-lo
- power, cooling costs are all included. 77. Admin costs are the
same, so we can ignore them.
- Same people responsible for both.
- Cost for Co-location facility:
- $120,000 hardware + $51,000 /yrcolo. 78. $91,000 per year (3
years hardware lifetime).
- Cost forAWS site:
- We can run 3 mirrors for 90% of the cost of 1 mirror. 79. It is
not free!
80. Advantages
- No physical hardware.
- Work can start as soon as we enter our credit card numbers...
81. No US customs, Fedex etc.
- Less hardware:
- No Firewalls, SAN management appliances etc.
- Much simpler management infrastructure.
-
- AWS give you out of band management for free. 82. No hardware
issues.
- Easy path for growth.
- No space constraints.
- No need to get tin decommissioned /re-installed at Co-lo.
- Add more machines until we run out of cash.
83. Downsides
- Underestimated the time it would take to make the web-code
mirror-ready.
- Not a cloud specific problem, but something to be aware of when
you take big applications and move them outside your home
institution.
- Curation of software images takes time.
- Regular releases of new data and code. 84. Ensembl team now has
a dedicated person responsible for the cloud. 85. Somebody has to
look after the systems.
- Management overhead does not necessarily go down.
86. Going forward
- Change code to remove all dependencies on Sanger.
- Make the AMIs publically available.
- Today we have Mysql servers + data.
- Data generously hosted on Amazon public datasets.
- Allow users to simply run their own sites.
87. HPC Workloads 88. Why HPC in the Cloud?
- We already have a data-centre.
- Not seeking to replace our existing infrastructure. 89. Not
cost effective.
- But: Long lead-times for installing kit.
- ~3-6 months from idea to going live. 90. Longer than the
science can wait. 91. Ability to burst capacity might be
useful.
- Test environments.
- Test at scale. 92. Large clusters for a short amount of
time.
93. Distributing analysis tools
- Sequencing is becoming a commodity. 94. Informatics / analysis
tools needs to be commodity too. 95. Requires a significant amount
ofdomain knowledge.
- Complicated software installs, relational databases etc.
- Goal:
- Researcher with no IT knowledge can take their sequence data,
upload it to AWS, get it analysed and view the results.
96. Life Sciences HPC Workloads Tightly Coupled (MPI)
Embarrassingly Parallel CPU Bound IO Bound modelling/ docking
Simulation Genomics 97. Our Workload
- Embarrassingly Parallel.
- Lots of single threaded jobs. 98. 10,000s of jobs. 99. Core
algorithms in C 100. Perl pipeline manager to generate and manage
workflow. 101. Batch schedular to execute jobs on nodes. 102. mysql
database to hold results & state.
- Moderate memory sizes.
- IO bound.
- Fast parallel filesystems.
103. Life Sciences HPC Workloads Tightly Coupled (MPI)
Embarrassingly Parallel CPU Bound IO Bound modelling/ docking
Simulation Genomics 104. Different Architectures VS CPU CPU CPU Fat
Network POSIX Global filesystem CPU CPU CPU CPU thin network Local
storage Local storage Local storage Local storage Batch schedular
S3 Hadoop? 105. Life Sciences HPC Workloads Tightly Coupled (MPI)
Embarrassingly Parallel CPU Bound IO Bound modelling/ docking
Simulation Genomics 106. Careful choice of problem:
- Choose a simple part of the pipeline
- Re-factor all the code that expects global filesystem and make
it use S3.
- Why not use hadoop?
- Production code that works nicely inside Sanger. 107. Vast
effort to port code, for little benefit. 108. Questions about
stability for multi-user systems internally.
- Build self assembling HPC cluster.
- Code which will spin up AWS images and self assembles into a
HPC cluster and batch schedular.
- Cloud allows you to simplify.
- Sanger compute cluster is shared.
- Lots of complexity in ensuring applications/users play nicely
together.
- AWS clusters are unique to a user/application.
109. The real problem: Internet
- Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).
- Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s) 110.
Cambridge -> EC2 Dublin: 25 Mbytes/s(200 Mbits/s) 111. 11 hours
to move 1TB to Dublin. 112. 23 hours to move 1 TB to East
coast.
- What speedshouldwe get?
- Once we leave JANET (UK academic network) finding out what the
connectivity is and what we should expect is almost
impossible.
- Do you have fast enough disks at each end to keep the network
full?
113. Networking
- How do we improve data transfers across the public internet?
- CERN approach; don't. 114. 10 Gbit dedicated network between
CERN and the T1 centres.
- Can it work for cloud?
- Buy dedicated bandwidth to aprovider.
- Ties you in. 115. Should they pay?
- What happens when you want to move?
116. Summary
- Moving existing HPC applications is painful. 117. Small data /
high CPU applications work really well. 118. Large data
applications less well.
119. Data Security 120. Are you allowed to put data on the
cloud?
- Default policy: 121. Our data is
confidential/important/critical to our business. 122. We must keep
our data on our computers. 123. Apart from when we outsource it
already.
124. Reasons to be optimistic:
- Most (all?) data security issues can be dealt with.
- But the devil is in the details. 125. Data can be put on the
cloud, if care is taken.
- It is probably more secure there than in your own data-centre.
- Can you match AWS data availability guarantees?
- Are cloud providers different from any other organisation you
outsource to?
126. Outstanding Issues
- Audit and compliance:
- If you need IP agreements, above your providers standard
T&Cs, how do you push them through?
- Geographical boundaries mean little in the cloud.
- Data can be replicated across national boundaries, withoutend
user being aware.
- Moving personally identifiable data outside of the EU is
potentially problematic.
- (Can be problematic within the EU; privacy laws are not as
harmonised as you might think.) 127. More sequencing experiments
are trying to link with phenotype data. (ie personally identifiable
medical records).
128. Private Cloud to rescue?
- Can we do something different?
129. TraditionalCollaboration DCC: Sequencing Centre + Archive
Sequencing centre Sequencing centre Sequencing centre Sequencing
centre IT IT IT IT 130. Dark Archives
- Storing data in an archive is not particularly useful.
- You need to be able to access the data and do something useful
with it.
- Data in current archives is dark.
- You can put/get data, but cannot compute across it. 131. Is
data in an inaccessible archive really useful?
132. Private Cloud Collaborations Sequencing CentreSequencing
centre Sequencing centre Sequencing centre Private Cloud IaaS /
SaaS Private Cloud IaaS / SaaS 133. Private Cloud
- Advantages:
- Small organisations leverage expertise of big IT organisations.
134. Academia tends to be linked by fast research networks.
- Moving data is easier. (move compute to the data via VMs)
- Consortium will be signed up to data-access agreements.
- Simplifies data governance.
- Problems:
- Big change in funding model. 135. Are big centres set up to
provide private cloud services?
- Selling services is hard if you are a charity.
- Can we do it as well as the big internet companies?
136. Summary
- Cloud is a useful tool.
- Will not replace our local IT infrastructure.
- Porting existing applications can be hard.
- Do not underestimate time / people.
- Still need IT staff.
- End up doing different things.
137. Acknowledgements
- Sanger
- Phil Butcher 138. James Beal 139. Pete Clapham 140. Simon
Kelley 141. Gen-Tao Chiang
- Steve Searle 142. Jan-Hinnerk Vogel 143. Bronwen Aken
- EBI 144. Glenn Proctor 145. Steve Keenan