Cloud Technical Challenges

Cloud Technical Challenges

Guy Coates

Wellcome Trust Sanger Institute

[email protected]

Outline

Background

Cloud Experiences

Barriers

Future Directions

The Sanger Institute Funded by Wellcome Trust.

• 2nd largest research charity in the world.• ~700 employees.• Based in Hinxton Genome Campus,

Cambridge, UK.

Large scale genomic research.• Sequenced 1/3 of the human genome.

(largest single contributor).• We have active cancer, malaria,

pathogen and genomic variation / human health studies.

All data is made publicly available.• Websites, ftp, direct database. access,

programmatic APIs.

Lost in the clouds...

Victory!

Our Cloud Experiences

Hype Cycle

Awesome!

Just works...

Ensembl

Ensembl is a system for genome Annotation.

Data visualisation / Mining web services.• www.ensembl.org• Provides web / programmatic interfaces to genomic data.• 10k visitors / 126k page views per day.

Compute Pipeline (HPTC Workload)• Take a raw genome and run it through a compute pipeline to find genes

and other features of interest.• Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate

genomes.

• Software is Open Source (apache license).• Data is free for download.

We have web services and HPTC workloads running on Iaas.

http://www.ensembl.org/

Why Cloud?

Web services• Was hosted in a single datacentre at the Genome Campus, UK.• 1 datacentre = Single point of failure.• Access slow if you were not in western Europe.

Cloud Application• Build worldwide network of mirrors on IaaS.

HPC• People want to run Ensembl HPC pipeline on their own data.• Requires skilled bioinformatician to get the software running and access

to a HPC cluster.

Cloud Application• Build HPC SaaS.• Users deploy ready-to-run Ensembl code on AWS, self-assembles into a

HPC cluster and analyses their data.

Hype Cycle

Web services /Web services /Some HPCSome HPC

That was easy...

Hype cycle

Sequencinginformatics

DNA sequencing

Economic Trends:

As cost of sequencing halves every 12 months.• cf Moore's Law

The Human genome project: • 13 years.• 23 labs.• $500 Million.

A Human genome today:• 3 days.• 1 machine.• $10,000.• Large centres are now doing studies with 10,000s of

genomes.

Trend will continue:• Generation 3 sequencers are on their way.• $500 genome is probable within 5 years.

The scary graph

Peak Yearly capillary sequencing: 30 Gbase

Current weeky sequencing:3000 Gbase

19941995

19961997

19981999

20002001

20022003

20042005

20062007

20082009

0

1000

2000

3000

4000

5000

6000

Disk Storage

Year

Te

rab

yte

s

Managing Growth We have exponential growth in

storage and compute.• Storage /compute doubles every 12

months.• 2009 ~7 PB raw

Gigabase of sequence ≠ Gigbyte of storage.• 16 bytes per base for for sequence

data.• Intermediate analysis typically need 10x

disk space of the raw data.

Moore's law will not save us.• Transistor/disk density: T

d=18 months

• Sequencing cost: Td=12 months

• Sequencing output: Td=3-6 months

What do you need to do sequencing?

SequencerSequencer analysis softwareanalysis software

LIMS System / Data TrackingLIMS System / Data Tracking

Sample prepSample prep Datarepository

Datarepository

External repositoryExternal

repository

HPC Resource

HPC Resource

Integratedcompute

Integratedcompute

What IT do you need to do sequencing?


Datarepository

Datarepository


repository


Sample prepSample prep

HPC Resource

HPC Resource

Integratedcompute

Integratedcompute

Part covered in the grant

This is really hard...

We have a whole division of HPC specialists, LIMs developers, bio-informaticians.

What about smaller labs with 1 or 2 sequencers?

...and then change it.

Sequencing informatics is massively fluid.• New chemistry.• More sequencing machines.• New analysis software.

Constant cycle of development and deployment.

How can cloud help?

What can we put on the Cloud?



Sample prepSample prep Datarepository

Datarepository


repository

HPC Resource

HPC Resource

Integratedcompute

Integratedcompute

Does it Cloud?

How do we decide what to cloud?

Rule of thumb borrowed from HPC.• Small data / High CPU work better in distributed environments.

IO Bound / Large data

CPU Bound / small data

Sequencing Data

( Raw data (TB) )

Alignments (200 GB)

Sequence + quality data (500 GB)

Variation data (1GB)

Individual features (3MB)

Structured data(databases)

Unstructured data(flat files)

Data size per Genome

Tracking / LIMs (100s Kbytes)

Sequencing Data

( Raw data (TB) )

Alignments (200 GB)

Sequence + quality data (500 GB)

Variation data (1GB)

Individual features (3MB)

Structured data(databases)

Unstructured data(flat files)

Data size per Genome

Cloud FriendlyCloud Friendly

Cloud UnfriendlyCloud Unfriendly

Tracking / LIMs (100s Kbytes)

Can we Cloudify Sequencing?

SequencerSequencer analysis softwareanalysis softwareSample prepSample prep Data

repositoryData

repository


repository

HPC Resource

HPC Resource

Integratedcompute

Integratedcompute


What are the blockers?

HPC infrastructure is now available in the cloud.• Good enough for 95% of sequencing.

Doing big data is hard:

1. You have to get the data there first.

2. You may not be allowed to put the data there.

Moving data is hard

Tools:• (FTP,ssh/rsync) are not suited to wide-area networks.• WAN tools: gridFTP/FDT/Aspera.

Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).• Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s)• Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s) • 11 hours to move 1TB to Dublin.• 23 hours to move 1 TB to East coast.

What speed should we get?• Once we leave JANET (UK academic network) finding out what the

connectivity is and what we should expect is almost impossible.

Do you have fast enough disks at each end to keep the network full?

Why not just ship disks?• Logistical nightmare.• Format issues, corruption, slow.

Networking

How do we improve data transfers across the public internet?• CERN approach; don't.• Dedicated networking has been

put in between CERN and the T1 centres who get all of the CERN data.

Can it work for cloud?• Buy dedicated bandwidth to a

provider.• Ties you in.• Should they pay?

We need good connectivity to everywhere.

Data Security

Are you allowed to put data on the cloud?

Default policy:

“Our data is confidential/important/critical to our business. We must keep our data on our computers.”

What does “My System” mean?

Purchased computer in my data centre

Leased computer inmy data centre

Purchased computer in a co-lo facility

Traditionally outsourced IT service

IaaS on a cloud provider

SaaS on a cloud provider

My System Not my system

Root / Admin Access?

Encrypted/ Non encrypted?

VPN / inside or outside firewall?

Legal / IP agreement in place?

How confidential is the data?

Publically available Genome data

Anonymised datasets(eg individual genomes with no identifiers)

Trade Secret / Patentable data

Low Risk High Risk

Personally identifiable datasets

Reasons to be optimistic:

Most (all?) data security issues can be dealt with.• But the devil is in the details.• Data can be put on the cloud, if care is taken.

It is probably more secure there than in your own data-centre.• Can you match AWS data availability guarantees?

Are cloud providers different from any other organisation you outsource to?

Outstanding Issues

Audit and compliance:• If you need IP agreements, above your providers standard T&Cs, how do

you push them through?

Geographical boundaries mean little in the cloud.• Data can be replicated across national boundaries, without end user

being aware.

Moving personally identifiable data outside of the EU is potentially problematic.• (Can be problematic within the EU; privacy laws are not as harmonised as

you might think.)• More sequencing experiments are trying to link with phenotype data. (ie

personally identifiable medical records).

Private Cloud to rescue?

Sequencing increasingly takes place in large consortiums.• Eg International Cancer Genome Consortium http://www.icgc.org)

Can we do private clouds within the consortium?

Traditional Collaboration

SequencingCentre + DCCSequencing

Centre + DCC

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

ITIT

ITIT

ITIT

ITIT

Cloud Collaborations

SequencingCentre

SequencingCentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Private CloudIaaS / SaaS




Private Cloud

Advantages:• LIMS / analysis software easily shared with consortium.

• Small organisations leverage expertise of big IT organisations.• Academia tends to be linked by fast research networks.

• Moving data is easier.• Consortium will be signed up to data-access agreements.

• Simplifies data governance.

Problems:• Big change in funding model.• Are big centres set up to provide private cloud services?

• Selling services is hard if you are a charity.• Can we do it as well as the big internet companies?

Cloud data archives

Dark Archives

Storing data in an archive is not particularly useful.• You need to be able to access the

data and do something useful with it.

Data in current archives is “dark”.• You can put/get data, but cannot

compute across it.• Is data in an inaccessible archive

really useful?

Example problem:

“We want to run out pipeline across 100TB of data currently in EGA/SRA.”

We will need to de-stage the data to Sanger, and then run the compute.• Extra 0.5 PB of storage, 1000 cores of compute.• 3 month lead time.• ~$1.5M capex.

Cloud / Computable archives

Move the compute to the data.• Upload workload onto VMs.• Put VMs on compute that is

“attached” to the data.

Federated between centres• Grid software build on top of

cloud components.• Avoids scaling problems

inherent in putting everything on one place.

CPUCPU CPUCPU CPUCPU CPUCPUDataData

VMVMDataData

CPUCPU CPUCPU CPUCPU CPUCPU

Acknowledgements

Sanger

• Phil Butcher• James Beal• Pete Clapham• Simon Kelley• Gen-Tao Chiang

• Steve Searle• Jan-Hinnerk Vogel• Bronwen Aken

EBI

Glenn Proctor Steve Keenan

Technology

Cloud Technical Challenges