72
Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute [email protected]

Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute [email protected]

Embed Size (px)

Citation preview

Page 1: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Next Generation Genomics:Petascale data in the life

sciences

Guy Coates

Wellcome Trust Sanger Institute

[email protected]

Page 2: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Outline

DNA Sequencing and Informatics

Managing Data

Sharing Data

Adventures in the Cloud

Page 3: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

The Sanger Institute Funded by Wellcome Trust.

• 2nd largest research charity in the world.• ~700 employees.• Based in Hinxton Genome Campus,

Cambridge, UK.

Large scale genomic research.• Sequenced 1/3 of the human genome.

(largest single contributor).• We have active cancer, malaria,

pathogen and genomic variation / human health studies.

All data is made publicly available.• Websites, ftp, direct database. access,

programmatic APIs.

Page 4: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

DNA sequencing

Page 5: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Next-generation Sequencing

Life sciences is drowning in data from our new sequencing machines.

Traditional sequencing:• 96 sequencing reactions carried

out per run.

Next-generation: sequencing.• 52 Million reactions per run.

Machines are cheap(ish) and small.• Small labs can afford one.• Big labs can afford lots of them.

Page 6: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Economic Trends:

As cost of sequencing halves every 12 months.• cf Moore's Law

The Human genome project: • 13 years.• 23 labs.• $500 Million.

A Human genome today:• 3 days.• 1 machine.• $10,000.• Large centres are now doing studies with 1000s and

10,000s of genomes.

Changes in sequencing technology are going to continue this trend.• “Next-next” generation sequencers are on their way.• $500 genome is probable within 5 years.

Page 7: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Output Trends

Our peak “old generation” sequencing:• August 2007: 3.5 Gbases/month.

Current output:• Jan 2010: 4 Tbases/month.

1000x increase in our sequencing output.• In August 2007, total size of genbank was

200 Gbases.

Improvements in chemistry continue to increase the output of machines.

Jan 20100

500

1000

1500

2000

2500

3000

3500

4000

4500

3.5

4000

Capillary

Illumina

Gb

ase

s

Page 8: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

The scary graph

Instrument upgrades

Peak Yearly capillary sequencing

Page 9: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

1994199519961997199819992000200120022003200420052006200720082009

0

1000

2000

3000

4000

5000

6000

Disk Storage

Year

Te

rab

yte

s

Managing Growth We have exponential growth in

storage and compute.• Storage /compute doubles every 12

months. 2009 ~7 PB raw

Gigabase of sequence ≠ Gigbyte of storage.• 16 bytes per base for for sequence

data.• Intermediate analysis typically need 10x

disk space of the raw data.

Moore's law will not save us.• Transistor/disk density: T

d=18 months

• Sequencing cost: Td=12 months

Page 10: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Sequencing Informatics

Page 11: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

DNA Sequencing

TCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG

AAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA

TGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC

ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG

TGCACTCCAGCTTGGGTGACACAG CAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG

AAATAATCAGTTTCCTAAGATTTTTTTCCTGAAAAATACACATTTGGTTTCA

ATGAAGTAAATCG ATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC

250 Million * 75-108 Base fragmentsHuman Genome (3GBases)

Page 12: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Alignment

Find the best match of fragments to a known genome / genomes. • “grep” for DNA sequences.• Use more sophisticated algorithms that can do fuzzy matching.

Real DNA has Insertions, deletions and mutations. Typical algorithms are maq, bwa, ssaha, blast.

Look for differences • Single base pair differences (SNP).• Larger insertions/deletions/mutations.

Typical experiment:• Compare cancer cell genomes with healthy ones.

Reference: ...TTTGCTGAAACCCAAGTGACGCCATCCAGCGTGACCACTGCATTTTTCTCGGTCATCACCAGCATTCTC....

Query: CAAGTGACGCCATCCAGCGTGACCACTGCATTTTTCTAGGTCATCACCAGCA

Page 13: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Assembly

Assemble fragments into a complete genome.• Typical experiment: collect reference

genome for a new species.

“De-novo” assembly.• Assemble fragment with no external

data.• Harder than it looks.

Non uniform coverage, low depth, non-unique sequence (repeats).

Alignment based assembly.• Align fragments to a related genome.• Starting scaffold which can then be

refined. Eg H. neanderthal. is being

assembled against a H. sapiens sequence.

Page 14: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Cancer Genomes

Cancer is a disease caused by abnormalities in a cell's genome.

Page 15: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Mutation Details

Lung Carcenoma genome• Nature 2010 463; 184-90.

22,910 mutations

58 rearrangements

334 copy number segments

Page 16: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Analysing Cancer Genomes

Cancer genomes contains a lot of genetic damage.• Many of the mutations in cancer are incidental.• Initial mutation disrupts the normal DNA repair/replication processes.• Corruption spreads through the rest of the genome.

Today: Find the “driver” mutations amongst the thousands of “passengers.• Identifying the driver mutations will give us new targets for therapies.

Tomorrow: Analyse the cancer genome of every patient in the clinic.• Variations in a patient and cancer genetic makeup play a major role in

how effective a particular drugs will be.• Clinicians will use this information to tailor therapies.

Page 17: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

International Cancer Genome Project

Many cancer mutations are rare.• Low signal-to-noise ratio.

How do we find the rare but important mutations?• Sequence lots of cancer genomes.

International Cancer Genome Project.• Consortia of sequencing and cancer research centres in 10

countries.

Aim of the consortia.• Complete genomic analysis of 50 different tumor types. (50,000

genomes).

Page 18: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Past Collaborations

Data

SequencingCentre + DCC

Sequencingcentre

Sequencingcentre

Sequencingcentre Sequencing

centre

Page 19: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Future Collaborations

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Federatedaccess

Collaborations are short term: 18 months-3 years.

Page 20: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Genomics Data

Intensities / raw data (2TB)

Alignments (200 GB)

Sequence + quality data (500 GB)

Variation data (1GB)

Individual features

(3MB)

Structured data(databases)

Unstructured data(flat files)

Data size per Genome

Clinical Researchers,Clinical Researchers,non-infomaticiansnon-infomaticians

Sequencing informatics Sequencing informatics specialistsspecialists

Page 21: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Where can grid technologies help us?

Managing data.

Sharing data.

Making our software resources available.

Page 22: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Managing Data

Page 23: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Bulk Data

Intensities / raw data (2TB)

Alignments (200 GB)

Sequence + quality data (500 GB)

Variation data (1GB)

Individual features

(3MB)

Structured data(databases)

Unstructured data(flat files)

Data size per Genome

Sequencing informatics Sequencing informatics specialistsspecialists

Page 24: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Bulk Data Management

We though we were really good at it.• All samples that come through the sequencing lab are bar-coded and

tracked (Laboratory Information Systems).• Sequencing machines fed into an automated analysis pipeline.• All the data was tracked, analysed and archived appropriately.

Strict meta-data controls. • Experiments do not start in the wet-lab until the investigator has supplied

all the required data privacy and archiving requirements. Anonymised data → straight into the archive. Identifiable data → private/controlled archives. Some data held back until journal publication.

Page 25: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Compute farmanalysis/QC pipelineAlignment/assembly

suckers

Data pull

...

Final Repository(Oracle)

100TB / yrstaging areastaging area500 TB 500 TB

Seq 1 Seq 38

Page 26: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

It turn out we were looking in the wrong place

We had been focused on the sequencing pipeline.• For many investigators, data coming off the

end of the sequencing pipeline is where they start.

• Investigators take the mass of finished sequence data out of the archives, onto our compute farms and “do stuff”.

Huge explosion of data and disk use all over the institute.• We had no idea what people were doing with

their data.

Page 27: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

...

Compute Farm

Compute farmCompute farmdiskdisk

Collaberators /3rd party sequencing

Unmanged

Compute farmanalysis/QC pipelineassembly/alignment

suckers

Data pull

...

Final Repository(Oracle)

100TB / yrstaging areastaging area500TB500TB

Seq 1 Seq 38

?

LIMS managed data

Page 28: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Accidents waiting to happen...

From: <User A> (who left 12 months ago)<User A> (who left 12 months ago)

I find the <project><project> directory is removed . The original directory is "/scratch/<User B> (who left 6 months ago)<User B> (who left 6 months ago)"

..where is it ?

If this problem cannot be solved ,I am afriaid that <project><project> cannot be released.

Page 29: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

An idea whose time had come

Forward thinking groups had hacked up a file tracking systems for their unstructured data.• They could not keep track of where the results.• Problem exacerbated with student turnover (summer students, PhD

students on rotation).

Big wins with little effort.• Disk space usage dropped by 2/3.

Lots of individuals keeping copies of the same data set “so I know where it is”.

• Team leaders are happy that their data is where they thing it is. Important stuff is on filesystems that are backed up etc.

But:• Systems are ad-hoc, quick hacks.• We want an institute wide, standardised system.

Invest in people to maintain/develop it.

Page 30: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

iRODS

iRODS: Integrated Rule-Oriented Data System.

Produced by DICE (Data Intensive Cyber Environments) groups at U. North Carolina, Chapel Hill.

Successor to SRB.

Page 31: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

iRODS

ICATCataloguedatabase

Rule EngineImplements policies

Irods ServerData on disk

User interfaceWebDAV, icommands,fuse

Irods ServerData in database

Page 32: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Basic Features

Catalogue:• Put data on disk and keeps a record of where it it.• Add query-able metadata to files.

Rules engine.• “Do things” to files based on file data and metadata.

Eg move data between fast/archival storage.• Implement policies.

Experiment A data should be publicly viewable, but experiment B is restricted to certain users until 6 months after deposition.

Efficient.• Copes with PB of data and 100,000M+ files.• Fast parallel data transfers across local and wide area network links.

Page 33: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Advanced Features

Extensible• Link the system out to external services.

Eg external databases holding metadata, external authentication systems.

Federated• Physically and logically separated iRODS installs can be federated.• Allows user at institute A to seamlessly access data at institute B in a

controlled manner.• Supports replication. Useful for disaster recovery/backup scenarios.

Policy enforcements• Enforces data sharing / data privacy rules.

Page 34: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

What are we doing with it?

Piloting it for internal use.• Help groups keep track of their data.• Move files between different storage pools.

Fast scratch space ↔ warehouse disk ↔ Offsite DR centre.• Link metadata back to our LIMs/tracking databases.

We need to share data with other institutions.• Public data is easy: FTP/http.• Controlled data is hard:• Encrypt files and place on private FTP dropboxes.• Cumbersome to manage and insecure.

Proof of concept to use iRODS to provide controlled access to datasets.• Will we get buy in for the community?

Page 35: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Sharing data

Page 36: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Structured Data

Intensities / raw data (2TB)

Alignments (200 GB)

Sequence + quality data (500 GB)

Variation data (1GB)

Individual features

(3MB)

Structured data(databases)

Unstructured data(flat files)

Data size per Genome

Clinical Researchers,Clinical Researchers,non-infomaticiansnon-infomaticians

Page 37: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Raw Genomes are not useful

TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTGGAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAATTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTATTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCCAAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGCTTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAAATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTGAAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCACTGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGGAACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAGAAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCAGAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATTATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC

Genomes need to be annotated.• Locations of genes.• Functions of genes.• Relationships between genes (homologues, functional groups)• Links to the medical/scientific literature

Page 38: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Ensembl

Ensembl is a system for genome Annotation.

Compute Pipeline.• Take a raw genome and run it through a compute pipeline to find genes

and other features of interest.• Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate

genomes.

Data visualisation.• www.ensembl.org• Provides web interface to genomic data.• 10k visitors / 126k page views per day.

Data access and mining.• OO Perl / Java APIs.• Direct SQL access.• Bulk data download.• BioMart, DAS

Software is Open Source (apache license). Data is free for download.

Page 39: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Example annotation

Page 40: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Example annotation

Page 41: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Example annotation

Page 42: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Sharing data with Web Services

Page 43: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Distributed Annotation Service

Labs may have data that they want to view with Ensembl.• Put data into context with everything else.

DAS is a web-services protocol that allows sharing of annotation information.• Developed at Cold Spring Harbor Lab and extended by Sanger Institute

and others.

DAS Information;• metadata:

Description of the dataset, features supported. This can be optionally registered/validated at das.registry.org.

• Data: Object type. Co-ordinates (typically genome species/version and position). Stylesheet; (how should the data be displayed, eg histogram, color

gradient).

Page 44: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

DAS community

Currently ~600 DAS providers spread across 45 institutions and 18 counties.

Removal of non-responsive services

Page 45: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
Page 46: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
Page 47: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
Page 48: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

BioMART

Provides query based access to structured data.• Collaboration between CSHL, European Bioinformatics Institute and

Ontario Institute for Cancer Research.

“Tell me the function of genes that have substitution mutations in breast-cancer samples.”

Query requires queries across multiple databases.• Mutations are stored in COSMIC, Cancer Genome database.• Gene function is stored in Ensembl.

BioMart provides a unified entry point to these databases.

Page 49: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

BioMART

Oracle

CSV

Mysql

MART

MART

MART

XML

GUI

PERL

SOAP/REST

JAVA

Transform / Import

Query

Common IDs: federatable

Common IDs:federatable

Page 50: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
Page 51: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
Page 52: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
Page 53: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
Page 54: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk
Page 55: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Clouds

Page 56: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Disclaimer

This talk will use Amazon/EC2.

We tested it.

It is not a commercial endorsement.

Other cloud providers exist.

It a short hand; feel free to insert your favourite cloud provider instead.

Page 57: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Cloud-ifying Ensembl

Website• LAMP stack.• Ports easily to Amazon.• Provides virtual world-wide co-lo.

Compute Pipeline• HPTC workload• Compute pipeline is a harder problem.

Page 58: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Expanding markets

There are going to be lots of new genomes that need annotating.• Sequencers moving into small labs, clinical settings.• Limited informatics / systems experience.

Typically postdocs/PhD who have a “real” job to do.

We have already done all the hard work on installing the software and tuning it.• Can we package up the pipeline, put it in the cloud?

Goal: End user should simply be able to upload their data, insert their credit-card number, and press “GO”.

Page 59: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Gene Finding

DNA

HMM Prediction

Alignment with known proteins

Alignment with fragments recovered in vivo

Alignment with other genes and other species

Page 60: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Compute Pipeline

Architecture:• OO perl pipeline manager.• Core algorithms are C.• 200 auxiliary binaries.

Workflow:• Investigator describes analysis at high level.• Pipeline manager splits the analysis into parallel chunks.

Typically 50k-100k jobs.• Sorts out the dependences and then submits jobs to a DRM.

Typically LSF or SGE.• Pipeline state and results are stored in a mysql database.

Workflow is embarrassingly parallel.• Integer, not floating point.• 64 bit memory address is nice, but not required.

64 bit file access is required.• Single threaded jobs.• Very IO intensive.

Page 61: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Running the pipeline in practice

Requires a significant amount of domain knowledge.

Software install is complicated.• Lots of perl modules and dependencies.• Apache wranging if you want to run a website.

Need a well tuned compute cluster.• Pipeline takes ~500 CPU days for a moderate genome.

Ensembl chewed up 160k CPU days last year.• Code is IO bound in a number of places.• Typically need a high performance filesystem.

Lustre, GPFS, Isilon, Ibrix etc.• Need large mysql database.

100GB-TB mysql instances, very high query load generated from the cluster.

Page 62: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

How does this port to cloud environments?

Creating the software stack / machine image.• Creating images with software is reasonably straightforward.• Getting queuing system etc running requires jumping through some

hoops.

Mysql databases• Lots of best practice on how to do that on EC2.

But it took time, even for experienced systems people.• (You will not be firing your system-administrators just yet!).

Page 63: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Moving data is hard

Moving large amounts of data across the public internet is hard.• Commonly used tools are not suited to wide-area networks.

There is a reason gridFTP/FDT/Aspera exist.

Data transfer rates (gridFTP/FDT):• Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s)• Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s) • 11 hours to move 1TB to Dublin.• 23 hours to move 1 TB to East coast.

What speed should we get?• Once we leave JANET (UK academic network) finding out what the

connectivity is and what we should expect is almost impossible.

Page 64: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

IO Architecture

CPUCPU CPU

Fat Network

Posix Global filesystem

CPU CPUCPUCPU

thin network

Localstorage

Localstorage

Localstorage

Localstorage

Batch schedular hadoop/S3

VS

Page 65: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Storage / IO is hard

No viable global filesystems on EC2.

NFS has poor scaling at the best of times.• EC2 has poor inter-node networking. > 8 NFS clients, everything stops.

“The cloud way”: store data in S3.• Web based object store.

Get, put, delete objects.• Not POSIX.

Code needs re-writing / forking.• Limitations; cannot store objects > 5GB.

Nasty-hacks:• Subcloud; commercial product that allows you to run a POSIX filesystem

on top of S3. Interesting performance, and you are paying by the hour...

Page 66: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Going forward

Page 67: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Cloud vs HPTC

Re-writing apps to use S3 or hadoop/HDFS is a real hurdle.• Not an issue for new apps.• But new apps do not exist in isolation.• Barrier for entry is much lower for file-systems.

Am I being a reactionary old fart?• 15 years ago clusters of PCs were not real supercomputers.• ...then beowulf took over the world.

Big difference: porting applications between the two architectures was easy.• MPI/PVM etc.

Will the market provide “traditional” compute clusters in the cloud?

Page 68: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Networking

How do we improve data transfers across the public internet?• CERN approach; don't.• Dedicated networking has been put in between CERN and the T1 centres

who get all of the CERN data.

Our collaborations are different.• We have relatively short lived and fluid collaborations. (1-2 years, many

institutions).• As more labs get sequencers, our potential collaborators also increase.• We need good connectivity to everywhere.

Page 69: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Can we turn the problem on its head?

Fixing the internet is not going to be cost effective for us.

Amazon fixing the internet may be cost effective for them.• Core to their business model.• All we need to do is get data into Amazon, and then everyone else can get

the data from there.

Cloud as virtual co-location site.• Mass datastores.• Host mirror sites for our web services.

Requires us to invest in a fast links to Amazon.• It changes the business dynamic.• We have effectively tied ourselves to a single provider.

Expensive mistake if you change your mind, or your provider goes <pop>.

Page 70: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Identity management

Web services for linking databases together are mature.• They are currently all public.

There will be demand for restricted services.• Patient identifiable data.

Our next big challenge.• Lots of solutions:

openID, shibboleth, aspis, globus etc.• Finding consensus will be hard.• Culture shock.

Page 71: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Acknowledgements

Sanger Institute

Phil Butcher

ISG• James Beal• Gen-Tao Chiang• Pete Clapham• Simon Kelley

Cancer-genome Project• Adam Butler• John Teague

STFC• David Corney• Jens Jensen

Page 72: Next Generation Genomics: Petascale data in the life sciences Guy Coates Wellcome Trust Sanger Institute gmpc@sanger.ac.uk

Sites of interest

http://www.ensembl.org

http://www.sanger.ac.uk/cosmic

http://www.biomart.org

http://www.biodas.org

http://www.icgc.org