17
GCC Genomics Core Computing

GCC Genomics Core Computing. Current situation GCC 1.0 Roche 454 Current cluster UZ network 8C 16Gb 2TB UZ NAS Storage 8C 16Gb Per run: ~ 1 Mio reads

Embed Size (px)

Citation preview

GCC

Genomics Core Computing

Current situation GCC 1.0

Roche 454Current

clusterUZ network

8C 16Gb 2TB

UZ NAS Storage

8C 16Gb

8C 16Gb

Per run:

~ 1 Mio reads

~ 2Gb raw data

New sequencer: 1000x increase

1.1TB / run (200Gbp)

~1000 Mio reads

8 days run!

Basic analysis of 1 full run

< 1 week on 3 nodes with 48Gb RAM and 8 CPU cores each (and needs 7TB space)

Full capacity sequencing = full capacity 24 cpu cores

Meta-analyses & post-analyses

• Several fold higher needs than basic run analyses• Integrate multiple runs (e.g,. patient versus controls,

families, etc)• Integrate with previous data• Integrate with publicly available data

– RNA-Seq + gene expression data from GEO

• Integrate with other data sources– DNA-Seq + RNA-Seq + Methyl-Seq

• Integrate with genome browsers– Galaxy, UCSC, Ensembl

• Make analysis pipelines available to users as a service• Custom analyses as a service or in collaboration

Ideal computing setup

High Performance Computing (HPC)

500MB/s

UZ-GBIOMED-VSC

8C 16Gb 2TB

8C 16Gb

8C 16Gb

UZ NAS Storage

- Additional RAM (32Gb or 48 Gb per node)

- Additional storage?

DAS or NAS?

Dell, NetApp?

Open-MPI

SGE

Distributed computing

Torque/PBS

Distributed computing

Flexible computing

~ 100 cpu

6Gb RAM/core

NetApp +DDN Storage

- Servers - Storage- Switches

Software: - Academic tools- CLCBio?

Software: - CASAVA (parall. by user)- Academic: bowtie, bwa, … - CLCBio?

UZ-Patient data

Software: - CASAVA- CLCBio- Roche

- Computing (0,5 EUR / cpu-hour)- Storage (750-1500 EUR / TB)

VSCgbiomedUZ

To be discussed

• How can HiSEQ2000 choose between UZ and KULeuven network to send run data to storage?

– 1Gb– 350 Gb / run compressed

• Where to store data after secondary analysis?– Cheap storage– External HDD– tape

• Who does what?– Jeroen / Jan for UZ?– Stein / Gert / Raf for Biomed?

• Can we already buy additional RAM for UZ cluster?• Can we connect gbiomed servers directly to UZ storage?

– What are the requirements?

• Estimate load over 3 levels – # users– # run– Difficult to estimate now – evaluate after 1yr

What’s next

• Decide on gbiomed hardware• List of things needed at UZ• Start testing CASAVA on UZ system and on VSC• Test CLCBio on UZ system for Illumina data

Test with 1000 genomes data

Storage

• How much do we need?– 1.1 TB per run– 7 TB space during analysis

• BUT: keep only runs that are being analyzed– ~ 3 at a time?– 10-15 TB

• After analysis: – Data delivered to client– Data compressed and moved to offline storage

• Cheap HDD array?• Tape?• External HDDs?

Proposal for GCC2.0 (ideas under construction)

UZ

Computing nodes (existing)

8C 16Gb 2TB

UZ NetApp Storage

8C 16Gb

8C 16Gb

Patient-related data Non-patient-related data (e.g., model organisms, cell lines, …)

32C 256Gb

8C 48Gb

8C 48Gb

gbiomed computing

nodes

Fast interconnect; high I/O bandwidth

Illumina HiSEQ2000

Roche 454

ICTS/VSC NetApp +DDN Storage

VSC(existing), pay per cpu-hour

!

Non-patient-related data

!

! = to create, to test, or to open 10Gb link

GCC2.0 features

• Divide and conquer: solution at 3 levels– UZ: for UZ-patient-related data (protected)– Gbiomed: ad hoc, flexible computing for research (non-UZ-patient related data)– VSC: high-performance computing (non UZ-patient related data)

• Storage (too expensive to duplicate)– VSC storage with Gbiomed access (create 10Gb fast interconnect from ICTS to gbiomed)– UZ storage with Gbiomed access (create ‘open-access’ policy for non-patient related data)– Gbiomed ad hoc storage (HDDs in the local servers)

• Computing – VSC for HPC– Servers in UZ (patient-related data)– Servers in gbiomed (for research-related ad hoc analyses, web services, development,

software testing, …) • Requires fast (10Gb ethernet) access to ICTS storage and fast (and open) access to UZ-open storage

GCC2.0 Cost, Timing & Effort estimates• Budget from Stichting tegen Kanker

– 200-250 K left for computing• Solution for the first 3 years should be possible (excluding bioinformatics manpower)• Budget spread between VSC-gbiomed-UZ: to be decided internally in genomics core• VSC x%

– Storage (86.400 EUR for 32 TB; ~80 TB is needed for 25 runs per year)– Computing time (29.594 EUR for 55.000 cpu-hours)

• Gbiomed local servers and local storage y%• UZ additional storage z%• Software licenses (CLCBio) (price quote requested)

– More investments needed over time (e.g., new hardware is only for 3 years)

• Timing: 31 August 2010?• Estimated effort (to be discussed)

– VSC:• Create 10Gb ethernet link to gbiomed (cost?)• … mandays for startup and testing (network connections, storage, software)• Maintenance included in price

– Genomics Core bioinformaticians (VRC, CME)• … mandays for startup and testing

– Gbiomed IT: • … mandays for setting-up local servers & integration with ICTS storage• … FTE for maintenance of local servers

– UZ: … mandays for additional storage and setup NetApp share

Hurdles to overcome

• 1) 10Gb ethernet link between VSC and gbiomed– For non-UZ-patient related data– To transfer Illumina data to VSC– To run ad hoc analyses on local gbiomed servers, connected to the VSC

storage, without the need to duplicate the storage solution and the data (too costly)

– An absolute requirement– Currently not available– A necessary investment for future VSC-BMW interactions

• 2) UZ-Patient-related data cannot be transferred to VSC storage, nor computed at VSC– Can VSC provide a secure transfer, storage and computing environment for UZ-

data? If not, data analysis and storage for UZ-data remains in UZ.• 3) Link between UZ storage and gbiomed for non-patient related data

– Gbiomed-UZ– 10Gb link is possible in principle. Perhaps during transition period (while waiting

for 10Gb link VSC-gbiomed)?

Alternatives

• All-in-one solution• PSSCLabs• Public tender

Bioinformatics analyses

• Estimated effort from Genomics Core bioinformatician for basic analysis of 1 run: ~2-3 mandays– Included in service fee?– This analysis will not be satisfactory for most projects

• Fee-based bioinformatics and data analysis service for more advanced analyses?

• Many users have a bioinformatician in the group or already collaborate with bioinformaticians

• Contribution in the service fee for GCC hardware & maintenance cost, and software licenses

• Estimated effort: – Either only basic analysis services are offered: ½ FTE bioinformatics postdoc – Or basic plus advanced bioinformatics services will be offered: 1 FTE

bioinformatics postdoc.