The ELIXIR Compute Platform: An environment for …The ELIXIR Compute Platform: An environment for...

Preview:

Citation preview

ELIXIR-EXCELERATE is funded by the European Commission within the Research Infrastructures programme of Horizon 2020, grant agreement number 676559.

The ELIXIR Compute Platform:An environment for Analysing

Life-Science DataSteven Newhouse, Head of Technical Services, EMBL-EBI

Genomics vs High Energy Physics

• Both are excellent examples of Big Data but Genomics data is:

• More complex and variable, used in more demanding ways

• Growth is accelerating faster than physics data

• Greater uncertainty on short timescales => less time to respond

• Less community-wide investment in s/w and infrastructure

• Sequencing and imaging machines provide 1000’s of data sources

• Research data deposited into repositories before publishing

• Health data retained inside organisational firewalls

• Tony Wildish, Genomics vs. Physics, HEPIX 2016,

https://indico.cern.ch/event/531810/sessions/208405/#20161019

EMBL: European Molecular Biology LaboratoryOver 1600 people and more than 80 nationalities

Structural biology

Hamburg

Life sciences

Heidelberg

Epigenetics and neurobiology

Rome

Bioinformatics

Cambridge(EMBL-EBI)

Structural biology

Grenoble

Tissue biology and disease modelling

Barcelona

Data Resources at EMBL-EBI

Literature & ontologies• Experimental Factor

Ontology• Gene Ontology• BioStudies• Europe PMC

Chemical biology• ChEBI• ChEMBL• SureChEMBL

Molecular structures• Protein Data Bank in Europe• Electron Microscopy Data Bank

Gene, protein & metabolite expression• Expression Atlas• Metabolights• PRIDE• RNA Central

Protein sequences, families & motifs• InterPro• Pfam• UniProt

Genes, genomes & variation• Ensembl• Ensembl Genomes• GWAS Catalog• Metagenomics portal

Systems• BioModels• BioSamples• Enzyme Portal• IntAct• Reactome

Molecular Archives• European Nucleotide Archive• European Variation Archive• European Genome-phenome Archive• ArrayExpress

Cross domain resources . Cross dom

ain resources

dg

P

b

s

y

Ever Increasing Demands

Storage growth at EMBL-EBI still 40-50% a year.

Increasingly ‘interesting’ data being generated and held in national or local repositories.• Integration challenges

Big data, big demand

~27 million requests to EMBL-EBI websites

every day

200 petabytes

of storage capacity in our data centres

EMBL-EBI delivered

152 million jobs to its users in 2016

Scientists at over

3.2 million unique IP addresses use

EMBL-EBI websites

Data Centre Infrastructure

Campus(Hinxton)[90 racks]

Leased Data Centre(Hemel Hempstead)

[90 racks]

Leased Data

Centre(Slough)

[10 racks]

JANET – UK Academic Network

• Raw Storage:• Object Store – 101PB• NAS – 70PB• HPC Storage 22PB• Tape – 22PB

• Analysis Capacity:• HTC: 22,000 job slots• HPC: 7,000 job slots• Cloud: 6,000 vCPUs• Virtual infrastructure: 1,500 cores

20Gbs10Gbs1Gbs

ELIXIR – Research Infrastructure for Life Science

8

• ComputeAccess, Exchange & Compute on sensitive data

• DataSustain core data resources

• ToolsServices & connectors to drive access and exploitation

• StandardsIntegration and interoperability of data and services.

• TrainingProfessional skills for managing and exploiting data

ELIXIR Compute Platform: Integration with communities

The transfer of large volume, electronic confidential, human data

https://www.elixir-europe.org/events/elixir-webinar-transfer-large-volume-data

ELIXIR Compute Platform: Integrating Existing Serviceshttps://www.elixir-europe.org/platforms/compute

The ELIXIR Nodes and their collaboration with European e-Infrastructures form the technical and resource foundation of the ELIXIR Compute Platform.

A geographically distributed Authentication & Authorisation Infrastructure (AAI) in operation.

Integrated Cloud & Compute and Storage & File Transfer Services that are provided by the individual ELIXIR Nodes and which will be discoverable through ELIXIR.

Moving data between sites is one key capability of the ELIXIR Compute Platform.

Raising the level of abstraction through platforms that promote distributed workflow execution

ELIXIR Cloud & Compute

ELIXIR Cloud capacities surveyed here DK, DE, EBI, FI, FR, SUI confirmed capacity, counting only these nodes

> 60.000 compute cores

> 24.000 TB of storage

> 3.000 compute users

Resource allocationdecisions are made bythe nodes

ELIXIR Data Transfer and Storage

• PID and Metadata Registry

• Minimal metadata for tracking and downloading data available

• Example implementation integrating GridFTP and Handles capturing minimal metadata; automatic Handle resolving

• Next step: Integration with RDA collections API and specification

• File Transfer

• Deployed FTS3 integrated with ELIXIR AAI supporting multiple protocols (gridftp, https, S3, …),

• Command line and web UI

• Performance tests between GridFTP, Aspera, http and other protocols is still ongoing (Elixir-ES)

• Reference Data Set Distribution Service

• RDSDS planned, designed and developed at EMBL-EBI with support from EUDAT2020

Interaction with e-Infrastructures

Communities CommunitiesCommunities

ELIXIR Compute Platform

EOSC(EGI, EUDAT & Indigo)

Commercial ProvidersELIXIR Nodes

At a GLIF meeting a long time ago, in a galaxy far, far away…

• EGI.eu still here!

• Coordinating EOSC-Hub

• EOSC: European Open Science Cloud

• Federating cloud resources and services

• Enabling open science around open data

• https://www.glif.is/meetings/2010/plenary/newhouse-egi.pdf

Future Compute Platform: ELIXIR-GA4GH Analysis Environment

• Integrate user federation ELIXIR AAI into local compute and data deployments

• Rationalise a ELIXIR-wide Data Distribution Network – starting with Reference datasets

• Drive ELIXIR Compute Platform support for hybrid (public/private & cloud/HPC) deployments – e.g. Openstack, SGE, etc

• Develop Task Distribution Network using Task orchestration engines – e.g. Kubernetes

• Support national or regional workflow choreography engines – e.g. CWL, Nextflow, Galaxy, etc.

Infrastructure Requirements

• Data Sources:

• EMBL-EBI has a lot of data! Science DMZ to improve access.

• Cloud Resources:

• From within ELIXIR Nodes, national providers, others all federated through EOSC

• Commercial cloud providers: HelixNebula (T-Systems & RHEA), AWS, GCP, MSA, …

• Data Sinks:

• Strategic placement of reference data sets & tactical placement of analysis data

Underlying Network Infrastructure with dynamic dedicated virtual links?

Hybrid Cloud Future

• Cost model of public clouds:• Good for transitory activities, e.g. 1000 cores for 2 months

• Bad for long-term activities, e.g. 17PB for 5 years growing 0.5PB/month in + out

• How can we present our on-site storage externally?• Replicate and sync to the cloud: Existing file based access model

• On-demand caching: Existing file based access model with smart layer

• Direct network access over http: Read/Write whole object

• How much bandwidth is needed to support these models?

Scaling out from EMBL-EBI’s Data Centres

JANET/GEANT

Public Clouds

http(Object Store)

NFS(Scale out storage)

Web SitesWeb Services

ELIXIR Clouds

IndividualUsers

EBI Hybrid Cloud

Summary

• Network remains key factor• Even more so for big data in clouds

• Business & service models remain complex• Both local ISP connectivity and public cloud providers

• Can network paths on demand help (or hinder) here• Danger in just adding another complexity layer!• But could provide USP for performance & cost with public clouds

www.elixir-europe.org

@ELIXIREurope /company/elixir-europe

www.elixir-europe.org

/company/elixir-europe

Thank you – steven.newhouse@ebi.ac.ukAcknowledgements: ELIXIR Compute Platform &

EMBL-EBI Technical Services Clustercompute-exco@elixir-europe.org

Recommended