48
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015 Accelerating Time to Science: Transforming Research in the Cloud Jamie Kinney - @jamiekinney Director of Scientific Computing, a.k.a. “SciCo” – Amazon Web Services Dr. Michael Ernst - @brookhavenlab Director, RHIC and ATLAS Computing Facility - Brookhaven National Laboratory ©2015, Amazon Web Services, Inc. or filiates. All rights reserved.

Accelerating Time to Science:Transforming Research in the Cloud

Embed Size (px)

Citation preview

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Accelerating Time to Science:Transforming Research in the Cloud

Jamie Kinney - @jamiekinneyDirector of Scientific Computing, a.k.a. “SciCo” – Amazon Web Services

Dr. Michael Ernst - @brookhavenlabDirector, RHIC and ATLAS Computing Facility - Brookhaven National

Laboratory

                ©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Agenda• An introduction to scientific computing on AWS

• How are researchers using AWS today?

• Case study: How the ATLAS experiment is using AWS

• Q & A

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

What do we mean by scientific computing?

Scientific computing refers to the application of simulation, mathematical modeling, and quantitative analysis to analyze and solve scientific problems.

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

How is AWS used for scientific computing?

• High Performance Computing (HPC) for engineering and simulation

• High-throughput computing (HTC) for data-intensive analytics

• Hybrid supercomputing centers• Collaborative research environments• Citizen science• Science-as-a-Service

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Why do researchers love using AWS?

Time to scienceaccess research 

infrastructure in minutes

Low costpay-as-you-go pricing

Globally accessibleeasily collaborate with 

researchers around the world

SecureA collection of tools toprotect data and privacy

Scalableaccess to effectively limitless capacity

Elasticeasily add or remove capacity

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Why does AWS care about scientific computing?

• We want to improve our world by accelerating the pace of scientific discovery• It is a great application of AWS with a broad customer base• The scientific community helps us innovate on behalf of all customers

– Streaming data processing and analytics– Exabyte scale data management solutions and exaflop scale compute– Collaborative research tools and techniques– New AWS regions– Significant advances in low-power compute, storage, and data centers– Efficiencies that will lower our costs and therefore pricing for all customers

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Research grantsAWS provides free usage credits to help researchers:

• Teach advanced courses• Explore new projects• Create resources for the scientific community

aws.amazon.com/grants

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Peering with all global research networks

Image courtesy John Hover - Brookhaven National Lab

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Restricted-access genomics on AWS

aws.amazon.com/genomics

How are researchers using AWS today?

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

High-throughput computing at scale

The Large Hadron Collider experiments @ CERN involve thousands of researchers from over 40 countries and produces tens of PB of data each year.  

The ATLAS and CMS experiments are using AWS for Monte Carlo simulations and analysis of LHC data.

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Data-intensive computingThe Square Kilometer Array (SKA) will link 250,000 radio telescopes together, creating the world’s most sensitive telescope. The SKA will generate zettabytes of raw data, publishing exabytes annually over 30-40 years.

Researchers are using AWS to develop and test: • Data processing pipelines• Image visualization tools• Exabyte-scale research data management• Collaborative research environments

aws.amazon.com/solutions/case-studies/icrar/

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

High Performance Computing

Simulations in the automotive sector• Crash and materials simulations• Fluid and thermal dynamics simulations• Car body aerodynamics• Electronics and electromagnetic simulations

Honda materials science simulations on AWS:• Deploying scalable HPC clusters on AWS Spot Instances – up to 1,000 C3

instances• Running more simulations than before, for more accurate results

“Cloud offers us an opportunity, as we can innovate faster than before.” - Ayumi Tada, IT System Administrator, Honda R&D

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Schrodinger and Cycle Computing:Computational chemistry for better solar power

Simulation by Mark Thompson of the 

University of Southern California to see 

which of 205,000 organic compounds 

could be used for photovoltaic cells for 

solar panel material.

Estimated computation time 264 years

completed in 18 hours.

• 156,314 core cluster, 8 regions

• 1.21 petaFLOPS (Rpeak)

• $33,000 or 16¢ per molecule

loosely coupled

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Science-as-a-Service

Globus Genomics, DNAnexus, and SevenBridges Genomics offer inexpensive, easy-to-use, and secure platforms for processing and analyzing genomic data.

The Weather Company pushes four gigabytes of data to AWS each second in order to deliver 15 billion forecasts each day to their customers around the world.

aws.amazon.com/solutions/case-studies/the-weather-company/

Case Study: Brookhaven National LaboratoryATLAS: Accelerating Scientific Discovery with AWS

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Accelerating Scientific Discovery in the Cloud

Michael Ernst

Brookhaven National Laboratory

June 25, 2015AWS Government, Education, and Nonprofits Symposium

Washington, D.C.

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

19

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

20

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

21

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

22

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

24

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 201525

LHCb

CMS

ALICE ATLAS

.

The Large Hadron Collider at CERN

27 km

25

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

28

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

New Physics Frontiers in LHC Run 2

30

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

31

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

32

Big data: Not a buzz word when it comes to ATLAS

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

LEPdataset: a few TB

ATLASdataset: 160 PB

NDN

33

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

••

•••

••

•••

• …

34

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

ATLAS workload: Managed by PanDA

35

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

36

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Leveraging the AWS Spot market for compute-hungry HEP

• Cloud resources are very valuable to HEP experimental computing, and HEP generally is a big user

• In the past, experimental HEP has used commercial cloud resources little – we want to change that

• We are compute-limited in our science – cloud resources can enrich the science• Clouds have (cost-efficient) room for us if our workload is fine-grained and flexible,

even when the resource occupancy is high• Just as there’s room for sand in a full jar of rocks, there’s room for us• Joint project with AWS Scientific Computing team and ESnet

• Scoped out a pilot centered on representative HEP/ATLAS workflows • AWS contributes precious technical expertise and credits for trial runs• ESnet contributes expertise and network gear at the AWS/ESnet peering points

• ESnet participation is central to AWS waiving the egress fee (cond. apply) 

• Which brings us to our new fine-grained data processing system  37

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

 

• We’ve leveraged new developments in our Workload Management System (PanDA), our parallel software framework, powerful networking, and efficient I/O and storage to implement a new approach to event processing – a fine-grained event service

• An extension to PanDA that allows it to manage event-level workloads (instead of file level workloads where hundreds of events are clustered)

• Object stores (e.g. S3) provide highly scalable storage for many small event-scale outputs

Applicable to any workflow (not just HEP) able to support fine-grained partitioning of the processing and its output

Data-intensive, network-centric, platform-agnostic computing• An increasingly important paradigm in the scientific computing 

community 38

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

39

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

ATLAS simulated event production is currently running on EC2 using the event service

• PanDA “Site” at BNL sends jobs to EC2 Spot Market VMs• Exercising scaling to >50k concurrent jobs, entering production soon• Event Service maximizes return on short-lived job slots (~1h)• Leverages capability from the BNL Tier 1 to elastically and transparently expand 

workloads into cloud resources: after dedicated resources are fully utilized, jobs overflow into the cloud to accommodate peak demands  

40

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Using cloud resources effectively: A policy-based cloud scheduler

Policy

Fully transparent to Workload ManagementSystem (e.g. PanDA),Elastically expandspool of compute resources accordingto user-defined policy

Demand-driven, policy-basedprogrammaticinstantiation andcontraction of cloud resources 

41

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Elastic cluster: “Flexible and nimble” provisioning

Programmatically instantiatesCompute resources in theCloud

Designed to serve- Peak demands- Users without dedicated resources- Dynamic creation

of specific resource types (e.g. DB, storage, DTNs)

Goal: setup time <5% of total compute time

HTCondor

HTCondor

42

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Architectural overview from the facility perspective

43

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

100G R&E  Exchange

Direct ConnectESnet Pilot 2x10G

AWS Planned 100G to PNWG

Seattle

Direct ConnectESnet Pilot 1x10G

Connecting AWS Facilities to the Research Community

44

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Image Authoring and Runtime ConfigurationDesign goals:

• Useful for ATLAS, but usable by other VOs.• Eliminate runtime RPM installation. Fatal with O(1000) startups.• Images deterministically reproduceable. No snapshotting. • Provide the ability for other users to do it themselves (make toolset public).• Flexibility between build-time and runtime customization. Both options OK.• Open source only. Only use functions/services for which open source equivalents exist (EC2, S3). • Off-the-shelf, non-cloud (Puppet, Hiera, Condor, Yum) wherever possible. Off-the-shelf cloud (cloud-

init, Imagefactory/Oz) only where needed.• Keep custom parts small, simple, and/or optional.  

• 10,000 ft summary:• Imagefactory 1.1.7 generates VMs from merged hierarchical templates.  • Masterless puppet consumes single Hiera file (injected via cloud-init write_file) at boot. 

45

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Build Framework

46

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Final remarks• ATLAS has met the challenge of data-intensive computing at a scale not seen before• Resource virtualization - integration of storage, compute and network - in a 

seamless manner, including cloud and local resources• A rather complete and still growing set of AWS services to instantiate VMs, 

allocate storage, and network dynamically• New innovations like the Event Server allow ATLAS to efficiently harvest EC2 spot 

market resources to meet its computing growth needs• The joint project with the AWS Scientific Computing team and ESnet has been 

crucial to the successful implementation 

47

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Additional resources• aws.amazon.com/hpc• aws.amazon.com/big-data• aws.amazon.com/grants• aws.amazon.com/genomics• aws.amazon.com/compliance• aws.amazon.com/security

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015

Thank You.This presentation will be loaded to SlideShare the week following the Symposium.

http://www.slideshare.net/AmazonWebServices

AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015