60
November 14, 2014 | Las Vegas, NV Jason Stowe, Cycle Computing Patrick Saris, USC David Hinz, HGST

(BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

Embed Size (px)

Citation preview

Page 1: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

November 14, 2014 | Las Vegas, NV

Jason Stowe, Cycle Computing

Patrick Saris, USC

David Hinz, HGST

Page 2: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

BETTER ANSWERS FASTER

2

Page 3: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

We believe access to

Cloud Cluster Computing

accelerates invention & discovery

Page 4: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

Cluster Computing is Everywhere

• Strategic Answers

• Speed & Agility

4

Page 5: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

Jevons Paradox• UK in the 1860’s: “we

need a fixed amount of steam power”

• People thought:

More efficient coal use = use less coal

• Jevons disagreed!

Page 6: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

Jevons Paradox

• Jevons was contrarian:

Increasing efficiency in turning coal to steam, making the interface simpler to consume, radically increases demand.

Page 7: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

Cloud helps capacity…Fixed clusters are:

Too small when needed most,

Too large every other time…

But this work is hard to move: data scheduling, encryption, multi-AZ, security, etc.

Cycle powers access at scale

Page 8: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

Internal:

500 Servers,

100% Full

Data Workflow

Cloud Orchestration

Drug Designer

Cycle solutions help access

Cluster Container

40 years of drug design in 9 hours

3 new compounds, $4,372 in Spot

10,600

Servers

Molecule

Data

Molecule

Data

Burst

Page 9: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

Thanks to cloud, people can:

Ask the right questions

Get better answers, faster

Page 10: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

Record Scale, Enterprise Speed

• Very innovative work by:

– Patrick Saris, USC

– David Hinz, HGST

• Both will show the importance of:

– Asking the right question, regardless of scale

– Getting results faster to increase throughput

Page 11: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

November 14, 2014 | Las Vegas, NV

Patrick Saris, University of Southern California

Page 12: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

Biomass

5.6%

Hydroelectric

3.1%

Wind

2.0%

Solar: 0.4%

Geothermal

0.3%

Fossil Fuels: 79%

Nuclear: 10%

Source: U.S. Energy Information Administration,

Monthly Energy Review – Table 1.2

Renewables: 11%

Page 13: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 14: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

+

Donor

Acceptor

-V

Page 15: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

+

Donor

Acceptor

-V

Page 16: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

IndigoChlorophyl

Graphite

fragments

Page 17: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 18: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

0

5

10

15

20

25

0 5 10 15 20 25

Experiment

0

5

10

15

20

25

0 10 20

Calculation

Experiment

Page 19: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

IndigoChlorophyl

Graphite

fragments

Page 20: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

nitrogens

Page 21: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

-3.66 eV

-2.65 eV

1

2

3

4

Mat Halls, Schrodinger Inc.

Page 22: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 23: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

nitrogen

3,473

Page 24: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

phenyl

3,473

553,855

Page 25: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 26: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 27: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

+

Donor

Acceptor

-V

Page 28: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 29: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 30: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

‘Band gap’

of parent

structure

Page 31: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 32: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 33: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 34: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 35: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 36: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 37: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 38: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 39: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 40: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 41: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

Production Cycle Deployment First live deployment 2008

File System (PBs)

If an internal cluster

Exists.

Jobs & data

Blob data

(S3)

Cloud Filer

Glacier

Auto-scaling

external

environment

HPC

Cluster

Internal HPC

Blob data

(S3)

Cloud Filer

Glacier

Auto-scaling

external

environment

HPC

ClusterBlob data

Cloud Filer

Cold Storage

Auto-scaling

external

environment

HPC

Cluster

Scheduled

Data

Page 42: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

Metric Count

Compute Hours of Work 2,312,959 hours

Compute Days of Work 96,373 days

Compute Years of Work 264 years

Molecule Count 205,000 materials

Run Time < 18 hours

Max Scale (cores) 156,314 cores across 8 regions

Max Scale (instances) 16,788 instances

Page 43: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

How did we do this?

Auto-scaling

Execute Nodes

JUPITER

Distributed

Queue

Data

Automated in 8 Cloud Regions,

4 continents, Double resiliency

…14 nodes controlling 16,788

Page 44: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 45: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 46: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014
Page 47: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

© 2014 HGST, INC.

David Hinz

Global Director, IS&T

Cloud and DataCenter Computing Solutions

Cost Effective

High Performance Computing

On Amazon Web Services

Page 48: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

© 2014 HGST, INC. | HGST CONFIDENTIAL 48

Agenda

• Who is HGST and how is HPC used?

• HGST’s AWS HPC Journey

• Use of Cloudability for Cost Analysis and RI Planning

• What’s Next

Page 49: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

© 2014 HGST, INC. | HGST CONFIDENTIAL 49

Capacity Enterprise

Performance Enterprise

Cloud & Datacenter

Enterprise SSD(+3 acquisitions in 2013)

7200 RPM &

CoolSpin

HDDs

Ultrastar®

Ultrastar® &

MegaScale DC™

10K & 15K

HDDs

PCIe

SAS

HGST History

Founded in 2003 through the combination of the hard drive

businesses of IBM, the inventor of the hard drive, and

Hitachi, Ltd (“Hitachi”)

Acquired by Western Digital in 2012

More than 4,200 active worldwide patents

Headquartered in San Jose, California

Approximately 42,000 employees worldwide

Develops innovative, advanced hard disk drives (HDD),

enterprise-class solid state drives (SSD), external storage

solutions and services

Delivers intelligent storage devices that tightly integrate

hardware and software to maximize solution performanceUltrastar He6

with HelioSeal™ technology

Page 50: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

© 2014 HGST, INC. | HGST CONFIDENTIAL 50

HPC Modeling and Simulation:

HGST’s Innovation Engine

• Improved Mechanical Innovation- Internal/External Mechanical Structural Analysis of HDD

- Critical Lubricant Attributes and Physics

- Airflow / He inside HDD

- Optimal combination of HDD head and media

compositions, spindle design, lubricants

- Storage Array: HDD location, airflow investigations

• Faster Aerial Density Improvements- Micro magnetic analysis for Heat Assisted Magnetic Recording (HAMR)

- Head-Medium Spacing (HMS)

Magnetic Medium

Magnetic Head Sensors

Magnetic Spacing

Trailing Edgeof Slider

HPC Doing The “Physics Work” Driving HGST Innovation

Page 51: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

© 2014 HGST, INC. | HGST CONFIDENTIAL 51

HGST’s HPC AWS Evolution

• Stage 1 : PoC

- First HPC PoC: Sept 2013

• Stage 2 : Small Start

- 1st HPC Production Cluster: Nov 2013

• Stage 3 : Optimize Workloads And Flexibility

- AWS C3 Deployments Jan 2014

- 4th HPC Production Cluster: June 2014

• Stage 4 : Lower Cost

- Use Spot And Reserved Instances: Oct / Nov / Dec 2014

• Stage 5 : Business metrics

- Utilization and cost reports to HGST engineers : Dec 2014

Page 52: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

© 2014 HGST, INC. | HGST CONFIDENTIAL 52

Stage 1 + 2 + 3: Shape and Scale Compute

Fluid Dynamics1.4x Overall Throughput Gain

MicroMagneticsMolecular Dynamics

Parameter Sweeps Throughput Gain

Model 1 1.23x - 1.78x

Model 2 1.01x – 1.67x

Model 3 1.23x – 1.69x

Model 4 Up to 2.7x

Simulation TypeThroughput

Increase

Head Drive Interface Vacuum Gaps 1.99x

Vacuum Gap "collection" 4.00x

Media Grains for HAMR (FePt/C) 2.03x

4 Carbon Molecule Clusters 5.67x

Molecular Dynamics 1.67x Initial Overall Throughput Gain

Page 53: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

© 2014 HGST, INC. | HGST CONFIDENTIAL 53

Stage 3: Optimize For Workload Flexibility

Not all workloads and work require same compute resources 24 x 7 x 365

Page 54: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

© 2014 HGST, INC. | HGST CONFIDENTIAL 54

Stage 4: Spot Instances For Advanced HDD Research

EBS

Submit jobs,

orchestrate HPC

clusters over VPN

Simulated 22 advanced

head designs across 3

materials possibilities

= 15 compute years

Used AWS c3 instances

6x faster run-time:

Ran in 5 days, not 30!

Total cost:

$4,026.02

New Drive

Head

Design

Workloads

Encrypt, route data to

AWS, return results

HPC Cluster

1024 Cores

Of Spot

Instances

Page 55: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

© 2014 HGST, INC. | HGST CONFIDENTIAL 55

EBS

Submit jobs, orchestrate

HPC clusters over VPC

Run 1 Million drive head

designs = 70.75 core-years

90x throughput:

Ran in 8 hours, not 30 days!

3 days from idea to running!

70,908 cores, 729 TFLOPS

c3, r3 with Intel IvyBridge

Cost: $5,594, Spot

Instances

New Drive

Head

Design

Workloads

World’s Largest F500 Cloud RunTransforming drive design to store the world’s data

Encrypt, route data to

AWS, return results

Cluster with

70,908 Cores

Of Spot

Instances

Page 56: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

© 2014 HGST, INC. | HGST CONFIDENTIAL 56

Stage 4 + 5: Optimize For Utilization and Cost

Great Solutions Are Available To Ease Optimization Effort

Page 57: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

© 2014 HGST, INC. | HGST CONFIDENTIAL 57

Summary

• HGST’s HPC AWS Journey ~15 months

• Take The Right Steps Along The Journey To the Cloud

• Pick The Right Partners and Tools For Success

• Continually Evaluate Environment and Needs

Page 58: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

© 2013 HGST, INC. | HGST CONFIDENTIAL 58

Thank You

Page 59: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

Take advantage of efficiency

• Find more uses for this efficient, inexpensive compute

Please ask the right questions, get answers quickly

Go invent and discover!

Page 60: (BDT311) MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production Workloads | AWS re:Invent 2014

THANK YOU

60