Science in the Clouds and Beyondcloudseminar.berkeley.edu/data/magellan.pdf · User...

Preview:

Citation preview

Science in the Clouds and Beyond

Lavanya Ramakrishnan

Computational Research Division (CRD) &

National Energy Research Scientific Computing Center

(NERSC) Lawrence Berkeley National Lab

The goal of Magellan was to determine the appropriate role for cloud computing for science

Program Office

Advanced Scientific Computing Research 17%

Biological and Environmental Research 9%

Basic Energy Sciences -Chemical Sciences 10%

Fusion Energy Sciences 10%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

Access to additional resources

Access to on-demand (commercial) paid resources closer to deadlines

Ability to control software environments specific to my application

Ability to share setup of software or experiments with collaborators

Ability to control groups/users

Exclusive access to the computing resources/ability to schedule independently of other groups/users

Easier to acquire/operate than a local cluster

Cost associativity? (i.e., I can get 10 cpus for 1 hr now or 2 cpus for 5 hrs at the same cost)

MapReduce Programming Model/Hadoop

Hadoop File System

User interfaces/Science Gateways: Use of clouds to host science gateways and/or access to cloud resources through science

gateways

Program Office

High Energy Physics 20%

Nuclear Physics 13%

Advanced Networking Initiative (ANI) Project 3%

Other 14%

Magellan was architected for flexibility and to support research

3 3

QD

R InfiniB

and

Mgt Nodes (2)

Flash Storage Servers 10 Compute/Storage Nodes 8TB High-Performance FLASH 20 GB/s Bandwidth

Gateway Nodes (27)

Compute Servers 720Compute Servers Nehalem Dual quad-core 2.66GHz 24GB RAM, 500GB Disk Totals 5760 Cores, 40TF Peak 21TB Memory, 400 TB Disk

Global Storage (GPFS) 1 PB

ESNet 10Gb/s

Big Memory Servers 2 Servers 2TB Memory

ANI 100 Gb/s Future

Aggregation Sw

itch

Router

IO Nodes (9)

Archival Storage

Science + Clouds = ?

4

Data Intensive Science Technologies from Cloud

Business model for Science

Performance and Cost

Scientific applications with minimal communication are best suited for clouds

5

0

2

4

6

8

10

12

14

16

18

GAMESS GTC IMPACT fvCAM MAESTRO256

Run

time

Rel

ativ

e to

Mag

ella

n (n

on-

VM)

Carver

Franklin

Lawrencium

EC2-Beta-Opt

Amazon EC2

Amazon CC

5

Performance Analysis of High Performance Computing Applications on the Amazon Web Services Cloud, CloudCom 2010

Better

Scientific applications with minimal communication are best suited for clouds

6 6

Better

0

10

20

30

40

50

60

MILC PARATEC

Run

time

rela

tive

to M

agel

lan

Carver Franklin Lawrencium EC2-Beta-Opt Amazon EC2

The principle decrease in bandwidth occurs when switching to TCP over IB.

7

0

0.5

1

1.5

2

2.5

3

3.5

4

32 64 128 256 512 1024

Ping

Pong

Ban

dwid

th(G

B/s

)

Number of Cores

IB TCPoIB 10G - TCPoEth Amazon CC

Better

BEST 2X

5X

HPCC PingPong BW Evaluating Interconnect and Virtualization Performance for High Performance Computing, ACM Perf Review 2012

Ethernet connections are unable to cope with significant amounts of network connection

8

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

32 64 128 256 512 1024

Ran

dom

Ord

er B

andw

idth

(GB

/s)

Number of Cores

IB TCPoIB 10G - TCPoEth Amazon CC 10G- TCPoEth VM 1G-TCPoEth

Better

BEST

The principle increase in latency occurs for TCP over IB even at mid-range concurrency

9

0

50

100

150

200

250

32 64 128 256 512 1024

Ping

Pong

Lat

ency

(us)

Number of Cores

IB TCPoIB 10G - TCPoEth Amazon CC 10G- TCPoEth VM 1G-TCPoEth

Better

BEST

40X

HPCC: PingPong Latency

0

50

100

150

200

250

300

350

400

450

500

32 64 128 256 512 1024

Ran

dom

Ord

er L

aten

cy (u

s)

Number of Cores

IB TCPoIB 10G - TCPoEth Amazon CC 10G- TCPoEth VM 1G-TCPoEth

Latency is affected by contention by a greater amount than the bandwidth

10

Better

BEST

10G VM shows 6X

HPCC: RandomRing Latency

Clouds require significant programming and system administration support

11

•  STAR performed Real-time analysis of data coming from Brookhaven Nat. Lab

•  First time data was analyzed in real-time to a high degree

•  Leveraged existing OS image from NERSC system

•  Started out with 20 VMs at NERSC and expanded to ANL.

On-demand access for scientific applications might be difficult if not impossible

Number of cores required to run a job immediately upon submission to Franklin

12

Public clouds can be more expensive than in-house large systems

13

Component Cost

Compute Systems (1.38B hours) $180,900,000

HPSS (17 PB) $12,200,000

File Systems (2 PB) $2,500,000

Total (Annual Cost) $195,600,000

Assumes 85% utilization and zero growth in HPSS and File System data. Doesn’t include the 2x-10x performance impact that has been measured.

This still only captures about 65% of NERSC’s $55M annual budget. No consulting staff, no administration, no support.

Cloud is a business model and can be applied to HPC centers

14

Traditional Enterprise IT HPC Centers

Typical Load Average 30% * 90%

Computational Needs Bounded computing requirements – Sufficient to meet customer demand or transaction rates.

Virtually unbounded requirements – Scientist always have larger, more complicated problems to simulate or analyze.

Scaling Approach Scale-in. Emphasis on consolidating in a node using virtualization

Scale-Out Applications run in parallel across multiple nodes.

Cloud HPC Centers

NIST Definition Resource Pooling, Broad network access, measured service, rapid elasticity, on-demand self service

Resource Pooling, Broad network access, measured service. Limited: rapid elasticity, on-demand self service

Workloads High throughput modest data workloads

High Synchronous large concurrencies parallel codes with significant I/O and communication

Software Stack Flexible user managed custom software stacks

Access to parallel file systems and low-latency high bandwidth interconnect. Preinstalled, pre-tuned application software stacks for performance

Science + Clouds = ?

15

Data Intensive Science Technologies from Cloud

Business model for Science

Performance and Cost

MapReduce shows promise but current implementations have gaps for scientific applications

16

High throughput workflows

Scaling up from desktops

File system: non POSIX Language: Java Input and output formats: mostly line-oriented text Streaming mode: restrictive i/p and o/p model Data locality: what happens when multiple inputs?

Streaming adds a performance overhead

17

Better

Evaluating Hadoop for Science, In submission

High performance file systems can be used with MapReduce at lower concurrency

0

2

4

6

8

10

12

0 500 1000 1500 2000 2500 3000

Tim

e (m

inut

es)

Number of maps

Teragen (1TB) HDFS

GPFS

Linear(HDFS) Expon.(HDFS) Linear(GPFS) Expon.(GPFS)

18

Better

Data operations impacts the performance differences

19

Better

Schemaless databases show promise for scientific applications

20

Schemaless database

manager.x manager.x manager.x

Brain

www.materialsproject.org Source: Michael Kocher, Daniel Gunter

Data centric infrastructure will need to evolve to handle large scientific data volumes

21

Joint Genome Institute, Advance Light Source, etc are all facing a data tsunami

Cloud is a business model and can be applied at DOE supercomputing centers

•  Current day cloud computing solutions have gaps for science –  performance, reliability, stability –  programming models are difficult for legacy apps –  security mechanisms and policies

•  HPC centers can adopt some of the technologies and mechanisms –  support for data-intensive workloads –  allow custom software environments –  provide different levels of service

22

Acknowledgements

•  US Department of Energy DE-AC02-05CH11232 •  Magellan

–  Shane Canon, Tina Declerck, Iwona Sakrejda, Scott Campbell, Brent Draney •  Amazon Benchmarking

–  Krishna Muriki, Nick Wright, John Shalf, Keith Jackson, Harvey Wasserman, Shreyas Cholia

•  Magellan/ANL –  Susan Coghlan, Piotr T Zbiegiel, Narayan Desai, Rick Bradshaw, Anping

Liu •  NERSC

–  Jeff Broughton, Kathy Yelick •  Applications

–  Jared Wilkening, Gabe West, Ed Holohan, Doug Olson, Jan Balewski, STAR collaboration, K. John Wu, Alex Sim, Prabhat, Suren Byna, Victor Markowitz

23

Recommended