Architectural Trends and Programming Model Strategies for Large …cscads.rice.edu/download-5.pdf · 2018. 4. 26. · • Game processors and GPUs ... for $20K, whereas a PS3 is about

1 Kathy Yelick

Architectural Trends and ProgrammingModel Strategies for Large-Scale

Machines

Katherine Yelick

U.C. Berkeley and Lawrence Berkeley National Lab

http://titanium.cs.berkeley.eduhttp://upc.lbl.gov

Kathy Yelick, 2Climate 2007

Architecture Trends

• What happened to Moore’s Law?• Power density and system power• Multicore trend• Game processors and GPUs• What this means to you


Moore’s Law is Alive and Well

2X transistors/Chip Every 1.5 yearsCalled “Moore’s Law”

Moore’s Law

Microprocessors havebecome smaller, denser,and more powerful.

Gordon Moore (co-founder ofIntel) predicted in 1965 that thetransistor density ofsemiconductor chips woulddouble roughly every 18months.

Slide source: Jack Dongarra


But Clock Scaling Bonanza Has Ended•Processor designers forced to go “multicore”:

• Heat density: faster clock means hotter chips• more cores with lower clock rates burn less power

• Declining benefits of “hidden” Instruction Level Parallelism(ILP)• Last generation of single core chips probably over-engineered• Lots of logic/power to find ILP parallelism, but it wasn’t in the apps

• Yield problems• Parallelism can also be used for redundancy• IBM Cell processor has 8 small cores; a blade system with all 8 sells

for $20K, whereas a PS3 is about $600 and only uses 7


Clock Scaling Hits Power Density Wall

400480088080

8085

8086

286 386486

Pentium®P6

1

10

100

1000

10000

1970 1980 1990 2000 2010Year

Pow

er D

ensi

ty (W

/cm

2 )

Hot Plate

NuclearReactor

RocketNozzle

Sun’sSurface

Source: Patrick Gelsinger, Intel®

Scaling clock speed (business as usual) will not work


Revolution is Happening Now• Chip density is

continuingincrease ~2x every2 years• Clock speed is not• Number of

processor coresmay doubleinstead

• There is little or nohidden parallelism(ILP) to be found

• Parallelism mustbe exposed to andmanaged bysoftware

Source: Intel, Microsoft (Sutter) andStanford (Olukotun, Hammond)


100

1000

10000

100000

1E+06

1E+07

1E+08

1E+09

1E+10

1E+11

1E+12

1993 1996 1999 2002 2005 2008 2011 2014

SUM#1#500

Petaflop with ~1M Cores By 20081Eflop/s

100 Pflop/s

10 Pflop/s

1 Pflop/s

100 Tflop/s

10 Tflops/s

1 Tflop/s

100 Gflop/s

10 Gflop/s

1 Gflop/s

10 MFlop/s

1 PFlop system in 2008

Slide source Horst Simon, LBNL

Data from top500.org

6-8 years

Commonby 2015?


NERSC 2005 Projections forComputer Room Power (System + Cooling)

0

10

20

30

40

50

60

70

Meg

aWat

ts

2001 2003 2005 2007 2009 2011 2013 2015 2017

CoolingN8N7N6N5bN5aNGFBassiJacquardN3EN3PDSFHPSSMisc

Power is a systemlevel problem, not justa chip level problem


Concurrency for Low Power• Highly concurrent systems are more power efficient

• Dynamic power is proportional to V2fC• Increasing frequency (f) also increases supply voltage

(V): more than linear effect• Increasing cores increases capacitance (C) but has

only a linear effect• Hidden concurrency burns power

• Speculation, dynamic dependence checking, etc.• Push parallelism discover to software (compilers and

application programmers) to save power• Challenge: Can you double the concurrency in your

algorithms every 18 months?


Cell Processor (PS3) on Scientific Kernels

• Very hard to program: explicit control over memory hierarchy


Game Processors Outpace Moore’s Law

• Traditionally too specialized (no scatter, no inter-corecommunication, no double fp) but trend to generalize

• Still have control over memory hierarchy/partitioning

spee

dup


What Does this Mean to You?•The world of computing is going parallel

• Number of cores will double every 12-24 months• Petaflop (1M processor) machines common by 2015

•More parallel programmers to hire ☺•Climate codes must have more parallelism

• Need for fundamental rewrite and new algorithms• Added challenge of combining with adaptive algorithms

•New programming model or language likely ☺• Can the HPC benefit from investments in parallel languages

and tools?• One programming model for all?


Performance on Current MachinesOliker, Wehner, Mirin, Parks and Worley

• Current state-of-the-art systems attain around 5% of peak atthe highest available concurrencies• Note current algorithm uses OpenMP when possible to increase

parallelism• Unless we can do better, peak performance of system must

be 10-20x of sustained requirement• Limitations (from separate studies of scientific kernels)

• Not just memory bandwidth: latency, compiler code generation, ..

Percent of Theoretical Peak

3%4%5%6%7%8%9%

10%11%

P=256 P=336 P=672

Concurrency

Power3Itanium2Earth SimulatorX1X1E


Strawman 1km Climate ComputerOliker, Shalf, Wehner

• Cloud system resolving global atmospheric model• .015oX.02o horizontally with 100 vertical levels

• Caveat: A back-of-the-envelope calculation, with manyassumptions about scaling current model

• To acheive simulation time 1000x faster than real time:• ~10 Petaflops sustained performance requirement• ~100 Terabytes total memory• ~2 million horizontal subdomains• ~10 vertical domains• ~20 million processors at 500Mflops each sustained inclusive of

communications costs.• 5 MB memory per processor• ~20,000 nearest neighbor send-receive pairs per subdomain per

simulated hour of ~10KB each

15 Kathy Yelick

A Programming Model Approach:Partitioned Global Address Space

(PGAS) Languages

What, Why, and How


Parallel Programming Models• Easy parallel software is still an unsolved problem !• Goals:

• Making parallel machines easier to use• Ease algorithm experiments• Make the most of your machine (network and memory)• Enable compiler optimizations

• Partitioned Global Address Space (PGAS)Languages

• Global address space like threads (programmability)• One-sided communication• Current static (SPMD) parallelism like MPI• Local/global distinction, i.e., layout matters (performance)


Partitioned Global Address Space• Global address space: any thread/process may directly

read/write data allocated by another• Partitioned: data is designated as local or global

Glo

bal a

ddre

ss s

pace x: 1

y:

l: l: l:

g: g: g:

x: 5y:

x: 7y: 0

p0 p1 pn

By default:• Object heaps

are shared• Program

stacks areprivate

• 3 Current languages: UPC, CAF, and Titanium• All three use an SPMD execution model• Emphasis in this talk on UPC and Titanium (based on Java)

• 3 Emerging languages: X10, Fortress, and Chapel


PGAS Language for Hybrid Parallelism• PGAS languages are a good fit to shared

memory machines• Global address space implemented as reads/writes• Current UPC and Titanium implementation uses threads• Working on System V shared memory for UPC

• PGAS languages are a good fit to distributedmemory machines and networks• Good match to modern network hardware• Decoupling data transfer from synchronization can

improve bandwidth


PGAS Languages on Clusters:One-Sided vs Two-Sided Communication

• A one-sided put/get message can be handled directly by a networkinterface with RDMA support• Avoid interrupting the CPU or storing data from CPU (preposts)

• A two-sided messages needs to be matched with a receive toidentify memory address to put data• Offloaded to Network Interface in networks like Quadrics• Need to download match tables to interface (from host)

address

message id

data payload

data payloadone-sided put message

two-sided message

network interface

memory

hostCPU

Joint work with Dan Bonachea


One-Sided vs. Two-Sided: Practice

0

100

200

300

400

500

600

700

800

900

10 100 1,000 10,000 100,000 1,000,000

Size (bytes)

Ban

dwid

th (M

B/s

)

GASNet put (nonblock)"MPI Flood

Re lative B W (GASNet/MPI)

1.01.21.41.61.82.02.22.4

10 1000 100000 10000000

Size (bytes)

• InfiniBand: GASNet vapi-conduit and OSU MVAPICH 0.9.5• Half power point (N ½ ) differs by one order of magnitude• This is not a criticism of the implementation!

Joint work with Paul Hargrove and Dan Bonachea

(up

is g

ood) NERSC Jacquard

machine withOpteronprocessors


GASNet: Portability and High-Performance(d

own

is g

ood)

GASNet better for latency across machines

8-byte Roundtrip Latency

14.6

6.6

22.1

9.6

6.6

4.5

9.5

18.5

24.2

13.5

17.8

8.3

0

5

10

15

20

25

Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed

Rou

ndtr

ip L

aten

cy (u

sec)

MPI ping-pongGASNet put+sync

Joint work with UPC Group; GASNet design by Dan Bonachea


(up

is g

ood)

GASNet at least as high (comparable) for large messages

Flood Bandwidth for 2MB messages

1504

630

244

857225

610

1490799255

858 228795

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


Perc

ent H

W p

eak

(BW

in M

B)

MPIGASNet

GASNet: Portability and High-Performance



(up

is g

ood)

GASNet excels at mid-range sizes: important for overlap

GASNet: Portability and High-PerformanceFlood Bandwidth for 4KB messages

547

420

190

702

152

252

750

714231

763223

679

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


Perc

ent H

W p

eak

MPIGASNet



Communication Strategies for 3D FFT

Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea

chunk = all rows with same destination

pencil = 1 row

• Three approaches:• Chunk:

• Wait for 2nd dim FFTs to finish• Minimize # messages

• Slab:• Wait for chunk of rows destined

for 1 proc to finish• Overlap with computation

• Pencil:• Send each row as it completes• Maximize overlap and• Match natural layout slab = all rows in a single plane with

same destination


NAS FT Variants Performance Summary

• Slab is always best for MPI; small message cost too high• Pencil is always best for UPC; more overlap

0

200

400

600

800

1000

Myrinet 64InfiniBand 256

Elan3 256Elan3 512

Elan4 256Elan4 512

MFl

ops

per T

hrea

d

Best MFlop rates for all NAS FT Benchmark versions

Best NAS Fortran/MPIBest MPIBest UPC

0

100

200

300

400

500

600

700

800

900

1000

1100

Myrinet 64

InfiniBand 256Elan3 256

Elan3 512Elan4 256

Elan4 512

MF

lops

per

Thr

ead

Best NAS Fortran/MPI

Best MPI (always Slabs)

Best UPC (always Pencils)

.5 Tflops

Myrinet Infiniband Elan3 Elan3 Elan4 Elan4#procs 64 256 256 512 256 512

MFl

ops

per T

hrea

d

Chunk (NAS FT with FFTW)Best MPI (always slabs)Best UPC (always pencils)

26 Kathy Yelick

Making PGAS Real:Applications and Portability


Coding Challenges: Block-Structured AMR• Adaptive Mesh Refinement

(AMR) is challenging• Irregular data accesses and

control from boundaries• Mixed global/local view is useful

AMR Titanium work by Tong Wen and Philip Colella

Titanium AMR benchmark available


Languages Support Helps Productivity

C++/Fortran/MPI AMR• Chombo package from LBNL• Bulk-synchronous comm:

• Pack boundary data between procs• All optimizations done by programmer

Titanium AMR• Entirely in Titanium• Finer-grained communication

• No explicit pack/unpack code• Automated in runtime system

• General approach• Language allow programmer

optimizations• Compiler/runtime does some

automatically

Work by Tong Wen and Philip Colella; Communication optimizations joint with Jimmy Su

0

5000

10000

15000

20000

25000

30000

Titanium C++/F/MPI(Chombo)

Line

s of

Cod

e

AMRElliptic

AMRTools

Util

Grid

AMR

Array


Conclusions• Future hardware performance improvements

• Mostly from concurrency, not clock speed• New commodity hardware coming (sometime) from

games/GPUs• PGAS Languages

• Good fit for shared and distributed memory• Control over locality and (for better or worse) SPMD• Available for download

• Berkeley UPC compiler: http://upc.lbl.gov• Titanium compiler: http://titanium.cs.berkeley.edu

• Implications for Climate Modeling• Need to rethink algorithms to identify more parallelism• Need to match needs of future (efficient) algorithms

30 Kathy Yelick

Extra Slides


Parallelism Saves Low Power• Exploit explicit parallelism for reducing power

Power = C * V2 * F Performance = Cores * F

• Using additional cores– Allows reduction in frequency and power supply

without decreasing (original) performance– Power supply reduction lead to large power savings

Power = 2C * V2 * F Performance = 2Cores * FPower = 4C * V2/4 * F/2 Performance = 4Lanes * F/2Power = (C * V2 * F)/2 Performance = 2Cores * F

• Additional benefits– Small/simple cores more predictable performance


Where is Fortran?

Source: Tim O’Reilly

Documents

Architectural Trends and Programming Model Strategies for Large …cscads.rice.edu/download-5.pdf · 2018. 4. 26. · • Game processors and GPUs ... for $20K, whereas a PS3 is about