“Future TechnologiesFuture Technologies” (WP8) … · 2019. 9. 24. · RapidMind allows to write code which can run on x86 cores as well as accelerators like GPUs and Cell. x86‐dp

“Future Technologies” (WP8) PrototypesFuture Technologies (WP8) PrototypesIris Christadler, Dr. Herbert Huber

Leibniz Supercomputing Centre, Germany

Prototype Overview (1/2)Prototype Overview (1/2)CEA“GPU/CAPS”

1U Tesla Server T1070 (CUDA, CAPS DDT) Intel Harpertown nodes

Take more easily advantage of accelerators. Compare HMPP with other approaches to program accelerators“GPU/CAPS” CAPS, DDT), Intel Harpertown nodes HMPP with other approaches to program accelerators.

CINECA I/O Subsystem (SSD, Lustre, pNFS) Assess the applicability of new file system and storage technologies.

CINES-LRZ“LRB/CS”

Hybrid SGI ICE/UV/Nehalem-EP & Nehalem-EX/ClearSpeed/Larrabee

Evaluate a hybrid system architecture containing thin nodes, fat nodes and compute accelerators with a shared file system.

CSCS“UPC/CAF”

Prototype PGAS language compilers (CAF + UPC for Cray XT systems)

Understand the usability and programmability of PGAS languages.

EPCC“FPGA”

Maxwell – FPGA prototype (VHDL support & consultancy + software licenses (e.g., Mitrion-C))

Assess the potential of high-level languages for using FPGAs in HPC. Compare energy efficiency with other solutions.

SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 2

Prototype Overview (2/2)

FZJ eQPACE (PowerXCell Gain deep expertise in communication

Prototype Overview (2/2)

FZJ“Cell & FPGA interconnect”

eQPACE (PowerXCell8i cluster with special network processor)

Gain deep expertise in communication network issues. Extend the application domain of the QPACE system.

LRZ“RapidMind”

RapidMind Multi-Core Development Platform (automatic code generation for x86, GPUs and Cell)

Assess the potential of data stream languages. Compare RapidMind with other approaches for programming accelerators or multi-core systems

NCF“ClearSpeed”

ClearSpeed CATS 700 units

Evaluate ClearSpeed accelerator hardware for large-scale applications.

Air cooled blade system from SNIC-KTH

ySupermicro with AMD Istanbul processors & QDR IB(subject to EC approval)

Evaluate and optimize energy efficiency and packing density ofcommodity hardware.Experiences with the

prototypes will be reported in Deliverable D8 3 2


in Deliverable D8.3.2 [http://www.prace-project.eu/documents/public-deliverables-1/]

The teaser

A SELECTION OF RESULTSThe teaser


RinfRinf


Euroben results - accelerator languagesEuroben results - accelerator languages

Accelerator Languages (absolute performance)

94% 81%100000

1000000

Accelerator Languages (absolute performance)MKL (8 Nehalem cores)

CUDA (1 C1060)

CellSs (1 PowerXCell8i)79%

78% v. peak

10000

100000 CellSs (1 PowerXCell8i)

Cn (1CSX700)

94

Accelerator Languages (%peak perf)

100

1000

Mflo

ps

94

3.3

30

81

0 9

4.5

79

2

78

610.00

100.00

rforman

ce

MKL

mod2f/MKL:single‐threaded only

10

0.9

0.04

0.030 01

0.10

1.00

% of p

eak pe

r

CUDA

CellSs

Cnmod2f/MKL: single‐threadedonly


1

peak perf mod2am mod2as mod2f

0.01

mod2am mod2as mod2f

Euroben results - GPGPU languagesEuroben results - GPGPU languages

100

Performance Comparison (dense matrix‐matrix mul.) on Nvidia C1060

70

80

90

100

50

60

70

Gflo

ps

CUDA

CAPS

20

30

40

G

CUDA+MPI 4x4

RapidMind

OpenCL

0

10 MKL (8cores Nehalem)

SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 7matrix size (m)

Euroben results - productivityEuroben results - productivity20100000

Development Time versus Performance (dense matrix-matrix mul.)

12

14

16

18

1000

10000

me

in D

ays

Mflo

ps

*6

8

10

12

100

1000

velo

pmen

t Tim

Perf

orm

ance

in

Performance

* *

**0

2

4

1

10 DevP Performance

total time

first version

**

* OpenCL and CUDA+MPI port based on existing CUDA port

** RapidMind developer included


time for benchmarking

First IO-ResultsFirst IO-Results


A glimpse on what you will find in Deliverable D8.3.2

PROTOTYPESA glimpse on what you will find in Deliverable D8.3.2


eQPACEeQPACEExtend communication capabilities of eQPACE to make

it suitable for a wider range of applications. Reach a top position in the Green500 list (FZJ).H d P XC ll8i d i h• Hardware: PowerXCell8i processor nodes with custom 3D-torus interconnect. B h k• Benchmarks:HPL, Euroben kernels, torus network benchmarktorus network benchmark,applications & iterative solvers.

• Programming environments:

SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2

g gCell SDK & CellSs

11

RapidMindRapidMindEvaluation of the RapidMind programming model (LRZ).

R idMi d d2

• Hardware:– CPUs (Nehalem EP, AMD Opteron)

10203040506070

Gfops

RapidMind mod2am

– GPUs (Nvidia Tesla and Quadro FX)– Cell (QS22-blade cluster)

• Software:

010

matrix size (m)

• Software:RapidMind allows to write code which can run on x86 cores as well as accelerators like GPUs and Cell.

x86‐dp (8 cores nehalem) cuda‐dp (c1060) glsl‐sp (FX 5800)

– Evaluate ease-of-use & portability– Assess RapidMind performance on different architectures


– Compare RapidMind with other accelerator languages

12

LRZ-CINESLRZ-CINESEvaluation of a hybrid system architecture containing thin

nodes, fat nodes and compute accelerators with a shared file system (CINES, LRZ).H d• Hardware:– SGI ICE (Nehalem EP)– SGI UV (Nehalem EX)– SGI UV (Nehalem EX)– Clearspeed CSX700

• Benchmarks:– Euroben kernels– Synthetic BMs: HPL, Rinf, Intel MPI Benchmark, Apex-MAP


– Application BMs: Gadget, Raxml, Specfm3dglobe

13

Hybrid technology demonstratorHybrid technology demonstratorEvaluating GPGPU with CAPS HMPP (CEA).• Hardware:

Tesla servers connected to B ll i PCI E

40506070

ops

CAPS hmpp mod2am

Bull servers via PCI-E.• Software:

CAPS HMPP ll t l it th0

102030G

fl

CAPS HMPP allows to exploit the potential of GPGPUs by simply adding preprocessor directives to

matrix size (m)

50

60

70

CUDA mod2am

adding preprocessor directives to legacy Fortran and C codes.

0

10

20

30

40

50

Gflo

ps

SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 14matrix size (m)

Maxwell FPGAMaxwell FPGAEvaluate the performance and usability of the

HARWEST Compiling Environment (EPCC).• Hardware: FPGA prototype “Maxwell” (32 FPGAs)

f b h Al h D L d d N ll h L d ifrom both Alpha Data Ltd and Nallatech Ltd using Virtex-4 FPGAs supplied by Xilinx Corp.B h k• Benchmarks:4 Euroben kernels

• Languages:• Languages:– VHDL– HCE


PGAS languagesPGAS languagesEvaluate ease of use of PGAS programming model

(CSCS).• Hardware: Cray XT5• Compiler: Cray Compiler Environment (CCE)• Evaluation of the compiler:

– Functional correctness– Conformance with language standards

Usability for existing CAF and UPC benchmarks/applications– Usability for existing CAF and UPC benchmarks/applications

• Benchmarks from Rice University, George Washington University and the Lawrence Berkley


Washington University and the Lawrence Berkley National Laboratory

16

ClearSpeed/PetaPathClearSpeed/PetaPathEvaluate ClearSpeed-Petapath system (NCF).• Hardware:

114 ClearSpeed CSX700 cards• Language: Cn

• Benchmarks: – 4 Euroben kernels– 4 Applications

• Astronomy• Astronomy• Geophysics• numerical mathematics


• medical tomography

17

XC4-IOXC4-IO• Compare performances in storage infrastructure

access, using different hardware configurations and file system architectures. (CINECA).

Das Bild kann nicht angezeigt werden. Dieser Computer verfügt möglicherweise über zu wenig Arbeitsspeicher, um das Bild zu öffnen, oder das Bild ist beschädigt. Starten Sie den Computer neu, und öffnen Sie dann erneut die Datei. Wenn weiterhin das rote x angezeigt wird, müssen Sie das Bild möglicherweise löschen und dann erneut einfügen.


SNIC-KTHEvaluate energy efficiency of

SNIC-KTH Preliminary Results (Gromacs)

high density commodity parts (SNIC-KTH).

• Hardware: AMD Istanbul• Benchmarks:

Euroben, STREAM, IMB, Gromacs, CFD• Measure power consumption per component• Adjust fan speed and fan power• Assess energy management features of AMD Istanbul


(Control of voltage and frequency of components)19

Results will be reported in Deliverable D8.3.2.

RESEARCH ACTIVITIESResults will be reported in Deliverable D8.3.2.


Parallel GPUParallel GPUEvaluation of GPGPU programming languages (CSC).• Languages

– CUDA+MPIOpenCL

GPU-HMMER– OpenCL

• Benchmarks:– GPU-HMMER– Euroben Kernels

• Hardware– Tesla– AMD Firestream

CEA WP8 Prototype


– CEA WP8 Prototype

21

Advanced PGAS ProgrammingAdvanced PGAS ProgrammingEvaluate usability of PGAS upc_barrier;upc_forall (sc=0; sc

Research on power efficiencyResearch on power efficiencyEvaluate power consumption of components (STFC, PSNC).• Hardware:

ClearSpeed, Tesla, Firestream, Cell, Power6.• Different workloads:

stand-by, neutral, real life, artificial stress.• Assess CPU, Memories, Accelerators, HDD’s, cooling fans,

backplane, power supply.P t ith• Power measurements with:Clamp meters, PDUs with built-in ammeters, values from system management software


system management software

23

Contact information:Dr. Herbert Huber (WP8 Leader), [email protected] i Ch i t dl (WP8 C L d ) h i t dl @l dIris Christadler (WP8 Co-Leader), [email protected] Supercomputing Centre, Germany

THANK YOU FOR YOUR ATTENTION!COMMENTS? QUESTIONS?COMMENTS? QUESTIONS?


Documents

“Future TechnologiesFuture Technologies” (WP8) … · 2019. 9. 24. · RapidMind allows to write code which can run on x86 cores as well as accelerators like GPUs and Cell. x86‐dp