accULL (HAC Leganés)

HeterogeneousArchitectures

accULL: An EarlyOpenACCImplementation

Results

Conclusions andFuture Work

accULL: An User-directed Approach toHeterogeneous Programming

Ruyman Reyes Ivan Lopez-Rodrıguez Juan J. FumeroFrancisco de Sande

1Dept. E.I.O. y Computacion,Univ. de La Laguna, 38271–La Laguna, Spain

International Workshop on HeterogeneousArchitectures and Computing

Leganes, July 13 2012

1 / 66

Results

Outline

1 Heterogeneous Architectures

2 accULL: An Early OpenACC Implementation

3 Results

4 Conclusions and Future Work

2 / 66

Results

Outline

3 Results

3 / 66

Results

Introduction

The irruption of GPUs: Impressive Results

4 / 66

Results

Successfully used for general purpose computing (GPGPU)

5 / 66

Results

Heterogeneous Architectures

But ...

It is not Easy!

6 / 66

Results

Heterogeneous Architectures

A GPU is not a CPU

GPUs are inherently SIMD processorsCPUs and GPUs tackle the processing of tasks differentlyCPUs excel at serial processingGPUs are better at handling applications that require highfloating point calculations and lower power consumption

7 / 66

Results

Parallel Languages: MPI (DM) and OpenMP (SM)

They are not valid for programming GPUs

New programming models are required...

8 / 66

Results

GPGPU Programming

Nowadays Software Stack:

9 / 66

Results

CUDA from NVIDIA

Pros: Performance, Easierthan OpenCL

Con: Only for NVIDIAhardware

CUDA Code Example

1 __global__ v o i d mmkernel ( f l o a t ∗ a , f l o a t ∗ b , f l o a t ∗ c , i n t n ,2 i n t m , i n t p ) {3 i n t i = blockIdx . x∗32 + threadIdx . x ;4 i n t j = blockIdx . y ;5 f l o a t sum = 0 . 0 f ;6 f o r ( i n t k = 0 ; k < p ; ++k ) sum += b [ i+n∗k ] ∗ c [ k+p∗j ] ;7 a [ i+n∗j ] = sum ;8 }

10 / 66

Results

GPGPU Programming

OpenCL: Open Computing Language

A framework developed by the Khronos Group

A standard

OpenCL programs execute across heterogeneous platforms:CPUs + GPUs + other processors

Pros: can be used with any device, it is a standardCons: more complex than CUDA, inmature

11 / 66

Results

GPGPU Programming

Common Problems1 The programmer needs to know low-level details of the

architecture

12 / 66

Results

GPGPU Programming

Common Problems1 The programmer needs to know low-level details of the

architecture2 Source codes need to be rewritten:

One version for CPUA different version for GPU

3 Good performance requires a great effort in parameter tunning

4 CUDA and OpenCL are new and complex for non-experts

13 / 66

Results

GPGPU Programming

Our Claim: New models and tools are needed if we wantto widespread the use of GPUs in HPC

Is there anything new in the horizon?

hiCUDA

PGI accelerator model

CAPS HMPP

OpenACC

14 / 66

Results

GPGPU Programming

hiCUDATranslates each directive into a CUDA call

It is able to use the GPU Shared Memory

Only works with NVIDIA devices

The programmer still needs to know hardware details

hiCUDA Code Example:

1 . . .2 #pragma h icuda g l o b a l a l l o c c [ ∗ ] [ ∗ ] c o p y i n

4 #pragma h icuda k e r n e l mxm t b l o c k (N/16 ,N/16) t h r e a d ( 1 6 , 1 6 )5 #pragma hicuda loop_partition over_tblock over_thread6 f o r ( i = 0 ; i < N ; i++ ) {7 #pragma hicuda loop_partition over_tblock over_thread8 f o r ( j = 0 ; j < N ; j++) {9 double sum = 0 . 0 ;

10 . . .

15 / 66

Results

GPGPU Programming

PGI accelerator model

It is a higher level (directive-based) approach

Fortran and C are supported

Precursor to OpenACC

PGI Accelerator Model Code Example:

1 #pragma acc data c o p y i n ( b [ 0 : n∗ l ] , c [ 0 :m∗ l ] ) copy ( a [ 0 : n∗m] )2 {3 #pragma acc r e g i o n4 {5 #pragma acc loop independent6 f o r ( j = 0 ; j < n ; j++)7 {8 #pragma acc loop independent9 f o r ( i = 0 ; i < l ; i++ ) {

10 double sum = 0 . 0 ;11 f o r ( k = 0 ; k < m ; k++ ) {12 sum += b [ i+k∗l ] ∗ c [ k+j∗m ] ;13 }14 a [ i+j∗l ] = sum ;15 }16 }17 }18 }

16 / 66

Results

GPGPU Programming

OpenACC: introduced last November inSuperComputing’2011

A directive based language

Aim to be standard

Supported by: Cray, NVIDIA, PGI and CAPS

A single source code for CPU/GPU

Platform independent

Easier for beginners

17 / 66

Results

GPGPU Programming

OpenACC Code Example:

18 / 66

Results

Outline

3 Results

19 / 66

Results

accULL: Our OpenACC implementation

accULL is a framework developed to support OpenACCprograms

20 / 66

Results

accULL: Our OpenACC implementation

accULL = YaCF + Frangollo

It is a two-layer based implementation:Compiler + RunTime Library

21 / 66

Results

YaCF: the compiler

YaCF (Yet Another Compiler Framework) is the compilerframework we have developed

Some features:

It is a StS compiler

Written in Python from scratch with an OO approach

Receives C99 as input

It is able to generate CUDA/OpenCL kernels from an annotatedcode

A driver for compiling OpenACC directives has been added

YaCF translates the directives into Frangollo calls

A public-domain development

22 / 66

Results

Frangollo: the RunTime

Frangollo

It is a RunTime to support the execution over heterogeneousplatforms

1 Encapsulates the hardware issues

2 Is able to run in NVIDIA devices using CUDA

3 Is able to manage a wider range of devices using OpenCL

23 / 66

Results

Compilation flow

24 / 66

Results

Its Responsibilities1 Manages the memory

2 Initializes the devices

3 Launches the kernels

Makes programmers’ life easier!

25 / 66

Results

Its Responsibilities1 Manages the memory

2 Initializes the devices

3 Launches the kernels

Makes programmers’ life easier!

26 / 66

Results

Frangollo: Memory Management

A program workflow

27 / 66

Results

Frangollo: Structure

Interface layer: A door to Frangollo

Some functions in the C interface:

registerVar

launchKernel

getNumDevices

28 / 66

Results

Abstract layer

Frangollo uses a class-hierarchy

All classes in this layer are abstracts

29 / 66

Results

Device layer

Encapsulates all targetlanguage related functions

New platforms could beadded in the future

30 / 66

Results

Outline

3 Results

31 / 66

Results

Platforms

M1: A Desktop computer

Intel Core i7 930 processor (2.80 GHz)

1MB of L2 cache, 8MB of L3 cache, shared by the four cores

4 GB RAM

2 GPU devices attached:

Tesla C1060 with 3Gb memory (M1a)Tesla C2050 (Fermi) with 4GB memory (M1b)Accelerator platform is CUDA 4.0

M1a/ M1b mimic the scenario of an OpenACC average developer

She can purchase a GPU card and plug in it into her desktopcomputer

It features a relatively cheap platform

32 / 66

Results

Platforms

M2: A cluster node

M2: 2 quad core Intel Xeon E5410 (2.25GHz) processors

24 GB memory

Attached a Fermi C2050 card with 448 multiprocessors and 4GB memory

Accelerator platform: CUDA 4.0

M2 is a node of a common multinode cluster

Nowadays clusters combine multicore processors and GPUdevices, so we can take advantage of OpenACC

This kind of compute node has higher acquisition andmaintenance costs than M1

33 / 66

Results

Platforms

M3: A second clusterM3 is a shared memory system

4 Intel Xeon E7 4850 CPU

2.50MB L2 cache and 24MB L3 cache (for all its 10 cores)

6GB of memory per core

Accelerator platform: Intel OpenCL SDK 1.5, running on theCPU

M3 showcases an alternative use of OpenCL

There are implementations of OpenCL targeting shared memorysystems

Using CPU-targeted OpenCL platforms along with OpenACCrepresents an interesting alternative to OpenMP programming

34 / 66

Results

Some of our Experiments

Blocked Matrix Multiplication (M×M)

Rodinia BenchmarkThe Rodinia Benchmark suite comprises compute-heavyapplications

It covers a wide range of applications

OpenMP, CUDA and OpenCL versions are available for most ofthe codes in the suite

From them, we have selected:

Needleman-Wunsch (NW)HotSpot (HS)Speckle Reducing Anisotropic Diffusion (SRAD)

35 / 66

Results

Matrix Multiplication

Sketch of M×M in OpenACC

1 #pragma acc k e r n e l s name ( "mxm" ) copy ( a [ L∗N ] )2 c o p y i n ( b [ L∗M ] , c [ M∗N ] . . . )3 {4 #pragma acc loop p r i v a t e ( i , j ) c o l l a p s e ( 2 )5 f o r ( i = 0 ; i < L ; i++)6 f o r ( j = 0 ; j < N ; j++)7 a [ i ∗ L + j ] = 0 . 0 ;8 /∗ I t e r a t e ove r b l o c k s ∗/9 f o r ( ii = 0 ; ii < L ; ii += tile_size )

10 f o r ( jj = 0 ; jj < N ; jj += tile_size )11 f o r ( kk = 0 ; kk < M ; kk += tile_size ) {12 /∗ I t e r a t e i n s i d e a b l o ck ∗/13 #pragma acc loop c o l l a p s e ( 2 ) p r i v a t e (i , j , k )14 f o r ( j=jj ; j < min (N , jj+tile_size ) ; j++)15 f o r ( i=ii ; i < min (L , ii+tile_size ) ; i++)16 f o r ( k=kk ; k < min (M , kk+tile_size ) ; k++)17 a [ i∗L+j ] += ( b [ i∗L+k ] ∗ c [ k∗M+j ] ) ;18 }19 }

36 / 66

Results

Floating point performance for M×M in M2

37 / 66

Results

Floating point performance comparison between OpenMP,accULL, PGI and hiCUDA in M1

accULL is the second with better performance

38 / 66

Results

Comparison between OpenMP-gcc implementation andFrangollo+OpenCL in M3 (SM system 40 cores)

39 / 66

Results

Needleman-Wunsch

Performance comparisons of NW in M1b

accULL performs worse than native versions40 / 66

Results

Needleman-Wunsch

Performance comparisons of NW in M3 (SM, 40 cores)

The OpenMP versions outperform to the OpenCL counterparts41 / 66

Results

HotSpot

Performance comparison of different implementationsshowing efficiency over native CUDA code in M1

In this case, accULL performs similarly to hiCUDA 42 / 66

Results

HotSpot

Speed-Up comparison with native CUDA code inM1b (Fermi)

43 / 66

Results

HotSpot

Efficiency w.r.t. Intel-OpenMP in M3 (SM, 40 cores)

44 / 66

Results

Speedup over the OpenMP implementation in M1b

45 / 66

Results

Speedup over the OpenMP implementation in M3

46 / 66

Results

Outline

3 Results

47 / 66

Results

Conclusions I

accULL

First OpenACC implementation with support for both CUDAand OpenCL

It supports most of the standard

We validate accULL using codes from widely availablebenchmarks using GPUs and CPUs

It meets the requirements of a non-expert developer

48 / 66

Results

Conclusions I

accULL

49 / 66

Results

Conclusions I

accULL

50 / 66

Results

Conclusions I

accULL

51 / 66

Results

Conclusions II

accULL

YaCF can be used as a fast-prototyping tool to exploreoptimizations

Frangollo can be detached from YaCF and combined with aproduction-ready compiler

Some issues that can be tackled within Frangolloindependently from the compiler

Memory allocationKernel schedulingData splittingOverlapping of computation and communicationsParallel reduction implementation

52 / 66

Results

Conclusions II

accULL

53 / 66

Results

Conclusions II

accULL

54 / 66

Results

Conclusions II

accULL

Memory allocation

Kernel schedulingData splittingOverlapping of computation and communicationsParallel reduction implementation

55 / 66

Results

Conclusions II

accULL

Memory allocationKernel scheduling

Data splittingOverlapping of computation and communicationsParallel reduction implementation

56 / 66

Results

Conclusions II

accULL

Memory allocationKernel schedulingData splitting

Overlapping of computation and communicationsParallel reduction implementation

57 / 66

Results

Conclusions II

accULL

Memory allocationKernel schedulingData splittingOverlapping of computation and communications

Parallel reduction implementation

58 / 66

Results

Conclusions II

accULL

59 / 66

Results

Future work

There are plenty of opportunities to improve performance

To implement 2D arrays as cudaMatrix or OCLImages toimprove non-contiguous memory access

To complete the implementation of the asynchronous calls forbetter performance

Multi-GPU support

To explore different possibilities of integration with MPI

Integration of Frangollo with a production-ready compiler

New backend for FPGAs

60 / 66

Results

Future work

Multi-GPU support

61 / 66

Results

Future work

Multi-GPU support

62 / 66

Results

Future work

Multi-GPU support

63 / 66

Results

Future work

Multi-GPU support

64 / 66

Results

Future work

Multi-GPU support

65 / 66

Results

Thank you for your attention!

accULL: An User-directed Approach toHeterogeneous Programming

http://accull.wordpress.com/

This work has been partially supported by the EU (FEDER),the Spanish MEC (contracts TIN2008-06570-C04-03 andTIN2011-24598), HPC-EUROPA2 and the Canary Islands

Government, ACIISI

F. de Sandefsande@ull.es

66 / 66

accULL (HAC Leganés)

Technology

DH-HAC-HDBW2231E - Osec · DH-HAC-HDBW2231EP 2.8mm 2MP Starlight HDCVI IR Dome Camera, PAL DH-HAC-HDBW2231EP 3.6mm DH-HAC-HDBW2231EP 6mm DH-HAC-HDBW2231EN 2.8mm 2MP Starlight HDCVI

Investigación Observatorio ciudadano de Leganés

INFORMATIVO LEGANÉS Febrero 2011

QUERELLA LEGANÉS (sin nombres)

Nº 457 INFORAMTIVO LEGANÉS

Fiestas Leganés 2012

Vive Leganés Abril 2009

(pro hac vice) pro hac vice) - Clearinghouse

Visita a matadero Leganés

En Valverde de Leganés, a 09 de Junio de 2020. (Badajoñ VAL VERDE de Leganés Excmo. Ayuntamiento de Valverde Cie Leganés Plaza Constitución, 2 06130 Valverde de Leganés 496 011

Vive Leganés Agosto 2010

Vive Leganés Mayo 2010

Leganés AEscena. Septiembre_Diciembre_2013

LEGANÉS CON LOS CLÁSICOS

Leganés, Madrid. España

LEGANÉS mayo 2012

Club Deportivo Leganés Nº47

Meeting leganés nov 2012

Belén viviente. Pereda. Leganés

INFORMATIVO LEGANÉS Marzo 2011