Multi-Core-Architectures for Numerical Simulation€¦ · 1 Multi-Core-Architectures for Numerical...

Multi-Core-Architecturesfor Numerical Simulation

Lehrstuhl für Informatik 10 (Systemsimulation)

Universität Erlangen-Nürnberg

www10.informatik.uni-erlangen.de

Siemens Simulation Center

11. November 2009

H. Köstler, J. Habich, J. Götz, M. Stürmer, S. Donath, T. Gradl, D. Ritter,

C. Feichtinger, K. Iglberger (LSS Erlangen und RRZE)

U. Rüde (LSS Erlangen, ruede@cs.fau.de)

In collaboration with RRZE and many more

OverviewIntro

Who we are

How fast are computers today?

Technological Trends

GPUs, Cell, and others

Example: Flow Simulation with Lattice Boltzmann Methods

Computational Haemodynamics using the PlayStation

Conclusions

The LSS Mission

Development and Analysis of Computer Methodsfor Applications in Science and Engineering

Applications fromPhysical and Engineering

Sciences

ComputerScience

Mathematics

Who is at LSS (and does what?)

• C. Feichtinger

• S. Donath

• C. Mihoubi

• J. Götz

• S. Ganguly

• K. Pickl

• S. Bogner

• !"##"#$%"&'(")"#'$

Alumni

Prof. G. Horton (Univ. of Magdeburg)

Prof. El Mostafa Kalmoun (Cadi Ayyad University, Marocco)

Dr. M. Kowarschik (Siemens Health Care)

Dr. M Mohr (Geophysik, TU München)

Dr. F. Hülsemann (EDF, Paris)

Dr. B. Bergen (Los Alamos, USA)

Dr. N. Thürey (ETZH Zürich)

Dr. J. Härdtlein (Bosch GmbH)

C. Möller (Navigon)

Dr. U. Fabricius (Elektrobit)

Dr. Th. Pohl (Siemens Health Care)

J. Treibig (RRZE)

C. Freundl (YAGER Development)

Laser Simulation

Prof. Dr. C. Pflaum

Supercomputing

J. Götz

Numerical Algorithms

H. Köstler

Complex Flows

K. Iglberger

B. Berneker

M. Wohlmuth

C. Jandl

Kai Hertel

J. Werner

T. Gradl

M. Stürmer

F. Deserno

D. Ritter

B. Gmeiner

S. Geißelsöder

• T. Dreher

• Dr. W. Degen

• T. Preclik

• D. Bartuschat

• S. Strobl

• Li Yi

How much is a PetaFlops?106 = 1 MegaFlops: Intel 486

33MHz PC (~1989)

109 = 1 GigaFlops: Intel Pentium III

1GHz (~2000)

If every person on earth does one operation every 6 seconds, all humans together have 1 GigaFlops performance (less than a current laptop)

1012= 1 TeraFlops: HLRB-I

1344 Proc., ~ 2000

1015= 1 PetaFlops

>100 000 Proc. Cores

Roadrunner/Los Alamos: Jun 2008

• If every person on earth runs a 486 PC, we all together have an aggregate Performance of 6 PetaFlops.

HLRB-II: 63 TFlops

HLRB-I: 2 TFlops

Where Does Computer Architecture Go?

Computer architects have capitulated: It may not be possible anymore to exploit progress in semiconductor technology for automatic performance improvements

Even today a single core CPU is a highly parallel system:

superscalar execution, complex pipeline, ... and additional tricks

Internal parallelism is a major reason for the performance increases until now, but ...

There is a limited amount of parallelism that can be exploited automatically

Multi-core systems concede the architects´ defeat:

Architects fail to build faster single core CPUs given more transistors

Clock rate increases only slowly (due to power considerations)

Therefore architects have started to put several cores on a chip:

programmers must use them directly

What are the consequences?

For the application developers “the free lunch is over”

Without explicitly parallel algorithms, the performance potential cannot be used any more

it will become increasingly important to use instruction level parallelism (such as vector units)

For perfromance critical applications:

CPUs will have 2, 4, 8, 16, ..., 128, ..., ??? cores - maybe sooner than we are ready for this

In the high end will have to deal with systems with millions of cores

Trends in Computer Architecture

On Chip Parallelism for everyone

instruction level

SIMD-like vectorization

multicore (with caches or local memory)

Off Chip parallelism

for large scale parallel systems

Accelerator hardware

Cell processor

Limits to clock rate

Limits to memory bandwidth

Limits to memory latency

Multi-Core Aktivitäten am LSS

Architekturen

IBM Cell

• Nvidia

• AMD/ATI

Konventionelle Mehrkernarchitekturen (Intel, AMD)

Anwendungen Finite Elemente - PDE - Mehrgitterverfahren (Strömungslöser LBM-Verfahren

Bildverarbeitung, medizintechnische Anwendungen

3D-Realzeitsimulation für industrielle Steuerung

Siehe Papers, Berichte, Master- und Bachelorarbeiten:

http://www10.informatik.uni-erlangen.de/Publications/

Multi Core Architectures

IBM-Sony-Toshiba Cell Processor

GPU: Nvidia or AMD/ATI

The STI Cell Processor

hybrid multicore processor based on IBM Power architecture

(simplified) PowerPC core

runs operating system

controls execution of programs

multiple co-processors (8, on Sony PS3 only 6 available)

operate on fast, private on-chip memory

optimized for computation

vectorization: „float4“ data type

DMA controller copies data from/to main memory

• multi-buffering can hide main memory latencies completely for streaming-like applications

• loading local copies has low and known latencies

memory with multiple channels and banks can be exploited if many memory transactions are in-flight

IBM Cell Processor Available cell systems:

Roadrunner

Blades

Playstation 3

Cell Architecture: 9 cores on a chip

massively parallel SIMD-like execution on several hundred compute units

typical performance values (Nvidia Fermi, soon):

2.7 TFlop single precision possible

630 Gflop double precision

4+x GByte memory „on board“

150+x GByte/sec memory bandwidth

additionally vectorization in „warps“ (16 floats)

ATI Radeon HD 4870

Costs: 150 !

Interface: PCI-E 2.0 x16

Shader Clock: 750 MHz

Memory Clock: 900 MHz

Memory Bandwidth: 115 GB/s

FLOPS: 1200 GFLOPS

Max Power Draw: 160 W

Framebuffer: 1024 MB

Memory Bus: 256 bit

Shader Processors: 800

Nvidia GeForce GTX 295

Costs: 450 !

Interface: PCI-E 2.0 x16

Shader Clock: 1242 MHz

Memory Clock: 999 MHz

Memory Bandwidth: 2x112 GB/s

FLOPS: 2x894 GFLOPS

Max Power Draw: 289 W

Framebuffer: 2x896 MB

Memory Bus: 2x448 bit

Shader Processors: 2x240

GPU: AMD Stream Processor

AMD Stream Architecture (cont‘d)

ATI Radeon 3870(RV670) / Firestream 9170

Example 1: Flow Simulation on Cell

LBM Optimized for Cell

memory layout

optimized for DMA transfers

information propagating between patches is reordered on the SPE and stored sequentially in memory for simple and fast exchange

code optimization

kernels hand-optimized in assembly language

SIMD-vectorized streaming and collision

branch-free handling of bounce-back boundary conditions

Simulation ofMetal Foams

Free Surface Flows

Applications:

Engineering: metal foam simulations

Computer graphics: special effects

Based on LBM:

Mesoscopic approach to solving the NS equations

Good for complex boundary conditions

Details: D3Q19 model, BGK collision and grid compression

Performance Results

Xeon 5160 PPE SPE*

2,04,8

LBM performance on a single core (8x8x8 channel flow)

*on Local Store without DMA transfers

straight-forward C codeSIMD-optimized assembly

Performance Results

1 2 3 4 5 6

95949493

Performance Results

Xeon 5160* Playstation 3

1 core

1 CPU*performance optimized code by LB-DC

Programming the Cell-BE

the hard way

control SPEs using management libraries

issue DMAs by language extensions

do address calculations manually

exchange main memory addresses, array sizes etc.

synchronization using mailboxes, signals or libraries

frameworks

Accelerated Library Framework (ALF) and Data, Communication, and Synchronization (DaCS) by IBM

Rapidmind SDK

accelerated libraries

single-source-compiler

IBM’s xlc-cbe-sse, uses OpenMP

Naive SPU implementation: A[] = A[]*cvolatile vector float ls_buffer[8] __attribute__((aligned(128)));

void scale( unsigned long long gs_buffer, // main memory address of vector

int number_of_chunks, // number of chunks of 32 floats

float factor ) { // scaling factor

vector float v_fac = spu_splats(factor); // create SIMD vector with all

// four elements being factor

for ( int i = 0 ; i < number_of_chunks ; ++i ) {

mfc_get( ls_buffer , gs_buffer , 128 , 0 ,0,0); // DMA reading i-th chunk

mfc_write_tag_mask( 1 << 0 ); // wait for DMA...

mfc_read_tag_status_all(); // ...to complete

for ( int j = 0 ; j < 8 ; ++j )

ls_buffer[j] = spu_mul( ls_buffer[j] , v_fac ); // scale local copy using SIMD

mfc_put( ls_buffer ,gs_buffer , 128 , 0 ,0,0); // DMA writing i-th chunk

mfc_write_tag_mask( 1 << 0 ); // wait for DMA...

mfc_read_tag_status_all(); // ...to complete

gs_buffer += 128; // incr. global store pointer

Remove latencies using multi-buffering

mfc_get( ls_buffer[0] , gs_buffer , 128 , 0 ,0,0); // request first chunk

for (int i = 0; i < number_of_chunks; ++i) {

int cur = ( i ) % 3; // buffer no. and DMA tag for i-th !chunk

int next = (i+1) % 3; // " for (i-2)-th and (i+1)-th chunk

if (i < number_of_chunks-1) {

mfc_write_tag_mask( 1 << next ); // make sure the (i-2)-th chunk...

mfc_read_tag_status_all(); // ...has been stored

mfc_get( ls_buffer[next] , gs_buffer+128 , 128 , next ,0,0); // request (i+1)-th chunk

mfc_write_tag_mask( 1 << cur ); // wait until i-th chunk...

mfc_read_tag_status_all(); // ...is available

for (int j = 0; j < 8; ++j) ls_buffer[cur][j] = spu_mul(ls_buffer[cur][j],v_fac);

mfc_put( ls_buffer[cur] , gs_buffer , 128 , cur ,0,0);// store i-th chunk

gs_buffer += 128;

mfc_write_tag_mask( 1 | 2 | 4 ); // wait for any...

mfc_read_tag_status_all(); // outstanding DMA

volatile vector float ls_buffer[3][8]

__attribute__((aligned(128)));

Example 2: LBM on Graphics Cards

Johannes Habich, M.Sc. Johannes.Habich@rrze.uni-erlangen.de24.09.2008 29

OpenMP vs. CUDA

! Alignment constraints must be met

! No cache and cache lines must be considered

Thread 2

CUDA! Divide domain into small pieces

Thread 1

Thread 0

Thread 2

Block 1

Block 2

Thread 0

Thread 1

Thread 2

OpenMP! Divide domain to huge chunks

Johannes Habich, M.Sc. Johannes.Habich@rrze.uni-erlangen.de24.09.2008 30

Performance Results for GPUs

! Up to 6 times CPU performance if well implemented

! Less than CPU performance if straightforwardly implemented

! Use padding to circumvent performance breakdown (80% loss)

Arbitrary Geometries

Part IV

Conclusions

Computer Science X - System Simulation Group Markus Stürmer (markus.stuermer@informatik.uni-erlangen.de)

Evolution of processors: Improvements

Std.-CPU CBEA GPU

pipelining

superscalar execution

out-of-order

wider buses

multithreading

multiprocessing

caches

hardware prefetcher

resource virtualization

! ! / " !

! ? / " "

" / ! !

! ! / " "

instruction

thread

transfer

local storage "

Conclusions

There is no way around Multi-Core architectures in the forseeable future

Multi-Core Accelerators have excellent performance potential, but ....

they are not suitable for all algorithms

they are tricky to program

many results published in the literature are too optimistic

• people tend to make unfair comparisons

accellerators and their programming environments are changing very quickly

what about robustness, numerical accuracy, etc?

there is a good chance that we will see these technologies in the future, but likely in different guise

Thank you for your attention

Questions?

Slides, papers, reports, thesis, animations available for

download at: www10.informatik.uni-erlangen.de

Multi-Core-Architectures for Numerical Simulation€¦ · 1 Multi-Core-Architectures for Numerical...

Documents

Operational Numerical Weather Prediction in Switzerland ......in Switzerland and Evolution Towards New Supercomputer Architectures Philippe Steiner (MeteoSwiss) Michele De Lorenzi

Parallel Numerical Algorithms - Edgar Solomonik · 2019. 9. 25. · Motivation Architectures Networks Communication Parallel Numerical Algorithms Chapter 1 – Parallel Computing

1 The Impact Of Computer Architectures On Linear Algebra and Numerical Libraries Jack Dongarra Innovative Computing Laboratory University of Tennesseedongarra

Systemsimulation am Beispiel von Palm OS · Palm PDA, Hardware, OS, Software Hardwaresimulation Programmentwicklung Palm Entwicklungswerkzeuge Ausblick Übersicht Proseminar: Entwurf

Experiences with Large Scale Numerical Simulation · 2015. 5. 13. · 1 Experiences with Large Scale Numerical Simulation Lehrstuhl für Informatik 10 (Systemsimulation) Dundee, June

Future Network Architectures Introduction to future network Architectures

HEC-ResSim(Reservoir SystemSimulation)kmcenter.rid.go.th/kcresearch/training/Person33.pdfHEC-ResSim(Reservoir SystemSimulation) แบบจําลอง HEC-ResSim หรือ

Architectures de Montréal, architectures des Amériques...Architectures de Montréal, architectures des Amériques 11 La fin du xvme siècle est un autre temps « moderne » dans

Mobile Application Architectures Mobile Application Architectures

Web Application Architectures - STI Innsbruck · 2013-04-23 · • Web Application Architectures – Introduction • Software architectures • Architectures development • Patterns

Ore-inspiring structures - some numerical modelling perspectives on orogenic architectures favourable for formation and preservation of mineral deposits

Memory Systems and Memory-Centric Computing …...Fundamentally Energy-Efficient Architectures Memory-centric (Data-centric) Architectures Fundamentally Low-Latency Architectures Architectures

Lehrstuhl für Informatik 10 (Systemsimulation) · 2019. 1. 10. · Chapter1 Introduction OurplanetEarth,accordingtothegeoscientiﬁccommunitycanbedeﬁnedeitherbymechanical orchemicalproperties

Debugging Numerical Simulations on Accelerated Architectures - TotalView for OpenPOWER, CUDA and OpenMP

Lehrstuhl für Informatik 10 (Systemsimulation) · Lehrstuhl für Informatik 10 (Systemsimulation) ... For large objects like a submarine weighing 20 kilotons, this force yields an

Lehrstuhl für Informatik 10 (Systemsimulation) · 2015-05-13 · FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG TECHNISCHE FAKULTÄT DEPARTMENT INFORMATIK Lehrstuhl für Informatik

Tomyparents. - COnnecting REpositories · 8 Systemsimulation 51 ... Gli ultimi tre componenti appartengono al circuito idraulico di controllo e ... (inparticolaregliaspettipositividell’applicazionedell’ap-

Lehrstuhl fu¨r Informatik 10 (Systemsimulation)

Software Architectures, Week 5 - Advanced Architectures

Chapter 3 Data Converter Architectures F - Analog … converter architectures 3.1 dac architectures 3.1 chapter 3 data converter architectures section 3.1: dac architectures james