Radically Simplified GPU Programming in F# and...3 Sep 2014 GPU Programming Today Massive parallel...

Radically Simplified GPU Programming in F# and .NET

Luc BläserInstitute for Software, HSR Rapperswil

Oskar Knobel, Philipp Kramer

Joint Project with QuantAlea

Daniel Egloff, Xiang Zhang, Dany Fabian

F# User Meeting3 Sep 2014

GPU Programming Today

Massive parallel power

□ Very specific pattern: vector-parallelism

High obstacles

□ Particular algorithms needed

□ Machine-centric programming models

□ Poor language and runtime integration

Good excuses against it - unfortunately

□ Too difficult, costly, error-prone, marginal benefit

5760 Cores

Our Goal

GPU parallel programming for (almost) everyone

Radical simplification

□ No GPU experience required

□ Fast development

□ High performance comes automatically

□ Guaranteed memory safety

Broad community

□ .NET in general: C#, F#, VB etc.

□ Based on Alea cuBase F# runtime

Alea Dataflow Programming Model

Dataflow

□ Graph of operations

□ Data propagated through graph

Reactive

□ Feed input in arbitrary intervals

□ Listen for asynchronous output

The Descriptive Power

Program is purely descriptive

□ What, not how

Efficient execution behind the scenes

□ Vector-parallel operations

□ Stream operations on GPU

□ Minimize memory copying

□ Hybrid multi-platform scheduling

□ Tune degree of parallelization

□ …

Operation

Unit of calculation (typically vector-parallel)

Input and output ports

Port = stream of typed data

Consumes input, produces output

Input: T[]

Output: U[]

MatrixProduct

Left: T[,] Right: T[,]

Output: T[,]

Splitter

Input: Tuple<T, U>

First: T Second: U

Random

Pairing

Average

let randoms = Random<single>(0.f, 1.f)let coordinates = Pairing<single>()let inUnitCircle = Map<Pair<single>, single>

(fun p -> if p.Left * p.Left + p.Right * p.Right <= 1.f

then 1.f else 0.f)let average = new Average<single>()

randoms.Output.ConnectTo(coordinates.Input)coordinates.Output.ConnectTo(inUnitCircle.Input)inUnitCircle.Output.ConnectTo(average.Input)

float[]

Pair<float>[]

Dataflow

Send data to input port

Receive from output port

All asynchronousRandom

Pairing

Average

average.Output.OnReceive(fun x -> Console.WriteLine(4.f * x))

random.Input.Send(1000)random.Input.Send(1000000)

Write(…)

1000000

Write(…)

Short Fluent Notation

let randoms = new Random<single>(0.f, 1.f)randoms.Pairing().Map(fun p -> if p.Left * p.Left + p.Right * p.Right <= 1.f

then 1.f else 0.f).Average().OnReceive(fun x -> Console.WriteLine(4.f * x))

random.Send(1000)random.Send(1000000)

Algebraic Computation

A * B + C

MatrixProduct

MatrixSum

let product = MatrixProduct<float>()let sum = MatrixSum<float>()

product.Output.ConnectTo(sum.Left)

sum.Output.OnReceive(Console.WriteLine)

product.Left.Send(A)product.Right.Send(B)sum.Right.Send(C)

Iterative Computation

𝑏𝑖+1 = 𝐴 ∙ 𝑏𝑖 (until 𝑏𝑖+1 = 𝑏𝑖)

let source = Splitter<float[,], float[]>()let product = source.First.Multiply(source.Second)

let steady = product.Compare(source.Second, fun x y ->

Abs(x – y) < 1E-6)let next = source.First.Merge(product)let branch = steady.Switch(next)branch.False.ConnectTo(source.Input)

branch.True.OnReceive(Console.WriteLine)source.Send((A, b0))

Splitter

MatrixVector

Product

Merger Predicate

(A, bi)

(A, bi+1)

Switch

(A, bi+1)

Current Scheduler

Operation implement GPU and/or CPU

GPU operations combined to stream

Memory copy only when needed

Host scheduling with .NET TPL

Automatic free memory management

Operation Catalogue

Prefabricated generic operations

□ Switch, Merger, Splitter, Predicate

□ Map, Reduce, Average, Pairing

□ Random, MatrixProduct, MatrixSum, MatrixVectorProduct, VectorSum, ScalarProduct

□ More to come…

Custom operations can be added

Good performance

□ Nearly as fast as native C CUDA (margin 10-20%)

Related Works

Rx.NET / TPL Dataflow□ Single input and output port

□ Not for GPU

Xcelerit□ Not reactive: single flow per graph

□ No generic operations with functors

MSR PTasks / Dandelion□ Synchronous receive, on C++, no generic operations

□ .NET LINQ integration (pull instead of push)

Fastflow□ Not reactive (sync run of the graph)

□ More low-level C++ tasks, no functors

Conclusions

Simple but powerful GPU parallelization in .NET

□ No low-level GPU artefacts

□ Fast and condensed problem formulation

□ Efficient and safe execution by the scheduler

The descriptive paradigm is the key

□ Reactive makes it very general: cycles, infinite etc.

□ Practical suitability depends on operations

Future directions

□ Advanced schedulers: multi GPUs, cluster, optimizations

□ Larger operation catalogue

Radically Simplified GPU Programming in F# and...3 Sep 2014 GPU Programming Today Massive parallel...

Documents

GPU Programming Guide G80

Radically Simplified GPU Parallelization: The Alea ...quantaleablog.azurewebsites.net/wp-content/uploads/... · Radically Simplified GPU Parallelization: The Alea Dataflow Programming

Radically better software development with Extreme Programming · Radically better software development with Extreme Programming Carl Erickson Atomic Object LLC October 2002.

Applications of Programming the GPU Directly from Python ...on-demand.gputechconf.com/...Programming-GPU-Python... · Applications of Programming the GPU Directly from Python Using

GPU programming: 1. Parallel programming models · GPU programming: 1. Parallel programming models ... Computer architecture Computer arithmetic ... Example Brent's theorem BSP

GPU Programming 360iDev

GPU Programming: CocoaConf Atlanta

GPU PROGRAMMING GPU Programming 1. Assignment 4 Consists of two programming assignments Concurrency GPU programming Requires a computer with a CUDA/OpenCL/DirectCompute

CS 380 - GPU and GPGPU Programming Lecture 4: GPU ... · CS 380 - GPU and GPGPU Programming Lecture 4: GPU Architecture 3 Markus Hadwiger, KAUST

GPU Programming

CS179: GPU Programming

GPU Programming (2)

GPU Programming Overview

GPU Programming and CUDA

Introduction to GPU Programming

GPU programming: CUDA

GPU Architecture and Programming. GPU vs CPU

PROGRAMMING MULTI-GPU NODES · 2018-11-05 · Steve Abbott & Jeff Larkin, November 2018 PROGRAMMING MULTI-GPU NODES. 2 AGENDA Summit Node Overview Multi-GPU Programming Models Multi-GPU

GPU Programming Paradigms

GPU Programming with CUDA