93
Tornado VM: Running Java on GPUs and FPGAs Juan Fumero, PhD http://jjfumero.github.io QCon-London 2020, 3rd March 2020

Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Tornado VM: Running Java on GPUs and FPGAs

Juan Fumero, PhDhttp://jjfumero.github.io

QCon-London 2020, 3rd March 2020

Page 2: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Agenda1. Motivation & Background

2. TornadoVM

• API - examples

• Runtime & Just In Time Compiler

• Live Task Migration

• Demos

3. Performance Results

4. Related Work & Future Directions

5. Conclusions

2

Page 3: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Who am I?Dr. Juan Fumero

Lead Developer of TornadoVM

Postdoc @ University of Manchester

[email protected]

@snatverk

3

Page 4: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Motivation

4

Page 5: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Why should we care about GPUs/FPGAs, etc.?

Intel Ice Lake (10nm)8 cores HT, AVX(512 SIMD)~1TFlops* (including the iGPU)~ TDP 28WSource: Intel docs

NVIDIA GP 100 – Pascal - 16nm60 SMs, 64 cores each3584 FP32 cores10.6 TFlops (FP32)TDP ~300 WattsSource: NVIDIA docs

Intel FPGA Stratix 10 (14nm)Reconfigurable Hardware~ 10 TFlopsTDP ~225WattsSource: Intel docs

CPU GPU FPGA

5

Page 6: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

What is a GPU? Graphics Processing Unit

Contains a set of Stream Multiprocessor cores (SMx)* Pascal arch. 60 SMx* ~3500 CUDA cores

Users need to know:

A) Programming model (normally CUDA or OpenCL)

B) Details about the architecture are essential to achieve performance

* Non sequential consistency, manual barriers, etc.

Source: NVIDIA docs

6

Page 7: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

What is an FPGA? Field Programmable Gate Array

You can configure the design of your hardware after manufacturing

It is like having "your algorithms directly wired on hardware" with only the parts you need

7

Page 8: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Current Computer Systems & Prog. Lang.

8

Page 9: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Ideal System for Managed Languages

9

Page 10: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM

10

Page 11: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Demo: Kinect Fusion with TornadoVM

https://github.com/beehive-lab/kfusion-tornadovm

* Computer Vision Application* ~7K LOC* Thousands of OpenCL LOC generated.

11

Page 12: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Overview

12

Tasks = Methods

Task-Schedulers = Group of Methods

APIAnnotations

Page 13: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Overview

13

Tasks = Methods

Task-Schedulers = Group of Methods

APIAnnotations

Runtime

Data-Flow & Optimizer

TornadoVM Bytecode Generation

Page 14: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Overview

14

Tasks = Methods

Task-Schedulers = Group of Methods

APIAnnotations

Runtime

Data-Flow & Optimizer

TornadoVM Bytecode Generation

ExecutionEngine

Bytecode interpreter

Device Drivers

Page 15: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Overview

15

Tasks = Methods

Task-Schedulers = Group of Methods

APIAnnotations

Runtime

Data-Flow & Optimizer

TornadoVM Bytecode Generation

ExecutionEngine

Bytecode interpreter

Device Drivers

Just-In-Time Compiler

Device's heap

Compiler /Graal JIT

Extensions

Page 16: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Overview

16

Tasks = Methods

Task-Schedulers = Group of Methods

APIAnnotations

Runtime

Data-Flow & Optimizer

TornadoVM Bytecode Generation

ExecutionEngine

Bytecode interpreter

Device Drivers

Just-In-Time Compiler

Device's heap

Compiler /Graal JIT

Extensions

• OpenJDK 8 > 141• OpenJDK 11• GraalVM 19.3.0• OpenCL >= 1.2• Support for:

• NVIDIA GPUs• Intel HD Graphics• AMD GPUs• Intel Altera FPGAs• Xilinx FPGAs• Multi-core CPUs

Page 17: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Tornado API – example

class Compute {public static void mxm(Matrix2DFloat A, Matrix2DFloatB,

Matrix2DFloat C, final int size) {

for (int i = 0; i < size; i++) {for (int j = 0; j < size; j++) {

float sum = 0.0f;for (int k = 0; k < size; k++) {

sum += A.get(i, k) * B.get(k, j);}C.set(i, j, sum);

}}

}}

17

Page 18: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Tornado API – example

class Compute {public static void mxm(Matrix2DFloat A, Matrix2DFloatB,

Matrix2DFloat C, final int size) {

for (@Parallel int i = 0; i < size; i++) {for (@Parallel int j = 0; j < size; j++) {

float sum = 0.0f;for (int k = 0; k < size; k++) {

sum += A.get(i, k) * B.get(k, j);}C.set(i, j, sum);

}}

}}

We add the parallel annotation as a hint for the compiler.

18

Page 19: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Tornado API – example

class Compute {public static void mxm(Matrix2DFloat A, Matrix2DFloatB,

Matrix2DFloat C, final int size) {

for (@Parallel int i = 0; i < size; i++) {for (@Parallel int j = 0; j < size; j++) {

float sum = 0.0f;for (int k = 0; k < size; k++) {

sum += A.get(i, k) * B.get(k, j);}C.set(i, j, sum);

}}

}}

TaskSchedule ts = new TaskSchedule("s0");ts.task("t0", Compute::mxm, matrixA, matrixB, matrixC, size)

.streamOut(matrixC)

.execute();

19

Page 20: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Tornado API – example

class Compute {public static void mxm(Matrix2DFloat A, Matrix2DFloatB,

Matrix2DFloat C, final int size) {

for (@Parallel int i = 0; i < size; i++) {for (@Parallel int j = 0; j < size; j++) {

float sum = 0.0f;for (int k = 0; k < size; k++) {

sum += A.get(i, k) * B.get(k, j);}C.set(i, j, sum);

}}

}}

TaskSchedule ts = new TaskSchedule("s0");ts.task("t0", Compute::mxm, matrixA, matrixB, matrixC, size)

.streamOut(matrixC)

.execute();

$ tornado Compute

20

To run:

tornado command is just an alias to Java and all the parameters to enable TornadoVM

Page 21: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Demo: Running Matrix Multiplication

21

https://github.com/jjfumero/qconlondon2020-tornadovm

Page 22: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Compiler & Runtime Overview

22

Page 23: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM & Dynamic Languages

23

Page 24: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM & Dynamic Languages

24

Page 25: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Demo 2: Node.js example

25

https://github.com/jjfumero/qconlondon2020-tornadovm

Page 26: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Compiler & Runtime Overview

26

Page 27: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Compiler & Runtime Overview

27

Page 28: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM JIT Compiler Specializations

28

Page 29: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

FPGA Specializations

void compute(float[] input,float[] output) {

for (@Parallel int i = 0; …) }for (int j = 0; ...) {

// Computation}

}}

29

From slowdowns without Specializations to 240x with Automatic Specializations on Intel FPGAs

Page 30: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM: VM in a VM

30

Page 31: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM: VM in a VM

31

Page 32: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Bytecodes - Example

32

Page 33: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Bytecodes - Example

33

Page 34: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Bytecodes - Example

34

Page 35: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Bytecodes - Example

35

Page 36: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Bytecodes - Example

36

Page 37: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Bytecodes - Example

37

Page 38: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Bytecodes - Example

38

Page 39: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM Bytecodes - Example

39

Page 40: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Batch Processing: 16GB into 1GB GPU

40

Page 41: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Batch Processing: 16GB into 1GB GPU

41

Page 42: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Batch Processing: 16GB into 1GB GPU

42

Page 43: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Live Task Migration

43

Page 44: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Dynamic Reconfiguration

44

Page 45: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Dynamic Reconfiguration

45

Page 46: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Dynamic Reconfiguration

46

Page 47: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

How is the decision made?

• End-to-end: including JIT compilation time

• Peak Performance: without JIT and after warming-up

• Latency: does not wait for all threads to finish

47

Page 48: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Demo Live Task Migration –Server/Client App

48

https://github.com/jjfumero/qconlondon2020-tornadovm

Page 49: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

New compilation tier for Heterogeneous Systems

49

Page 50: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

New compilation tier for Heterogeneous Systems

50

Page 51: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Related Work

51

Page 52: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Related Work (in the Java context)

ProjectProduction-

ReadySupported Devices

Live Task Migration

Compiler Specializations

Dynamic Languages

Sumatra No AMD GPUs No No No

Marawacc NoMulti-core, GPUs

No No No

JaBEE No NVIDIA GPUs No No No

RootBeer No NVIDIA GPUs No No No

Aparapi YesGPUs, multi-core

No No No

IBM GPU J9 Yes NVIDIA GPUs No No No

grCUDA No (*) NVIDIA GPUs No No Yes

TornadoVM Not yet (*)Multi-core, GPUs,FPGAs

Yes Yes Yes

52

Page 53: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Related Work (in the Java context)

ProjectProduction-

ReadySupported Devices

Live Task Migration

Compiler Specializations

Dynamic Languages

Sumatra No AMD GPUs No No No

Marawacc NoMulti-core, GPUs

No Yes Yes

JaBEE No NVIDIA GPUs No No No

RootBeer No NVIDIA GPUs No No No

Aparapi YesGPUs, multi-core

No No No

IBM GPU J9 Yes NVIDIA GPUs No No No

grCUDA No (*) NVIDIA GPUs No No Yes

TornadoVM Not yet (*)Multi-core, GPUs,FPGAs

Yes Yes Yes

53

Page 54: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Ok, cool! What about performance?

54

Page 55: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Performance

- NVIDIA GTX 1060- Intel FPGA Nallatech 385a- Intel Core i7-7700K

* TornadoVM performs up to 7.7x over the best device (statically).* Up to >4500x over Java sequential

55

Page 56: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Performance on GPUs, iGPUs, and CPUs

56

Page 57: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

More details in our papers!

57

https://github.com/beehive-lab/TornadoVM/blob/master/assembly/src/docs/Publications.md

Page 58: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Limitations &

Future Work

58

Page 59: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Limitations

We inherit limitations from the underlying Programming Model:

• No object support (except for a few cases)

• No recursion

• No dynamic memory allocation (*)

• No support for exceptions (*)

59

Page 60: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Future Work

• GPU/FPGA full capabilities• Exploitation of Tier-memories such as local memory (in progress)

• Policies for energy efficiency

• Multi-device within a task-schedule

• More parallel skeletons (reductions, stencil, scan, filter, …)

• PTX Backend for NVIDIA

60

Page 61: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Current Applicability of TornadoVM

61

Page 62: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

EU H2020 E2Data Project

https://e2data.eu/

"End-to-end solutions for Big Data deployments that fully exploit heterogeneous hardware"

European Union’s Horizon H2020 research and innovation programme under grant agreement No 780245

62

Page 63: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

E2Data Project – Distributed H. System with Apache Flink & TornadoVM

https://e2data.eu/

63

Page 64: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

How TornadoVM is currently being used in Industry?

64

For example in Health Care and Machine Learning

Page 65: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

65

Using TornadoVM for the training phase (2M patients):

From ~2615s to 185s (14x)

Thanks to Gerald Mema from Exus for sharing the numbers and the use case

How TornadoVM is currently being used in Industry?

Page 66: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

To sum up …

66

Page 67: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

TornadoVM available on Github and DockerHub

67

https://github.com/beehive-lab/TornadoVM

$ docker pull beehivelab/tornado-gpu

#And run!$ ./run_nvidia.sh javac.py YourApp$ ./run_nvidia.sh tornado YourApphttps://github.com/beehive-lab/docker-tornado

Page 68: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Team

• Academic staff:

Christos Kotselidis

• Research staff:

Juan Fumero

Athanasios Stratikopoulos

Foivos Zakkak

Florin Blanaru

Nikos Foutris

• PhD Students:

Michail Papadimitriou

Maria Xekalaki

• Interns:

Undergraduates:

Gyorgy Rethy

Mihai-Christian Olteanu

Ian Vaughan

68

We are looking for collaborations (industrial & academics) -> Talk to us!

• Alumni:

James Clarkson

Benjamin Bell

Amad Aslam

Page 69: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Takeaways

69

Page 70: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Takeaways

70

Page 71: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Takeaways

71

> 4500x vs Hotspot

Page 72: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Thank you so much for your attention

This work is partially supported by the EU Horizon 2020 E2Data 780245

Contact: Juan Fumero <[email protected]>

72

Page 73: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Back up slides

73

Page 74: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

We could potentially use ALL devices!

CPU Cores:* 4-8 cores per CPU* Local cache (L1-L3)

GPU cores:* Thousands of cores per GPU card* > 60 cores per SM* Small caches per SM* Global memory within the GPU* Few thread/schedulers per SM

FPGAs:* Chip with LUTs, BRAMs, and wires to* Normally global memory within the chip

74

Page 75: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Current Computer Systems

75

Page 76: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Current Computer Systems & Prog. Lang.

76

Page 77: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Current Computer Systems & Prog. Lang.

77

Page 78: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Tornado API – Map-Reduce

78

class Compute {public static void map(float[] input, float[] output) {

for (@Parallel int i = 0; i < size; i++) {output[i] = Math.sqrt(input[i]);

}}public static void reduce(@Reduce float[] data) {

for (@Parallel int i = 0; i < size; i++) {data[0] += data[i];

}}

}

Page 79: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Tornado API – Map-Reduce

class Compute {public static void map(float[] input, float[] output) {

for (@Parallel int i = 0; i < size; i++) {output[i] = Math.sqrt(input[i]);

}}public static void reduce(@Reduce float[] data) {

for (@Parallel int i = 0; i < size; i++) {data[0] += data[i];

}}

}

TaskSchedule ts = new TaskSchedule("MapReduce");ts.streamIn(input)

.task("map", Compute::map, input, output)

.task("reduce", Compute::reduce, output)

.streamOut(output)

.execute();

github.com/beehive-lab/TornadoVM/tree/master/examples

79

Page 80: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Still, why should we care about GPUs/FPGAs, etc?

Performance for each device against Java hotspot:* Up to 4500x by using a GPU* 240x by using an FPGA

80

Page 81: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

How to Program? E.g., OpenCL

1. Query OpenCL Platforms

2. Query devices available

3. Create device objects

4. Create an execution context

5. Create a command queue

6. Create and compile the GPU Kernels

7. Create <GPU> buffers

8. Create buffers and send data (Host -> Device)

10. Send data back (Device -> Host)

11. Free Memory

9. Run <GPU> Kernel

81

Page 82: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

How the OpenCL Generated Kernel looks like?

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__kernel void vectorAdd(__global uchar *_heap_base, ulong _frame_base, .. ){

int i_9, i_11, i_4, i_3, i_13, i_14, i_15;long l_7, l_5, l_6;ulong ul_0, ul_1, ul_2, ul_12, ul_8, ul_10;

__global ulong *_frame = (__global ulong *) &_heap_base[_frame_base];

// BLOCK 0ul_0 = (ulong) _frame[6];ul_1 = (ulong) _frame[7];ul_2 = (ulong) _frame[8];i_3 = get_global_id(0);// BLOCK 1 MERGES [0 2 ]i_4 = i_3;for(;i_4 < 256;) {// BLOCK 2l_5 = (long) i_4;l_6 = l_5 << 2;l_7 = l_6 + 24L;ul_8 = ul_0 + l_7;i_9 = *((__global int *) ul_8);ul_10 = ul_1 + l_7;i_11 = *((__global int *) ul_10);ul_12 = ul_2 + l_7;i_13 = i_9 + i_11;*((__global int *) ul_12) = i_13;i_14 = get_global_size(0);i_15 = i_14 + i_4;i_4 = i_15;

}// BLOCK 3

return;}

Access to the Java frame

Access the data within the frame

Access the arrays (skip object header)

Operation

Final Store

private void vectorAdd(int[] a, int[] b, int[] c) {for (@Parallel int i = 0; i < c.length; i++) {

c[i] = a[i] + b[i];}

}

82

Page 83: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Example in VHDL (using structural modelling)

83

library ieee;use ieee.std_logic_1164.all;

entity half_adder is -- Entityport (a, b: in std_logic;

sum, carry: out std_logic);end half_adder;

architecture structure of half_adder is -- Architecturecomponent xor_gate -- xor component

port (i1, i2: in std_logic;o1: out std_logic);

end component;

component and_gate -- and componentport (i1, i2: in std_logic;

o1: out std_logic);end component;

beginu1: xor_gate port map (i1 => a, i2 => b, o1 => sum);u2: and_gate port map (i1 => a, i2 => b, o1 => carry);

end structure;

Page 84: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Using OpenCL instead

84

library ieee;use ieee.std_logic_1164.all;

entity half_adder is -- Entityport (a, b: in std_logic;

sum, carry: out std_logic);end half_adder;

architecture structure of half_adder is -- Architecturecomponent xor_gate -- xor component

port (i1, i2: in std_logic;o1: out std_logic);

end component;

component and_gate -- and componentport (i1, i2: in std_logic;

o1: out std_logic);end component;

beginu1: xor_gate port map (i1 => a, i2 => b, o1 => sum);u2: and_gate port map (i1 => a, i2 => b, o1 => carry);

end structure;

Industry is pushing for OpenCL on FPGAs!

_kernel void sum(float a,float b,__global float*result){

result[0] = a + b;}

Page 85: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

More About FPGA Support

$ tornado YourProgram

$ tornado –Dtornado.fpga.aot.bitstream=<path> YourProgram

$ tornado –Dtornado.fpga.emulation=True YouProgram

85

Page 86: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

FPGA Support

86

void compute(float[] input,float[] output) {

for (@Parallel int i = 0; …) }for (@Parallel int j = 0; ...)

{// Computation

}}

}

Java

TornadoVM

Physical hardware

Program FPGAs within your favourite IDE: Eclipse, IntelliJ, …

Page 87: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

FPGA Specializations

With Compiler specializations, TornadoVM performs from 5x to 240x against Java Hostpot for DFT!!!

void compute(float[] input,float[] output) {

for (@Parallel int i = 0; …) }for (int j = 0; ...) {

// Computation}

}}

87

Non-specialized version Specialized version

Page 88: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Specializations: reductions

88

void reduce(float[] input, @Reduce float[] output) {for (@Parallel int i = 0; I < N; I++) {

output[0] += input[I];}

}

… but how?

Page 89: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Reduction Specializations via Snippets

With reduction-specializations we execute the code within 80% of the native (manual written code)

89

Page 90: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Demo - code specialization with Graal

90

$ tornado --igv YourProgram

Page 91: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Performance: FPGA vs Multi-threading Java

* TornadoVM on FPGA is up to 19x over Java multi-threads (8 cores)* Slowdown for small sizes

ector dd lac Sc oles RenderTrac ody D T

Speedup gainst ava

Multi t readed

Small Size Medium Size arge Size

91

Page 92: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

JEP - 8047074GOALS Implemented in Tornado?

No syntactic changes to Java 8 parallel stream API (Own API)

Autodetection of hardware and software stack

Heuristic to decide when to offload to GPU gives perf gains

Performance improvement for embarrassingly parallel workloads

Code accuracy has the same (non-) guarantees you can get with multi core parallelism

Code will always run with fallback to normal CPU execution if offload fails In progress!

Will not expose any additional security risks Under research

Offloaded code will maintain Java memory model correctness (find JSR) Under formal specification(several trade-offs have to be considered)

Where possible enable JVM languages to be offloaded Plan to integrate with Truffle. E.g., FastR-GPU: https://bitbucket.org/juanfumero/fastr-

gpu/src/default/

http://openjdk.java.net/jeps/8047074

92

Page 93: Tornado VM: Running Java on GPUs and FPGAs · Agenda 1. Motivation & Background 2. TornadoVM • API - examples • Runtime & Just In Time Compiler • Live Task Migration • Demos

Additional features

Additional Features (not included JEP 8047074) Implemented in Tornado?

Include GPUs, integrated GPU, FPGAs, multi-cores CPUs

Live-task migration between devices

Code specialization for each accelerator

Potentially accelerate existing Java libraries (Lucene)

Automatic use of tier-memory on the device (e.g., local memory) < In progress>

Virtual Shared Memory (OpenCL 2.0) < In progress>

93