130
Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization Support Kyle C. Hale and Peter Dinda

Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization

Support

Kyle C. Hale and Peter Dinda

Page 2: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

2

Hybrid Runtimes

runtime not limited to abstractions exposed by syscall interface

HPDC ‘15

opportunity for leveraging privileged HW features

the runtime IS the kernel

Page 3: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

3

user-mode

kernel-mode

CURRENT OS/R MODEL

PARALLEL RUNTIME

GENERAL PURPOSE OS

PARALLEL APP

HARDWARE

Page 4: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

4

user-mode

kernel-mode

PARALLEL RUNTIME

GENERAL PURPOSE OS

PARALLEL APP

HARDWARE HYBRID RUNTIME

PARALLEL APP

HARDWAREkernel mashup of OS and runtime

THE HYBRID RUNTIME

Page 5: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

5

Page 6: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

6

Racket

Legion

NDPC

NESL

Page 7: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

7

Spee

dup

over

Lin

ux

1

1.1

1.2

1.3

1.4

Legion Processor Count (Cores)

1 2 4 8 16 32 64 128 200 220

Xeon Phi x64

Page 8: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

8

Spee

dup

over

Lin

ux

1

1.1

1.2

1.3

1.4

Legion Processor Count (Cores)

1 2 4 8 16 32 64 128 200 220

Xeon Phi x64

1.19x speedup over

linux

Page 9: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

9

Spee

dup

over

Lin

ux

1

1.1

1.2

1.3

1.4

Legion Processor Count (Cores)

1 2 4 8 16 32 64 128 200 220

Xeon Phi x64

1.37x speedup over

linux

1.19x speedup over

linux

Page 10: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

10

TWO ENABLING TOOLS

Page 11: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

11

kernel framework

Page 12: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

12

HVMTHE HYBRID VIRTUAL MACHINE

bridge HRT with legacy OS

Page 13: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

OUTLINE

13

Background/Overview

Nautilus

Deployment Models

Hybrid Virtual Machine

Multiverse & Future Work

Page 14: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

14

PARALLEL APP

RUNTIME

THREADS SYNC PAGING EVENTS HW INFO BOOTSTRAP TIMERS IRQS CONSOLE

user-mode

kernel-mode

HARDWARE

Nautilus primitives & utilities (HRT can use or not use any of them)

aerokernel

NAUTILUS

Page 15: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

15

Nautilus under the hood

Page 16: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

16

Nautilus

Parallel Runtime System

Page 17: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

17

kernel primitives should be SIMPLE and FAST

runtime developer can easily reason about them!

Page 18: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

18

start with familiar interfaces

condition variables

threads

mutexes/locks

fork/join parallelism

memory management

Page 19: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

19

Parallel Runtime System

logical CPUs

map runtime’s logical view of machine onto physical HW

Page 20: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

20

threadsunified, shared address space

threads can operate preemptively or cooperatively*

*for runtimes that require more determinism

Page 21: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

21

thread creation is FASTC

ycle

s

0

1750000

3500000

5250000

7000000

2 4 8 16 32 64

NautilusLinux (pthreads)

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

Page 22: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

22

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

Cyc

les

0

1750000

3500000

5250000

7000000

2 4 8 16 32 64

Linux (pthreads)

thread creation is FAST

Page 23: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

23

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

Cyc

les

0

1750000

3500000

5250000

7000000

2 4 8 16 32 64

NautilusLinux (pthreads)

~3ms

~90µs

thread creation is FAST

Page 24: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

24

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

user-mode software events are SLOW

hardware (IPI)

lower is better

0

5000

10000

15000

20000

25000

trig

ger

late

ncy

(cyc

les)

30000

35000

cond. var. futex

software events

Page 25: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

25

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

nautilus events triggers are FASTER

hardware (IPI)0

5000

10000

15000

20000

25000

trig

ger

late

ncy

(cyc

les)

30000

35000

cond. var. naut. cond. var.futex

naut. cond. var.

w/IPI

Page 26: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

26TR

IGG

ER L

ATEN

CY (C

YCLE

S)

0180360540720900

10801260144016201800

IPI Nemo

Page 27: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

27TR

IGG

ER L

ATEN

CY (C

YCLE

S)

0180360540720900

10801260144016201800

IPI Nemo~100 cycles

Page 28: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

28

memory

Page 29: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

29

static identity map

bootup

1GB

.

.

.

2MB

.

.

.

physical memory

page size

Page 30: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

eliminates expensive page faults

reduces TLB misses + shootdowns

increases performance under virtualization

30

Page 31: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

31

NUMAfirst-touch

stack

NUMA domain NUMA domainruntime has FULL control over thread placement and

memory layout

Page 32: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

32

philix

Page 33: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

33

hostXeon Phi

philix host utility

Intel MPSS stack

custom OS

virtual console

Page 34: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

34

Nautilus 25,000 lines of mostly C

Legion 43,000 lines of C++

Additions for Legion 800 lines of C/C++

Additions for Xeon Phi 1350 lines of C

Nautilus is fairly small

Page 35: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

35

What do we lose?

Page 36: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

36

familiar environment

Page 37: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

37driver ecosystem

Page 38: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

38

protection/isolation

Page 39: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

OUTLINE

39

Background/Overview

Nautilus

Deployment Models

Hybrid Virtual Machine

Multiverse & Future Work

Page 40: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

40

HARDWARE

HRT

APPLICATION

DEDICATED

full HW access

Page 41: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

41

HARDWARE

HRT

APPLICATION

PARTITIONED

full HW access

CORE 0

LINUX STACK

… CORE K-1 CORE K CORE N-1…

Page 42: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

42

HARDWARE

HRT

PARALLEL APP

VM

LINUX STACK

VMM

VIRTUAL MACHINE 0 VIRTUAL MACHINE 1

Page 43: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

43

HARDWARE

HRT

PARALLEL APP

Hybrid Virtual Machine

LINUX STACK

VMM

HVMregular OS vcores HRT vcores

Page 44: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

OUTLINE

44

Background/Overview

Nautilus

Deployment Models

Hybrid Virtual Machine

Multiverse & Future Work

Page 45: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

45

HRT

user-mode

kernel-mode

user-mode

kernel-mode

PARALLEL RUNTIME

GENERAL PURPOSE OS

PARALLEL APP

PARALLEL APP

NODE HARDWARE NODE HARDWARE

HVM

NODE HARDWARE

GENERAL VIRTUALIZATION MODEL

SPECIALIZED VIRTUALIZATION MODEL

regular OS (ROS)

performance pathlegacy functionality from ROS via HVM

Page 46: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

46

ROS vcore

ROS vcore

HRT vcore

Virtual Machine

IPI IPI

ROS/HRT setup

HRT vcore

IPI

IPI

Page 47: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

47

ROS vcore

ROS vcore

HRT vcore

VM memory

Virtual Machine

ROS/HRT setup

ROS visible memory

HRT vcore

Page 48: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

48

ROS vaddr space HRT vaddr space

merged address spacehigher half

ROS kernel

app + runtime code & data

higher half Aerokernel

app + runtime code & data

merged

Page 49: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

49ROS/HRT communication

LINUX

HYPERVISOR + HVM

HRTHRT reboot requestAddress space mergerHRT function invocation

shared data page

Page 50: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

50ROS/HRT communication

LINUX

HYPERVISOR + HVM

HRTSynchronous operation request

shared data page

comm. area

Page 51: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

51

Cycles Time

Addr space merge ~33K 15µs

Asynch function invocation ~25K 11µs

Synch function invocation (remote

socket)~1060 482ns

Synch function invocation (same

socket)~790 359ns

ROS/HRT communication

Page 52: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

52ROS boot

ROS BSPINIT-SIPI-SIPI

INIT-SIPI-SIPI

INIT-SIPI-SIPI

BOOTSTRAP

BOOTSTRAP

BOOTSTRAP

ROS APs

Page 53: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

53HRT boot

HVM sets cores up in long mode*

*no need for INIT-SIPI-SIPI sequence

HVM sets up registers, system tables

GDT

IDT

TSS

REGISTERS

INIT PAGE

TABLES

HVM boots cores simultaneously

Page 54: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

54

LINUX FORK + EXEC ~ 714µs

HVM + HRT CORE BOOT ~ 61µs

Page 55: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

55

how do we take a legacy runtime to

the HRT + X model?

PORT

Page 56: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

56porting a runtime/app to an a new OS

environment is…

DIFFICULT TIME-CONSUMING

ERROR-PRONE

Page 57: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

57

development cycle:

ADD FUNCTION

REBUILD

BOOT HRT

do{

}while(HRTfallsover)

Page 58: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

58

much of the functionality is

NOT ON THE CRITICAL PATH

Page 59: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

59

we want to make this

easier

Page 60: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

OUTLINE

60

Background/Overview

Nautilus

Deployment Models

Hybrid Virtual Machine

Multiverse & Future Work

Page 61: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

61

give us your legacy runtime

Page 62: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

62

we automatically transform it to run

as an HRT

bridged with a legacy OS in kernel mode

(AUTOMATIC HYBRIDIZATION)

Page 63: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

63

rebuild with our toolchain

LEGACY APP/RUNTIME

Page 64: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

64

LEGACY APP/RUNTIME

MULTIVERSE RUNTIME LAYER

AEROKERNEL BINARY

LINUX

HYPERVISOR + HVM

runs like normal

application

BOOTED AEROKERNEL

LEGACY APP/RUNTIME

MULTIVERSE RUNTIME LAYER

kernel-mode

Page 65: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

65

Page 66: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

66

bridge HRT with legacy OSHybrid Virtual Machine automatic transformation: legacy app+runtime HRT

Multiverse

4 parallel runtimes (Legion, Racket, NDPC, NESL)

Page 67: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

67

thank you

Page 68: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

68

my webpage: halek.co lab: presciencelab.org download nautilus: nautilus.halek.co v3vee project: v3vee.org

Page 69: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

69

backups

Page 70: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

70

thread fork

Page 71: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

71

interrupt driven execution

Page 72: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

72

memory and paging

Page 73: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

73

static identity map

bootup

1GB

.

.

.

2MB

.

.

.

physical memory

page size

Page 74: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

eliminates expensive page faults

reduces TLB misses + shootdowns

increases performance under virtualization

74

Page 75: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

75Un

ified

TLB

Miss

es

1

1000

1000000

log scale

Page 76: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

76Un

ified

TLB

Miss

es

1

1000

1000000

Linux Nautilus

Page 77: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

77

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

thread creation is PREDICTABLE

1000

10000

100000

1e6

1e7

1e8

cycl

es

Linux (pthreads) Nautilus

log scale

lower is better

kernel threads

Page 78: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

78

binding to physical processor is GUARANTEED

Page 79: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

79

RUNTIME control overvirtual CPUphysical

Page 80: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

80

interrupts &

events

Page 81: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

81

asynchronous notifications

Page 82: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

82

EVENT COMPLETEStrigger

DEPENDENT TASK EXECUTES

trigger latency

Page 83: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

83

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

user-mode software events are SLOW

hardware (IPI)

lower is better

0

5000

10000

15000

20000

25000

trig

ger

late

ncy

(cyc

les)

30000

35000

cond. var. futex

Page 84: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

84

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

nautilus events triggers are FASTER

hardware (IPI)0

5000

10000

15000

20000

25000

trig

ger

late

ncy

(cyc

les)

30000

35000

cond. var. naut. cond. var.futex

naut. cond. var.

w/IPI

Page 85: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

85

handle_event()handle_event()

Page 86: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

86

asynchronous, IPI-based execution

Page 87: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

87

NOT POSSIBLE IN LINUX USERSPACE

Page 88: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

88TR

IGG

ER L

ATEN

CY (C

YCLE

S)

0180360540720900

10801260144016201800

IPI Nemo

lower is better

Page 89: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

89TR

IGG

ER L

ATEN

CY (C

YCLE

S)

0180360540720900

10801260144016201800

IPI Nemo

Page 90: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

90TR

IGG

ER L

ATEN

CY (C

YCLE

S)

0180360540720900

10801260144016201800

IPI Nemo~100 cycles

Page 91: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

91

this gives us an

incremental path for creating HRTs

Page 92: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

92

start out with a working

HRT system

Page 93: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

93

then pull functionality

into the HRT for hot spots

Page 94: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

94

RACKET

most widely used Scheme implementation

downloaded ~300 times/day

complex runtime with JIT

Page 95: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

95

LEGACY APP/RUNTIME

MULTIVERSE RUNTIME LAYER

AEROKERNEL BINARY

LINUX

HYPERVISOR + HVM

BOOTED AEROKERNEL

LEGACY APP/RUNTIME

MULTIVERSE RUNTIME LAYER

system call

system call

ROS/HRT communication

Page 96: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

96

Page 97: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

97

0

500000

1×106

1.5×106

2×106

2.5×106

pthread condvar

futex wakeup

Aerokernel condvar

Aerokernel condvar + IPI

broadcast IPI

µ = 995795

min = 17538max = 2.17277e+06

σ = 544512

µ = 370630

min = 16402max = 1.89553e+06

σ = 199680

µ = 265820

min = 3258max = 612959

σ = 159421

min = 7842max = 464015µ = 132417σ = 98637.4

min = 1252max = 57467µ = 12827.3σ = 2931.32

Cyc

les

to W

akeu

p

Broadcast wakeups on x64

Nautilus events

Page 98: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

98

0

0.2

0.4

0.6

0.8

1

1000 10000 100000 1×106

CD

F

σ

pthread condvarfutex broadcastAerokernel condvarAerokernel condvar + IPIbroadcast IPI

Wakeup deviation on x64

Page 99: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

99

0

0.2

0.4

0.6

0.8

1

1000 10000 100000 1×106

CD

F

σ

pthread condvarfutex broadcastAerokernel condvarAerokernel condvar + IPIbroadcast IPI

HW lower bound

user-modeevents

Nemo events(kernel-mode)

2x

Wakeup deviation on x64

Nautilus events

Page 100: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

100

0

500000

1×106

1.5×106

2×106

2.5×106

pthrea

d con

dvar

futex

wakeu

p

broad

cast

IPI

µ = 995795

min = 17538max = 2.17277e+06

σ = 544512

µ = 370630

min = 16402max = 1.89553e+06

σ = 199680

µ = 12827.3

min = 1252max = 57467

σ = 2931.32

Cyc

les

to W

akeu

p

Broadcast wakeups on x64

Page 101: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

101

0

0.2

0.4

0.6

0.8

1

710 720 730 740 750 760 770 780 790 800

CD

F

Cycles mesaured from BSP (core 224)

95th percentile = 746 cycles

Unicast IPIs on phi

Page 102: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

102

0

0.2

0.4

0.6

0.8

1

2080 2100 2120 2140 2160 2180 2200

CD

F

Cycles mesaured from BSP (core 224)

95th percentile = 2105 cycles

Roundtrip IPIs on phi

Page 103: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

103

0

0.2

0.4

0.6

0.8

1

15 15.5 16 16.5 17 17.5 18 18.5

CD

F

σ

95th percentile σ = 31 cycles

Multicast IPIs on phi

Page 104: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

104

0

0.2

0.4

0.6

0.8

1

1000 10000 100000 1×106

CD

F

σ

pthread condvarfutex broadcast

broadcast IPI

Wakeup deviation on x64

Page 105: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

105

0

0.2

0.4

0.6

0.8

1

1000 10000 100000 1×106

CD

F

σ

pthread condvarfutex broadcast

broadcast IPI

HW lower bound

user-mode events

~70x

Wakeup deviation on x64

Page 106: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

106

0

1000

2000

3000

4000

5000

6000

7000

Linux (pthreads) Aerokernel threads

µ = 1412.67

min = 1352max = 2438

σ = 114.377

µ = 1269.52

min = 1176max = 1391

σ = 48.3902

Cyc

les

thread context switch on x64

Page 107: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

107thread create + launch on phi

10000

100000

1×106

1×107

Linux (pthreads) Linux (kernel threads) Nautilus kernel threads

Cyc

les

Page 108: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

108

0

2×106

4×106

6×106

8×106

1×107

1.2×107

2 4 8 16 32 64

Cyc

les

Threads

Linux pthreadsLinux kernel threads

Nautilus threads

thread create + launch (many threads) on x64

Page 109: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

109

0

2×107

4×107

6×107

8×107

1×108

1.2×108

1.4×108

1.6×108

2 4 8 16 32 64 128 228

Cyc

les

Threads

Linux pthreadsLinux kernel threads

Nautilus threads

thread create + launch (many threads) on phi

Page 110: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

110Unicast IPIs on x64

0

0.2

0.4

0.6

0.8

1

1000 1200 1400 1600 1800 2000 2200 2400

CD

F

Cycles mesaured from BSP (core 0)

95th percentile = 1728 cycles

Page 111: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

111

0

0.2

0.4

0.6

0.8

1

3600 3800 4000 4200 4400 4600 4800 5000

CD

F

Cycles mesaured from BSP (core 0)

95th percentile = 4640 cycles

Roundtrip IPIs on x64

Page 112: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

112

0

0.2

0.4

0.6

0.8

1

2000 3000 4000 5000 6000 7000 8000

CD

F

σ

95th percentile σ = 2812 cycles

Multicast IPIs on x64

Page 113: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

113

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000

CD

F

Cycles mesaured from BSP (core 0)

unicast IPIsynchronous event (mem polling)

Unicast IPI vs memory polling

Page 114: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

114

ipi sync cdf (this one has annotations)

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000

CD

F

Cycles mesaured from BSP (core 0)

unicast IPIsynchronous event (mem polling)

coherence network

interruptnetwork

Unicast IPI vs memory polling

Page 115: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

115

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000

CD

F

Cycles mesaured from BSP (core 0)

unicast IPIsynchronous event (mem polling)projected remote syscall

Unicast IPI vs memory polling

Page 116: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

116

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000

CD

F

Cycles mesaured from BSP (core 0)

unicast IPIsynchronous event (mem polling)projected remote syscall

cost of dest.handling

Unicast IPI vs memory polling

Page 117: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

117

pthrea

d con

dvar

futex

wakeu

p

unica

st IPI

µ = 25722.7

min = 24333max = 27834σ = 618.407

µ = 15637.4

min = 13311max = 22343σ = 1173.37

µ = 740.85min = 732max = 761σ = 5.71205

Single wakeup on phi

Page 118: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

118

wakeup phi many

0

0.2

0.4

0.6

0.8

1

10 100 1000 10000 100000 1×106 1×107

CD

F

σ

pthread condvarfutex broadcast

broadcast IPI

Wakeup deviation on phi

Page 119: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

119

0

0.2

0.4

0.6

0.8

1

10 100 1000 10000 100000 1×106 1×107

CD

F

σ

pthread condvarfutex broadcast

broadcast IPI

HW lower bound

user-mode events

~79000x

Wakeup deviation on phi

Page 120: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

120

0

0.2

0.4

0.6

0.8

1

0 5000 10000 15000 20000 25000 30000 35000 40000

CD

F

Cycles to Wakeup

U. mode condvarFutex wakeup

Oneway IPI

Single wakeup on x64 (CDF)

Page 121: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

121

0

0.2

0.4

0.6

0.8

1

0 5000 10000 15000 20000 25000 30000

CD

F

Cycles to Wakeup

U. mode condvarFutex wakeup

Oneway IPI

Single wakeup on phi (CDF)

Page 122: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

122

0

5000

10000

15000

20000

25000

30000

pthread condvar

futex wakeup

Aerokernel condvar

Aerokernel condvar + IPI

unicast IPI

µ = 25722.7min = 24333max = 27834σ = 618.407

µ = 15637.4min = 13311max = 22343σ = 1173.37

µ = 9014.73min = 5528max = 14074σ = 2507.14

min = 5953max = 7305µ = 6483.89σ = 213.517

min = 732max = 761µ = 740.85σ = 5.71205

Cyc

les

to W

akeu

p

Single wakeup on phi

Page 123: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

123

0

2×106

4×106

6×106

8×106

1×107

1.2×107

pthread condvar

futex wakeup

Aerokernel condvar

Aerokernel condvar + IPI

broadcast IPI

µ = 5.64881e+06

min = 31743max = 1.17807e+07

σ = 3.08505e+06

µ = 2.25096e+06

min = 28545max = 4.55009e+06

σ = 1.27551e+06

µ = 1.02075e+06

min = 7199max = 7.25596e+06

σ = 1.1411e+06 min = 8267max = 3.93142e+06µ = 558155σ = 428849

min = 1016max = 1225µ = 1153.68σ = 17.1824

Cyc

les

to W

akeu

p

Many wakeups on phi

Page 124: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

124

wakeup phi many wkernel cdf

0

0.2

0.4

0.6

0.8

1

10 100 1000 10000 100000 1×106 1×107

CD

F

σ

pthread condvarfutex broadcastAerokernel condvarAerokernel condvar + IPIbroadcast IPI

Wakeup deviation on phi

Page 125: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

125

0

0.2

0.4

0.6

0.8

1

10 100 1000 10000 100000 1×106 1×107

CD

F

σ

pthread condvarfutex broadcastAerokernel condvarAerokernel condvar + IPIbroadcast IPI

HW lower bound

Nemo events(kernel-mode)

user-mode events

4x

Wakeup deviation on phi

Page 126: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

126

pthrea

d con

dvar

futex

wakeu

p

broad

cast

IPI

µ = 5.64881e+06

min = 31743max = 1.17807e+07

σ = 3.08505e+06

µ = 2.25096e+06

min = 28545max = 4.55009e+06

σ = 1.27551e+06

µ = 1153.68

min = 1016max = 1225

σ = 17.1824

Broadcast wakeups on phi

Page 127: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

127

0

0.2

0.4

0.6

0.8

1

0 5000 10000 15000 20000 25000 30000 35000 40000

CD

F

Cycles to Wakeup

U. mode condvarFutex wakeup

K.mode condvarK.mode condvar + IPI

Oneway IPI

Single wakeup on x64 (CDF)

Page 128: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

128

0

0.2

0.4

0.6

0.8

1

0 5000 10000 15000 20000 25000 30000

CD

F

Cycles to Wakeup

U. mode condvarFutex wakeup

K.mode condvarK.mode condvar + IPI

Oneway IPI

Single wakeup on phi (CDF)

Page 129: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

129

0 10 20 30 40 50 60 70 80 90

fannk

uch-r

edux

binary

-tree-2 fas

tafas

ta-3

nbod

y

spec

tral-n

orm

mande

lbrot-

2

Run

time

(s)

NativeVirtual

Multiverse

racket multiverse overheads

Page 130: Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000 1×106 1.5×106 2×106 2.5×106 pthread condvar futex wakeupAerokernel condvarAerokernel

130

100

1000

10000

100000

1x106

1x107

1x108

getpi

d

gettim

eofda

yfwrite sta

trea

d

getcw

dop

enclo

semmap

Run

time

(cyc

les)

VirtualMultiverse

system call overheads

~40µs