50
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling NUMA Profiling for Dynamic Dataflow Applications Manuel Selva – Lionel Morel – Kevin Marquet CITI - INRIA SOCRATE Université de Lyon September 29th, 2015 1 / 28

NUMA Profiling for Dynamic Dataflow Applications

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

NUMA Profiling for Dynamic DataflowApplications

Manuel Selva – Lionel Morel – Kevin Marquet

CITI - INRIA SOCRATEUniversité de Lyon

September 29th, 2015

1 / 28

Page 2: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

CMP are everywhere

Intel Nehalem - 4 cores -2009 Kalray MPPA - 256 cores - 2013

SamsungExynos - 2 x 4cores - 2012

On the headlines

• David P.: “The Trouble with Multicore”

• Herb S.: “Welcome to the Jungle”

• Ed L.: “The Problem with Threads”

• Timothy R.: “Mind the Gap...”

• David P.: “The Hail Mary of Programming”

2 / 28

Page 3: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

But Programming them is Hard ...

3 / 28

Page 4: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Dataflow

Parser

Text.Y

Text.U

Text.V

Mot.Y

Mot.U

Mot.V

Merger Display

Text. YText. YText. Y

• Actors exchanging data only through FIFO channels• Different forms of parallelism

• Task• Pipeline• Data

4 / 28

Page 5: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Dataflow Applications Examples

Medical imageprocessing

[Albers2012]

Software Defined Radio[Dardaillon2014]

Video Decoding [Lucarz09]

5 / 28

Page 6: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

The setting

6 / 28

Page 7: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

The question

6 / 28

Do DF applications scale?If not, why?

Page 8: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Does it scale?

1 2 3 4 5 6 7 8 9 10 11 121

1.5

2

2.5

3

Different inputsHEVC decoding

200 frames33 Actors

Number of cores

Spe

edup

vssi

ngle

-cor

e

7 / 28

Page 9: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

What are the reasons for that?

• Are the applications well written?• ⇒ blame the app designer.

• Are the runtimes well implemented?• ⇒ blame the runtime designer.

• Is the model of computation really the goodone?

• Programmer tricked into some idiosyncracies?• ⇒ blame the language designer.

8 / 28

Page 10: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Problem Statement

How to identify and understand performancebottlenecks in dataflow programs?

Contribution: CPU/memory profiling to analyse (and fix)bottlenecks on dataflow programs

9 / 28

Page 11: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Preliminary: Which Software?

RVC-Cal - [Yviquel13]

• Dynamic Dataflow• Dedicated to video codec applications• Many applications available (hevc, h264, gzip, zigbee)• Active community

10 / 28

Page 12: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Preliminary: Dataflow Execution Model

AB

CD

C; D;B;A;

Compiler

Core 1 Core 2Core 1 Core 2

RAM

Mapper

A; C; D; B;

11 / 28

Page 13: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Preliminary: Dataflow Execution Model

AB

CD

C; D;B;A;

Compiler

Core 1 Core 2Core 1 Core 2

RAM

Mapper

A; C; D; B;

11 / 28

Page 14: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Preliminary: Dataflow Execution Model

AB

CD

C; D;B;A;

Compiler

Core 1 Core 2Core 1 Core 2

RAM

Mapper

A; C; D; B;

⇒1 thread per core - actors scheduled within thread

11 / 28

Page 15: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Preliminary: Which Architecture?

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

• Commodity HW• NUMA• PMU• linux-supported

12 / 28

Page 16: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Goal: Identify and Understand PerformanceBottlenecks in Dataflow Programs

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

Correlatehw profilingto the DFgraph

13 / 28

Page 17: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

CPU Profiling

14 / 28

Page 18: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Goal: Identify and Understand Cores Imbalance

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

Exec time→ actors

15 / 28

Page 19: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Cores Balance

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

Single actor:Inter pred.

200 framesInput: KimonoHEVC

2827272929

3134

3636

4354

100

Number of cores

Wor

kdi

strib

utio

nby

core

(%)

Core 1 Core 2 Core 3 Core 4 Core 5 Core 6

Core 7 Core 8 Core 9 Core 10 Core 11 Core 12

16 / 28

Page 20: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Cores Balance

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

Single actor:Inter pred.Single actor:Inter pred.

200 framesInput: KimonoHEVC

2827272929

3134

3636

4354

100

Number of cores

Wor

kdi

strib

utio

nby

core

(%)

Core 1 Core 2 Core 3 Core 4 Core 5 Core 6

Core 7 Core 8 Core 9 Core 10 Core 11 Core 12

16 / 28

Page 21: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Diagnosis?

The application is not parallel enough:• Split the Interframe Prediction actor! [Jerbi14]• Split other actors as well ...• Parallelize the sequential code inside actors?

17 / 28

Page 22: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Total Work Time is Increasing!

1 2 3 4 5 6 7 8 9 10 11 123

4

5

6

7

8

·1010

+49%

Input: Kimono

200 frames

HEVC

Number of cores

Tota

lWor

kTi

me

(cyc

les)

Total Work Time = Sum of cpu time for all cores usedQuestion: where does this overhead come from?

18 / 28

Page 23: NUMA Profiling for Dynamic Dataflow Applications

Memory Profiling

Page 24: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Goal: Identify and Understand Memory Usage

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

Mem. traffic→ FIFOs

19 / 28

Page 25: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

NUMA - Performance Monitoring Unit

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

PMU PMU PMU PMU

PMU PMU

20 / 28

Page 26: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

NUMA - Performance Monitoring Unit

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

PMU PMU PMU PMU

PMU PMU

Hardware profiling mechanisms

• Hard to program

20 / 28

Page 27: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

A library for NUMA Profiling

Hardware

Writ

eas

sem

bler

Run

insu

perv

isor

PMU

LinuxKernel

perf_event_open()system call

Kernel module/dev/cpu/msr

Linux Perf PAPI numap Intel PCM

• Memory bandwidth profiling• Memory access sampling

21 / 28

Page 28: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Using numap for memory bandwidth usage

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

PMU PMU PMU PMU

PMU PMU

22 / 28

Page 29: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Using numap for memory bandwidth usage

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

PMU PMU PMU PMU

PMU PMU

DF applicationssaturate memorybandwidth?

22 / 28

Page 30: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Main Memory Bandwidth Usage

1 2 3 4 5 6 7 8 9 10 11 120

5

10

15

20

25

Write max bandwidth

Read max bandwidth

Input: Kimono

200 frames

HEVC

Number of cores

Ave

rage

Ban

dwid

th(G

B/s

)ReadWrite

DF applications saturate memory bandwidth? NO!23 / 28

Page 31: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Do you pay for too many distant accesses?

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

PMU PMU PMU PMU

PMU PMU

24 / 28

Page 32: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Do you pay for too many distant accesses?

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

PMU PMU PMU PMU

PMU PMU

@=0x7123CFF

Associatemem accessesto actorsand FIFOs

24 / 28

Page 33: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Communication Cost

1 2 3 4 5 6 7 8 9 1011120

20

40

60

80

100

200 framesInput: KimonoHEVC

Average MemoryLatency (cycles)21

.120

.019

.416

.016

.014

.512

.911

.410

.09.

58.

27.

9

18 18 25 25 21

14 17 26 32 2518 17

14 17 2211 16 16 1619 18 1789 90

79 7564 57 52 47 47 39 39 37

Number of cores

%of

acce

sses

L1 LFB L2 L3 RemoteCache LocalRAM RemoteRAM

25 / 28

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Intel X5650Westemere

Page 34: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Communication Cost

1 2 3 4 5 6 7 8 9 1011120

20

40

60

80

100

200 framesInput: KimonoHEVC

Average MemoryLatency (cycles)21

.120

.019

.416

.016

.014

.512

.911

.410

.09.

58.

27.

9 Average MemoryLatency (cycles)21

.120

.019

.416

.016

.014

.512

.911

.410

.09.

58.

27.

9

18 18 25 25 21

14 17 26 32 2518 17

14 17 2211 16 16 1619 18 1789 90

79 7564 57 52 47 47 39 39 37

Number of cores

%of

acce

sses

L1 LFB L2 L3 RemoteCache LocalRAM RemoteRAM

A small part of the accesses are responsible for a largeshare of the latency.

25 / 28

Page 35: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Where to Optimize?

High latency

Hig

hla

tenc

y

The profiler gives us:• High latency data exchanges at the dataflow level

We plan on using this for:• Feeding this information to the mapping heuristics

26 / 28

Page 36: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Conclusion

Proposition

• Main goal: Improve scalability of DF programs• How: Understand performance bottlenecks in DF programs• Approach: connect HW-level performance monitoring to

DF runtime

Contributions

• numap: memory profiling for NUMA architectures• Connection to the RVC-Cal runtime• Memory profiling of video decoders

27 / 28

Page 37: NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

Perspectives

Short-term

• Continue analysis of memory sampling results• Build more intelligent (re-)mapping decisions

Mid- and Long-term

• Compare ressource usage of DF-written decoders withtraditionnal thread-based implementations (eg ffmpeg)

• Integrate DF notions (ie data-dependencies) into OS kernel• Adapt runtime strategies to many-core architectures• Run and adapt multiple DF applications simultaneously

28 / 28

Page 38: NUMA Profiling for Dynamic Dataflow Applications

Bibliography I

I A H R Albers and P H N de With.Task complexity analysis and qos management for mappingdynamic video-processing tasks on a multi-core platform.Journal of Real-Time Image Processing, 7(3):185–202,2012.

I Mickaël Dardaillon, Kevin Marquet, Tanguy Risset, JérômeMartin, and Henri-Pierre Charles.A compilation flow for parametric dataflow: Programmingmodel, scheduling, and application to heterogeneous mpsoc.

In Proceedings of the 2014 International Conference onCompilers, Architecture and Synthesis for EmbeddedSystems, CASES ’14, pages 8:1–8:10, New York, NY, USA,2014. ACM.

Page 39: NUMA Profiling for Dynamic Dataflow Applications

Bibliography II

I Mohammad Dashti, Alexandra Fedorova, Justin Funston,Fabien Gaud, Renaud Lachaize, Baptiste Lepers, VivienQuema, and Mark Roth.Traffic management: A holistic approach to memoryplacement on numa systems.In Proceedings of the Eighteenth International Conferenceon Architectural Support for Programming Languages andOperating Systems, ASPLOS ’13, pages 381–394, NewYork, NY, USA, 2013. ACM.

I Tudor David, Rachid Guerraoui, and Vasileios Trigonakis.Everything you always wanted to know aboutsynchronization but were afraid to ask.In Proceedings of the Twenty-Fourth ACM Symposium onOperating Systems Principles, SOSP ’13, pages 33–48, NewYork, NY, USA, 2013. ACM.

Page 40: NUMA Profiling for Dynamic Dataflow Applications

Bibliography III

I Khaled Jerbi, Daniele Renzi, Damien de Saint-Jorre, HervéYviquel, Mickaël Raulet, Claudio Alberti, and MarcoMattavelli.Development and optimization of high level dataflowprograms: the HEVC decoder design case.In 48th Asilomar Conference on Signals, Systems andComputers, Pacific Grove, United States, November 2014.

I I. Amer, C. Lucarz, G. Roquier, M. Mattavelli, M. Raulet, J.-F.Nezan, and O. Deforges.Reconfigurable video coding on multicore.Signal Processing Magazine, IEEE, 26(6):113 –123,november 2009.

Page 41: NUMA Profiling for Dynamic Dataflow Applications

Bibliography IV

I Daniel Molka, Daniel Hackenberg, Robert Schone, andMatthias S. Muller.Memory performance and cache coherency effects on anintel nehalem multiprocessor system.In Proceedings of the 2009 18th International Conference onParallel Architectures and Compilation Techniques, PACT’09, pages 261–270, Washington, DC, USA, 2009. IEEEComputer Society.

I Herve Yviquel, Antoine Lorence, Khaled Jerbi, GildasCocherel, Alexandre Sanchez, and Mickael Raulet.Orcc: Multimedia development made easy.In Proceedings of the 21st ACM International Conference onMultimedia, MM ’13, pages 863–866. ACM, 2013.

Page 42: NUMA Profiling for Dynamic Dataflow Applications

Communication Overhead On NUMA

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

3 / 6

Page 43: NUMA Profiling for Dynamic Dataflow Applications

Communication Overhead On NUMA

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

Remote vs local latency+30%

[Molka2009, David2013]

3 / 6

Page 44: NUMA Profiling for Dynamic Dataflow Applications

Communication Overhead On NUMA

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

Cache coherencyprotocol

QPI overheadlat. * 4

[Molka2009]

3 / 6

Page 45: NUMA Profiling for Dynamic Dataflow Applications

Communication Overhead On NUMA

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

Memory controlers andQPI links contention

lat. * 5[Dashti2013]

3 / 6

Page 46: NUMA Profiling for Dynamic Dataflow Applications

Why build a “dataflow” profiler?

Why not use a “regular” profiler alone?

Because they are generally too low-level:• Distance to programmer’s thinking is too long• May know about threads, but not actors• Will not be aware of data dependencies between actors

4 / 6

Page 47: NUMA Profiling for Dynamic Dataflow Applications

Preliminary: Dataflow Actors Internals

A B C

Application graph

int[512] fifo_AB;int[512] fifo_BC;void action1() {int in = pop(fifo_AB);int out = in * 42 + 7;push(fifo_BC, tmp);

}

void action2(){...

}

C code generated for actor B

Work time(B) =∑a∈actions

cpu time (a)

5 / 6

Page 48: NUMA Profiling for Dynamic Dataflow Applications

Sample correlation

A B C

Application graph

int[512] fifo_AB;int[512] fifo_BC;void action1() {int in = pop(fifo_AB);int out = in * 42 + 7;push(fifo_BC, tmp);

}

C code generated for actor B

fifo

stack

6 / 6

Page 49: NUMA Profiling for Dynamic Dataflow Applications

Sample correlation

A B C

Application graph

int[512] fifo_AB;int[512] fifo_BC;void action1() {int in = pop(fifo_AB);int out = in * 42 + 7;push(fifo_BC, tmp);

}

C code generated for actor B

PMU Sample

PC = 0x7867BC@ = 0xEFC234A

latency = 50 cyclesfifo

stack

6 / 6

Page 50: NUMA Profiling for Dynamic Dataflow Applications

Sample correlation

A B C

Application graph

int[512] fifo_AB;int[512] fifo_BC;void action1() {int in = pop(fifo_AB);int out = in * 42 + 7;push(fifo_BC, tmp);

}

C code generated for actor B

PMU Sample

PC = 0x7867BC@ = 0xEFC234A

latency = 50 cyclesfifo

stack

B:action1fifo_AB

latency = 50 cycles

Dataflow Sample

Correlation

6 / 6