NUMA Profiling for Dynamic Dataflow Applications

Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling

NUMA Profiling for Dynamic DataflowApplications

Manuel Selva – Lionel Morel – Kevin Marquet

CITI - INRIA SOCRATEUniversité de Lyon

September 29th, 2015

1 / 28


CMP are everywhere

Intel Nehalem - 4 cores -2009 Kalray MPPA - 256 cores - 2013

SamsungExynos - 2 x 4cores - 2012

On the headlines

• David P.: “The Trouble with Multicore”

• Herb S.: “Welcome to the Jungle”

• Ed L.: “The Problem with Threads”

• Timothy R.: “Mind the Gap...”

• David P.: “The Hail Mary of Programming”

2 / 28


But Programming them is Hard ...

3 / 28


Dataflow

Parser

Text.Y

Text.U

Text.V

Mot.Y

Mot.U

Mot.V

Merger Display

Text. YText. YText. Y

• Actors exchanging data only through FIFO channels• Different forms of parallelism

• Task• Pipeline• Data

4 / 28


Dataflow Applications Examples

Medical imageprocessing

[Albers2012]

Software Defined Radio[Dardaillon2014]

Video Decoding [Lucarz09]

5 / 28


The setting

6 / 28


The question

6 / 28

Do DF applications scale?If not, why?


Does it scale?

1 2 3 4 5 6 7 8 9 10 11 121

1.5

2

2.5

3

Different inputsHEVC decoding

200 frames33 Actors

Number of cores

Spe

edup

vssi

ngle

-cor

e

7 / 28


What are the reasons for that?

• Are the applications well written?• ⇒ blame the app designer.

• Are the runtimes well implemented?• ⇒ blame the runtime designer.

• Is the model of computation really the goodone?

• Programmer tricked into some idiosyncracies?• ⇒ blame the language designer.

8 / 28


Problem Statement

How to identify and understand performancebottlenecks in dataflow programs?

Contribution: CPU/memory profiling to analyse (and fix)bottlenecks on dataflow programs

9 / 28


Preliminary: Which Software?

RVC-Cal - [Yviquel13]

• Dynamic Dataflow• Dedicated to video codec applications• Many applications available (hevc, h264, gzip, zigbee)• Active community

10 / 28


Preliminary: Dataflow Execution Model

AB

CD

C; D;B;A;

Compiler

Core 1 Core 2Core 1 Core 2

RAM

Mapper

A; C; D; B;

11 / 28



AB

CD

C; D;B;A;

Compiler


RAM

Mapper

A; C; D; B;

11 / 28



AB

CD

C; D;B;A;

Compiler


RAM

Mapper

A; C; D; B;

⇒1 thread per core - actors scheduled within thread

11 / 28


Preliminary: Which Architecture?

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

• Commodity HW• NUMA• PMU• linux-supported

12 / 28


Goal: Identify and Understand PerformanceBottlenecks in Dataflow Programs

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

Correlatehw profilingto the DFgraph

13 / 28


CPU Profiling

14 / 28


Goal: Identify and Understand Cores Imbalance

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

Exec time→ actors

15 / 28


Cores Balance

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

Single actor:Inter pred.

200 framesInput: KimonoHEVC

2827272929

3134

3636

4354

100

Number of cores

Wor

kdi

strib

utio

nby

core

(%)

Core 1 Core 2 Core 3 Core 4 Core 5 Core 6


16 / 28


Cores Balance

1 2 3 4 5 6 7 8 9 10 11 120

20

40

60

80

100

Single actor:Inter pred.Single actor:Inter pred.


2827272929

3134

3636

4354

100

Number of cores

Wor

kdi

strib

utio

nby

core

(%)



16 / 28


Diagnosis?

The application is not parallel enough:• Split the Interframe Prediction actor! [Jerbi14]• Split other actors as well ...• Parallelize the sequential code inside actors?

17 / 28


Total Work Time is Increasing!

1 2 3 4 5 6 7 8 9 10 11 123

4

5

6

7

8

·1010

+49%

Input: Kimono

200 frames

HEVC

Number of cores

Tota

lWor

kTi

me

(cyc

les)

Total Work Time = Sum of cpu time for all cores usedQuestion: where does this overhead come from?

18 / 28

Memory Profiling


Goal: Identify and Understand Memory Usage

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

Mem. traffic→ FIFOs

19 / 28


NUMA - Performance Monitoring Unit

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

PMU PMU PMU PMU

PMU PMU

20 / 28


NUMA - Performance Monitoring Unit

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

PMU PMU PMU PMU

PMU PMU

Hardware profiling mechanisms

• Hard to program

20 / 28


A library for NUMA Profiling

Hardware

Writ

eas

sem

bler

Run

insu

perv

isor

PMU

LinuxKernel

perf_event_open()system call

Kernel module/dev/cpu/msr

Linux Perf PAPI numap Intel PCM

• Memory bandwidth profiling• Memory access sampling

21 / 28


Using numap for memory bandwidth usage

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

PMU PMU PMU PMU

PMU PMU

22 / 28


Using numap for memory bandwidth usage

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

PMU PMU PMU PMU

PMU PMU

DF applicationssaturate memorybandwidth?

22 / 28


Main Memory Bandwidth Usage

1 2 3 4 5 6 7 8 9 10 11 120

5

10

15

20

25

Write max bandwidth

Read max bandwidth

Input: Kimono

200 frames

HEVC

Number of cores

Ave

rage

Ban

dwid

th(G

B/s

)ReadWrite

DF applications saturate memory bandwidth? NO!23 / 28


Do you pay for too many distant accesses?

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

PMU PMU PMU PMU

PMU PMU

24 / 28


Do you pay for too many distant accesses?

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

PMU PMU PMU PMU

PMU PMU

@=0x7123CFF

Associatemem accessesto actorsand FIFOs

24 / 28


Communication Cost

1 2 3 4 5 6 7 8 9 1011120

20

40

60

80

100


Average MemoryLatency (cycles)21

.120

.019

.416

.016

.014

.512

.911

.410

.09.

58.

27.

9

18 18 25 25 21

14 17 26 32 2518 17

14 17 2211 16 16 1619 18 1789 90

79 7564 57 52 47 47 39 39 37

Number of cores

%of

acce

sses

L1 LFB L2 L3 RemoteCache LocalRAM RemoteRAM

25 / 28

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Intel X5650Westemere


Communication Cost

1 2 3 4 5 6 7 8 9 1011120

20

40

60

80

100


Average MemoryLatency (cycles)21

.120

.019

.416

.016

.014

.512

.911

.410

.09.

58.

27.

9 Average MemoryLatency (cycles)21

.120

.019

.416

.016

.014

.512

.911

.410

.09.

58.

27.

9

18 18 25 25 21

14 17 26 32 2518 17

14 17 2211 16 16 1619 18 1789 90

79 7564 57 52 47 47 39 39 37

Number of cores

%of

acce

sses

L1 LFB L2 L3 RemoteCache LocalRAM RemoteRAM

A small part of the accesses are responsible for a largeshare of the latency.

25 / 28


Where to Optimize?

High latency

Hig

hla

tenc

y

The profiler gives us:• High latency data exchanges at the dataflow level

We plan on using this for:• Feeding this information to the mapping heuristics

26 / 28


Conclusion

Proposition

• Main goal: Improve scalability of DF programs• How: Understand performance bottlenecks in DF programs• Approach: connect HW-level performance monitoring to

DF runtime

Contributions

• numap: memory profiling for NUMA architectures• Connection to the RVC-Cal runtime• Memory profiling of video decoders

27 / 28


Perspectives

Short-term

• Continue analysis of memory sampling results• Build more intelligent (re-)mapping decisions

Mid- and Long-term

• Compare ressource usage of DF-written decoders withtraditionnal thread-based implementations (eg ffmpeg)

• Integrate DF notions (ie data-dependencies) into OS kernel• Adapt runtime strategies to many-core architectures• Run and adapt multiple DF applications simultaneously

28 / 28

Bibliography I

I A H R Albers and P H N de With.Task complexity analysis and qos management for mappingdynamic video-processing tasks on a multi-core platform.Journal of Real-Time Image Processing, 7(3):185–202,2012.

I Mickaël Dardaillon, Kevin Marquet, Tanguy Risset, JérômeMartin, and Henri-Pierre Charles.A compilation flow for parametric dataflow: Programmingmodel, scheduling, and application to heterogeneous mpsoc.

In Proceedings of the 2014 International Conference onCompilers, Architecture and Synthesis for EmbeddedSystems, CASES ’14, pages 8:1–8:10, New York, NY, USA,2014. ACM.

Bibliography II

I Mohammad Dashti, Alexandra Fedorova, Justin Funston,Fabien Gaud, Renaud Lachaize, Baptiste Lepers, VivienQuema, and Mark Roth.Traffic management: A holistic approach to memoryplacement on numa systems.In Proceedings of the Eighteenth International Conferenceon Architectural Support for Programming Languages andOperating Systems, ASPLOS ’13, pages 381–394, NewYork, NY, USA, 2013. ACM.

I Tudor David, Rachid Guerraoui, and Vasileios Trigonakis.Everything you always wanted to know aboutsynchronization but were afraid to ask.In Proceedings of the Twenty-Fourth ACM Symposium onOperating Systems Principles, SOSP ’13, pages 33–48, NewYork, NY, USA, 2013. ACM.

Bibliography III

I Khaled Jerbi, Daniele Renzi, Damien de Saint-Jorre, HervéYviquel, Mickaël Raulet, Claudio Alberti, and MarcoMattavelli.Development and optimization of high level dataflowprograms: the HEVC decoder design case.In 48th Asilomar Conference on Signals, Systems andComputers, Pacific Grove, United States, November 2014.

I I. Amer, C. Lucarz, G. Roquier, M. Mattavelli, M. Raulet, J.-F.Nezan, and O. Deforges.Reconfigurable video coding on multicore.Signal Processing Magazine, IEEE, 26(6):113 –123,november 2009.

Bibliography IV

I Daniel Molka, Daniel Hackenberg, Robert Schone, andMatthias S. Muller.Memory performance and cache coherency effects on anintel nehalem multiprocessor system.In Proceedings of the 2009 18th International Conference onParallel Architectures and Compilation Techniques, PACT’09, pages 261–270, Washington, DC, USA, 2009. IEEEComputer Society.

I Herve Yviquel, Antoine Lorence, Khaled Jerbi, GildasCocherel, Alexandre Sanchez, and Mickael Raulet.Orcc: Multimedia development made easy.In Proceedings of the 21st ACM International Conference onMultimedia, MM ’13, pages 863–866. ACM, 2013.

Communication Overhead On NUMA

Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

3 / 6


Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

Remote vs local latency+30%

[Molka2009, David2013]

3 / 6


Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

Cache coherencyprotocol

QPI overheadlat. * 4

[Molka2009]

3 / 6


Core 1L1L2

...Core 6

L1L2

L3

Mem. Ctrl QPI

Xeon X5650

Memory Bank 1

Core 7L1L2

...Core 12

L1L2

L3

Mem. CtrlQPI

Xeon X5650

Memory Bank 2

Coredomain

Uncoredomain

Memory controlers andQPI links contention

lat. * 5[Dashti2013]

3 / 6

Why build a “dataflow” profiler?

Why not use a “regular” profiler alone?

Because they are generally too low-level:• Distance to programmer’s thinking is too long• May know about threads, but not actors• Will not be aware of data dependencies between actors

4 / 6

Preliminary: Dataflow Actors Internals

A B C

Application graph

int[512] fifo_AB;int[512] fifo_BC;void action1() {int in = pop(fifo_AB);int out = in * 42 + 7;push(fifo_BC, tmp);

}

void action2(){...

}

C code generated for actor B

Work time(B) =∑a∈actions

cpu time (a)

5 / 6

Sample correlation

A B C

Application graph


}


fifo

stack

6 / 6

Sample correlation

A B C

Application graph


}


PMU Sample

PC = 0x7867BC@ = 0xEFC234A

latency = 50 cyclesfifo

stack

6 / 6

Sample correlation

A B C

Application graph


}


PMU Sample

PC = 0x7867BC@ = 0xEFC234A

latency = 50 cyclesfifo

stack

B:action1fifo_AB

latency = 50 cycles

Dataflow Sample

Correlation

6 / 6

Documents

NUMA Profiling for Dynamic Dataflow Applications