Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
NUMA Profiling for Dynamic DataflowApplications
Manuel Selva – Lionel Morel – Kevin Marquet
CITI - INRIA SOCRATEUniversité de Lyon
September 29th, 2015
1 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
CMP are everywhere
Intel Nehalem - 4 cores -2009 Kalray MPPA - 256 cores - 2013
SamsungExynos - 2 x 4cores - 2012
On the headlines
• David P.: “The Trouble with Multicore”
• Herb S.: “Welcome to the Jungle”
• Ed L.: “The Problem with Threads”
• Timothy R.: “Mind the Gap...”
• David P.: “The Hail Mary of Programming”
2 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
But Programming them is Hard ...
3 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Dataflow
Parser
Text.Y
Text.U
Text.V
Mot.Y
Mot.U
Mot.V
Merger Display
Text. YText. YText. Y
• Actors exchanging data only through FIFO channels• Different forms of parallelism
• Task• Pipeline• Data
4 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Dataflow Applications Examples
Medical imageprocessing
[Albers2012]
Software Defined Radio[Dardaillon2014]
Video Decoding [Lucarz09]
5 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
The setting
6 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
The question
6 / 28
Do DF applications scale?If not, why?
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Does it scale?
1 2 3 4 5 6 7 8 9 10 11 121
1.5
2
2.5
3
Different inputsHEVC decoding
200 frames33 Actors
Number of cores
Spe
edup
vssi
ngle
-cor
e
7 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
What are the reasons for that?
• Are the applications well written?• ⇒ blame the app designer.
• Are the runtimes well implemented?• ⇒ blame the runtime designer.
• Is the model of computation really the goodone?
• Programmer tricked into some idiosyncracies?• ⇒ blame the language designer.
8 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Problem Statement
How to identify and understand performancebottlenecks in dataflow programs?
Contribution: CPU/memory profiling to analyse (and fix)bottlenecks on dataflow programs
9 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Preliminary: Which Software?
RVC-Cal - [Yviquel13]
• Dynamic Dataflow• Dedicated to video codec applications• Many applications available (hevc, h264, gzip, zigbee)• Active community
10 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Preliminary: Dataflow Execution Model
AB
CD
C; D;B;A;
Compiler
Core 1 Core 2Core 1 Core 2
RAM
Mapper
A; C; D; B;
11 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Preliminary: Dataflow Execution Model
AB
CD
C; D;B;A;
Compiler
Core 1 Core 2Core 1 Core 2
RAM
Mapper
A; C; D; B;
11 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Preliminary: Dataflow Execution Model
AB
CD
C; D;B;A;
Compiler
Core 1 Core 2Core 1 Core 2
RAM
Mapper
A; C; D; B;
⇒1 thread per core - actors scheduled within thread
11 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Preliminary: Which Architecture?
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Coredomain
Uncoredomain
• Commodity HW• NUMA• PMU• linux-supported
12 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Goal: Identify and Understand PerformanceBottlenecks in Dataflow Programs
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Coredomain
Uncoredomain
Correlatehw profilingto the DFgraph
13 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
CPU Profiling
14 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Goal: Identify and Understand Cores Imbalance
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Coredomain
Uncoredomain
Exec time→ actors
15 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Cores Balance
1 2 3 4 5 6 7 8 9 10 11 120
20
40
60
80
100
Single actor:Inter pred.
200 framesInput: KimonoHEVC
2827272929
3134
3636
4354
100
Number of cores
Wor
kdi
strib
utio
nby
core
(%)
Core 1 Core 2 Core 3 Core 4 Core 5 Core 6
Core 7 Core 8 Core 9 Core 10 Core 11 Core 12
16 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Cores Balance
1 2 3 4 5 6 7 8 9 10 11 120
20
40
60
80
100
Single actor:Inter pred.Single actor:Inter pred.
200 framesInput: KimonoHEVC
2827272929
3134
3636
4354
100
Number of cores
Wor
kdi
strib
utio
nby
core
(%)
Core 1 Core 2 Core 3 Core 4 Core 5 Core 6
Core 7 Core 8 Core 9 Core 10 Core 11 Core 12
16 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Diagnosis?
The application is not parallel enough:• Split the Interframe Prediction actor! [Jerbi14]• Split other actors as well ...• Parallelize the sequential code inside actors?
17 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Total Work Time is Increasing!
1 2 3 4 5 6 7 8 9 10 11 123
4
5
6
7
8
·1010
+49%
Input: Kimono
200 frames
HEVC
Number of cores
Tota
lWor
kTi
me
(cyc
les)
Total Work Time = Sum of cpu time for all cores usedQuestion: where does this overhead come from?
18 / 28
Memory Profiling
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Goal: Identify and Understand Memory Usage
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Coredomain
Uncoredomain
Mem. traffic→ FIFOs
19 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
NUMA - Performance Monitoring Unit
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Coredomain
Uncoredomain
PMU PMU PMU PMU
PMU PMU
20 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
NUMA - Performance Monitoring Unit
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Coredomain
Uncoredomain
PMU PMU PMU PMU
PMU PMU
Hardware profiling mechanisms
• Hard to program
20 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
A library for NUMA Profiling
Hardware
Writ
eas
sem
bler
Run
insu
perv
isor
PMU
LinuxKernel
perf_event_open()system call
Kernel module/dev/cpu/msr
Linux Perf PAPI numap Intel PCM
• Memory bandwidth profiling• Memory access sampling
21 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Using numap for memory bandwidth usage
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Coredomain
Uncoredomain
PMU PMU PMU PMU
PMU PMU
22 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Using numap for memory bandwidth usage
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Coredomain
Uncoredomain
PMU PMU PMU PMU
PMU PMU
DF applicationssaturate memorybandwidth?
22 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Main Memory Bandwidth Usage
1 2 3 4 5 6 7 8 9 10 11 120
5
10
15
20
25
Write max bandwidth
Read max bandwidth
Input: Kimono
200 frames
HEVC
Number of cores
Ave
rage
Ban
dwid
th(G
B/s
)ReadWrite
DF applications saturate memory bandwidth? NO!23 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Do you pay for too many distant accesses?
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Coredomain
Uncoredomain
PMU PMU PMU PMU
PMU PMU
24 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Do you pay for too many distant accesses?
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Coredomain
Uncoredomain
PMU PMU PMU PMU
PMU PMU
@=0x7123CFF
Associatemem accessesto actorsand FIFOs
24 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Communication Cost
1 2 3 4 5 6 7 8 9 1011120
20
40
60
80
100
200 framesInput: KimonoHEVC
Average MemoryLatency (cycles)21
.120
.019
.416
.016
.014
.512
.911
.410
.09.
58.
27.
9
18 18 25 25 21
14 17 26 32 2518 17
14 17 2211 16 16 1619 18 1789 90
79 7564 57 52 47 47 39 39 37
Number of cores
%of
acce
sses
L1 LFB L2 L3 RemoteCache LocalRAM RemoteRAM
25 / 28
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Intel X5650Westemere
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Communication Cost
1 2 3 4 5 6 7 8 9 1011120
20
40
60
80
100
200 framesInput: KimonoHEVC
Average MemoryLatency (cycles)21
.120
.019
.416
.016
.014
.512
.911
.410
.09.
58.
27.
9 Average MemoryLatency (cycles)21
.120
.019
.416
.016
.014
.512
.911
.410
.09.
58.
27.
9
18 18 25 25 21
14 17 26 32 2518 17
14 17 2211 16 16 1619 18 1789 90
79 7564 57 52 47 47 39 39 37
Number of cores
%of
acce
sses
L1 LFB L2 L3 RemoteCache LocalRAM RemoteRAM
A small part of the accesses are responsible for a largeshare of the latency.
25 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Where to Optimize?
High latency
Hig
hla
tenc
y
The profiler gives us:• High latency data exchanges at the dataflow level
We plan on using this for:• Feeding this information to the mapping heuristics
26 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Conclusion
Proposition
• Main goal: Improve scalability of DF programs• How: Understand performance bottlenecks in DF programs• Approach: connect HW-level performance monitoring to
DF runtime
Contributions
• numap: memory profiling for NUMA architectures• Connection to the RVC-Cal runtime• Memory profiling of video decoders
27 / 28
Introduction Motivation Profiling DF Programs Cpu Profiling Memory Profiling
Perspectives
Short-term
• Continue analysis of memory sampling results• Build more intelligent (re-)mapping decisions
Mid- and Long-term
• Compare ressource usage of DF-written decoders withtraditionnal thread-based implementations (eg ffmpeg)
• Integrate DF notions (ie data-dependencies) into OS kernel• Adapt runtime strategies to many-core architectures• Run and adapt multiple DF applications simultaneously
28 / 28
Bibliography I
I A H R Albers and P H N de With.Task complexity analysis and qos management for mappingdynamic video-processing tasks on a multi-core platform.Journal of Real-Time Image Processing, 7(3):185–202,2012.
I Mickaël Dardaillon, Kevin Marquet, Tanguy Risset, JérômeMartin, and Henri-Pierre Charles.A compilation flow for parametric dataflow: Programmingmodel, scheduling, and application to heterogeneous mpsoc.
In Proceedings of the 2014 International Conference onCompilers, Architecture and Synthesis for EmbeddedSystems, CASES ’14, pages 8:1–8:10, New York, NY, USA,2014. ACM.
Bibliography II
I Mohammad Dashti, Alexandra Fedorova, Justin Funston,Fabien Gaud, Renaud Lachaize, Baptiste Lepers, VivienQuema, and Mark Roth.Traffic management: A holistic approach to memoryplacement on numa systems.In Proceedings of the Eighteenth International Conferenceon Architectural Support for Programming Languages andOperating Systems, ASPLOS ’13, pages 381–394, NewYork, NY, USA, 2013. ACM.
I Tudor David, Rachid Guerraoui, and Vasileios Trigonakis.Everything you always wanted to know aboutsynchronization but were afraid to ask.In Proceedings of the Twenty-Fourth ACM Symposium onOperating Systems Principles, SOSP ’13, pages 33–48, NewYork, NY, USA, 2013. ACM.
Bibliography III
I Khaled Jerbi, Daniele Renzi, Damien de Saint-Jorre, HervéYviquel, Mickaël Raulet, Claudio Alberti, and MarcoMattavelli.Development and optimization of high level dataflowprograms: the HEVC decoder design case.In 48th Asilomar Conference on Signals, Systems andComputers, Pacific Grove, United States, November 2014.
I I. Amer, C. Lucarz, G. Roquier, M. Mattavelli, M. Raulet, J.-F.Nezan, and O. Deforges.Reconfigurable video coding on multicore.Signal Processing Magazine, IEEE, 26(6):113 –123,november 2009.
Bibliography IV
I Daniel Molka, Daniel Hackenberg, Robert Schone, andMatthias S. Muller.Memory performance and cache coherency effects on anintel nehalem multiprocessor system.In Proceedings of the 2009 18th International Conference onParallel Architectures and Compilation Techniques, PACT’09, pages 261–270, Washington, DC, USA, 2009. IEEEComputer Society.
I Herve Yviquel, Antoine Lorence, Khaled Jerbi, GildasCocherel, Alexandre Sanchez, and Mickael Raulet.Orcc: Multimedia development made easy.In Proceedings of the 21st ACM International Conference onMultimedia, MM ’13, pages 863–866. ACM, 2013.
Communication Overhead On NUMA
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Coredomain
Uncoredomain
3 / 6
Communication Overhead On NUMA
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Coredomain
Uncoredomain
Remote vs local latency+30%
[Molka2009, David2013]
3 / 6
Communication Overhead On NUMA
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Coredomain
Uncoredomain
Cache coherencyprotocol
QPI overheadlat. * 4
[Molka2009]
3 / 6
Communication Overhead On NUMA
Core 1L1L2
...Core 6
L1L2
L3
Mem. Ctrl QPI
Xeon X5650
Memory Bank 1
Core 7L1L2
...Core 12
L1L2
L3
Mem. CtrlQPI
Xeon X5650
Memory Bank 2
Coredomain
Uncoredomain
Memory controlers andQPI links contention
lat. * 5[Dashti2013]
3 / 6
Why build a “dataflow” profiler?
Why not use a “regular” profiler alone?
Because they are generally too low-level:• Distance to programmer’s thinking is too long• May know about threads, but not actors• Will not be aware of data dependencies between actors
4 / 6
Preliminary: Dataflow Actors Internals
A B C
Application graph
int[512] fifo_AB;int[512] fifo_BC;void action1() {int in = pop(fifo_AB);int out = in * 42 + 7;push(fifo_BC, tmp);
}
void action2(){...
}
C code generated for actor B
Work time(B) =∑a∈actions
cpu time (a)
5 / 6
Sample correlation
A B C
Application graph
int[512] fifo_AB;int[512] fifo_BC;void action1() {int in = pop(fifo_AB);int out = in * 42 + 7;push(fifo_BC, tmp);
}
C code generated for actor B
fifo
stack
6 / 6
Sample correlation
A B C
Application graph
int[512] fifo_AB;int[512] fifo_BC;void action1() {int in = pop(fifo_AB);int out = in * 42 + 7;push(fifo_BC, tmp);
}
C code generated for actor B
PMU Sample
PC = 0x7867BC@ = 0xEFC234A
latency = 50 cyclesfifo
stack
6 / 6
Sample correlation
A B C
Application graph
int[512] fifo_AB;int[512] fifo_BC;void action1() {int in = pop(fifo_AB);int out = in * 42 + 7;push(fifo_BC, tmp);
}
C code generated for actor B
PMU Sample
PC = 0x7867BC@ = 0xEFC234A
latency = 50 cyclesfifo
stack
B:action1fifo_AB
latency = 50 cycles
Dataflow Sample
Correlation
6 / 6