Upload
luis-lopez
View
1.398
Download
5
Embed Size (px)
Citation preview
Improving GStreamer performance on large pipelines: from profiling to optimization
8-9 October 2015Dublin, Ireland
Conference 2015
Miguel Parí[email protected]
2
Who I am
Miguel París
● Software Engineer
● Telematic Systems Master's
● Researcher at Universidad Rey Juan Carlos (Madrid, Spain)
● Kurento real-time manager
● Twitter: @mparisdiaz
Overview
3
GStreamer is quite good to develop multimedia apps, tools, etc. in an easy way.
It could be more efficient
The first step: measuring / profiling
● Main principle: “you cannot improve what you cannot measure”
Detecting bottlenecks
Measuring the gain of the possible solutions
Comparing different solutions
● In large pipelines a “small” performance improvement could make a “big” difference
The same for a lot of pipelines in the same machine
Profiling levels
4
● Different detailed levels: the most detailed, the most overhead (typically)
● High level
Threads num: ps -o nlwp <pid>
CPU: top, perf stat -p <pid>● Medium level
time-profiling: how much time is spent per each GstElement (using GstTracer)
● Easy way to determine which elements are the bottlenecks.● do_push_buffer_(pre|post), do_push_buffer_list_(pre|post)● Reducing the overhead as much as possible
– Avoid memory alloc/free: it stores all timestamps in a static memory previously allocated
– Avoid logs: logging all entries at the end of the execution– Post-processing: log in CSV format that can be processed by a R script.
latency-profiling: latency added per each Kurento Element (using GstMeta)
● Low level: which functions spend more CPU (using callgrind)
Applying solutions
5
● Top-down function. Repeat this process:
1)Remove unnecessary code
2)Reduce calls
a) Is it needed more than once?
b) Reuse results (CPU vs memory)
3)Go into more low-level functions● GstElements
1)Remove unnecessary elements2)Reduce/Reuse elements
Study case I
6
● The one2many case
● What do we want to improve?
Increase the number of senders in a machine.
Reduce the consumed resources using a fix number of viewers
7
<GstPipeline>pipeline0
[>]
KmsWebrtcEndpointkmswebrtcendpoint1
[>]
GstRTPRtxQueuertprtxqueue1
[>]
GstRtpVP8Payrtpvp8pay1
[>]
GstRtpOPUSPayrtpopuspay1
[>]
KmsWebrtcSessionkmswebrtcsession1
[>]
KmsRtcpDemuxkmsrtcpdemux1
[>] GstRtpSsrcDemuxrtpssrcdemux5
[>]
KmsWebrtcTransportSinkNicekmswebrtctransportsinknice1
[>]
GstNiceSinknicesink1
[>]
GstDtlsSrtpEncdtlssrtpenc1
[>]
GstFunnelfunnel
[>]
GstSrtpEncsrtp-encoder
[>]
GstDtlsEncdtls-encoder
[>]
KmsWebrtcTransportSrcNicekmswebrtctransportsrcnice1
[>]
GstDtlsSrtpDecdtlssrtpdec1
[>]
GstSrtpDecsrtp-decoder
[>]
GstDtlsDecdtls-decoder
[>]
GstDtlsSrtpDemuxdtls-srtp-demux
[>]
GstNiceSrcnicesrc1
[>]
GstRtpBinrtpbin1
[>]
GstRtpSsrcDemuxrtpssrcdemux4
[>]
GstRtpSessionrtpsession3
[>]
GstRtpSsrcDemuxrtpssrcdemux3
[>]
GstRtpSessionrtpsession2
[>]
KmsWebrtcEndpointkmswebrtcendpoint0
[>]
GstRtpVP8Depayrtpvp8depay0
[>]
KmsAgnosticBin2kmsagnosticbin2-1
[>]
GstQueuequeue3
[>]
KmsParseTreeBinkmsparsetreebin1
[>]
KmsVp8Parsekmsvp8parse0
[>]
GstFakeSinkfakesink3
[>]
GstTeetee3[>]
GstFakeSinkfakesink2
[>]
GstTeetee2[>]
GstRTPOpusDepayrtpopusdepay0
[>]
KmsAgnosticBin2kmsagnosticbin2-0
[>]
GstQueuequeue1
[>]
KmsParseTreeBinkmsparsetreebin0
[>]
GstOpusParseopusparse0
[>]
GstFakeSinkfakesink1
[>]
GstTeetee1[>]
GstFakeSinkfakesink0
[>]
GstTeetee0[>]
GstRTPRtxQueuertprtxqueue0
[>]GstRtpVP8Payrtpvp8pay0
[>]
GstRtpOPUSPayrtpopuspay0
[>]
KmsWebrtcSessionkmswebrtcsession0
[>]
KmsRtcpDemuxkmsrtcpdemux0
[>]
GstRtpSsrcDemuxrtpssrcdemux2
[>]
KmsWebrtcTransportSinkNicekmswebrtctransportsinknice0
[>]
GstNiceSinknicesink0
[>]
GstDtlsSrtpEncdtlssrtpenc0
[>]
GstFunnelfunnel
[>]
GstSrtpEncsrtp-encoder
[>]
GstDtlsEncdtls-encoder
[>]
KmsWebrtcTransportSrcNicekmswebrtctransportsrcnice0
[>]
GstDtlsSrtpDecdtlssrtpdec0
[>]
GstSrtpDecsrtp-decoder
[>]
GstDtlsDecdtls-decoder
[>]
GstDtlsSrtpDemuxdtls-srtp-demux
[>]
GstNiceSrcnicesrc0
[>]
GstRtpBinrtpbin0
[>]
GstRtpJitterBufferrtpjitterbuffer1
[>] GstRtpPtDemuxrtpptdemux1
[>]
GstRtpJitterBufferrtpjitterbuffer0
[>] GstRtpPtDemuxrtpptdemux0
[>]
GstRtpSsrcDemuxrtpssrcdemux1
[>]
GstRtpSessionrtpsession1
[>]
GstRtpSsrcDemuxrtpssrcdemux0
[>]
GstRtpSessionrtpsession0
[>]
LegendElement-States: [~] void-pending, [0] null, [-] ready, [=] paused, [>] playingPad-Activation: [-] none, [>] push, [<] pullPad-Flags: [b]locked, [f]lushing, [b]locking; upper-case is setPad-Task: [T] has started task, [t] has paused task
proxypad40[>][bfb]
sink[>][bfb]
sink_audio[>][bfb]
proxypad42[>][bfb]
sink[>][bfb]
sink_video[>][bfb]
sink[>][bfb]
src[>][bfb]
send_rtp_sink_1[>][bfb]
proxypad33[>][bfb]
src[>][bfb]
src[>][bfb]
send_rtp_sink_0[>][bfb]
proxypad31[>][bfb]
send_rtp_src_0[>][bfb]
sink[>][bfb]
rtp_src[>][bfb]
rtcp_src[>][bfb] rtcp_sink
[>][bfb]
sink[>][bfb]
src_1[>][bfb] recv_rtp_sink_1
[>][bfb]
rtcp_src_1[>][bfb] recv_rtcp_sink_1
[>][bfb]
src_421259003[>][bfb] recv_rtp_sink_0
[>][bfb]
rtcp_src_421259003[>][bfb] recv_rtcp_sink_0
[>][bfb]
proxypad44[>][bfb]
proxypad45[>][bfb]
proxypad46[>][bfb]
proxypad47[>][bfb]
sink[>][bfb]
proxypad34[>][bfb]
rtp_sink_0[>][bfb]
rtp_sink_0[>][bfb]
src[>][bfb]
proxypad36[>][bfb]
rtcp_sink_0[>][bfb]
rtcp_sink_0[>][bfb]
proxypad37[>][bfb]
rtp_sink_1[>][bfb]
rtp_sink_1[>][bfb]
proxypad39[>][bfb]
rtcp_sink_1[>][bfb]
rtcp_sink_1[>][bfb]
proxypad29[>][bfb]
funnelpad5[>][bfb]
src[>][bfb]
funnelpad6[>][bfb]
funnelpad7[>][bfb]
funnelpad8[>][bfb]
funnelpad9[>][bfb]
rtp_src_0[>][bfb]
rtcp_src_0[>][bfb]
rtp_src_1[>][bfb]
rtcp_src_1[>][bfb]
src[>][bfb][T]
proxypad28[>][bfb]
sink[>][bfb]
sink[>][bfb]
rtp_src[>][bfb]
proxypad26[>][bfb]
proxypad27[>][bfb]
rtcp_src[>][bfb]
rtp_sink[>][bfb]
rtp_src[>][bfb]
rtcp_sink[>][bfb]
rtcp_src[>][bfb]
sink[>][bfb]
rtp_src[>][bfb]
dtls_src[>][bfb]
src[>][bfb][T]
send_rtp_sink[>][bfb]
send_rtp_sink[>][bfb]
recv_rtp_sink[>][bfb]
recv_rtcp_sink[>][bfb]
recv_rtp_sink[>][bfb]
recv_rtcp_sink[>][bfb]
proxypad30[>][bfb]
proxypad32[>][bfb]
send_rtp_src_1[>][bfb]
proxypad35[>][bfb]
send_rtcp_src_0[>][bfb]
proxypad38[>][bfb]
send_rtcp_src_1[>][bfb]
sink[>][bfb]
rtcp_sink[>][bfb]
send_rtp_src[>][bfb]
send_rtcp_src[>][bfb]
recv_rtp_src[>][bfb]
sync_src[>][bfb]
sink[>][bfb]
rtcp_sink[>][bfb]
send_rtp_src[>][bfb]
send_rtcp_src[>][bfb]
recv_rtp_src[>][bfb]
sync_src[>][bfb]
proxypad14[>][bfb]
sink[>][bfb]
sink_audio[>][bfb]
audio_src_0[>][bfb]
proxypad15[>][bfb]
sink[>][bfb]
sink_video[>][bfb]
proxypad24[>][bfb]
proxypad25[>][bfb]
video_src_0[>][bfb]
sink[>][bfb]
src[>][bfb]
sink[>][bfb]
proxypad23[>][bfb]
src_0[>][bfb]
sink[>][bfb]
proxypad43[>][bfb]sink
[>][bfb]src
[>][bfb][T]
sink[>][bfb]
src[>][bfb]
sink[>][bfb]
src_0[>][bfb]
sink[>][bfb]
src_2[>][bfb]
sink[>][bfb]
src_0[>][bfb]
src_1[>][bfb]
sink[>][bfb]
src[>][bfb]
sink[>][bfb]
proxypad19[>][bfb]
src_0[>][bfb]
sink[>][bfb]
proxypad41[>][bfb]sink
[>][bfb]src
[>][bfb][T]
sink[>][bfb]
src[>][bfb]
sink[>][bfb]
src_0[>][bfb]
sink[>][bfb]
src_2[>][bfb]
sink[>][bfb]
src_0[>][bfb]
src_1[>][bfb]
sink[>][bfb]
src[>][bfb]
send_rtp_sink_1[>][bfb]
proxypad7[>][bfb]
src[>][bfb]
src[>][bfb]
send_rtp_sink_0[>][bfb]
proxypad5[>][bfb]
send_rtp_src_0[>][bfb]
sink[>][bfb]
rtp_src[>][bfb]
rtcp_src[>][bfb]
rtcp_sink[>][bfb]
sink[>][bfb]
src_1442068093[>][bfb]
recv_rtp_sink_0[>][bfb]
rtcp_src_1442068093[>][bfb]
recv_rtcp_sink_0[>][bfb]
src_836061664[>][bfb]
recv_rtp_sink_1[>][bfb]
rtcp_src_836061664[>][bfb]
recv_rtcp_sink_1[>][bfb]
proxypad16[>][bfb]
proxypad17[>][bfb]
proxypad20[>][bfb]
proxypad21[>][bfb]
sink[>][bfb]
proxypad8[>][bfb]
rtp_sink_0[>][bfb]
rtp_sink_0[>][bfb]
src[>][bfb]
proxypad10[>][bfb]
rtcp_sink_0[>][bfb]
rtcp_sink_0[>][bfb]
proxypad11[>][bfb]
rtp_sink_1[>][bfb]
rtp_sink_1[>][bfb]
proxypad13[>][bfb]
rtcp_sink_1[>][bfb]
rtcp_sink_1[>][bfb]
proxypad3[>][bfb]
funnelpad0[>][bfb]
src[>][bfb]
funnelpad1[>][bfb]
funnelpad2[>][bfb]
funnelpad3[>][bfb]
funnelpad4[>][bfb]
rtp_src_0[>][bfb]
rtcp_src_0[>][bfb]
rtp_src_1[>][bfb]
rtcp_src_1[>][bfb]
src[>][bfb][T]
proxypad2[>][bfb]
sink[>][bfb]
sink[>][bfb] rtp_src
[>][bfb]
proxypad0[>][bfb]
proxypad1[>][bfb]
rtcp_src[>][bfb]
rtp_sink[>][bfb]
rtp_src[>][bfb]
rtcp_sink[>][bfb]
rtcp_src[>][bfb]
sink[>][bfb]
rtp_src[>][bfb]
dtls_src[>][bfb]
src[>][bfb][T]
send_rtp_sink[>][bfb]
send_rtp_sink[>][bfb]
recv_rtp_sink[>][bfb]
recv_rtcp_sink[>][bfb]
recv_rtp_sink[>][bfb]
recv_rtcp_sink[>][bfb] proxypad4
[>][bfb]
proxypad6[>][bfb]
send_rtp_src_1[>][bfb]
proxypad9[>][bfb]
send_rtcp_src_0[>][bfb]
proxypad12[>][bfb]
send_rtcp_src_1[>][bfb]
proxypad18[>][bfb]
recv_rtp_src_0_1442068093_111[>][bfb]
proxypad22[>][bfb]
recv_rtp_src_1_836061664_100[>][bfb]
sink[>][bfb] src
[>][bfb][T]sink_rtcp[>][bfb]
sink[>][bfb]
src_100[>][bfb]
sink[>][bfb] src
[>][bfb][T]sink_rtcp[>][bfb]
sink[>][bfb]
src_111[>][bfb]
sink[>][bfb]
src_836061664[>][bfb]
rtcp_sink[>][bfb]
rtcp_src_836061664[>][bfb]
send_rtp_src[>][bfb]
send_rtcp_src[>][bfb]
recv_rtp_src[>][bfb]
sync_src[>][bfb]
sink[>][bfb]
src_1442068093[>][bfb]
rtcp_sink[>][bfb]
rtcp_src_1442068093[>][bfb]
send_rtp_src[>][bfb]
send_rtcp_src[>][bfb]
recv_rtp_src[>][bfb]
sync_src[>][bfb]
Study case IIThe pipeline
Study case III
8
● Analyzing the sender part of the pipeline
● We detected that:
funnel is quite inefficient https://bugzilla.gnome.org/show_bug.cgi?id=749315
srtpenc does unnecesary work● https://bugzilla.gnome.org/show_bug.cgi?id=752774
funnel: time-profiling (nanoseconds)
9
pad mean e_mean e_min (accumulative)1 dtlssrtpenc1:src 163034.5 163034.478 494782 funnel:src 170207.5 7173 2029 3 srtp-encoder:rtp_src_1 317373.9 147166.435 573184 :proxypad40 716469.7 399095.739 105379 5 rtpbin1:send_rtp_src_1 781019 64371.783 18326 rtpsession3:send_rtp_src 784436 3417 8597 :proxypad35 802532 18096 56328 rtprtxqueue3:src 806016.1 3484.174 12459 rtpvp8pay1:src 834627.3 28611.217 8957 10 :proxypad46 905171.5 69938.136 2120611 kmswebrtcep0:video_src_1 912607 7435.455 212612 kmsagnosticbin2-1:src_0 918833.2 6226.227 228313 queue3:src 925268.2 6434.955 2486
funnel: callgrind profiling
10
● IDEA: look for chain functions to see accumulative CPU usage of the downstream flow.
● CPU percentages (Downstream and ordered by Incl. in kcachegrind)
100 - gst_rtp_base_payload_chain93.99 - gst_rtp_rtx_queue_chain + gst_rtp_rtx_queue_chain_list90.90 - gst_rtp_session_chain_send_rtp_common80.13 - gst_srtp_enc_chain + gst_srtp_enc_chain_list 53.35 - srtp_protect19.51 - gst_funnel_sink_chain_object 9.82 - gst_pad_sticky_events_foreach8.79 - gst_base_sink_chain_main
funnel: callgrind graph
11
funnel: solution
12
CPU impr.: ~100%Time before: 147166 nsTime after: 5829 ns
● Applying solution type 2.a): send sticky events only once
● Add a property to funnel element (“forward-sticky-events”)
If set to FALSE, do not forward sticky events on sink pad changes.
Results
srtpenc
13
● Applying solution type 1)
● srtpenc: remove unnecessary rtp/rtcp checks
https://bugzilla.gnome.org/show_bug.cgi?id=752774
CPU improvement: 2.89 / (100 – 58.90) = 7%
Other examples
14
● g_socket_receive_message: the most CPU usage of is wasted in the error management
https://bugzilla.gnome.org/show_bug.cgi?id=752769
latency-profiling
15
● Mark Buffers with timestamp using GstMeta
● Add a considerable overhead
Sampling (do not profile every buffer)
GstMeta pool?● DEMO (WebRtcEp + FaceOverlay)
Real time profiling
WebRTC, decoding, video processing, encoding...
General remarks (BufferLists)
16
Use BufferLists always you can
Pushing buffers through pads is not free
Really important in large pipelines Pushing BufLists through pads spend the same CPU than
pushing only one buffer
Pushing BufLists through some elements spend the same CPU than pushing only one buffer. Eg: tee, queue
Kurento has funded and participated in the BufList support of a lot of elements
Open discussion: queue: Add property to allow pushing all queued buffers together
● https://bugzilla.gnome.org/show_bug.cgi?id=746524
General remarks (BufferPool)
17
Extending the usage of BufferPool
Significant CPU % is spent allocating / freeing buffers
Nowadays, memory is much cheaper than CPU
Let's take advantage of this Example
Buffers of different size, but always < than 1500Bytes are allocated
Configure a BufferPool to generate Buffers of 1500Bytes and reuse them in a BaseSrc, Queue, RtpPayloader, etc.
General remarks (Threading)
18
GStreamer could be improved a lot in threading aspects
Each GstTask has its own thread It is idle the most time
A lot of threads → Too many context-switches → wasting CPU
Kurento team proposes using thread pools and avoid blocking threads
Kurento has funded the development of the first implementation of TaskPool ( thanks Sebastian ;) )
● http://cgit.freedesktop.org/~slomo/gstreamer/log/?h=task-pool
● It is not finished, let's try to push it forward
Ambitious architecture change● Sync vs Async● Move to a reactive architecture
Conclusion/Future work
19
● Take into account performance
● Performance could be as important as a feature works properly
● Time processing restriction
● Embedded devices
Automatic profiling
Reduce manual work Continuous integration: pass criteria to accept a commit
Warnings
Thank you
Miguel Parí[email protected]
http://www.kurento.orghttp://www.github.com/[email protected]: @kurentoms
http://www.nubomedia.eu
http://www.fi-ware.org
http://ec.europa.eu