Improving GStreamer performance on large pipelines: from profiling to optimization

Improving GStreamer performance on large pipelines: from profiling to optimization

8-9 October 2015Dublin, Ireland

Conference 2015

Miguel Parí[email protected]

2

Who I am

Miguel París

● Software Engineer

● Telematic Systems Master's

● Researcher at Universidad Rey Juan Carlos (Madrid, Spain)

● Kurento real-time manager

● [email protected]

● Twitter: @mparisdiaz

mailto:[email protected]

Overview

3

GStreamer is quite good to develop multimedia apps, tools, etc. in an easy way.

It could be more efficient

The first step: measuring / profiling

● Main principle: “you cannot improve what you cannot measure”

Detecting bottlenecks

Measuring the gain of the possible solutions

Comparing different solutions

● In large pipelines a “small” performance improvement could make a “big” difference

The same for a lot of pipelines in the same machine

Profiling levels

4

● Different detailed levels: the most detailed, the most overhead (typically)

● High level

Threads num: ps -o nlwp <pid>

CPU: top, perf stat -p <pid>● Medium level

time-profiling: how much time is spent per each GstElement (using GstTracer)

● Easy way to determine which elements are the bottlenecks.● do_push_buffer_(pre|post), do_push_buffer_list_(pre|post)● Reducing the overhead as much as possible

– Avoid memory alloc/free: it stores all timestamps in a static memory previously allocated

– Avoid logs: logging all entries at the end of the execution– Post-processing: log in CSV format that can be processed by a R script.

latency-profiling: latency added per each Kurento Element (using GstMeta)

● Low level: which functions spend more CPU (using callgrind)

Applying solutions

5

● Top-down function. Repeat this process:

1)Remove unnecessary code

2)Reduce calls

a) Is it needed more than once?

b) Reuse results (CPU vs memory)

3)Go into more low-level functions● GstElements

1)Remove unnecessary elements2)Reduce/Reuse elements

Study case I

6

● The one2many case

● What do we want to improve?

Increase the number of senders in a machine.

Reduce the consumed resources using a fix number of viewers

7

<GstPipeline>pipeline0

[>]

KmsWebrtcEndpointkmswebrtcendpoint1

[>]

GstRTPRtxQueuertprtxqueue1

[>]

GstRtpVP8Payrtpvp8pay1

[>]

GstRtpOPUSPayrtpopuspay1

[>]

KmsWebrtcSessionkmswebrtcsession1

[>]

KmsRtcpDemuxkmsrtcpdemux1

[>] GstRtpSsrcDemuxrtpssrcdemux5

[>]

KmsWebrtcTransportSinkNicekmswebrtctransportsinknice1

[>]

GstNiceSinknicesink1

[>]

GstDtlsSrtpEncdtlssrtpenc1

[>]

GstFunnelfunnel

[>]

GstSrtpEncsrtp-encoder

[>]

GstDtlsEncdtls-encoder

[>]

KmsWebrtcTransportSrcNicekmswebrtctransportsrcnice1

[>]

GstDtlsSrtpDecdtlssrtpdec1

[>]

GstSrtpDecsrtp-decoder

[>]

GstDtlsDecdtls-decoder

[>]

GstDtlsSrtpDemuxdtls-srtp-demux

[>]

GstNiceSrcnicesrc1

[>]

GstRtpBinrtpbin1

[>]

GstRtpSsrcDemuxrtpssrcdemux4

[>]

GstRtpSessionrtpsession3

[>]


[>]


[>]

KmsWebrtcEndpointkmswebrtcendpoint0

[>]

GstRtpVP8Depayrtpvp8depay0

[>]

KmsAgnosticBin2kmsagnosticbin2-1

[>]

GstQueuequeue3

[>]

KmsParseTreeBinkmsparsetreebin1

[>]

KmsVp8Parsekmsvp8parse0

[>]

GstFakeSinkfakesink3

[>]

GstTeetee3[>]


[>]

GstTeetee2[>]

GstRTPOpusDepayrtpopusdepay0

[>]

KmsAgnosticBin2kmsagnosticbin2-0

[>]

GstQueuequeue1

[>]

KmsParseTreeBinkmsparsetreebin0

[>]

GstOpusParseopusparse0

[>]


[>]

GstTeetee1[>]


[>]

GstTeetee0[>]

GstRTPRtxQueuertprtxqueue0

[>]GstRtpVP8Payrtpvp8pay0

[>]

GstRtpOPUSPayrtpopuspay0

[>]

KmsWebrtcSessionkmswebrtcsession0

[>]

KmsRtcpDemuxkmsrtcpdemux0

[>]


[>]

KmsWebrtcTransportSinkNicekmswebrtctransportsinknice0

[>]

GstNiceSinknicesink0

[>]

GstDtlsSrtpEncdtlssrtpenc0

[>]

GstFunnelfunnel

[>]

GstSrtpEncsrtp-encoder

[>]

GstDtlsEncdtls-encoder

[>]

KmsWebrtcTransportSrcNicekmswebrtctransportsrcnice0

[>]

GstDtlsSrtpDecdtlssrtpdec0

[>]

GstSrtpDecsrtp-decoder

[>]

GstDtlsDecdtls-decoder

[>]

GstDtlsSrtpDemuxdtls-srtp-demux

[>]

GstNiceSrcnicesrc0

[>]

GstRtpBinrtpbin0

[>]

GstRtpJitterBufferrtpjitterbuffer1

[>] GstRtpPtDemuxrtpptdemux1

[>]

GstRtpJitterBufferrtpjitterbuffer0

[>] GstRtpPtDemuxrtpptdemux0

[>]


[>]


[>]


[>]


[>]

LegendElement-States: [~] void-pending, [0] null, [-] ready, [=] paused, [>] playingPad-Activation: [-] none, [>] push, [<] pullPad-Flags: [b]locked, [f]lushing, [b]locking; upper-case is setPad-Task: [T] has started task, [t] has paused task

proxypad40[>][bfb]

sink[>][bfb]

sink_audio[>][bfb]

proxypad42[>][bfb]

sink[>][bfb]

sink_video[>][bfb]

sink[>][bfb]

src[>][bfb]

send_rtp_sink_1[>][bfb]

proxypad33[>][bfb]

src[>][bfb]

src[>][bfb]


proxypad31[>][bfb]

send_rtp_src_0[>][bfb]

sink[>][bfb]

rtp_src[>][bfb]

rtcp_src[>][bfb] rtcp_sink

[>][bfb]

sink[>][bfb]

src_1[>][bfb] recv_rtp_sink_1

[>][bfb]

rtcp_src_1[>][bfb] recv_rtcp_sink_1

[>][bfb]

src_421259003[>][bfb] recv_rtp_sink_0

[>][bfb]

rtcp_src_421259003[>][bfb] recv_rtcp_sink_0

[>][bfb]

proxypad44[>][bfb]

proxypad45[>][bfb]

proxypad46[>][bfb]

proxypad47[>][bfb]

sink[>][bfb]

proxypad34[>][bfb]

rtp_sink_0[>][bfb]

rtp_sink_0[>][bfb]

src[>][bfb]

proxypad36[>][bfb]

rtcp_sink_0[>][bfb]

rtcp_sink_0[>][bfb]

proxypad37[>][bfb]

rtp_sink_1[>][bfb]

rtp_sink_1[>][bfb]

proxypad39[>][bfb]

rtcp_sink_1[>][bfb]

rtcp_sink_1[>][bfb]

proxypad29[>][bfb]

funnelpad5[>][bfb]

src[>][bfb]

funnelpad6[>][bfb]

funnelpad7[>][bfb]

funnelpad8[>][bfb]

funnelpad9[>][bfb]

rtp_src_0[>][bfb]

rtcp_src_0[>][bfb]

rtp_src_1[>][bfb]

rtcp_src_1[>][bfb]

src[>][bfb][T]

proxypad28[>][bfb]

sink[>][bfb]

sink[>][bfb]

rtp_src[>][bfb]

proxypad26[>][bfb]

proxypad27[>][bfb]

rtcp_src[>][bfb]

rtp_sink[>][bfb]

rtp_src[>][bfb]

rtcp_sink[>][bfb]

rtcp_src[>][bfb]

sink[>][bfb]

rtp_src[>][bfb]

dtls_src[>][bfb]

src[>][bfb][T]

send_rtp_sink[>][bfb]


recv_rtp_sink[>][bfb]

recv_rtcp_sink[>][bfb]



proxypad30[>][bfb]

proxypad32[>][bfb]


proxypad35[>][bfb]

send_rtcp_src_0[>][bfb]

proxypad38[>][bfb]


sink[>][bfb]

rtcp_sink[>][bfb]

send_rtp_src[>][bfb]

send_rtcp_src[>][bfb]

recv_rtp_src[>][bfb]

sync_src[>][bfb]

sink[>][bfb]

rtcp_sink[>][bfb]




sync_src[>][bfb]

proxypad14[>][bfb]

sink[>][bfb]

sink_audio[>][bfb]

audio_src_0[>][bfb]

proxypad15[>][bfb]

sink[>][bfb]

sink_video[>][bfb]

proxypad24[>][bfb]

proxypad25[>][bfb]

video_src_0[>][bfb]

sink[>][bfb]

src[>][bfb]

sink[>][bfb]

proxypad23[>][bfb]

src_0[>][bfb]

sink[>][bfb]

proxypad43[>][bfb]sink

[>][bfb]src

[>][bfb][T]

sink[>][bfb]

src[>][bfb]

sink[>][bfb]

src_0[>][bfb]

sink[>][bfb]

src_2[>][bfb]

sink[>][bfb]

src_0[>][bfb]

src_1[>][bfb]

sink[>][bfb]

src[>][bfb]

sink[>][bfb]

proxypad19[>][bfb]

src_0[>][bfb]

sink[>][bfb]

proxypad41[>][bfb]sink

[>][bfb]src

[>][bfb][T]

sink[>][bfb]

src[>][bfb]

sink[>][bfb]

src_0[>][bfb]

sink[>][bfb]

src_2[>][bfb]

sink[>][bfb]

src_0[>][bfb]

src_1[>][bfb]

sink[>][bfb]

src[>][bfb]


proxypad7[>][bfb]

src[>][bfb]

src[>][bfb]


proxypad5[>][bfb]


sink[>][bfb]

rtp_src[>][bfb]

rtcp_src[>][bfb]

rtcp_sink[>][bfb]

sink[>][bfb]

src_1442068093[>][bfb]

recv_rtp_sink_0[>][bfb]

rtcp_src_1442068093[>][bfb]

recv_rtcp_sink_0[>][bfb]

src_836061664[>][bfb]

recv_rtp_sink_1[>][bfb]

rtcp_src_836061664[>][bfb]

recv_rtcp_sink_1[>][bfb]

proxypad16[>][bfb]

proxypad17[>][bfb]

proxypad20[>][bfb]

proxypad21[>][bfb]

sink[>][bfb]

proxypad8[>][bfb]

rtp_sink_0[>][bfb]

rtp_sink_0[>][bfb]

src[>][bfb]

proxypad10[>][bfb]

rtcp_sink_0[>][bfb]

rtcp_sink_0[>][bfb]

proxypad11[>][bfb]

rtp_sink_1[>][bfb]

rtp_sink_1[>][bfb]

proxypad13[>][bfb]

rtcp_sink_1[>][bfb]

rtcp_sink_1[>][bfb]

proxypad3[>][bfb]

funnelpad0[>][bfb]

src[>][bfb]

funnelpad1[>][bfb]

funnelpad2[>][bfb]

funnelpad3[>][bfb]

funnelpad4[>][bfb]

rtp_src_0[>][bfb]

rtcp_src_0[>][bfb]

rtp_src_1[>][bfb]

rtcp_src_1[>][bfb]

src[>][bfb][T]

proxypad2[>][bfb]

sink[>][bfb]

sink[>][bfb] rtp_src

[>][bfb]

proxypad0[>][bfb]

proxypad1[>][bfb]

rtcp_src[>][bfb]

rtp_sink[>][bfb]

rtp_src[>][bfb]

rtcp_sink[>][bfb]

rtcp_src[>][bfb]

sink[>][bfb]

rtp_src[>][bfb]

dtls_src[>][bfb]

src[>][bfb][T]






recv_rtcp_sink[>][bfb] proxypad4

[>][bfb]

proxypad6[>][bfb]


proxypad9[>][bfb]


proxypad12[>][bfb]


proxypad18[>][bfb]

recv_rtp_src_0_1442068093_111[>][bfb]

proxypad22[>][bfb]

recv_rtp_src_1_836061664_100[>][bfb]

sink[>][bfb] src

[>][bfb][T]sink_rtcp[>][bfb]

sink[>][bfb]

src_100[>][bfb]

sink[>][bfb] src

[>][bfb][T]sink_rtcp[>][bfb]

sink[>][bfb]

src_111[>][bfb]

sink[>][bfb]

src_836061664[>][bfb]

rtcp_sink[>][bfb]

rtcp_src_836061664[>][bfb]




sync_src[>][bfb]

sink[>][bfb]

src_1442068093[>][bfb]

rtcp_sink[>][bfb]

rtcp_src_1442068093[>][bfb]




sync_src[>][bfb]

Study case IIThe pipeline

Study case III

8

● Analyzing the sender part of the pipeline

● We detected that:

funnel is quite inefficient https://bugzilla.gnome.org/show_bug.cgi?id=749315

srtpenc does unnecesary work● https://bugzilla.gnome.org/show_bug.cgi?id=752774

https://bugzilla.gnome.org/show_bug.cgi?id=749315


funnel: time-profiling (nanoseconds)

9

pad mean e_mean e_min (accumulative)1 dtlssrtpenc1:src 163034.5 163034.478 494782 funnel:src 170207.5 7173 2029 3 srtp-encoder:rtp_src_1 317373.9 147166.435 573184 :proxypad40 716469.7 399095.739 105379 5 rtpbin1:send_rtp_src_1 781019 64371.783 18326 rtpsession3:send_rtp_src 784436 3417 8597 :proxypad35 802532 18096 56328 rtprtxqueue3:src 806016.1 3484.174 12459 rtpvp8pay1:src 834627.3 28611.217 8957 10 :proxypad46 905171.5 69938.136 2120611 kmswebrtcep0:video_src_1 912607 7435.455 212612 kmsagnosticbin2-1:src_0 918833.2 6226.227 228313 queue3:src 925268.2 6434.955 2486

funnel: callgrind profiling

10

● IDEA: look for chain functions to see accumulative CPU usage of the downstream flow.

● CPU percentages (Downstream and ordered by Incl. in kcachegrind)

100 - gst_rtp_base_payload_chain93.99 - gst_rtp_rtx_queue_chain + gst_rtp_rtx_queue_chain_list90.90 - gst_rtp_session_chain_send_rtp_common80.13 - gst_srtp_enc_chain + gst_srtp_enc_chain_list 53.35 - srtp_protect19.51 - gst_funnel_sink_chain_object 9.82 - gst_pad_sticky_events_foreach8.79 - gst_base_sink_chain_main

funnel: callgrind graph

11

funnel: solution

12

CPU impr.: ~100%Time before: 147166 nsTime after: 5829 ns

● Applying solution type 2.a): send sticky events only once

● Add a property to funnel element (“forward-sticky-events”)

If set to FALSE, do not forward sticky events on sink pad changes.

Results

srtpenc

13

● Applying solution type 1)

● srtpenc: remove unnecessary rtp/rtcp checks


CPU improvement: 2.89 / (100 – 58.90) = 7%


Other examples

14

● g_socket_receive_message: the most CPU usage of is wasted in the error management



latency-profiling

15

● Mark Buffers with timestamp using GstMeta

● Add a considerable overhead

Sampling (do not profile every buffer)

GstMeta pool?● DEMO (WebRtcEp + FaceOverlay)

Real time profiling

WebRTC, decoding, video processing, encoding...

General remarks (BufferLists)

16

Use BufferLists always you can

Pushing buffers through pads is not free

Really important in large pipelines Pushing BufLists through pads spend the same CPU than

pushing only one buffer

Pushing BufLists through some elements spend the same CPU than pushing only one buffer. Eg: tee, queue

Kurento has funded and participated in the BufList support of a lot of elements

Open discussion: queue: Add property to allow pushing all queued buffers together

● https://bugzilla.gnome.org/show_bug.cgi?id=746524

General remarks (BufferPool)

17

Extending the usage of BufferPool

Significant CPU % is spent allocating / freeing buffers

Nowadays, memory is much cheaper than CPU

Let's take advantage of this Example

Buffers of different size, but always < than 1500Bytes are allocated

Configure a BufferPool to generate Buffers of 1500Bytes and reuse them in a BaseSrc, Queue, RtpPayloader, etc.

General remarks (Threading)

18

GStreamer could be improved a lot in threading aspects

Each GstTask has its own thread It is idle the most time

A lot of threads → Too many context-switches → wasting CPU

Kurento team proposes using thread pools and avoid blocking threads

Kurento has funded the development of the first implementation of TaskPool ( thanks Sebastian ;) )

● http://cgit.freedesktop.org/~slomo/gstreamer/log/?h=task-pool

● It is not finished, let's try to push it forward

Ambitious architecture change● Sync vs Async● Move to a reactive architecture

http://cgit.freedesktop.org/~slomo/gstreamer/log/?h=task-pool

Conclusion/Future work

19

● Take into account performance

● Performance could be as important as a feature works properly

● Time processing restriction

● Embedded devices

Automatic profiling

Reduce manual work Continuous integration: pass criteria to accept a commit

Warnings

Thank you

Miguel Parí[email protected]

http://www.kurento.orghttp://www.github.com/[email protected]: @kurentoms

http://www.nubomedia.eu

http://www.fi-ware.org

http://ec.europa.eu

Internet

Improving GStreamer performance on large pipelines: from profiling to optimization