7
A distributed object detector-tracker aided video encoder for smart camera networks. Srivatsa Bhargava Jagarlapudi Department of Electrical Communication Engineering, RBCCPS IISc, Bangalore. [email protected] Pushkar Gorur Corporate R&D Qualcomm, Bangalore [email protected] Bharadwaj Amrutur Department of Electrical Communication Engineering, RBCCPS, IISc, Bangalore. [email protected] ABSTRACT In this paper, we propose a Region of Interest(ROI) modu- lated H.264 video encoder system, based on a distributed ob- ject detector-tracker framework, for smart camera networks. Locations of objects of interest, as determined by detector- tracker are used to semantically partition each frame into regions assigned with multiple levels of importance. A dis- tributed architecture is proposed to implement the object detector-tracker framework to mitigate the computational cost. Further, a rate control algorithm with modified Rate- Distortion(RD) cost is proposed to determine Quantization Parameter(QP) and skip decision of Macro Blocks based on their relative levels of importance. Our experiments show that, the proposed system achieves upto 3x reduction in bitrate without significant reduction in PSNR of ROI(head-shoulder region of pedestrians). We also demonstrate the trade-off between total computational cost and compression possible with the proposed distributed detector-tracker framework. KEYWORDS ROI aided video encoding, Edge analytics, Distributed object detection and tracking, Rate control. ACM Reference format: Srivatsa Bhargava Jagarlapudi, Pushkar Gorur, and Bharadwaj Amrutur. 2017. A distributed object detector-tracker aided video encoder for smart camera networks.. In Proceedings of ICDSC 2017, Stanford, CA, USA, September 5–7, 2017, 7 pages. https://doi.org/10.1145/3131885.3131920 Contributions by the author during his PhD at ECE Department, IISc Bangalore Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the au- thor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICDSC 2017, September 5–7, 2017, Stanford, CA, USA 2017 Copyright held by the owner/author(s). Publication rights licensed to Association for Computing Machinery. ACM ISBN 978-1-4503-5487-5/17/09. . . 15.00 https://doi.org/10.1145/3131885.3131920 Figure 1: Smart Camera network architecture with a shared aggregator node that can run the Object detector and the cameras running a low-complexity tracker and video encoder. D-Frame: Frame on which object detector is run. T-Frame: Frame in which objects are tracked locally on edge camera node. 1 INTRODUCTION We are approaching an inflection point in large scale deploy- ment of networked surveillance cameras to monitor our pub- lic spaces. Centralized analysis and storage of the captured streams will become a significant challenge due to increased bit-rate requirements for aggregated video streams. Appli- cations such as face recognition, Automatic Number Plate Recognition etc., will require high resolution video streams to be captured from these cameras, in order to reliably analyze them. Since we are typically interested in certain objects such as pedestrians, vehicles, etc., in surveillance applica- tions, we can encode image regions corresponding to these objects of interest with relatively higher fidelity and skip the background regions to reduce the bit-rate while not affecting the end application. This has to be done close to compute constrained edge camera nodes to reduce upstream data-rate requirements. The idea of using the Region of Interest(ROI) detection to aid video encoding has been an explored by many

A distributed object detector-tracker aided video …Srivatsa Bhargava Jagarlapudi, Pushkar Gorur, and Bharadwaj Amrutur. 2017. A distributed object detector-tracker aided video encoder

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A distributed object detector-tracker aided video …Srivatsa Bhargava Jagarlapudi, Pushkar Gorur, and Bharadwaj Amrutur. 2017. A distributed object detector-tracker aided video encoder

A distributed object detector-tracker aided video encoder forsmart camera networks.

Srivatsa BhargavaJagarlapudi

Department of ElectricalCommunication Engineering,RBCCPS IISc, Bangalore.

[email protected]

Pushkar Gorur�

Corporate R&D Qualcomm,Bangalore

[email protected]

Bharadwaj AmruturDepartment of Electrical

Communication Engineering,RBCCPS, IISc, [email protected]

ABSTRACT

In this paper, we propose a Region of Interest(ROI) modu-lated H.264 video encoder system, based on a distributed ob-ject detector-tracker framework, for smart camera networks.Locations of objects of interest, as determined by detector-tracker are used to semantically partition each frame intoregions assigned with multiple levels of importance. A dis-tributed architecture is proposed to implement the objectdetector-tracker framework to mitigate the computationalcost. Further, a rate control algorithm with modified Rate-Distortion(RD) cost is proposed to determine QuantizationParameter(QP) and skip decision of Macro Blocks based ontheir relative levels of importance. Our experiments show that,the proposed system achieves upto 3x reduction in bitratewithout significant reduction in PSNR of ROI(head-shoulderregion of pedestrians). We also demonstrate the trade-offbetween total computational cost and compression possiblewith the proposed distributed detector-tracker framework.

KEYWORDS

ROI aided video encoding, Edge analytics, Distributed objectdetection and tracking, Rate control.

ACM Reference format:Srivatsa Bhargava Jagarlapudi, Pushkar Gorur, and Bharadwaj

Amrutur. 2017. A distributed object detector-tracker aided videoencoder for smart camera networks.. In Proceedings of ICDSC

2017, Stanford, CA, USA, September 5–7, 2017, 7 pages.https://doi.org/10.1145/3131885.3131920

�Contributions by the author during his PhD at ECE Department,IISc Bangalore

Permission to make digital or hard copies of all or part of this workfor personal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advantageand that copies bear this notice and the full citation on the first page.Copyrights for components of this work owned by others than the au-thor(s) must be honored. Abstracting with credit is permitted. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee. Request permissionsfrom [email protected].

ICDSC 2017, September 5–7, 2017, Stanford, CA, USA

� 2017 Copyright held by the owner/author(s). Publication rightslicensed to Association for Computing Machinery.ACM ISBN 978-1-4503-5487-5/17/09. . . �15.00https://doi.org/10.1145/3131885.3131920

Figure 1: Smart Camera network architecture witha shared aggregator node that can run the Objectdetector and the cameras running a low-complexitytracker and video encoder. D-Frame: Frame onwhich object detector is run. T-Frame: Frame inwhich objects are tracked locally on edge cameranode.

1 INTRODUCTION

We are approaching an inflection point in large scale deploy-ment of networked surveillance cameras to monitor our pub-lic spaces. Centralized analysis and storage of the capturedstreams will become a significant challenge due to increasedbit-rate requirements for aggregated video streams. Appli-cations such as face recognition, Automatic Number PlateRecognition etc., will require high resolution video streams tobe captured from these cameras, in order to reliably analyzethem. Since we are typically interested in certain objectssuch as pedestrians, vehicles, etc., in surveillance applica-tions, we can encode image regions corresponding to theseobjects of interest with relatively higher fidelity and skip thebackground regions to reduce the bit-rate while not affectingthe end application. This has to be done close to computeconstrained edge camera nodes to reduce upstream data-raterequirements. The idea of using the Region of Interest(ROI)detection to aid video encoding has been an explored by many

Page 2: A distributed object detector-tracker aided video …Srivatsa Bhargava Jagarlapudi, Pushkar Gorur, and Bharadwaj Amrutur. 2017. A distributed object detector-tracker aided video encoder

ICDSC 2017, September 5–7, 2017, Stanford, CA, USA Srivatsa Bhargava Jagarlapudi, Pushkar Gorur, and Bharadwaj Amrutur

works [14], [8], [9], [5], [2]. Most of these researches have fo-cused on modeling the background and using the estimatedforeground mask as the Region of Interest(for computationalfeasibility), irrespective of the semantic relevance of the fore-ground blob to the application. In [7], a sparse-dense spatialsampling based cascaded segmentation scheme was proposedto identify the Macro Blocks containing moving objects andperform skip decision.

Due to the recent developments in Deep Neural Net-works(DNNs), many methods to train reliable and genericobject detectors such as [6], [12] have been proposed in theliterature. These object detectors are computationally ex-pensive due to very deep neural networks that are used tocompute the features. A study [3] reported that DNN ar-chitectures such as VGG Net consume around 10-13 Wattsof power when run on Nvidia-TX1(which is a significantlypowerful embedded SoC platform). Hence, it will be infeasibleto run such object detectors for every frame on relatively lesspowerful processors used in edge camera nodes.

We propose to use a DNN based object detector in adistributed detector-tracker framework to mitigate the afore-mentioned computational complexity problem. As shown inFigure 1, object detector which is computationally expensiveis executed on an aggregator node. This functionality of theaggregator can be shared by multiple camera nodes which runa low-complexity tracker to track object movements locally,while using the object detector(running on aggregator node)periodically in round-robin fashion. The object categoriesand their locations determined by the detector-tracker areused to semantically partition the frame into regions withmultiple levels of importance for the application. Further,the importance level is used to modulate the QP used for thecorresponding Macro Blocks (MBs) during encoding. Specifi-cally, in this work, we address the problem of encoding head-shoulder region of the pedestrians with highest fidelity(Regionof Interest-ROI), followed by the torso region with relativelylower fidelity(Region of Reduced Interest-RORI) and skipthe background region(Region of No Interest-RONI) in allthe P-frames. A similar scheme can be adopted for a differentapplication by redefining these regions accordingly.

In Glimpse[4], a distributed scheme to recognize and trackobjects by off-loading the recognition computation to a serverwas explored. A frame differencing based active cachingscheme was proposed to determine frames with significantmovement that could be offloaded to the server for recog-nition. Whereas, in our scheme cameras perform object de-tection in round robin fashion. The active caching schemeis complementary to the scheme proposed here and couldhelp further improve the performance by making detectioninterval adaptive to the movement.

Following are the key contributions of this work:

• A ROI aided video encoder system that uses theinformation from an integrated object detection andtracking framework is proposed in this work.

Figure 2: Block diagram of the proposed systemshowing the distributed object detection-trackingframework, integrated with the modified H.264 en-coder.

Figure 3: Representative frame showing different cat-egories of Macro Blocks computed by the proposedframework. Clustering of multiple pedestrians into aforeground blob and single pedestrian causing mul-tiple foreground blobs can also be observed in thisframe.

• A low complexity tracker that lends itself to a dis-tributed object detector-tracker framework on a cam-era network is proposed and implemented in orderto mitigate the computational complexity of using aDNN based object detector such as [6].

• A scheme to modulate the QPs of MacroBlocks toachieve low loss in the ROI while meeting the re-quired bit-rate constraints is proposed and imple-mented.

2 THE PROPOSED SYSTEM

The proposed ROI aided video encoder system, shown inFigure 2, is comprised of a distributed object detector-tracker

Page 3: A distributed object detector-tracker aided video …Srivatsa Bhargava Jagarlapudi, Pushkar Gorur, and Bharadwaj Amrutur. 2017. A distributed object detector-tracker aided video encoder

A distributed object detector-tracker aided video encoder for smart camera networks.ICDSC 2017, September 5–7, 2017, Stanford, CA, USA

framework, a MB level interest map generator (referred toas video analytics engine) to analyze each incoming frameto generate the ROI information, and a H.264 encoder withmodifications to rate control algorithm, which utilizes theROI information during encoding.

2.1 Video Analysis

The video analytics engine is responsible for accepting anincoming frame and marking its Macro Blocks accordingto their levels of importance. We define the following fourcategories for a Macro Block

• Region of Interest Macro Block (ROI-MB)• Region of Reduced Interest Macro Block(RORI-MB)• Region of No Interest Macro Block (RONI-MB)• Un-Explained Macro Block (UnExp-MB)

The MBs are categorized in to these classes based on thelocations of objects of interest and the foreground mask.Specifically, given the locations of pedestrians, the MBs con-tained in the upper (one-thirds) portion are marked as ROI-MBs and those in the lower (two-thirds) portion are markedas RORI-MBs. The Macro Blocks in the foreground regionthat are not in the pedestrian bounding boxes are classifiedas UnExp-MBs and the MBs in the background region areclassified as RONI-MBs as shown in Figure 3.

2.2 Object Detection and Tracking

We use an object detection framework called “Faster-RCNN”[6] using VGG-16 network for feature extraction to detectpedestrians. Though we consider only pedestrian’s head shoul-der region as ROI in this paper, due to the use of a genericobject detection framework, the proposed system can beeasily extended to include more classes as objects of interest.

In order to mitigate the challenge posed by computationalcomplexity and to handle noisy detections due to occlusions,we developed a low-complexity tracker that makes use oflow level features such as foreground mask, sparse opticalflow(KLT) vectors and color histograms to track the detectedpedestrians. Our framework currently computes the fore-ground mask by modeling the background using a GaussianMixture Model(GMM)[15] at all pixels. Use of the cascadedsegmentation scheme proposed in [7] (which is complementaryto our work) will further reduce the computational cost.

Since running the object detector on every frame is com-putationally expensive, each camera node sends a JPEG en-coded frame to the aggregator node once in every N frames,and the object detector is run on it. In the remaining N-1frames the tracker running on the camera node has to locallytrack the detected objects. Hence we have two categories offrames, D-Frames on which object detector is run on andT-Frames in which the objects are tracked (locally at theedge camera node) using low complexity features. The formu-lation used for tracking objects in D-Frames is similar to onespecified in [1]. Let yj

d represent the bounding box corner and

dimensions of the jth object detection and hjd be the feature

descriptor (color histogram) of the detected object for thejth detection in the tth D-frame. Let NT (t) be number of

tracklets, Xi(t) be the tracklet states of the position of eachtracklet in the tth frame, A be the data association matrixthat relates the detections to the tracklets, dB(., .) representsthe Bhattacharya distance between color histograms of thedetection and tracklet template, and u(.) represents the uni-form distribution. The observation likelihood of detectionsin the D-Frame can be written as,

p(yjd|X,A) = u(yj

d)Aj0

NT (t)∏

i=1

N (yjd|Xi(t),Σd)

Aji (1)

p(hjd|X,A) = u(hj

d)Aj0

NT (t)∏

i=1

(λe−λdB(hjd(t),hi

T (t−1)))Aji (2)

Note that in the case of D-Frames (which occur once in ev-ery N-frames), single detection per pedestrian is obtained(barring the possibility of missing or false positive detec-tions which are managed by standard tracker formulations).Many works such as [1], [10] proposed trackers based on the“tracking-by-detection” approach which assumes availabilityof object detections for every frame. We track objects inD-Frames using a Kalman filter based state update. TheBhattacharya distance of color histograms is used as thedistance metric, to solve for data association, by associatingeach measurement in the current frame to closest matchingthe tracklet template in the previous two frames. The objecttracking routine which is executed on D-Frames is describedin Algorithm-1.

To reduce the computational cost, the objects are trackedin the remaining (N-1) T-Frames using a low-complexitytracker. To track objects, if we use only the foreground blobbounding boxes as measurements, the observation likelihoodwill turn out to have a complex form since we can encounterforeground(FG) blobs constituting more than one pedestrian,as illustrated in Figure 3. Number of pedestrians clusteringto form a foreground blob is represented by the clique countvariable C which needs to be estimated for every foregroundblob to disambiguate and successfully track pedestrians. Theobservation likelihood in this case will be,

p(yjb |X,A,C) = u(yj

b)Aj0fy(y

jb |X,A,C)(1−Aj0)

p(hjb|X,A,C) = u(hj

d)Aj0fh(h

jb|X,A,C)(1−Aj0)

(3)

Here, the observation likelihood of the foreground blob bound-ing box assumes a skew normal distribution form(since thebounding box left-top and bottom-right co-ordinates of theFG blob bounding box are minimum and maximum valuesof individual tracklet bounding box locations, which are as-sumed to be Gaussian distributed).

fy(yjb |X,A,C) =

NT∑

i=1

{N (yb|Xi, σiX) ·Aji

NT∏

n=1n �=j

(0.5− 2 · erf( (ybX −Xn)√2σn

X

))Ajn}(4)

Due to clustering of multiple trajectories into commonforeground blobs, difficulty in disambiguation of individual

Page 4: A distributed object detector-tracker aided video …Srivatsa Bhargava Jagarlapudi, Pushkar Gorur, and Bharadwaj Amrutur. 2017. A distributed object detector-tracker aided video encoder

ICDSC 2017, September 5–7, 2017, Stanford, CA, USA Srivatsa Bhargava Jagarlapudi, Pushkar Gorur, and Bharadwaj Amrutur

trajectory locations manifests as difficulty to perform ML es-timation of parameters of the skew-normal distribution. Thisis further aggravated by inaccurate foreground segmentationwhich results a single object generating multiple foregroundblobs, hence requiring spatial clustering of blobs that belongto the same object. As a result, the methods used in the stan-dard “tracking by detection” approaches cannot be directlyapplied. Hence, we make use of sparsely computed opticalflow vectors (computed on Shi-Tomasi corner points) to clus-ter the foreground blobs based on their originating points (ofthe flow vectors) and also estimate the clique count as thenumber of clusters in the orientation histogram-of-filteredflow vectors. Noisy optical flow vectors are discarded by fil-tering using NCC scores and Forward-Backward consistencycheck. Further, the location of the pedestrians in the currentframe within that foreground blob is computed after solvingfor data association using color histogram matching. Finally,object motion vectors are refined to minimize the histogramdistance by searching uniformly in a MB region. The trackingroutine which is executed for the T-Frames is described inAlgorithm-2.

Data: Input frame:framet,List of tracklets:trackletList,Frame Number: fNum,Unsupported Measurements List:unsuppMeasList.Result: List of Update Tracks: updatedTrackletList

detList = ObjectDetector(framet)predict all Tracklet States kalman(trackletList)for deti in detList do

sim1 = computeSimilarity(deti,allT rackletsStates(fNum− 1))

sim2 = computeSimilarity(deti,allT rackletsStates(fNum− 2))

i assoc = resolveAssociations(sim1, sim2)if i assoc then

tracklet.associate(deti)else

unsuppMeasList.append(unsuppObj(deti))end

end

for trackletj in trackletList docorrect prediction kalman(trackletj)if trackletj .lastUpdate ≤ (fNum− Tdeath) then

delete(trackletj)end

for unsuppObj in unsuppMeasList doif unsuppObj.numUpdates ≥ Tbirth then

createTracklet(unsuppObj)end

endtrackletList.Update()

Algorithm 1: Tracker routine run on D-frames

Data: Input frame:framet,List of tracklets:trackletList,Frame Number: fNum,Unsupported Measurements List:unsuppMeasList.Result: List of Update Tracks: updatedTrackletList

fgBlobList, currFrame m = run FGBG segm(framet)trackletKPs = compute ST corners(prevFramem)predict all Tracklet States(trackletList)sparseOFV List = getSparseOFV(prevFrame m,currFrame m, trackletKPs)

for tracklet in trackletList dotrackletClique = {fgBlobs ∈ fgBlobList |frac OFV(fgBlob, tracklet) ≥ τ}

clustered fg Blobs.append(trackletClique)endupdate fgBlobList(clustered fg Blobs)sparseOFV List = getSparseOFV(prevFrame m,currFrame m, trackletKPs)

for fgBlob in fgBlobList dofgBlobClique = {tracklet ∈ trackletList |frac OFV(fgBlob, tracklet) ≥ τ}

Cj = update blob clique(fgBlobClique,predicted tracklets)

end

for tracklet in trackletList doinit mv = mean(tracklet OFV s)final mv =argmin

mvdist(Cj ∗ tracklet.hist, hist(tracklet.loc+

mv))correct prediction kalman(tracklet.loc+ final mv)

end

for tracklet in trackletList doiftracklet.lastUpdate < (fNum - Tdeath)delete(tracklet)

end

Algorithm 2: Tracker routine run on T-frames

2.3 Modulating the Quantizationparameter and Skip Decision

The Macro Blocks are categorized into one of ROI MB,RORI MB, RONI MB and UnExp MB categories by analyz-ing each frame. Quantization Parameter of each MB shouldbe modulated using this information, to encode the ROI MBswith the least loss and RORI MBs with relatively lesser loss.In H.264 video encoder, Rate Distortion Optimization isperformed to compute the mode and QP that minimize thecost shown in Equation 5, where D(QP ) is the measure ofdistortion caused due to quantization and R(QP ) is the bi-trate cost and λ is the lagrange multiplier (which arises dueto the constrained optimization problem of minimizing thedistortion subject to the bitrate constraint).

J(QP ) = D(QP ) + λR(QP ) (5)

Page 5: A distributed object detector-tracker aided video …Srivatsa Bhargava Jagarlapudi, Pushkar Gorur, and Bharadwaj Amrutur. 2017. A distributed object detector-tracker aided video encoder

A distributed object detector-tracker aided video encoder for smart camera networks.ICDSC 2017, September 5–7, 2017, Stanford, CA, USA

Many ROI aided aided rate control approaches were proposedin recent literature. In [13], an enhancement layer based ROIaided rate control scheme for scalable video coding was pro-posed focusing on independent decoding of ROI slices without any temporal constraints. Wherein, an additional distor-tion term to model the distortion of the error concealmentlayer was added to the RD cost function of the slices thatbelong to ROI. In [11], the authors propose a rate controlalgorithm for HEVC based on the R− λ model and uses auser selected constant ratio between the ROI and non-ROIbitrates.In our approach, we specify the relative interest parame-ters of the regions to allow for soft allocation of bit budgetsacross various regions. For this, the distortion term D(QP )in the RD cost function of each MB is modulated with arelative interest parameter αMB as shown in Equation 6.The parameter αMB can take on one of the four valuesαROI MB ≥ αUnExp MB ≥ αRORI MB ≥ αRONI MB whichare set depending on the relative interest of the regions forthe end application. During the RDO, the MBs correspondingto higher importance regions see an increased distortion costand automatically get assigned with smaller QP comparedto MBs corresponding to lower interest regions. UnExp MBscould be encoded with the same level of importance as thatused for ROI MBs, in order to mitigate the loss of PSNRin the ROI region due to inaccurate object detection andtracking. If a Macro Block is categorized as a RONI MB,we mark it as a skip macro block in the P-frame. Hence,background macro blocks get encoded only in the I-frameswith the QP determined by αRORI MB and are skipped inthe P-frames.

J(QP ) = αMBD(QP ) + λR(QP ) (6)

3 EXPERIMENTAL RESULTS

To evaluate the proposed ROI aided video encoder system, weintegrated the described object detector-tracker frameworkwith x264 encoder, which was modified to implement theproposed rate control mechanism. As mentioned in Section-1,in this work we explicitly address the problem of encoding thehead shoulder region of pedestrians as ROI in surveillancevideos and the torso region as the RORI. To the best ofour knowledge, we are not aware of any standard datasetthat directly provides the annotations of the head shoulderregion. Hence, we created our own dataset comprising of twosurveillance videos(porch and entranceRoad sequences) withROI annotations. This dataset is made publicly available1.

As can be visually observed in Figure 4, the proposed ROIaided encoder encodes the head shoulder region with higherfidelity and the background with much lower fidelity.

To quantitatively analyze the performance, we have eval-uated the proposed system by plotting the Rate-Distortioncurve shown in Figure 5 and Figure 6 wherein, the PSNR

1The yuv files of these sequences and annotations can be downloadedfrom: https://goo.gl/JzuYks

Figure 4: Representative frame crop showing headshoulder region encoded at 2Mbps with and withoutROI from entranceRoad sequence. Left:without ROIaided encoding, Right:with ROI aided encoding

Figure 5: Bitrate vs PSNR(Luma) of ground truthROI plot encoding entranceRoad Video with Detec-tor running once in every 1,3 and 5 Frames with andwithout proposed ROI aided rate control. Encoderrun at 10fps setting.

was computed over the ROI regions that were manually an-notated in our dataset. These plots were obtained by runningthe experiment on 300 frame sequences of 1280x720 resolu-tion (with αROI MB = 1000, αRORI MB = 50, αUnExp MB =500, αRONI MB = 1). As observed in Figure 5 and Figure 6,using the proposed ROI aided encoder results in significantbitrate savings of upto 3x over the default x264 rate con-trol algorithm (for the same PSNR values). It can also beobserved that as the interval between detections in the suc-cessive frames increases, the PSNR in the true ROI regionreduces. This is due to the drift of the tracker, which increasesas the detection window increases. This results in an increasein the number of unexplained MBs.

In Figure 7, per frame average PSNR values of ROI, RORIand RONI Macroblocks are plotted for the entranceRoadsequence encoded at 2Mbps at 10fps setting (using sameαROI MB , αRORI MB , αUnExp MB , αRONI MB settings asgiven above). As can be observed from the graph, the mod-ified rate control algorithm assigns lower Quantization Pa-rameters to ROI regions as opposed to RORI and RONI

Page 6: A distributed object detector-tracker aided video …Srivatsa Bhargava Jagarlapudi, Pushkar Gorur, and Bharadwaj Amrutur. 2017. A distributed object detector-tracker aided video encoder

ICDSC 2017, September 5–7, 2017, Stanford, CA, USA Srivatsa Bhargava Jagarlapudi, Pushkar Gorur, and Bharadwaj Amrutur

Figure 6: Bitrate vs PSNR(Luma) of ground truthROI plot encoding porch Video with Detector run-ning once in every 1,3 and 5 Frames with and with-out proposed ROI aided rate control. Encoder runat 10fps setting.

Figure 7: PSNR in the computed ROI, RORI andRONI Regions for each frame in entranceRoad se-quence encoded at 2Mbps bitrate at 10fps.

regions, resulting in the observed PSNR order PSNRROI ≥PSNRRORI ≥ PSNRRONI . Figure 8, shows that the objectdetection tracking framework is robust to compression onthe D-Frames upto a compression ratio of 0.15 at reducedresolution of 576x324. Hence, the proposed distributed objectdetector-low complexity tracker can be implemented withouta significant overhead in network bandwidth due to transmis-sion of JPEG compressed D-Frames from edge cameras to theaggregator node. As can be observed by comparing column-3and column-5 in Table 1, the overhead in transmitting theJPEG compressed D-Frames is not very significant comparedto the data rate of the video stream. It may also be noted

Figure 8: PSNR of Region of Interest when the D-Frames were compressed with JPEG encoder withvarious compression ratios before running the ob-ject detector with 3 frame interval between objectdetector runs.

that this overhead is limited only to the link between theedge camera node and the aggregator node and does notaffect the back haul data rate.

The proposed ROI aided video encoder with object de-tection tracking framework is currently implemented on adesktop core-i5 CPU and a Titan Black GPU, with theobject detector (faster-RCNN) and foreground-backgroundsegmentation running on the GPU, and rest of the frame-work running on the CPU in single threaded mode. Forpractical deployment, foreground-background segmentationcould be implemented on a much smaller GPU, used on anedge camera node since it is computationally far less expen-sive, compared to the object detector which needs a relativelylarger GPU that would be used in the aggregator node (In ourexperiments, when executed on the same GPU, the FG-BGsegmentation requires 8ms per frame, whereas, the object de-tector(faster-RCNN) requires 300 ms per frame). The averagerun times of our current implementation of each componentin the proposed system for running on 1280x720 resolutionsequences are given in Table 2. The object detector uses aVGG-16 network for extracting the features, which performs178 G Operations to detect objects in a 1280x720 image.Table 1 lists the computational costs involved in runningthe ROI aided encoder various simulated system configura-tions (using execution time as surrogate for computationalcomplexity) and the resulting bitrate savings. Table 1 alsoshows that, running the entire ROI aided encoder on theedge camera node results in 3x reduction in bitrate over thestandard x264 encoder, however with a computational costof 7.25x. Whereas in the distributed object detector-trackerframework, with 3 edge camera nodes sharing the detector,we obtain 2.74x reduction in bitrate at a computational costof 3.85x. When 5 edge camera nodes share the detector we

Page 7: A distributed object detector-tracker aided video …Srivatsa Bhargava Jagarlapudi, Pushkar Gorur, and Bharadwaj Amrutur. 2017. A distributed object detector-tracker aided video encoder

A distributed object detector-tracker aided video encoder for smart camera networks.ICDSC 2017, September 5–7, 2017, Stanford, CA, USA

System Configuration Edge cameracompute perframe

Camera toAggregatordatarate

Aggregator com-pute per frame

Backhauldatarate percamera

Total compute percamera per frame

Without ROI 52.3ms(1x) 7.5Mbps(1x) N.A 7.5Mbps(1x) 52.3ms(1x).

Encode, Detection and Tracking onedge camera.

379.5ms(7.256x)

2.5Mbps(0.33x)

N.A 2.5Mbps(0.33x)

379.5ms(7.256x).

Encode on edge camera, Detectionand Tracking on aggregator node.

61.15ms(1.169x)

8.53Mbps(1.137x)

327.2ms (6.25x)/camera

2.5Mbps(0.33x)

379.5ms (7.419x).

Encode and tracking on edge cam-era, Detection on aggregator (N=3).

100.383ms(1.919x)

3.422Mbps(0.456x)

303.2ms (5.797x)/3 cameras

2.8Mbps(0.364x)

201.45ms (3.85x).

Encode and tracking on edge cam-era, Detection on aggregator (N=5).

102.79ms(1.965x)

5.43Mbps(0.72x)

303.2ms (5.797x)/ 5 cameras

4.6Mbps(0.613x)

163.43ms (3.12x).

Table 1: Comparison of computational costs and achieved bitrates for encoding the entranceRoad at 32dBROI PSNR in various simulated system configurations. (Average computation time per frame is used as asurrogate measure of computational cost.)

System component Average Execution time per frame

Object Detector 300 ms (on Titan Black GPU)

D-Frame Tracker 27.2 ms (on core-i5 CPU, 1 thread)

T-Frame Tracker 54.1 ms (on core-i5 CPU, 1 thread)

H.264 Encoding 52.3 ms (on core-i5 CPU, 1 thread)

Table 2: Average per frame execution times of theproposed system components

notice bitrate savings of 1.63x only at a computational costof 3.12x. This clearly demonstrates the tradeoff possible be-tween the computation and bitrate savings and indicates theexistence of an optimal number of nodes that can share thedetector (running on the aggregator node).

4 CONCLUSIONS AND FUTUREWORK

We have proposed a ROI aided video encoder using a genericobject detection and tracking framework and demonstratedthat up to 3x bitrate savings are achievable without any sig-nificant reduction in PSNR computed over Regions of Interestcompared to the case of encoding without ROI information.Based on our study of various system configurations, forimplementing ROI aided encoding in camera networks, wedemonstrated that significant bitrate savings are achievableat a reduced computational cost when a distributed detector-tracker framework is used. Our experiments also indicatethe scope for Power-Rate-Distortion optimization for suchencoders in smart camera networks.

REFERENCES[1] Sileye Ba, Xavier Alameda-Pineda, Alessio Xompero, and Radu

Horaud. 2016. An on-line variational Bayesian model for multi-person tracking from cluttered scenes. Computer Vision andImage Understanding 153 (2016), 64–76.

[2] Sebastian Brutzer, Benjamin Hoferlin, and Gunther Heidemann.2011. Evaluation of background subtraction techniques forvideo surveillance. In Computer Vision and Pattern Recognition

(CVPR), 2011 IEEE Conference on. IEEE, 1937–1944.[3] Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. 2016.

An Analysis of Deep Neural Network Models for Practical Appli-cations. arXiv preprint arXiv:1605.07678 (2016).

[4] Tiffany Yu-Han Chen, Lenin Ravindranath, Shuo Deng, ParamvirBahl, and Hari Balakrishnan. 2015. Glimpse: Continuous, real-time object recognition on mobile devices. In Proceedings of the13th ACM Conference on Embedded Networked Sensor Systems.ACM, 155–168.

[5] Carlos Cuevas and Narciso Garcia. 2013. Efficient moving objectdetection for lightweight applications on smart cameras. IEEETransactions on Circuits and Systems for Video Technology 23,1 (2013), 1–14.

[6] Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEEInternational Conference on Computer Vision. 1440–1448.

[7] Pushkar Gorur and Bharadwaj Amrutur. 2014. Skip decisionand reference frame selection for low-complexity H. 264/AVCsurveillance video coding. IEEE Transactions on Circuits andSystems for Video Technology 24, 7 (2014), 1156–1169.

[8] Hong Han, Jianfei Zhu, Shengcai Liao, Zhen Lei, and Stan ZLi. 2015. Moving object detection revisited: Speed and robust-ness. IEEE Transactions on Circuits and Systems for VideoTechnology 25, 6 (2015), 910–921.

[9] Shih-Chia Huang and Bo-Hao Chen. 2013. Highly accurate movingobject detection in variable bit rate video-based traffic monitoringsystems. IEEE transactions on neural networks and learningsystems 24, 12 (2013), 1920–1931.

[10] Chanho Kim, Fuxin Li, Arridhana Ciptadi, and James M Rehg.2015. Multiple hypothesis tracking revisited. In Proceedingsof the IEEE International Conference on Computer Vision.4696–4704.

[11] Marwa Meddeb, Marco Cagnazzo, and Beatrice Pesquet-Popescu.2014. Region-of-interest-based rate control scheme for high-efficiency video coding. APSIPA Transactions on Signal andInformation Processing 3 (2014).

[12] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi.2016. You only look once: Unified, real-time object detection. InProceedings of the IEEE Conference on Computer Vision andPattern Recognition. 779–788.

[13] Hongtao Wang, Dong Zhang, and Houqiang Li. 2015. A rate-distortion optimized coding method for region of interest in scal-able video coding. Advances in Multimedia 2015 (2015), 1.

[14] Ching-Yu Wu and Po-Chyi Su. 2009. A Region of Interest Rate-Control Scheme for Encoding Traffic Surveillance Videos. In-ternational Conference on Intelligent Information Hiding andMultimedia Signal Processing. (2009).

[15] Zoran Zivkovic. 2004. Improved adaptive Gaussian mixture modelfor background subtraction. In Pattern Recognition, 2004. ICPR2004. Proceedings of the 17th International Conference on,Vol. 2. IEEE, 28–31.