7
Faster Robot Perception Using Salient Depth Partitioning Darren M. Chan, Angelique Taylor, and Laurel D. Riek Abstract— This paper introduces Salient Depth Partitioning (SDP), a depth-based region cropping algorithm devised to be easily adapted to existing detection algorithms. SDP is designed to give robots a better sense of visual attention, and to reduce the processing time of pedestrian detectors. In contrast to proposal generators, our algorithm generates sparse regions, to combat image degradation caused by robot motion, making them more suitable for real-world operation. Furthermore, SDP is able achieve real-time performance (77 frames per second) on a single processor without a GPU. Our algorithm requires no training, and is designed to work with any pedestrian detection algorithm, provided that the input is in the form of a calibrated RGB-D image. We tested our algorithm with four state-of-the-art pedestrian detectors (HOG and SVM [1], Aggregate Channel Features [2], Checkerboards [3], and R-CNN [4]), and show that it improves computation time by up to 30%, with no discernible change in accuracy. I. I NTRODUCTION As robots become more prevalent in human life, it is important that they gain sufficient understanding of the people, objects, and spaces around them [5], [6]. One major challenge is giving robots the ability to detect people in unpredictable human-centric environments [7], and to enable them to respond and adapt to dynamic situations quickly. These challenges are even greater when humans and robots are both in motion [8], and when they perform time-sensitive tasks [9], [10]. However, pedestrian and object detection algorithms are computationally expensive. The most common methods in- corporate a multi-scaled sliding window approach, followed by a classifier to determine whether or not an object in question exists in an image. This equates to searching each pixel for detection candidates, which leads to computation- ally demanding loads that are counter-productive to robots which should ideally process images in real-time. One solution to speed up detection algorithms is to use more powerful hardware (i.e. more GPUs). However, this is not always feasible because compact and power efficient computing systems are needed for mobile robots. Rather than have a robot’s attention divided among every possible location in an image, it can be given guidance about where to look. For example, autonomous vehicles require fast detection algorithms to ensure the safety of pedestrians and other inhabitants of the road. Vision-based algorithms (in contrast to LIDAR and RADAR techniques [11]–[13]) Some research reported in this article is based upon work supported by the National Science Foundation under Grant IIS-1527759. The authors are with the department of Computer Science and Engineering, UC San Diego, La Jolla, CA 92093, USA (email: [email protected], [email protected], [email protected]) typically exploit road geometry for ground-plane and sky estimation to increase detection speed [14], [15]. In more general applications where ground-plane estima- tion is not always tractable, proposal generators provide a better alternative. Proposal generators operate to suggest re- gions of an image that are most likely to contain objects and people, to alleviate the high computational demands of object detectors [16]–[19]. This idea stems from neurologically- inspired vision [20], where salient characteristics are detected and derived from low-level image features. However, pro- posal generators are typically trained and designed for static images that do not represent real-world, naturalistic data (i.e. containing motion blur, rotation, and distortion). In robotics, the prominence of RGB-D cameras has aided in the development of faster object segmentation and simul- taneous localization and mapping (SLAM) algorithms [21], [22]. RGB-D is favored over solely RGB methods, because they allow geometric and structural information to be ex- tracted from a scene without incurring high computational overhead [23]. Thus, the processing times of region proposal generators can be reduced by exploiting depth information. In this paper, we introduce a new algorithm, Salient Depth Partitioning (SDP), that extracts regions of interest from RGB-D images. Our algorithm serves as a basis for “where a robot should look” and is a preliminary step to object and people detection, to reduce computation time and minimally affect detection accuracy. In addition, our algorithm requires no a priori knowledge of the environment or reliance on ge- ometric constraints. It can work with virtually any detection algorithm given any calibrated RGB-D image. To validate SDP’s effectiveness, we integrated it with four state-of-the- art pedestrian detectors (HOG and SVM [1], Aggregate Channel Features [2], Checkerboards [3], and R-CNN [4]), and found that our algorithm decreases their runtimes by up to 30% with little to no degradation in detector accuracy. II. RELATED WORK Object detection 1 is a classification problem where an image must be extensively evaluated to determine where an object is located. However, computation time is wasted when a detector performs classification on background regions, or regions that do not contain objects. Proposal generators have emerged in the computer vision community as a method of pruning, to eliminate background regions and improve detector efficiency [24]. 1 To eliminate ambiguity, we note here that object and pedestrian detectors follow similar algorithmic processes and in many cases, only differ in the way that they are trained. For this reason, when we refer to “objects”, we also include pedestrians. 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) September 24–28, 2017, Vancouver, BC, Canada U.S. Government work not protected by U.S. copyright 4152

Faster Robot Perception Using Salient Depth Partitioninglriek/papers/chan-taylor... · Rather than have a robot's attention divided among every possible location in an image, it can

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Faster Robot Perception Using Salient Depth Partitioninglriek/papers/chan-taylor... · Rather than have a robot's attention divided among every possible location in an image, it can

Faster Robot Perception Using Salient Depth Partitioning

Darren M. Chan, Angelique Taylor, and Laurel D. Riek

Abstract— This paper introduces Salient Depth Partitioning(SDP), a depth-based region cropping algorithm devised tobe easily adapted to existing detection algorithms. SDP isdesigned to give robots a better sense of visual attention,and to reduce the processing time of pedestrian detectors.In contrast to proposal generators, our algorithm generatessparse regions, to combat image degradation caused by robotmotion, making them more suitable for real-world operation.Furthermore, SDP is able achieve real-time performance (77frames per second) on a single processor without a GPU. Ouralgorithm requires no training, and is designed to work withany pedestrian detection algorithm, provided that the inputis in the form of a calibrated RGB-D image. We tested ouralgorithm with four state-of-the-art pedestrian detectors (HOGand SVM [1], Aggregate Channel Features [2], Checkerboards[3], and R-CNN [4]), and show that it improves computationtime by up to 30%, with no discernible change in accuracy.

I. INTRODUCTION

As robots become more prevalent in human life, it isimportant that they gain sufficient understanding of thepeople, objects, and spaces around them [5], [6]. One majorchallenge is giving robots the ability to detect people inunpredictable human-centric environments [7], and to enablethem to respond and adapt to dynamic situations quickly.These challenges are even greater when humans and robotsare both in motion [8], and when they perform time-sensitivetasks [9], [10].

However, pedestrian and object detection algorithms arecomputationally expensive. The most common methods in-corporate a multi-scaled sliding window approach, followedby a classifier to determine whether or not an object inquestion exists in an image. This equates to searching eachpixel for detection candidates, which leads to computation-ally demanding loads that are counter-productive to robotswhich should ideally process images in real-time.

One solution to speed up detection algorithms is to usemore powerful hardware (i.e. more GPUs). However, thisis not always feasible because compact and power efficientcomputing systems are needed for mobile robots.

Rather than have a robot’s attention divided among everypossible location in an image, it can be given guidance aboutwhere to look. For example, autonomous vehicles requirefast detection algorithms to ensure the safety of pedestriansand other inhabitants of the road. Vision-based algorithms(in contrast to LIDAR and RADAR techniques [11]–[13])

Some research reported in this article is based upon work supported bythe National Science Foundation under Grant IIS-1527759.

The authors are with the department of Computer Science andEngineering, UC San Diego, La Jolla, CA 92093, USA (email:[email protected], [email protected], [email protected])

typically exploit road geometry for ground-plane and skyestimation to increase detection speed [14], [15].

In more general applications where ground-plane estima-tion is not always tractable, proposal generators provide abetter alternative. Proposal generators operate to suggest re-gions of an image that are most likely to contain objects andpeople, to alleviate the high computational demands of objectdetectors [16]–[19]. This idea stems from neurologically-inspired vision [20], where salient characteristics are detectedand derived from low-level image features. However, pro-posal generators are typically trained and designed for staticimages that do not represent real-world, naturalistic data (i.e.containing motion blur, rotation, and distortion).

In robotics, the prominence of RGB-D cameras has aidedin the development of faster object segmentation and simul-taneous localization and mapping (SLAM) algorithms [21],[22]. RGB-D is favored over solely RGB methods, becausethey allow geometric and structural information to be ex-tracted from a scene without incurring high computationaloverhead [23]. Thus, the processing times of region proposalgenerators can be reduced by exploiting depth information.

In this paper, we introduce a new algorithm, Salient DepthPartitioning (SDP), that extracts regions of interest fromRGB-D images. Our algorithm serves as a basis for “wherea robot should look” and is a preliminary step to object andpeople detection, to reduce computation time and minimallyaffect detection accuracy. In addition, our algorithm requiresno a priori knowledge of the environment or reliance on ge-ometric constraints. It can work with virtually any detectionalgorithm given any calibrated RGB-D image. To validateSDP’s effectiveness, we integrated it with four state-of-the-art pedestrian detectors (HOG and SVM [1], AggregateChannel Features [2], Checkerboards [3], and R-CNN [4]),and found that our algorithm decreases their runtimes by upto 30% with little to no degradation in detector accuracy.

II. RELATED WORK

Object detection1 is a classification problem where animage must be extensively evaluated to determine where anobject is located. However, computation time is wasted whena detector performs classification on background regions, orregions that do not contain objects. Proposal generators haveemerged in the computer vision community as a methodof pruning, to eliminate background regions and improvedetector efficiency [24].

1To eliminate ambiguity, we note here that object and pedestrian detectorsfollow similar algorithmic processes and in many cases, only differ in theway that they are trained. For this reason, when we refer to “objects”, wealso include pedestrians.

2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)September 24–28, 2017, Vancouver, BC, Canada

U.S. Government work not protected byU.S. copyright

4152

Page 2: Faster Robot Perception Using Salient Depth Partitioninglriek/papers/chan-taylor... · Rather than have a robot's attention divided among every possible location in an image, it can

Fig. 1. The steps of the SDP algorithm. (1) depicts the depth image in its unmodified form, and (2) shows the result of downsampling and Sobeledge-filtering. (3) removes noise via grid partitioning and edge density thresholding. (4) selects a region in the image using the remaining edges filteredby (3). The image is subpartitioned in (5), and transformed to accommodate the size of the original image. The transformed points are used in (6), wherea detector is performed on subpartitions of the RGB image.

Object proposal generators typically employ computation-ally inexpensive algorithms to exploit low-level image fea-tures. These features are then leveraged to suggest candidateregions that are most likely to contain objects. For example,Zitnick and Dollar introduced EdgeBoxes to predict objectlocations by finding regions that contain dense edge features[25], [26], and Cheng et al. [27] used a sliding windowapproach along with gradient features to determine regioncandidates. Other methods rely on merging adjacent, similarpixels to segment objects proposals (c.f. [28]–[30]).

These algorithms can produce on the order of severalthousand image regions – many of which need to removed,or computation time can be wasted. One common method isto prune the number of proposals using a cascade approach[31]. While cascade approaches have shown to excel in objectrecognition tasks, they rely on using specific features fromknown object instances and cannot be used to detect non-learned object classes. Thus, they are not suitable for taskswhere robots need the ability to detect and interact withobjects that they are not fully familiar with, or objects outsideof their trained recognition pipelines.

For object proposal algorithms to be effective in robotics,they must exhibit three characteristics. First, they must havelow computational overhead, or processing times that aremuch lower than that of detectors.

Second, object and region proposal algorithms must pruneas much of the background from the image as possiblewithout removing any actual objects. Thus, it is desirablethat they produce a minimally optimal number of proposals.

Third, object and region proposals should not overlap toavoid having a detector to redundantly process the sameregion multiple times. Although this problem can be some-what mitigated with cascade approaches, there are no existingalgorithms that directly address this last characteristic.

While there has been much progress for object proposalgenerators in computer vision, they are unsuitable for robots.Current top-performing algorithms are designed for, andevaluated on static images (i.e. PASCAL VOC dataset [32]).However, robots are often in situations where several ob-jects (including the robot itself) may simultaneously be in

motion. Non-linear motion blur, random orientation, anddiversification of object scale make conditions for tightlybounded proposals challenging [3]. Furthermore, current top-performing algorithms are computationally expensive and areinadequate for achieving real-time performance [24].

One important consideration that current object proposalsfail to address, is to assess how they actually performwhen integrated with detectors. The standard metric evaluatesproposal algorithms by comparing generated regions withlabeled data using intersection over union (IoU) criteriato derive recall and area under the curve (AUC) scores[33]. However, we argue that these scores do not providesubstantial evidence about real-world performance, whenapplied to state-of-the-art detectors and naturalistic, robot-centric data (c.f. Torralba and Efros [34]).

The literature on object proposal algorithms with regardto robotics is still fairly recent. Consequently, little workhas been done to include additional modalities in addition toRGB [35], [36].

To this end, we approach the design of a novel regioncropping algorithm with several observations. To combatproblems with motion blur and occlusion, sparser and largerregion candidates may allow for greater flexibility thantightly bounded regions. We also posit that regions contain-ing several objects at once will effectively eliminate overlap-ping proposals and redundant detector processes. Lastly, theaddition of a depth modality, in contrast to solely RGB-basedmethods, will allow for faster regions of interest generation.

III. SALIENT DEPTH PARTITIONING

SDP applies a multimodal approach on RGB-D imageswith the observation that adjacent pixels do not drasticallyvary in depth, when those pixels belong to the same object.

The main advantage to utilizing depth images for seg-mentation is that they do not contain textural information ofpeople and objects, but instead are adept at relaying structuraldetails about the scene [23]. This allows edges to establishclear delineations between objects of varying depth. Usingedge detection, we are able to recover ROIs that containstrongly formed edges, or locations in the image that are

4153

Page 3: Faster Robot Perception Using Salient Depth Partitioninglriek/papers/chan-taylor... · Rather than have a robot's attention divided among every possible location in an image, it can

most likely to contain people and objects. In contrast, largeregions with uniform and gradual depths (e.g. walls, floors,buildings, sky, etc.) do not contain strong edges.

SDP does not require previous knowledge about the en-vironment, nor does it require training. Instead, it adapts towhat the robot sees. Furthermore, our method can be appliedto virtually any detection algorithm, provided that the input isa calibrated RGB-D image. In the remainder of this section,we describe our algorithm (depicted in Figure 1).

A. Faster Edge Detection

Edges describe abrupt variations in depth and can be usedas predictors to object locations. Thus, we can also leveragedetected edges to filter out large non-object regions such asroads, floors, walls, and the sky. However, edge detectionand filtering adds considerable overhead to processing time,especially when they are applied via convolution. To reducethe number of computations, the depth map is first down-sampled to a lower resolution.

Given an image Irgb and its corresponding depth map, Id,we downscale Id by a factor, γ. The amount of downscalinglargely depends on the resolution of the image. In the extremecase, downscaling the image by too much will yield a poorquality edge map because the image will be too small to beoperated on by the convolution filter.

For images of 1280 × 720 resolution, we systematicallytested the scale size to select γ = 4, which we foundto provide an optimal balance of image preservation andmathematical simplicity (wholly divisible).

Finally, we apply an edge detector to I ′d. In our imple-mentation, we adopted the 3 × 3 Sobel kernel for its lowcomputation cost and adequate performance, in comparisonto some state-of-the-art edge detectors [37]. We also testedour algorithm with better performing edge detectors (e.g.Canny [38] and Laplacian of Gaussian), but found that theynegatively affected the runtime of SDP.

B. Edge Filtering

After edge detection, the resultant edge map may stillcontain weakly formed edges due to noise. To remedy this,I ′d is divided into patches of size 5 × 5 pixels. The densityof each patch is then found by computing the sum ofits contents. If the density of any patch exceeds the edgesensitivity threshold, ρ, the contents of that patch are notmodified. Conversely, if the density of the patch is less thanρ, the contents of the partition are erased to produce a filterededge map, PI′

d. Edge filtering is given by Equation 1.

PI′d(n,m) =

{I′d[α, β] if

∑I′d[α, β] > ρ

∅ otherwise

I′dN×M ; α = (5n− 4, ..., 5n), β = (5m− 4, ..., 5m)

∀n = 1, ...,N

5,∀m = 1, ...,

M

5

(1)

PI′d(n,m) is a 5× 5 patch of I′d.

I′d[α, β] is a submatrix of I′d, bounded by rows α andcolumns β.

ρ is the edge sensitivity threshold.

This method allows us to achieve localized image filteringwith low computational cost. ρ describes the threshold ratioof edge pixels to background pixels of the depth image, andits value is selected to remove weakly formed edges. Theimpact of parameter ρ is discussed in Section V.

C. ROI Partitioning

Next, we find the minimum and maximum rows andcolumns P , containing non-zero elements, to generate abounded subregion, ROI . ROI is given by Equation 2.

ROI = P[R,C] (2)

ROI is a subregion of PN×M bounded by minimumand maximum indices of row and column vectors, Rand C, respectively.R is a row vector containing non-zero values of P .C is a column vector containing non-zero values of P .

ROI may still contain large subregions, containing noedges, which may be removed by creating smaller partitions.

To accommodate a detector’s ability to detect objectsof minimum sizes, smaller partitions that cannot possiblycontain detectable objects are removed. For example, if adetector is only able to detect objects of minimum size128 × 64 pixels, we remove partitions measuring a heightor width less than 128 and 64 pixels, respectively.

We achieve subpartitioning using a simple plane-sweeptechnique to detect the sparsity of empty row or columnvectors in partition ROI , and use an efficient data structureto dynamically store region boundaries. The subpartitioningsubroutine is presented in Algorithm 1.

Finally, the subpartitions are upscaled by γ to generateproposals that are consistent with the original image size.

IV. EXPERIMENTAL EVALUATION

SDP is designed to decrease the runtime of any objectdetection algorithm and preserve its accuracy. However, itwould be impractical to evaluate the effects of SDP on all ofthem. Instead, we test SDP within the context of pedestriandetection, which we argue is a suitable and noisy testbed toallow us to evaluate it for robustness.

In our experiment, we selected four state-of-the-art pedes-trian detection algorithms: Histogram of Oriented Gradi-ents (HOG) and SVM, Aggregate Channel Features (ACF),Checkerboards, and R-CNN.

Despite its age, Dalal and Trigg’s HOG [1] is still widelyused in modern detection algorithms, and is regarded bymany as the best standalone feature descriptor [39]. TheHOG and SVM detector exploits gradient features, whichare discretized to enable greater detector immunity, whenobjects are arbitrary oriented.

Stemming from gradient-based approaches, Dollar et al.[2] proposed ACF, which runs significantly faster and as ac-curately as the state-of-the-art object detectors. Their methodderives feature pyramids to be automatically extrapolated

4154

Page 4: Faster Robot Perception Using Salient Depth Partitioninglriek/papers/chan-taylor... · Rather than have a robot's attention divided among every possible location in an image, it can

Algorithm 1: Subpartitioning(ROI, T )Create partitions from a binary image by detecting spar-sity of empty regions.

Input : ROI is a binary image mask of size N ×M .T is the detector’s minimum detectiondimension.

Output: S list containing row or column indices ofsubpartitions.

sweep list = {} // list containing the indices ofnon-zero valued row or column vectors.flag start count = false // if true, initialize thecounting of consecutive zero-valued columns or rows.empty count = 0 // counter corresponding to thenumber of consecutive zero-valued columns or rows.

for m = 1 to M doif empty count > T then

if flag start count = true thenS.append(min(sweep list))flag start count = false

endif ROI[N,m].sum > 0 then

S.append(max(sweep list))flag start count = truesweep list = {}

endendif ROI[N,m].sum = 0 then

sweep list.append(m)increment empty count by 1

elseempty count = 0sweep list = {}

endendReturn S

from nearby scales. This allows for quick feature approxima-tions, rather than have explicit time-consuming computationsperformed at every scale.

Zhang et al. [3], [39] introduced the Checkerboards al-gorithm, using Filtered Channel Features as an extensionof ACF. Using features such as an alternate color space(CIELUV) and HOG, they were able to show that theiralgorithm produced improved precision and recall scoresover the current top-performing pedestrian detectors.

Lately, Convolutional Neural Networks (CNNs) havegained prominence in the computer vision and robotics com-munities due their adeptness at deriving highly discriminantfeatures. As a result, they have been shown to achieve highrecall and precision scores, when used for purposes like ob-ject and pedestrian detection. To increase the breadth of ourstudy, we tested SDP with the Region-based ConvolutionalNetwork (R-CNN) proposed by Ross et al. [4], along with itspedestrian-trained variant, made public by Tome et al. [40].

A. Data Acquisition and Experimental Methodology

We required depth images of formidable quality, that arealso calibrated to the RGB images to test our algorithm.However, existing datasets (such as PASCAL VOC [32],INRIA [1], ETH [41], TUD-Brussells [42], and Caltech [43])do not contain spatiotemporal RGB-D images of pedestrians.

Therefore, we created our own RGB-D pedestrian dataset.Video was recorded with the ZED stereo camera system[44] at 1280× 720 resolution at approximately 8 frames persecond. We note here that we used the ZED camera as analternative to the Microsoft Kinect, because it is the mosteconomical camera with the farthest depth ranging ability(up to 30 meters) at the time of writing.

Additionally, we wanted to strictly control for consistentmotion and speed across both indoor and outdoor data. Tosimulate mobile robot vision, we attached the ZED to a hand-wheeled cart which was pushed at a constant walking pace.The camera was also panned to mimic active vision.

Our dataset consists of images recorded in two hallwaysand at two outdoor locations. Each location consists ofunique geometric landscapes, corridors, occlusive objects,lighting, and variable foot traffic to simulate environmentsthat an autonomous mobile robot may traverse. In total, ourdataset contains approximately 3400 RGB-D images.

To obtain ground truth labels, every video frame wasannotated with bounding boxes around the location of allvisible upright humans in the scene.

For our experiments, we used a laptop equipped with anIntel i7-6700 CPU and 8GB RAM. We selected this platformfor its compact size, which could reasonably be used on amobile robotic platform. All experiments were conducted inMATLAB solely using CPU resources.

V. RESULTS

Here, we present our results for SDP with regards toexecution time, detection accuracy, and computation cost,and discuss their implication in Section VI.

A. Execution Time

We measured the cumulative execution time of SDP fusedwith each of the baseline algorithms (HOG and SVM, ACF,Checkerboards, R-CNN). The threshold parameter, ρ, wasincrementally varied to show performance cutoff as edgesensitivity is decreased. Each of the baseline algorithms werealso tested, independently of SDP (shown in Figure 2).

We found that SDP reduces the execution time of HOGand SVM, ACF, and Checkerboards by up to 30%. Ourresults show that execution time is inversely proportional tothe value of ρ, which demonstrates that SDP was able toabstractly speed up these detectors without modification.

Interestingly, when compared with the other detectors, wefound that SDP produced the opposite effect for R-CNN,where execution time is proportional to ρ.

B. Detection Accuracy

We evaluated the detection accuracy of SDP to accountfor the detectors’ ability to both generate correct predictions,

4155

Page 5: Faster Robot Perception Using Salient Depth Partitioninglriek/papers/chan-taylor... · Rather than have a robot's attention divided among every possible location in an image, it can

Fig. 2. Cumulative execution times of several state-of-the-art detection algorithms integrated to SDP, for varying values of ρ - lower execution time isbetter. Note that we only select ρ values that depict interesting changes in detection behavior. The baseline (detection algorithms without SDP integration)algorithms are depicted in red, and do not have a ρ value.

Fig. 3. Detection accuracy vs. intersection over union of several state-of-the-art detection algorithms integrated with SDP, for varying values of ρ. Thesecurves correspond to the execution times shown in Figure 2. The baseline (detection algorithms without SDP integration) algorithms are depicted in red,and do not have a ρ value. A ρ curve that more closely matches the baseline curve means lower degradation in detection performance.

while minimally generating false positives. Here, we measurethe detection accuracy of SDP fused with each of the baselinealgorithms performed on our dataset, using the IoU boundingbox criteria found in [25], [32]. We also use the sameevaluation on each of the baseline algorithms independentlyof SDP. To account for variability in human poses (e.g.stretched out arms, differences in gait), we select a base δvalue of 0.5, or the standard value in the pedestrian trackingliterature [3]. The IoU metric is given by Equation 3.

P ∩GTP ∪GT

≥ δ

0 ≤ δ ≤ 1(3)

P is a rectangular bounding box predicted by a detector.GT is a ground truth rectangular bounding box.δ is the IoU threshold.

In IoU, predicted and labeled bounded boxes are pairedaccording to best-fit linear assignment, solved using theJonker-Volgenant algorithm [45]. If the IoU between a pairof bounding boxes equals or exceeds δ, the predictionqualifies as a true positive; otherwise, an IoU score lessthan the threshold designates the prediction a false positive.False negatives correspond to labeled bounding boxes thatdo not have prediction pairings, and are otherwise misseddetections. A true negative corresponds to no pedestrian

predictions in the case that no ground truth labels exist – inthis case, no pedestrians are in the image, and the detectordoes not make a prediction. Using this criteria, we definedetection accuracy in Equation 4.

ACC =TP + TN

TP + TN + FP + FN(4)

TP is the total number of true positives.TN is the total number of true negatives.FP is the total number of false positives.FN is the total number of false negatives.

SDP possesses differences from proposal algorithms, andis instead characterized as a region cropping algorithm.Rather than generate a number of object proposals, our algo-rithm generates sparse regions containing multiple objects.

Consequently, we cannot apply the same metrics (i.e.recall/AUC/detection accuracy vs. number of proposals) asthose of proposal algorithms. Instead, we measure detectionaccuracy as a function of ρ, the edge threshold sensitivity(previously discussed in section III), which directly correlatesto the quality of proposed regions.

Furthermore, our algorithm extensively relies on depthinformation, so the performance of SDP also depends on thecamera’s ability to capture depth at longer ranges. Therefore,to test the performance of our algorithm, we remove predic-

4156

Page 6: Faster Robot Perception Using Salient Depth Partitioninglriek/papers/chan-taylor... · Rather than have a robot's attention divided among every possible location in an image, it can

Fig. 4. RGB-D images depicting the limitations of the ZED camera atvarious ranges. A pedestrian is observed from the ZED stereo camera at5m (left), 10m (center), and 15m (right); the pedestrian is no longer visiblein the depth map at 15m.

tions and ground truth labels of pedestrians captured beyondthe range limitations of our camera. From our measurements,we estimate that the ZED can capture the depth of pedestrianswithin 13 meters distance at 1280 × 720 resolution, whichmeasures to approximately 154×64 pixels. Figure 4 depictsthe depth limitations of the ZED camera.

Figure 3 presents the detection accuracy of SDP in con-junction with each of the selected algorithms; the baselinedetection accuracy are also shown. Across all algorithms,there was minimal degradation in performance for ρ = 5,which suggests that SDP does not affect detection accuracyfor ρ < 6. There is significant degradation for all algorithmsat ρ > 5, which is to be expected, because the amountof image cropping (and number of cropped pedestrians) isdirectly proportional to ρ.

C. Computational Throughput

To measure computational throughput of SDP, we exe-cuted it on our dataset without an object detector; in thistest, regions are computed, but no classification is performed.Furthermore, we do not take into account the camera frame-rate, which we consider to be independent of the algorithm.

We report an average throughput of approximately 77frames per second (approximately 13 ms per image), whichis suitable for robots and real-time purposes.

VI. DISCUSSION

In this paper, we introduced SDP, and showed that itsubstantially decreases the execution time of several state-of-the-art pedestrian detectors, without affecting detectionaccuracy. Thus, we expect that other detectors can also beoptimized using region segmentation strategies.

When combined with SDP, the execution times of HOGand SVM, ACF, and Checkerboards are significantly reduced.We discovered that the proportional differences in executiontime is dependent on the number of sliding window com-putations that each algorithm performs. For example, HOGand SVM uses the most sliding window operations of all the

algorithms in our evaluation, so its execution time benefitsmost when more of the image is cropped out.

Contrary to what we expected, the execution time of R-CNN increases proportionally with ρ. We believe this tobe caused by R-CNN’s built-in proposal generator, whichcaused it to produce extraneous proposals when the originalimage is partitioned. Inevitably, the computation time in-creased because classification is performed on each proposal.

However, it is interesting to see that R-CNN gained a slightincrease in detection performance. This highly suggests thata greater number of false positives are removed when moreof the image (without sacrificing the integrity of the detector)is cropped, which supports the hypothesis found in [24].

Detection accuracy degradation was the highest for HOGand SVM. We postulate that this effect is caused when gra-dient features, neighbor to pedestrian locations, are removedafter region cropping. Although actual pedestrians are notremoved from the image, region cropping can have an effecton classification when localized histograms are altered.

Combined with SDP, ACF yielded lower degradation indetection accuracy at ρ = 5, when compared to ρ = 1. Thisis contrary to intuition, where a larger amount of pruningwould expect to yield lower detection accuracy. However, inthe ACF test we found a lower number of false positives forρ = 5, than ρ = 1. We attribute this to the same effect (falsepositives removed from region cropping) that was observedin the R-CNN and SDP evaluation.

This suggests that the image features extracted byCheckerboards are more invariant to region cropping, andthat SDP decreased the runtime of Checkerboards whilepreserving the optimal number of detectable pedestrians.

VII. LIMITATIONS

One limitation of our work is that our experiments wereonly studied within the context of pedestrian detection, whichwe believe to be a noisy testbed to evaluate our algorithm forrobustness. However, the our results are encouraging, whichmotivates us to explore how SDP can be integrated to generalobject detection algorithms in our future work.

Furthermore, our experimental results showed that SDPdoes not perform well with detectors that use an integratedobject proposal generator (i.e. R-CNN). However, our resultscould vastly differ if we had direct control over the numberof proposals that R-CNN generates, which we also plan toinvestigate in a future study.

VIII. FUTURE WORK

In summary, we introduced SDP and demonstrated thatdetection algorithms can be substantially improved by de-creasing computation time, while sacrificing little to nodetection performance. We observed that the computationalthroughput of SDP (approximately 77 frames per secondwithout classification) is negligibly low, and can be easilyadapted to any RGB-D vision pipeline.

In our future work, we intend to test SDP on new RGB-Dcameras as they become commercially available, to see howwell our algorithm will scale to more precise depth data.

4157

Page 7: Faster Robot Perception Using Salient Depth Partitioninglriek/papers/chan-taylor... · Rather than have a robot's attention divided among every possible location in an image, it can

Additionally, we plan to use SDP to pre-process existinglow-power, low-computational algorithms for mobile robots,such as UV sky segmentation [46], RatSLAM [47], and Se-qSLAM [48]. With the study of our algorithm in conjunctionwith these other algorithms, we can gain a better sense ofperformance when robots exhibit different types of motion.

Other possible areas for exploration include incorporatinga probabilistic framework that combines saliency [20] tech-niques with our algorithm. This will improve our method bycombining RGB and depth modalities to formulate strongerhypotheses for candidate regions of interest. Optimizationsfor pedestrian detection can be further enhanced using trajec-tory prediction and motion priors, which can further reducethe image search space for previously detected humans andprovide greater immunity to occlusion.

REFERENCES

[1] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in IEEE Conf. on Comp. Vis. and Patt. Rec. (CVPR), 2005.

[2] P. Dollar, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramidsfor object detection,” IEEE Trans. on Patt. Anal. and Mach. Intell.,2014.

[3] S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele,“How far are we from solving pedestrian detection?” arXiv preprintarXiv:1602.01237, 2016.

[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based con-volutional networks for accurate object detection and segmentation,”IEEE trans. on patt. anal. and mach. intell., 2016.

[5] H. Kjellstrom, D. Kragic, and M. J. Black, “Tracking people interact-ing with objects,” in IEEE Conf. on Comp. Vis. and Patt. Rec. (CVPR),2010.

[6] L. C. Jain and M. N. Favorskaya, “Practical matters in computervision,” in Comp. Vis. in Control Sys.-2, 2015.

[7] A. Nigam and L. D. Riek, “Social context perception for mobilerobots,” in IEEE/RSJ Intern. Conf. on Intell. Rob. and Sys. (IROS),2015.

[8] W. Choi, C. Pantofaru, and S. Savarese, “A general framework fortracking multiple people from a moving camera,” IEEE Trans. on Patt.Anal. and Mach. Intell., 2013.

[9] T. Iqbal, S. Rack, and L. D. Riek, “Movement coordination in human–robot teams: A dynamical systems approach,” IEEE Trans. on Rob.,2016.

[10] T. Iqbal and L. D. Riek, “Coordination dynamics in multihumanmultirobot teams,” IEEE Rob. and Auto. Letters, 2017.

[11] G. Ros, A. Sappa, D. Ponsa, and A. M. Lopez, “Visual slam for driver-less cars: A brief survey,” in Intell. Vehicles Symp. (IV) Workshops,2012.

[12] M. Enzweiler and D. M. Gavrila, “Monocular pedestrian detection:Survey and experiments,” IEEE Trans. on Patt. Anal. and Mach. Intell.,2009.

[13] D. Geronimo, A. M. Lopez, A. D. Sappa, and T. Graf, “Survey ofpedestrian detection for advanced driver assistance systems,” IEEETrans. on Patt. Anal. and Mach. Intell., 2010.

[14] M. Gabb, O. Lohlein, R. Wagner, A. Westenberger, M. Fritzsche, andK. Dietmayer, “High-performance on-road vehicle detection in monoc-ular images,” in IEEE 16th Intern. Conf. on Intell. TransportationSys.-(ITSC), 2013.

[15] R. Brehar and S. Nedevschi, “Scan window based pedestrian recog-nition methods improvement by search space and scale reduction,” inIEEE Intell. Vehicles Sym. Proc., 2014.

[16] B. Alexe, T. Deselaers, and V. Ferrari, “What is an object?” in IEEEConf. on Comp. Vis. and Patt. Rec. (CVPR), 2010.

[17] J. Carreira and C. Sminchisescu, “Constrained parametric min-cuts forautomatic object segmentation,” in IEEE Conf. on Comp. Vis. and Patt.Rec. (CVPR), 2010.

[18] I. Endres and D. Hoiem, “Category independent object proposals,” inEuro. Conf. on Comp. Vis., 2010.

[19] K. E. Van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders,“Segmentation as selective search for object recognition,” in Intern.Conf. on Comp. Vis., 2011.

[20] L. Itti, C. Koch, E. Niebur et al., “A model of saliency-based visualattention for rapid scene analysis,” IEEE Trans. on patt. anal. andmach. intell., 1998.

[21] L. Bose and A. Richards, “Fast depth edge detection and edge basedrgb-d slam,” in IEEE Intern. Conf. on Rob. and Auto. (ICRA), 2016.

[22] M. Ghafarianzadeh, M. B. Blaschko, and G. Sibley, “Efficient, dense,object-based segmentation from rgbd video,” in IEEE Intern. Conf. onRob. and Auto. (ICRA), 2016.

[23] G. Alenya, S. Foix, and C. Torras, “Using tof and rgbd cameras for3d robot perception and manipulation in human environments,” Intell.Service Rob., 2014.

[24] J. Hosang, R. Benenson, and B. Schiele, “How good are detectionproposals, really?” arXiv preprint arXiv:1406.6962, 2014.

[25] C. L. Zitnick and P. Dollar, “Edge boxes: Locating object proposalsfrom edges,” in Euro. Conf. on Comp. Vis., 2014.

[26] P. Dollar and C. L. Zitnick, “Fast edge detection using structuredforests,” IEEE trans. on patt. anal. and mach. intell., 2015.

[27] M. Cheng, Z. Zhang, W. Lin, and P. Torr, “Bing: Binarized normedgradients for objectness estimation at 300fps,” in IEEE conf. on comp.vis. and patt. rec., 2014.

[28] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders,“Selective search for object recognition,” Intern. journal of comp. vis.,2013.

[29] A. Humayun, F. Li, and J. M. Rehg, “Rigor: Reusing inference ingraph cuts for generating object regions,” in IEEE Conf. on Comp.Vis. and Patt. Rec., 2014.

[30] D. Endres, I.and Hoiem, “Category-independent object proposals withdiverse ranking,” IEEE trans. on patt. anal. and mach. intell., 2014.

[31] P. Viola and M. Jones, “Rapid object detection using a boosted cascadeof simple features,” in IEEE Conf. on Comp. Vis. and Patt. Rec.(CVPR), 2001.

[32] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-man, “The pascal visual object classes (voc) challenge,” Intern. journalof comp. vis., 2010.

[33] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas,F. Schaffalitzky, T. Kadir, and L. Van Gool, “A comparison of affineregion detectors,” Intern. journal of comp. vis., 2005.

[34] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in IEEEConf. on Comp. Vis. and Patt. Rec. (CVPR), 2011.

[35] A. Karpathy, S. Miller, and L. Fei-Fei, “Object discovery in 3d scenesvia shape analysis,” in IEEE Intern. Conf. on Rob. and Auto. (ICRA),2013.

[36] A. K. Mishra, A. Shrivastava, and Y. Aloimonos, “Segmenting simpleobjects using rgb-d,” in IEEE Intern. Conf. on Rob. and Auto. (ICRA),2012.

[37] J. Su, “Comparison of edge detectors,” in Intern. Conf. on MedicalImaging, Health and Emerging Comm. Systems (MedCom), 2001.

[38] J. Canny, “A computational approach to edge detection,” IEEE Trans.on patt. anal. and mach. intell., 1986.

[39] S. Zhang, R. Benenson, and B. Schiele, “Filtered channel features forpedestrian detection,” in IEEE Conf. on Comp. Vis. and Patt. Rec.(CVPR), 2015.

[40] D. Tome, F. Monti, L. Baroffio, L. Bondi, M. Tagliasacchi, andS. Tubaro, “Deep convolutional neural networks for pedestrian de-tection,” Signal Proc.: Image Comm., 2016.

[41] A. Ess, B. Leibe, K. Schindler, and L. V. Gool, “A mobile visionsystem for robust multi-person tracking,” in IEEE Conf. on Comp.Vis. and Patt. Rec. (CVPR), 2008.

[42] C. Wojek, S. Walk, and B. Schiele, “Multi-cue onboard pedestriandetection,” in IEEE Conf. on Comp. Vis. and Patt. Rec. CVPR., 2009.

[43] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection:A benchmark,” in IEEE Conf. on Comp. Vis. and Patt. Rec. (CVPR),2009.

[44] “Zed - stereo camera for depth sensing,” https://www.stereolabs.com/,accessed: 2016-1-12.

[45] R. Jonker and A. Volgenant, “A shortest augmenting path algorithmfor dense and sparse linear assignment problems,” Computing, 1987.

[46] T. Stone, M. Mangan, P. Ardin, and B. Webb, “Sky segmentation withultraviolet images can be used for navigation,” in Proc. Rob.: Sci. andSys., 2014.

[47] M. J. Milford, G. F. Wyeth, and D. Rasser, “Ratslam: a hippocampalmodel for simultaneous localization and mapping,” in IEEE Intern.Conf. on Rob. and Auto. (ICRA), 2004.

[48] M. J. Milford and G. F. Wyeth, “Seqslam: Visual route-based naviga-tion for sunny summer days and stormy winter nights,” in Conf. onRobo. and Auto. (ICRA), IEEE Intern., 2012.

4158