Download pdf - A Semantic Occlusion Model for Human Pose … Semantic Occlusion Model for Human Pose Estimation from a Single Depth Image Umer Raﬁ RWTH Aachen University Germany [email protected]

A Semantic Occlusion Model for Human Pose Estimation from a Single DepthImage

Umer RafiRWTH Aachen University

[email protected]

Juergen GallUniversity of Bonn

[email protected]

Bastian LeibeRWTH Aachen University

[email protected]

Abstract

Human pose estimation from depth data has made sig-nificant progress in recent years and commercial sensorsestimate human poses in real-time. However, state-of-the-art methods fail in many situations when the humans arepartially occluded by objects. In this work, we introduce asemantic occlusion model that is incorporated into a regres-sion forest approach for human pose estimation from depthdata. The approach exploits the context information of oc-cluding objects like a table to predict the locations of oc-cluded joints. In our experiments on synthetic and real data,we show that our occlusion model increases the joint esti-mation accuracy and outperforms the commercial Kinect 2SDK for occluded joints.

1. Introduction

Human pose estimation from depth data has made sig-nificant progress in recent years. One success story is thecommercially available Kinect system [16], which is basedon [20] and provides high-quality body joints predictions inreal time. However, the system works under the assumptionthat the humans can be well segmented. This assumptionis valid for gaming application for which the device wasdeveloped. Many computer vision applications, however,required human pose estimation in more general environ-ments where objects often occlude some body parts. In thiscase, the current SDK for Kinect 2 [14] fails to estimate thepartially occluded body parts. An example is shown in Fig-ure 1(a) where the joints of the left leg of the person arewrongly estimated.

In this work, we address the problem of estimating hu-man pose in the context of occlusions. Since for most ap-plications it is more practical to have always the entire poseand not only the visible joints, we aim to predict the loca-tions of all joints even if they are occluded. To this end, webuild on the work from [11], which estimates human pose

from depth data using a regression forest. Similar to theSDK, [11] works well for visible body parts but it fails toestimate the joint locations of partially occluded parts sinceit does not model occlusions. We therefore extend the ap-proach by an occlusion model. Objects, however, not onlyocclude body parts but they also provide some informationabout the expected pose. For instance, when a person is sit-ting at a table as in Figure 1, some joints are occluded buthumans can estimate the locations of the occluded joints.The same is true if the hands are occluded by a laptop ormonitor. In this case, humans can infer whether the per-son is using the occluded keyboard and they can predict thelocations of the occluded hands.

We therefore introduce a semantic occlusion model thatexploits the semantic context of occluding objects to im-prove human pose estimation from depth data. The model istrained on synthetic data where we use motion capture dataand 3D models of furniture. For evaluation, we recorded adataset of poses with occlusions1. The dataset was recordedfrom 7 subjects in 7 different rooms using the Kinect 2 sen-sor. In our experiments, we evaluate the impact of syntheticand real training data and show that our approach outper-forms [11]. We also compare our approach with the com-mercial pose estimation system that is part of Kinect 2. Al-though the commercial system achieves a higher detectionrate for visible joints since it uses much more training dataand highly engineered post-processing, the detection accu-racy for occluded joints of our approach is twice as high asthe accuracy of the Kinect 2 SDK.

2. Related Work3D Human Pose Estimation. Human Pose Estimation isa challenging problem in computer vision. It has applica-tions in gaming, human computer interaction and securityscenarios. It has generated a lot of research surveyed in[18]. There are variety of approaches that predict 3D human

1The dataset is available at http://www.vision.rwth-aachen.de/data.

1

http://www.vision.rwth-aachen.de/data

http://www.vision.rwth-aachen.de/data

(a) (b)

Figure 1: Comparison of Kinect 2 SDK and our approach. The SDK estimates the upper body correctly but the left leg iswrongly estimated due to the occluding table (a). Given the bounding box, our approach exploits the context of the table andpredicts the left leg correctly (b).

pose from monocular RGB images [1, 3, 15]. However, amajor caveat of using RGB data for inferring 3D pose is thatthe depth information is not available.

In recent years the availability of fast depth sensors hassignificantly reduced the depth information problem andfurther spurred the progress. Researchers have proposedtracking based approaches [13, 17, 8, 9]. These approacheswork in real time but operate by tracking 3D pose fromframe to frame. They cannot re-initialize quickly and areprone to tracking failures. Recently random forest based ap-proaches [11, 20, 21, 23] have been proposed. They predictthe 3D pose in super real time from a single depth imagecaptured by Kinect. These approaches work on monocu-lar depth images and therefore are more robust to trackingfailures. Some model based approaches [2, 26] have alsobeen proposed that fit a 3D mesh to image data to predictthe 3D pose. However, all these approaches require a well-segmented person to predict the 3D pose and will fail foroccluded joints when the person is partially occluded byobjects. The closest to our method is approach from [11]that uses regression forest to vote for the 3D pose and canhandle self-occlusions. Our approach on the other hand canalso handle occlusion from other objects.

Occlusion Handling. Explicit occluder handling hasbeen used in recent years to solve different problems incomputer vision. Girshick et al. [12] use grammar modelswith explicit occluder templates to reason about occludedpeople. The occlusion patterns needs to be specially de-signed in the grammer. Ghiasi et al. [10] automaticallylearn the different part-level occlusion patterns from datato reason about occlusion in people-people interactions by

using flexible mixture of parts [25]. Wang et al. [24] usepatch based Hough forests to learn object-object occlusionspatterns. In facial landmarks localization, recently someregression based approaches have been proposed that alsoincorporate occlusion handling for localizing occluded fa-cial landmarks [4, 27]. However, these approaches only usethe information from non-occluded landmarks in contrast toour approach that also uses the information from occludingobjects. Similar to [10, 24] our approach also learns the oc-clusions from data, however our approach learns occlusionsfor people-objects interactions in contrast to [10] that learnsocclusions for people-people interactions and the occludingobjects in our approach are classified purely on appearanceat test time and does not incorporate any knowledge aboutdistance from the object they are occluding as in [24].

3. Semantic Occlusion Model

In this work, we propose to integrate additional knowl-edge about occluding objects into a 3D pose estimationframework from depth data to improve the pose estimationaccuracy for occluded joints. To this end, we build on a re-gression framework for pose estimation that is based on re-gression forests [11]. We briefly discuss regression forestsfor pose estimation in Section 3.1. In Section 3.2 we extendthe approach for explicitly handling occlusions. In particu-lar, we predict the semantic label of an occluding object attest time and then use it as context to predict the position ofan invisible joint.

3.1. Regression Forests for Human Pose Estimation

Random Forests have been used in recent years for vari-ous regression tasks, e.g., for regressing the 3D human posefrom a single depth image [11, 20, 21, 23], estimating the2D human pose from a single image [7], or for predictingfacial features [6]. In this section, we briefly describe theapproach [11], which will be our baseline.

Regression forests belong to the family of random forestsand are ensembles of T regression trees. In the context ofhuman pose estimation from a depth image, they take aninput pixel q in a depth image D and predict the probabil-ity distribution over the locations of all joints in the image.Each tree t in the forest consists of split and leaf nodes. Ateach split node a binary split function φγ(q,D) 7→ {0, 1} isstored, which is parametrized by γ and evaluates a pixel lo-cation q in a depth image D. In this work, we use the depthcomparison features from [20]:

φγ(q,D) =

{1 if D

(q + u

D(q)

)−D

(q + v

D(q)

)> τ

0 otherwise.(1)

where the parameters γ = (u, v, τ) denote the offsets u, vfrom pixel q, which are scaled by the depth at pixel q tomake the features robust to depth changes. The thresholdconverts the depth difference into a binary value. Depend-ing on the value of φγ(q,D), (q,D) is send to the left orright child of the node.

During training each such split function is selected froma randomly generated pool of splitting functions. Thisset is evaluated on the set of training samples Q ={(q,D, c, {Vj})}, each consisting of a sampled pixel q ina sampled training image D, a class label c for the limb thepixel belongs to, and for each joint j the 3D offset vectorsVj = qj − q between pixel q and the joint position qj in theimage D.

Each sampled splitting function φ, partitions the train-ing data at the current node into the two subsets Q0(φ) andQ1(φ). The quality of a splitting function is then measuredby the information gain:

φ∗ = argmaxφ

g(φ), (2)

g(φ) = H(Q)−∑

s∈{0,1}

|Qs(φ)||Q|

H(Qs(φ)), (3)

H(Q) = −∑c

p(c|Q) log(p(c|Q)), (4)

whereH(Q) is the Shannon entropy and p(c|Q) the empiri-cal distribution of the class probabilities computed from theset Q. The training procedure continues recursively untilthe maximum allowed depth for a tree is reached.

At each leaf node l, the probabilities over 3D offset vec-tors Vj to each joint j, i.e., pj(V |l) are computed from the

training samples Q arriving at l. To this end, the vectorsVj are clustered by mean-shift with a Gaussian Kernel withbandwidth b and for each joint only the two largest clustersare kept for efficiency. If Vljk is the mode of the kth clus-ter for joint j at leaf node l, then the probability pj(V |l) isapproximated by

pj(V |l) ∝∑k∈K

wljk · exp

(−∥∥∥∥V − Vljkb

∥∥∥∥22

)(5)

where the weight of a cluster wljk is determined by thenumber of offset vectors that ended in the cluster. Clustercenters with ‖Vljk‖ > λj are removed since these corre-spond often to noise [11].

For pose estimation, pixels q from a depth image D aresampled and pushed through each tree in the forest untilthey reach a leaf node l. For each pixel q, votes for theabsolute location of a joint j are computed by xj = q+Vljk.In addition a confidence value that takes the depth of thepixel q into account is computed by wj = wljk · D2(q).The weighted votes for a joint j are collected for all pixelsq and form the set Xj = {(xj , wj)}. The probability of thelocation of a joint j in image D is then approximated by

pj(x|D) ∝∑

(xj ,wj)∈Xj

wj · exp

(−∥∥∥∥x− xjbj

∥∥∥∥22

)(6)

where bj is the bandwidth of the Gaussian Kernel learnedseparately for each joint j. As for training, the votes areclustered and only the clusters with the highest summedweights wj are used to predict the joint location.

3.2. Occlusion Aware Regression Forests (OARF)

In order to handle occlusions, we propose OcclusionAware Regression Forests (OARF) that build on the re-gression forest framework described in Section 3.1. Theypredict additionally the class label of an occluding objectat test time and then use this semantic knowledge aboutthe occluding object as context to improve the pose es-timation of occluded joints. To this end, we use an ex-tended set of training samples Qext = Q ∪ Qocc, whereQocc = {(qocc, D, cocc, {Vjocc})}, is a set of occluding ob-ject pixels where each pixel qocc is sampled from a trainingimage D, has a class label cocc and a set of of offset vectors{Vjocc} to each joint of interest j.

During training we use the depth comparison featuresdescribed in Section 3.1 for selecting a binary split func-tion φγ(q,D) at each split node in each tree. To select bi-nary split functions that can distinguish between occludingobjects and body parts we minimize the Shannon entropyH(Q) over extended set of class labels cext = c ∪ cocc :

H(Q) = −∑cext

p(cext|Q) log(p(cext|Q)), (7)

To use the semantic knowledge about occluding object asan additional clue for prediction of occluded joints, we alsostore at a leaf node l the probabilities over 3D offset vectorsVjocc to each joint j, i.e., pj(Vocc|l) that are computed fromthe training samples Qocc arriving at l by using the meanshift procedure described in Section 3.1. The probabilitypj(Vocc|l) is approximated by

pj(Vocc|l) ∝∑k∈K

wljk · exp

(−∥∥∥∥Vocc − Vljocckb

∥∥∥∥22

)(8)

The inference procedure is similar to standard regressionforest inference. At test time pixels qocc from occludingobjects in a test image D are also pushed through each treein the forest until they reach a leaf node l and cast a votexjocc = qocc + Vljocck with confidence wjocc for each jointj. The weighted votes for each joint j form the set Xj ={(xj , wj) ∪ (xjocc, wjocc)}. The final joint position is thenpredicted by using the mean shift procedure described inSection 3.1.

4. Training DataGathering real labeled training data for 3D human pose

estimation from depth images is expensive. To overcomethis, [20] generated a large synthetic database of depth im-ages of people covering a wide variety of poses togetherwith pixel annotations of body parts. For generating such adatabase, a large motion capture corpus of general humanactivities has been recorded. The body poses of the cor-pus were then retargeted to textured body meshes of vary-ing sizes. Since the dataset is not publicly available, wecaptured our own dataset.

Synthetic Data. We follow the same procedure. To thisend, we use the motion capture data for sitting and stand-ing poses from the publicly available CMU motion cap-ture database [5]. We retarget the poses to 6 textured bodymeshes using Poser [19], a commercially available anima-tion package, and generate a synthetic dataset of 1’110depth images of humans in different sitting and standingposes from different viewpoints. For each depth image, wealso have a pixel-wise body part labeling and the 3D jointpositions. The depth maps and body part labels are shownin Figure 2(a-b). For the occluding objects, we use the pub-licly available 3D models of tables and laptops from theSweet Home 3D Furniture Library [22]. We render the ob-ject together with the humans as shown in Figure 2(c). Thecompositions are randomized under the constraints that thetables and feet are on the ground plane, the laptops on thetables and the distance between the humans and the objectsis between 3-5 cm. For each composition, we compute theoccluded body parts (Figure 2(d)) and the depth and classlabels with occluding object classes Figure 2(e-f).

Real Data. We also recorded a dataset using the Kinect2 sensor. The dataset contains depth images of humans indifferent sitting and standing poses without occlusions asshown in Figure 2(a). The 3D poses are obtained by theKinect SDK and we discarded images were the SDK failed.This resulted in 2’552 images. The pixel-wise segmenta-tion of the body parts is obtained by the closest geodesicdistance of a surface point to the skeleton as shown in Fig-ure 2(b). The composition with synthetic 3D objects is doneas for the synthetic data.

5. Experiments and ResultsIn this section we describe in detail the experimental set-

tings used to evaluate our method and report quantitativeand qualitative results. For comparison, we consider threeapproaches:

1. The approach [11] described in Section 3.1 is our base-line. It is trained on the training data without occludingobjects shown in Figure 2(a-b).

2. The occlusion aware regression forest (OARF W se-mantics) described in Section 3.2 is trained with thesemantic labels of occluding objects shown in Fig-ure 2(e-f).

3. To show the impact of the semantic information of theoccluding objects, we also train an OARF by assigningall occluding objects a single label and not the labelsof the object classes (OARF W/O semantics).

Training. We train all forests with the same parameters.For each training depth image, we sample 1000 pixel loca-tions. The other parameters are used as in [11], i.e., the re-gression forests consist of 3 trees with maximum depth 20.For each splitting node, we sample 2000 depth comparisonfeatures and use b = 0.05m.

Testing. For testing our approach, we use synthetic andreal data. The real dataset is recorded in different indoor en-vironments, offices and living rooms, and consists of sevensequences of different subjects in different sitting and stand-ing poses. From each sequence, we select a set of uniqueposes thus providing us with a total of 1000 images. Wesplit the 1000 images into test and validation set with 800and 200 images, respectively. While bj were set as pro-posed in [11], we observed that the values for λj proposedin [11] are not optimal for our baseline. We therefore esti-mate them on the validation set. The synthetic test set con-sists of 440 images of 2 subjects. The subjects, occludingobjects and the poses in both test sets are different from theones in the training set. We report quantitative results onreal and synthetic test sets for the 15 body joint positions ofour skeleton.

(a) (b) (c) (d) (e) (f)

Figure 2: Procedure for generating depth images with pixel level ground truth body parts labels and occluding objects maskswith class labels. Synthetic data (top row) and real data (bottom row): (a) Rendered or captured and segmented depth image.(b) Body part labels at pixel level. (c) Depth data with object. (d) Occluded body parts. (e) Segmented depth map withoccluding objects. (f) Combined labels of body parts and occluding objects.

Method Mean Average Precision

Baseline 44%OARF W/O semantics 48%OARF W semantics 54%

Table 1: Mean average precision of the 3D body joints pre-dictions on the synthetic test set by using the evaluationmeasure from [20] with a distance threshold of 10 cm. Theresults show that integrating semantic knowledge about oc-cluding objects provides a significant improvement over thebaseline.

Synthetic Test Data. For quantitative evaluation on syn-thetic data, we use the evaluation measure from [20] witha distance threshold of 10 cm and report the mean averageprecision over the 3D body joints predictions in Table 1. In-tegrating the additional knowledge about occluding objectswithout object class labels alone provides 4% improvementover the baseline. When we also integrate semantic knowl-edge about occluding objects then this provides a significantimprovement of 10% over the baseline. This shows that oc-clusion handling is beneficial for pose estimation, but alsothat the semantic context of occluding objects contains im-portant information about joint locations.

Real Test Data. For real test data, it is difficult to get 3Dground truth body joint positions. For the quantitative eval-uation, we therefore manually labeled the 2D body jointpositions. Since manual annotations of the 2D positionsof body joints, especially occluded joints, in depth imagesare sometimes noisy, we used the mean annotations of threedifferent annotators as ground truth. For the quantitativeevaluation on the test data, we use the evaluation measure

Setting Average Detection Accuracy(%)

Occluded Non Occluded AllJoints Joints Joints

Synthetic Data

Baseline 22.17 50.51 44.54OARF W/O semantics 24.36 52.57 46.62OARF W semantics 25.42 52.43 46.73

Real Data


Real + Synthetic Data


Kinect SDK 18.13 66.36 56.94

Table 2: Average detection accuracy of 2D body joint pre-dictions on the real test set measured by the evaluation mea-sure from [7] with an error threshold of 0.1 of upper bodysize.

from [7] with an error threshold of 0.1 of upper body size toreport average detection accuracy over 2D body joint pre-dictions.

In Table 2, the results for the baseline, OARF with andwithout semantics are reported. We also evaluate the impactof the synthetic and real training data. The results show thatthe synthetic training data is not as good as the real trainingdata. However, if we combine real and synthetic data weget another boost of performance. For all three training set-tings, the numbers support the results of the synthetic test

set. The baseline is improved by occlusion handling andsemantic occlusion handling achieves the best result of thethree methods. The semantic occlusion model mainly im-proves the accuracy of occluded joints, but there is also aslight improvement of non-occluded joints. Without occlu-sion handling, objects close to joints introduce some noisyvotes that can result in wrong estimates. The occlusion han-dling reduces this effect. We also compared our results tothe Kinect SDK. The Kinect SDK achieves a higher over-all accuracy since it is trained on much more training dataand the SDK includes additional post-processing, which isnot part of our baseline. However, our approach achieves amuch better accuracy for the occluded joints.

Table 3 presents detailed quantitative results for 11 bodyjoints that are occluded in the real test set. The results arereported for OARF trained on real and synthetic data and forthe Kinect SDK. The results show that for most joints ourmodel achieves a better accuracy than the Kinect SDK andadding semantic knowledge provides further improvements.There are only three joints, namely the right elbow and theankles, with a low accuracy. This can be explained by thetraining data that contains only few examples where thesejoints are occluded.

Qualitative Results. We show some qualitative results onour real test set in Figure 3 and a few failure cases in Figure4. The top row shows the body joints predicted by OARFwith the semantic occlusion model, the middle row showsthe body joints predicted by OARF without the semanticocclusion model and the bottom row shows the body jointspredicted by the Kinect SDK.

6. Conclusion

In this paper, we have proposed an approach that inte-grates additional knowledge about occluding objects intoan existing 3D pose estimation framework from depth data.We have shown that occluding objects not only need to bedetected to avoid noisy estimates, but also that the seman-tic information of occluding objects is a valuable source forpredicting occluded joints. Although our experiments al-ready indicate the potential of the approach and outperforma commercial SDK already for occluded joints, the overallperformance can still be boosted by increasing the varietyof objects and poses in the training data.

Acknowledgment: The work in this paper was fundedby the EU projects STRANDS (ICT-2011-600623) andSPENCER (ICT-2011-600877). Juergen Gall was sup-ported by the DFG Emmy Noether program (GA 1927/1-1).

Joint Per Joint Detection Accuracy(%)

OARF OARF Kinect SDKW semantics W/O semantics

Spine 77.51 73.9 42.57Left Elbow 76.47 73.53 47.06Right Elbow 17.5 10.53 71.93Left Hand 47.59 36.55 28.28Right Hand 59.38 39.58 18.23Left Hip 50.53 48.76 10.60Right Hip 75.42 71.25 22.08Left Knee 31.64 31.64 22.60Right Knee 27.27 29.09 12.73Left Ankle 3.87 4.9 5.15Right Ankle 5.44 5.07 7.88

Average 35.77 32.60 18.13

Table 3: Per joint detection accuracy of 11 occluded bodyjoints in our real test set by using the evaluation measurefrom [7] with an error threshold of 0.1 of upper body size.The results are reported for OARF with and without seman-tics and for the Kinect SDK.

References[1] A. Agarwal and B. Triggs. Recovering 3d Human Pose from

Monocular Image. PAMI, 28(1):44–58, 2006. 2[2] A. Baak, M. Muller, G. Bharaj, H. P. Seidel, and C. Theobal.

A data-driven approach for real-time full body pose recon-struction from a depth camera. In ICCV, 2011. 2

[3] L. Bo, C. Sminchisescu, A. Kanaujia, and D. Metaxas. FastAlgorithms for Large Scale Conditional 3d Prediction. InCVPR, 2008. 2

[4] X. P. Burgos-Artizzu, P. Perona, and P. Dollar. Robust facelandmark estimation under occlusion. In ICCV, 2013. 2

[5] CMU Mocap. http://mocap.cs.cmu.edu/. 4[6] M. Dantone, J. Gall, G. Fanelli, and L. van Gool. Real-

time Facial Feature Detection using Conditional RegressionForests. In CVPR, 2012. 3

[7] M. Dantone, J. Gall, C. Leistner, and L. van Gool. HumanPose Estimation using Body Parts Dependent Joint Regres-sors. In CVPR, 2013. 3, 5, 6

[8] V. Ganapathi, C. Plagemann, D. Koller, and S. Thrun. Realtime motion capture using a single time-of-flight camera. InCVPR, 2010. 2

[9] V. Ganapathi, C. Plagemann, D. Koller, and S. Thrun. RealTime Human Pose Tracking from Range Data. In ECCV,2012. 2

[10] G. Ghiasi, Y. Yang, D. Ramanan, and C. C. Fowlkes. ParsingOccluded People. In CVPR, 2014. 2

[11] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, andA. Fitzgibbon. Efficient Regression of General-Activity Hu-man Poses from Depth Images. In ICCV, 2011. 1, 2, 3, 4

[12] R. B. Girshick, P. F. Felzenszwalb, and D. A. Mcallester.Object Detection with Grammar Models. In NIPS, 2011. 2

http://mocap.cs.cmu.edu/

Figure 3: Qualitative results for some sample images taken from our real test set. Top row shows the body joints predictedby OARF with semantics. Middle row shows the body joint predictions from OARF without semantics. Bottom row showsthe body joints predicted by the Kinect SDK. The results show that integrating knowledge about occluding objects helps inimproving the body joint prediction performance for the occluded joints.

[13] D. Grest, J. Woetzel, and R. Koch. Nonlinear Body PoseEstimation from Depth Images. In DAGM, 2005. 2

[14] Kinect For Windows SDK 2.0. http://www.microsoft.com/en-us/kinectforwindows/develop/. 1

[15] I. Kostrikov and J. Gall. Depth Sweep Regression Forests forEstimating 3D Human Pose from Images. In BMVC, 2014.2

[16] Microsoft Corp.Redmond WA. Kinect for Xbox 360. 1[17] C. Plagemann, V. Ganapathi, D. Koller, and S. Thrun. Real-

Time Identification and Localization of Body Parts fromDepth Images. In ICRA, 2010. 2

[18] R. Poppe. Vision-based Human Motion Analysis. CVIU,108(1-2):4–18, 2007. 1

[19] Poser. http://my.smithmicro.com/. 4[20] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio,

R. Moore, A. Kipman, and A. Blake. Real-time Human Pose

Recognition in Parts from Single Depth Images. In CVPR,2011. 1, 2, 3, 4, 5

[21] M. Sun, P. Kohli, and J. Shotton. Conditional RegressionForests for Human Pose Estimation. In CVPR, 2012. 2, 3

[22] Sweet Home 3D. www.sweethome3d.com. 4

[23] J. Taylor, J. Shotton, T. Sharp, and A. Fitzgibbon. The Vit-ruvian Manifold: Inferring Dense Correspondences for One-Shot Human Pose Estimation. In CVPR, 2012. 2, 3

[24] T. Wang, X. He, and N. Barnes. Learning Structured HoughVoting for Joint Object Detection and Occlusion Reasoning.In CVPR, 2013. 2

[25] Y. Yang and D. Ramanan. Articulated Pose Estimation usingFlexible Mixtures of Parts. In CVPR, 2011. 2

[26] M. Ye, X. Wang, R. Yang, L. Ren, and M. Pollefey. Accurate3D pose estimation from a single depth image. In ICCV,2011. 2

http://www.microsoft.com/en-us/kinectforwindows/develop/



http://my.smithmicro.com/

www.sweethome3d.com

Figure 4: Example failure cases for occluded joints from our Real Test Set. OARF with semantics (top row), OARF withoutsemantics (middle row) and Kinect SDK (bottom row).

[27] X. Yu, Z. Lin, J. Brandt, and D. N. Metaxas. Consensusof Regression for Occlusion-Robust Facial Feature Localiza-tion. In ECCV, 2014. 2