3
Robot-Centric Human Group Detection Angelique Taylor and Laurel D. Riek University of California San Diego {[email protected], [email protected]} ABSTRACT As robots enter human-occupied environments, it is important that they work effectively with groups of people. To achieve this goal, robots need the ability to detect groups. This ability requires robots to have ego-centric (robot-centric) perception because placing exter- nal sensors in the environment is impractical. Additionally, robots need learning algorithms which do not require extensive training, as a priori knowledge of an environment is difficult to acquire. We introduce a new algorithm that addresses these needs. It detects moving groups in real-world, ego-centric RGB-D data from a mo- bile robot. Also, it uses unsupervised learning which leverages the underlying structure of the data for learning. This paper provides an overview of our approach and discusses future experimental plans. This work will enable robots to work with human teams in the general public using their own on-board sensors with minimal training. KEYWORDS human-robot interaction, egocentric vision, group detection 1 INTRODUCTION Humans and robots work together proximately in many domains such as rehabilitation, manufacturing, healthcare, and transporta- tion [11, 16, 1821]. In many situations, robots are rarely interacting with one human and instead are working with multiple people at once [59, 23]. For instance, robots may encounter groups of people in public settings, such as airports, and will be expected to interact with multiple people simultaneously. Additionally, robots which perform joint tasks with multiple people in manufacturing settings must coordinate their actions accordingly [18]. Thus, robots need perception algorithms that enable them to detect human groups [17, 22]. Recent work has used inspiration from social science to detect groups. For instance, several methods in human-robot interaction (HRI) enable robots to detect F-Formations, which describe groups using position and orientation [12, 25]. Other methods use inspira- tion from proxemics [4] to detect and track groups [14, 15]. While this work provides a foundation to improve human group detection, a major gap remains: joint ego-centric group percep- tion using unsupervised models. Ego-centric perception is vital for robots as they are expected to perform tasks with human groups in a wide range of settings [26]. It is not practical to install monitoring systems in a robot’s environment to ensure that they function prop- erly or as an extension of their vision system. To this end, robot perception algorithms must be able to handle diverse environments and therefore should not rely on external sensors. Thus, robots need to work effectively in groups using only their on-board sensors. Although some papers, such as Linder et al. [14] use ego-centric perception, they employ supervised learning. This requires a priori Figure 1: This image shows bounding boxes around groups from a third person viewpoint and from the first person viewpoint. knowledge of the problem space, which requires manually anno- tation of large datasets. This is a laborious and time consuming task. Thus, this warrants exploration into unsupervised methods for human-robot group interaction. These methods have shown great potential to robustly solve problems in HRI such as manipulation, object recognition, and localization [1, 2, 24]. This provides inspira- tion for our work to develop robust algorithms for ego-centric, robot group perception that work in-the-wild with no a priori knowledge. 2 GROUP DETECTION METHOD We are designing a novel algorithm for robots to detect human groups in real-world environments. To our knowledge, we are the first to jointly address this problem from an ego-centric perspective while detecting groups using unsupervised methods. Our algorithm has three main parts: i) ego-centric pedestrian detection, ii) pedestrian motion estimation, and iii) group detection using joint motion and proximity estimation. For ego-centric pedestrian detection, we explored several off- the-shelf pedestrian detectors most well-suited for our algorithmic pipeline. Our method requires a pedestrian detector that has a reasonable frame-rate (about 20 frames per second (fps)), and has state-of-the-art accuracy (>75% average precision). Also, as our algorithm must work in real-world settings, the pedestrian detector must be able to handle a range of image challenges such as lighting illumination changes, shadows, and occlusion. Using RGB data, the pedestrian detector generates bounding boxes (BB), denoted b n = b n x , b n y , b n w , b n h . Our method estimates two feature descriptors: the position of pedestrians (centroid of each BB) and size of the BB (aspect ratio). The centroid of the BB, denoted b n x and b n y , are the center x and y pixel coordinates for the n th pedestrian at time t where n = 1, 2, ..., N . The next feature uses

Robot-Centric Human Group Detectioncseweb.ucsd.edu/~amt062/HRI-SRW-2018.pdffirst to jointly address this problem from an ego-centric perspective while detecting groups using unsupervised

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Robot-Centric Human Group Detectioncseweb.ucsd.edu/~amt062/HRI-SRW-2018.pdffirst to jointly address this problem from an ego-centric perspective while detecting groups using unsupervised

Robot-Centric Human Group DetectionAngelique Taylor and Laurel D. RiekUniversity of California San Diego

{[email protected], [email protected]}

ABSTRACTAs robots enter human-occupied environments, it is important thatthey work effectively with groups of people. To achieve this goal,robots need the ability to detect groups. This ability requires robotsto have ego-centric (robot-centric) perception because placing exter-nal sensors in the environment is impractical. Additionally, robotsneed learning algorithms which do not require extensive training,as a priori knowledge of an environment is difficult to acquire. Weintroduce a new algorithm that addresses these needs. It detectsmoving groups in real-world, ego-centric RGB-D data from a mo-bile robot. Also, it uses unsupervised learning which leverages theunderlying structure of the data for learning. This paper providesan overview of our approach and discusses future experimentalplans. This work will enable robots to work with human teams inthe general public using their own on-board sensors with minimaltraining.

KEYWORDShuman-robot interaction, egocentric vision, group detection

1 INTRODUCTIONHumans and robots work together proximately in many domainssuch as rehabilitation, manufacturing, healthcare, and transporta-tion [11, 16, 18–21]. In many situations, robots are rarely interactingwith one human and instead are working with multiple people atonce [5–9, 23]. For instance, robots may encounter groups of peoplein public settings, such as airports, and will be expected to interactwith multiple people simultaneously. Additionally, robots whichperform joint tasks with multiple people in manufacturing settingsmust coordinate their actions accordingly [18]. Thus, robots needperception algorithms that enable them to detect human groups[17, 22].

Recent work has used inspiration from social science to detectgroups. For instance, several methods in human-robot interaction(HRI) enable robots to detect F-Formations, which describe groupsusing position and orientation [12, 25]. Other methods use inspira-tion from proxemics [4] to detect and track groups [14, 15].

While this work provides a foundation to improve human groupdetection, a major gap remains: joint ego-centric group percep-tion using unsupervised models. Ego-centric perception is vital forrobots as they are expected to perform tasks with human groups ina wide range of settings [26]. It is not practical to install monitoringsystems in a robot’s environment to ensure that they function prop-erly or as an extension of their vision system. To this end, robotperception algorithms must be able to handle diverse environmentsand therefore should not rely on external sensors. Thus, robots needto work effectively in groups using only their on-board sensors.

Although some papers, such as Linder et al. [14] use ego-centricperception, they employ supervised learning. This requires a priori

Figure 1: This image shows bounding boxes around groupsfrom a third person viewpoint and from the first personviewpoint.

knowledge of the problem space, which requires manually anno-tation of large datasets. This is a laborious and time consumingtask. Thus, this warrants exploration into unsupervised methods forhuman-robot group interaction. These methods have shown greatpotential to robustly solve problems in HRI such as manipulation,object recognition, and localization [1, 2, 24]. This provides inspira-tion for our work to develop robust algorithms for ego-centric, robotgroup perception that work in-the-wild with no a priori knowledge.

2 GROUP DETECTION METHODWe are designing a novel algorithm for robots to detect humangroups in real-world environments. To our knowledge, we are thefirst to jointly address this problem from an ego-centric perspectivewhile detecting groups using unsupervised methods.

Our algorithm has three main parts: i) ego-centric pedestriandetection, ii) pedestrian motion estimation, and iii) group detectionusing joint motion and proximity estimation.

For ego-centric pedestrian detection, we explored several off-the-shelf pedestrian detectors most well-suited for our algorithmicpipeline. Our method requires a pedestrian detector that has areasonable frame-rate (about 20 frames per second (fps)), and hasstate-of-the-art accuracy (>75% average precision). Also, as ouralgorithm must work in real-world settings, the pedestrian detectormust be able to handle a range of image challenges such as lightingillumination changes, shadows, and occlusion.

Using RGB data, the pedestrian detector generates boundingboxes (BB), denoted bn = ⟨bnx ,bny ,bnw ,bnh ⟩. Our method estimatestwo feature descriptors: the position of pedestrians (centroid ofeach BB) and size of the BB (aspect ratio). The centroid of the BB,denoted bnx and bny , are the center x and y pixel coordinates for thenth pedestrian at time t where n = 1, 2, ...,N . The next feature uses

Page 2: Robot-Centric Human Group Detectioncseweb.ucsd.edu/~amt062/HRI-SRW-2018.pdffirst to jointly address this problem from an ego-centric perspective while detecting groups using unsupervised

the intuition that people further away from the robot should havesmaller BB than people closer. Therefore, we estimate the aspectratio (or scale), st = log(bnw /bnh ).

We use optical flow, a popular motion feature in computer vision,to estimate motion between consecutive image frames [10]. Thisfeature estimates velocity vt = ⟨vtx ,vty ⟩ where vtx and vty are thex and y components of the velocity vector from time t − 1 to t .There have been many recently published extensions to opticalflow which generate dense flow vectors. We are currently exploringthese methods for our algorithm.

In order to detect groups, our method uses joint pedestrian prox-imity and motion estimation. In our work, we use depth data toestimate ego-centric proximity from the robot to people in its en-vironment. Given a depth image dt , we convert it to grayscaleand extract a sub-image of the depth pixels corresponding to thenth pedestrian. We use the BB to extract this sub-image, whichwe denote as dt (bn ). Then, we use the mean of the pixels in thissub-image as the proximity measure, denoted bnz .

Using optical flow, our method estimates the velocity of eachdetected person by performing a similar operation as done forproximity estimation. We extract the velocity vectors in the BB thatare generated by the pedestrian detector, denoted vt (bn ). Then, weestimate the motion direction, using Eq. 1:

vtθ ← tan−1vty

vtx(1)

vtx is the x component of optical flow at time t .vty is the y component of optical flow at time t .

This generates a matrix of orientation estimations, vtθ . Then, wecompute themean ofvtθ whichwe then use to construct a histogramhtθ . In order to estimate the most likely direction of pedestrians, weestimate the maximum bin in htθ , denoted θ

tmax . After estimating

the aforementioned features, we concatenate them into a singlefeature vector, f nt = ⟨bx ,by ,bz ,θ

tmax , st ⟩.

Now that we have features to characterize groups, the next stepof our algorithm is to cluster people into groups. Each persondetected at time t is assigned a group identifier Ct = ⟨1, 2, ...,G⟩where G is the total number of groups. Given f nt ∈ F , we use aweighting scheme where wn

i is the weight for the ith feature forthe nth person. We are currently exploring optimization techniquesto automatically tune the weights based on experimentation.

Once the weights are optimized, we use agglomerative hierar-chical clustering to detect groups. First, we estimate the L2-normdistance of each pairwise detected pedestrian. Then, we computeaverage linkage d(C j

t ,Ckt ) =

∑b j ∈C j

t

∑bk ∈Ck

td(b j ,bk ) where C j

t

andCkt are clusters and b j and bk are a pair of pedestrian’s BB. Thispairs up people based on their respective distance from each other.This forms a hierarchical clustering tree. Using this tree, we applya threshold α to the height of each level in the tree to assign Ct .

3 DATA ACQUISITIONWe performed data acquisition in a large outdoor public park whichcaptured real-world computer vision and robotics challenges suchas variable light, occlusion, and motion blur. Data acquisition took

place during the daytime within a span of 1.5 hours. It capturespeople walking, eating, and viewing local nearby landmarks.

We used a ZED stereo camera and mounted it at human heighton a Double Telepresence Robot. The ZED sensor acquired 360x640resolution RGB-D images at about 20 fps with a maximum range of20meters. The robot continually collected data while being teleoper-ated using the Double mobile application. The robot roamed aroundthe park moving through corridors, on sidewalks, and through largecrowds. In total, our dataset consists of 16,827 RGB-D images.

4 EXPERIMENTAL PLANWe use two types of evaluation metrics: technical and social. Thetechnical evaluationmetrics are commonly used for object detectionin the literature [2]. These include Intersection over Union (IoU)vs. accuracy (ACC) and log-average miss rate (LMR) vs IoU. IoUmeasures how closely our algorithmmatches the ground truth. LMRvs IoU is similar to average precision as it summarizes the detectionperformance on a log scale by varying the detection confidence.

The social evaluation metric is based on a social navigation eval-uation paradigm. Diffusion distance has been used as a similaritymeasure between paths for autonomous navigation tasks [3]. Weapply this metric to groups as a measure of connectivity or affilia-tion between pairwise individuals in groups [13]. We are interestedin how the robot’s motion affects human group diffusion distance asthis may potentially provide insight for navigation tasks for mobilerobots in public settings. We plan to evaluate our algorithm on adataset collected from a real-world robot platform in real-worldenvironments (See Figure 1).

5 DISCUSSIONWe are continuing to implement our group detection algorithm.Upon completion, we plan to perform a set of experiments thatinvestigate various feature weighting schemes and α parameters.Then, we plan to evaluate our method on our dataset.

Human-robot group interaction is an exciting area of explorationand can greatly impact people’s lives as robots begin working along-side us. Our work will enable roboticists to leave the lab and exploregroup perception problems in-the-wild. While no longer being lim-ited to stationary sensing systems, mobile robots can detect humangroups while they are moving. By taking advantage of on-boardego-centric sensing methodologies, robots can perform tasks inde-pendently without human intervention to assist teams. To this end,our approach will provide inspiration for the HRI community to in-vestigate how they too can use unsupervised methods to approachrobotics challenges.

6 ACKNOWLEDGEMENTSSome research reported in this article is based upon work supportedby the National Science Foundation under grant nos. IIS-1527759and DGE-1650112.

REFERENCES[1] Alexandre Alahi, J. Wilson, L. Fei-Fei, and S. Savarese. 2017. Unsupervised camera

localization in crowded spaces. In IEEE International Conference on Robotics andAutomation (ICRA).

2

Page 3: Robot-Centric Human Group Detectioncseweb.ucsd.edu/~amt062/HRI-SRW-2018.pdffirst to jointly address this problem from an ego-centric perspective while detecting groups using unsupervised

[2] D. Chan, A. Taylor, and L. D. Riek. 2017. Faster Robot Perception Using SalientDepth Partitioning. In Proceedings of the IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS).

[3] Y. F. Chen, S. Liu, M. Liu, J. Miller, and J. P. How. 2016. Motion planning with dif-fusion maps. In Proceedings of the IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS).

[4] E. T. Hall. 1966. The hidden dimension. (1966).[5] T. Iqbal, M. J Gonzales, and L. D. Riek. 2015. Joint action perception to enable

fluent human-robot teamwork. In 24th IEEE International Symposium on Robotand Human Interactive Communication (RO-MAN). IEEE, 400–406.

[6] T. Iqbal, M. Moosaei, and L. D. Riek. 2016. Tempo adaptation and anticipa-tion methods for human-robot teams. In Proceedings of the Robotics: Scienceand Systems (RSS), Planning for Human-Robot Interaction: Shared Autonomy andCollaborative Robotics Workshop.

[7] T. Iqbal, S. Rack, and L. D. Riek. 2016. Movement Coordination in Human–RobotTeams: A Dynamical Systems Approach. IEEE Transactions on Robotics (2016).

[8] T. Iqbal and L. D. Riek. 2017. Coordination Dynamics in Multihuman MultirobotTeams. IEEE Robotics and Automation Letters 2, 3 (2017), 1712–1717.

[9] T. Iqbal and L. D. Riek. 2017. Human Robot Teaming: Approaches from JointAction and Dynamical Systems. Invited chapter in Springer Handbook of HumanoidRobotics. (2017).

[10] J. Y. Jason, A. W. Harley, and K. G. Derpanis. 2016. Back to basics: Unsupervisedlearning of optical flow via brightness constancy and motion smoothness. InComputer Vision–ECCV Workshops.

[11] M. Joosse and V. Evers. 2017. A Guide Robot at the Airport: First Impressions.In In Proceedings of the Companion of the ACM/IEEE International Conference onHuman-Robot Interaction.

[12] A. Kendon. 1990. Conducting interaction: Patterns of behavior in focused encoun-ters.

[13] S. Lafon and A. B. Lee. 2006. Diffusion Maps and Coarse-Graining: A UnifiedFramework for Dimensionality Reduction, Graph Partitioning, and Data SetParameterization. IEEE Transactions on Pattern Analysis and Machine Intelligence(2006).

[14] T. Linder, S. Breuers, B. Leibe, and K. O. Arras. 2016. On multi-modal peopletracking from mobile platforms in very crowded and dynamic environments. InIEEE International Conference on Robotics and Automation (ICRA).

[15] R. Mead and M. Mataric. 2016. Robots Have Needs Too: How and Why PeopleAdapt Their Proxemic Behavior to Improve Robot Social Signal Understanding.Journal of Human-Robot Interaction (2016).

[16] M. Moosaei, S. K. Das, D. O. Popa, and L. D. Riek. 2017. Using Facially ExpressiveRobots to Calibrate Clinical Pain Perception. In Proceedings of the 2017 ACM/IEEEInternational Conference on Human-Robot Interaction. ACM, 32–41.

[17] A. Nigam and L. D. Riek. 2015. Social context perception for mobile robots. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE,3621–3627.

[18] S. Nikolaidis, P. Lasota, G. Rossano, C. Martinez, T. Fuhlbrigge, and J. Shah.2013. Human-robot collaboration in manufacturing: Quantitative evaluation ofpredictable, convergent joint action. In 44th International Symposium on Robotics(ISR). IEEE.

[19] L. D. Riek. 2015. Robotics technology in mental health care. Artificial Intelligencein Behavioral and Mental Health Care (2015), 185.

[20] L. D. Riek. 2017. Healthcare robotics. Commun. ACM (2017), 68–78.[21] S. Schneider and F. Kümmert. 2016. Exercising with a humanoid companion is

more effective than exercising alone. In 16th IEEE-RAS International Conferenceon Humanoid Robots (Humanoids).

[22] A. Taylor and L. D. Riek. 2016. Robot Perception of Human Groups in the RealWorld: State of the Art. In In Proceedings of the AAAI Fall Symposium on ArtificialIntelligence in Human-Robot Interaction (AI-HRI).

[23] A. Taylor and L. D. Riek. 2017. Robot Affiliation Perception for Social Interaction.In Proceedings of Robots in Groups and Teams Workshop at CSCW (2017), 1–4.

[24] R. Toris, D. Kent, and S. Chernova. 2015. Unsupervised learning of multi-hypothesized pick-and-place task templates via crowdsourcing. In IEEE Interna-tional Conference on Robotics and Automation (ICRA).

[25] M. Vázquez, A. Steinfeld, and S. E. Hudson. [n. d.]. Parallel detection of conversa-tional groups of free-standing people and tracking of their lower-body orientation.In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robotsand Systems (IROS).

[26] L. Xia, I. Gori, J. K. Aggarwal, and M. S. Ryoo. 2015. Robot-centric activityrecognition from first-person rgb-d videos. In Conference on Applications ofComputer Vision (WACV).

3