Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

Embed Size (px)

Citation preview

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    1/166

    Beyond Controllers

    Human Segmentation, Pose, and Depth Estimationas Game Input Mechanisms

    Glenn Sheasby

    Thesis submitted in partial fulfilment of the requirements of the award of

    Doctor of Philosophy

    Oxford Brookes University

    in collaboration with Sony Computer Entertainment Europe

    December 2012

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    2/166

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    3/166

    Abstract

    Over the past few years, video game developers have begun moving away from the tradi-

    tional methods of user input through physical hardware interactions such as controllers or

    joysticks, and towards acquiring input via an optical interface, such as an infrared depth

    camera (e.g.the MicrosoftKinect) or a standard RGB camera (e.g.the PlayStation Eye).

    Computer vision techniques form the backbone of both input devices, and in this thesis,

    the latter method of input will be the main focus.In this thesis, the problem of human understanding is considered, combining segment-

    ation and pose estimation. While focussing on these tasks, we examine the stringent

    challenges associated with the implementation of these techniques in games, noting par-

    ticularly the speed required for any algorithm to be usable in computer games. We also

    keep in mind the desire to retain information wherever possible: algorithms which put

    segmentation and pose estimation into a pipeline, where the results of one task are used

    to help solve the other, are prone to discarding potentially useful information at an early

    stage, and by sharing information between the two problems and depth estimation, we

    show that the results of each individual problem can be improved.

    We adapt Wang and Kollers dual decomposition technique to take stereo information

    into account, and tackle the problems of stereo, segmentation and human pose estimation

    simultaneously. In order to evaluate this approach, we introduce a novel, large dataset

    featuring nearly 9,000 frames of fully annotated humans in stereo.

    Our approach is extended by the addition of a robust stereo prior for segmenta-

    tion, which improves information sharing between the stereo correspondence and human

    segmentation parts of the framework. This produces an improvement in segmentation

    results. Finally, we increase the speed of our framework by a factor of 20, using a highly

    efficient filter-based mean field inference approach. The results of this approach compare

    favourably to the state of the art in segmentation and pose estimation, improving on the

    best results in these tasks by 6.5% and 7% respectively.

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    4/166

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    5/166

    Acknowledgements

    Okay... now what?

    (Mike Slackenerny, PhD comic #844)

    It is finished. Although the PhD thesis is a beast that must be tamed in solitude, I

    dont believe its something that can be done entirelyalone, and there are many people

    to whom I owe a debt of gratitude.

    My supervisor, Phil Torr, made it possible for me to get started in the first place,

    and gave me immeasurable help along the way. While were talking about how I came to

    be doing a PhD, I should also thank my old boss, Andrew Stoddart, who recommended

    that I apply, and the recession for costing me the software job I was doing after leaving

    student life for the first time. I guess my escape velocity wasnt high enough, moving

    only two miles from my first alma mater. Im about 770 miles away now, so that should

    be enough!

    My colleagues at Brookes also helped immensely, from those who helped me settle

    in: David Jarzebowski, Jon Rihan, Chris Russell, Lubor Ladicky, Karteek Alahari, Sam

    Hare, Greg Rogez, and Paul Sturgess; to those who saw me off at the end of it: Paul

    Sturgess, Sunando Sengupta, Michael Sapienza, Ziming Zhang, Kyle Zheng, and Ming-

    Ming Cheng. Special thanks are due to Morten Lindegaard, who proof-read large chunksof this thesis, and to my co-authors: Julien Valentin, Vibhav Vineet, Jonathan Warrell,

    and my second supervisor, Nigel Crook.

    Financial support from the EPSRC partnership with Sony is gratefully acknowledged,

    and weekly meetings and regular feedback from Diarmid Campbell helped to guide and

    focus my research. Furthermore, Amir Saffari and the rest of the crew at SCEE London

    Studio provided a dataset, as well as feedback from a professional perspective.

    Id also like to thank my examiners, Teo de Campos, Mark Bishop, and David Duce,

    for taking the time to read my thesis, and for providing useful feedback and engaging

    discussion during the viva.

    While struggling through my PhD years, I was kept sane in Oxford by a variety ofgroups, including the prayer group at St. Mary Magdalenes, Brookes Ultimate Frisbee,

    and of course, the Oxford University Bridge Club, where I spent many Monday even-

    ings exercising my mind (and liver), and where I met my wonderful fiance, the future

    Dr. Mrs. Dr. Sheasby, Aleksandra: wszystkie nasze sukcesy s wsplne, ale ten jeden za-

    wdziczam wycznie Tobie. Wierzya we mnie nawet wtedy, kiedy ja sam w siebie nie

    wierzyem i za to bd Ci wdziczny do koca ycia.

    Lastly, but most importantly, Id like to thank my parents, for raising me, for sup-

    porting me in all of my endeavours, and for teaching me to question everything.

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    6/166

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    7/166

    Contents

    List of Figures 7

    List of Tables 9

    List of Algorithms 11

    1 Introduction 13

    1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    1.2 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2 Vision in Computer Games: A Brief History 19

    2.1 Motion Sensors: Nintendo Wii . . . . . . . . . . . . . . . . . . . . . . . . 20

    2.1.1 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.2 RGB Cameras: EyeToy and Playstation Eye . . . . . . . . . . . . . . . . 22

    2.2.1 Early Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.2.2 Antigrav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    2.2.3 Eye of Judgment . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.2.4 EyePet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.2.5 Wonderbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    2.3 Depth Sensors: Microsoft Kinect . . . . . . . . . . . . . . . . . . . . . . 28

    2.3.1 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.3.2 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.3.3 Vision Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    2.3.4 Overall Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    3 State of the Art in Selected Vision Algorithms 33

    3.1 Inference on Graphs: Energy Minimisation . . . . . . . . . . . . . . . . . 34

    3

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    8/166

    Contents

    3.1.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . 34

    3.1.2 Submodular Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    3.1.3 The st-Mincut Problem . . . . . . . . . . . . . . . . . . . . . . . . 363.1.4 Application to Image Segmentation . . . . . . . . . . . . . . . . . 38

    3.2 Inference on Trees: Belief Propagation . . . . . . . . . . . . . . . . . . . 40

    3.2.1 Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    3.2.2 Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    3.3 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    3.3.1 Pictorial Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.3.2 Flexible Mixtures of Parts . . . . . . . . . . . . . . . . . . . . . . 47

    3.3.3 Unifying Segmentation and Pose Estimation . . . . . . . . . . . . 50

    3.4 Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.1 Humans in Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    4 3D Human Pose Estimation in a Stereo Pair of Images 57

    4.1 Joint Inference via Dual Decomposition . . . . . . . . . . . . . . . . . . . 58

    4.1.1 Introduction to Dual Decomposition . . . . . . . . . . . . . . . . 59

    4.1.2 Related Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    4.2 Humans in Two Views (H2view) Dataset . . . . . . . . . . . . . . . . . . 67

    4.2.1 Evaluation Metrics Used . . . . . . . . . . . . . . . . . . . . . . . 68

    4.3 Inference Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    4.3.1 Segmentation Term . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    4.3.2 Pose Estimation Term . . . . . . . . . . . . . . . . . . . . . . . . 77

    4.3.3 Stereo Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    4.3.4 Joint Estimation of Pose and Segmentation . . . . . . . . . . . . . 81

    4.3.5 Joint Estimation of Segmentation and Stereo . . . . . . . . . . . . 82

    4.4 Dual Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

    4.4.1 Binarisation of Energy Functions . . . . . . . . . . . . . . . . . . 84

    4.4.2 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    4.4.3 Solving Sub-ProblemL1 . . . . . . . . . . . . . . . . . . . . . . . 884.4.4 Solving Sub-ProblemL2 . . . . . . . . . . . . . . . . . . . . . . . 88

    4.4.5 Solving Sub-ProblemL3 . . . . . . . . . . . . . . . . . . . . . . . 89

    4.5 Weight Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

    4.5.1 Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

    4.5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

    4.6.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

    4.6.2 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

    4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

    4

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    9/166

    Contents

    5 A Robust Stereo Prior for Human Segmentation 103

    5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    5.1.1 Range Move Formulation . . . . . . . . . . . . . . . . . . . . . . . 1065.2 Flood Fill Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

    5.3 Application: Human Segmentation . . . . . . . . . . . . . . . . . . . . . 112

    5.3.1 Original Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 112

    5.3.2 Stereo TermfD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    5.3.3 Segmentation TermsfS and fSD . . . . . . . . . . . . . . . . . . . 114

    5.3.4 Pose Estimation Terms fP and fPS . . . . . . . . . . . . . . . . . 115

    5.3.5 Energy Minimisation . . . . . . . . . . . . . . . . . . . . . . . . . 116

    5.3.6 Modifications toD Vector . . . . . . . . . . . . . . . . . . . . . . 118

    5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.4.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

    5.4.2 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

    5.4.3 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

    5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

    6 An Efficient Mean Field Based Method for Joint Estimation of Human

    Pose, Segmentation, and Depth 125

    6.1 Mean Field Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

    6.1.1 Introduction to Mean-Field Inference . . . . . . . . . . . . . . . . 128

    6.1.2 Simple Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 1306.1.3 Performance Comparison: Mean Field vs Graph Cuts . . . . . . . 131

    6.2 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

    6.2.1 Joint Energy Function . . . . . . . . . . . . . . . . . . . . . . . . 132

    6.3 Inference in the Joint Model . . . . . . . . . . . . . . . . . . . . . . . . . 135

    6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

    6.4.1 Segmentation Performance . . . . . . . . . . . . . . . . . . . . . . 137

    6.4.2 Pose Estimation Performance . . . . . . . . . . . . . . . . . . . . 137

    6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

    7 Conclusions and Future Work 1437.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 143

    7.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . 144

    Bibliography 147

    5

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    10/166

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    11/166

    List of Figures

    2.1 Duck Hunt screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.2 WiiSensor Bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    2.3 Putting action fromWii Sports. . . . . . . . . . . . . . . . . . . . . . . 21

    2.4 EyeToyand Playstation Eyecameras. . . . . . . . . . . . . . . . . . . . . 22

    2.5 EyeToy: Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    2.6 Antigrav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    2.7 Eye of Judgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.8 EyePet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.9 Wonderbookdesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    2.10 AWonderbook scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.11 Kinectgames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.12 Furniture removal guidelines in Kinect instruction manual . . . . . . . . 30

    3.1 Image for our toy example. . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    3.2 Segmentation results on the toy image . . . . . . . . . . . . . . . . . . . 40

    3.3 Skeleton Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.4 Part models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    3.5 Yang-Ramanan skeleton model . . . . . . . . . . . . . . . . . . . . . . . 49

    3.6 Stereo example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    4.1 Subgradient example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    4.2 Dual functions versus the cost variable. . . . . . . . . . . . . . . . . . . 66

    4.3 Values of the dual functiong() . . . . . . . . . . . . . . . . . . . . . . . 66

    4.4 Accuracy of Part Proposals . . . . . . . . . . . . . . . . . . . . . . . . . 73

    4.5 CRF Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    4.6 Foreground weightings on a cluttered image from the Parse dataset . . . 76

    4.7 Results using just fS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    4.8 Part selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    4.9 Limb recovery due toJ1 term . . . . . . . . . . . . . . . . . . . . . . . . 83

    7

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    12/166

    List of Figures

    4.10 Master-slave update process . . . . . . . . . . . . . . . . . . . . . . . . . 87

    4.11 Decision tree: parameter optimisation . . . . . . . . . . . . . . . . . . . . 94

    4.12 Sample stereo and segmentation results . . . . . . . . . . . . . . . . . . . 974.13 Segmentation results on H2view . . . . . . . . . . . . . . . . . . . . . . . 98

    4.14 Results from H2view dataset . . . . . . . . . . . . . . . . . . . . . . . . . 99

    5.1 Flood fill example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

    5.2 Three successive range expansion iterations . . . . . . . . . . . . . . . . . 109

    5.3 The new master-slave update process . . . . . . . . . . . . . . . . . . . . 117

    5.4 Segmentation results on H2View . . . . . . . . . . . . . . . . . . . . . . . 119

    5.5 Comparison of segmentation results on H2View . . . . . . . . . . . . . . 120

    5.6 Failure cases of segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 124

    6.1 Segmentation of theTreeimage . . . . . . . . . . . . . . . . . . . . . . . 127

    6.2 Basic 6-part skeleton model . . . . . . . . . . . . . . . . . . . . . . . . . 131

    6.3 Segmentation results on H2view compared to other methods . . . . . . . 138

    6.4 Further segmentation results . . . . . . . . . . . . . . . . . . . . . . . . . 139

    6.5 Qualitative results on H2View dataset . . . . . . . . . . . . . . . . . . . 141

    8

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    13/166

    List of Tables

    4.1 Table of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    4.2 Evaluation offS only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    4.3 Evaluation offS combined with fPS . . . . . . . . . . . . . . . . . . . . . 82

    4.4 List of weights learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

    4.5 Evaluation of segmentation performance . . . . . . . . . . . . . . . . . . 96

    4.6 Dual Decomposition results . . . . . . . . . . . . . . . . . . . . . . . . . 100

    5.1 Segmentation results on the H2View dataset . . . . . . . . . . . . . . . . 121

    5.2 Results (given in % PCP) on the H2view test sequence. . . . . . . . . . . 122

    6.1 Evaluation of mean field on the MSRC-21 dataset . . . . . . . . . . . . . 131

    6.2 Quantitative segmentation results on the H2View dataset . . . . . . . . . 137

    6.3 Pose estimation results on the H2View dataset . . . . . . . . . . . . . . . 140

    9

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    14/166

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    15/166

    List of Algorithms

    4.1 Parameter optimisation algorithm for dual decomposition framework. . . 935.1 Generic flood fill algorithm for an imageIof size W H. . . . . . . . . 110

    5.2 doLinearFill: perform a linear fill from the seed point (sx, sy). . . . . . 111

    6.1 Nave mean field algorithm for fully connected CRFs . . . . . . . . . . . 130

    11

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    16/166

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    17/166

    Chapter1

    Introduction

    Over the past several years, a wide range of commercial applications of computer vision

    have begun to emerge, such as face detection in cameras, augmented reality (AR) in shop

    displays, and the automatic construction of image panoramas. Another key application

    of computer vision that has become popular recently is computer games, with commercial

    products such as Sonys Playstation Eyeand Microsofts Kinectselling millions of units

    [98].

    In creating these products, video game developers have been able to partially expand

    the demographic of players. They have done this by moving away from the traditional

    controller pad method of user input, and enabling the player to control the game using

    other objects, such as books or AR markers, and even their own bodies. Some of the

    most popular games that are either partially or completely driven using human motion

    include sports games such as Wii Sports and Kinect Sports, and party games such as

    the EyeToy: Playseries. More recent games, such as EyePetand Wonderbook: Book of

    Spells, combine motion information with object detection. A more thorough description

    of these games can be found in Chapter 2.

    Three main computer vision techniques are used to obtain input instructions for these

    games: motion detection, object detection, and human pose estimation. The first of these,

    motion detection, involves detecting changes in image intensity across several frames; in

    video games, motion detection is used in particular areas of the screen as the player

    13

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    18/166

    1.1. Contributions

    attempts to complete tasks. Secondly,object detectioninvolves determining the presence,

    position and orientation of particular objects in the frame. The object can be a simple

    shape (e.g. a quadrilateral) or a complex articulated object, such as a cat. In certain

    video games, the detection of AR markers is used to add computer graphics to an image

    of the players surroundings. Finally, the goal ofhuman pose estimation is to determine

    the position and orientation of each of a persons body parts. Using images obtained via

    an infrared depth sensor, Kinectgames can track human poses over several frames, in

    order to detect actions [110].

    Theoretically, an image contains a lot more information than a controller can sup-ply. However, the player can only provide information via a relatively limited set of

    actions, either with their own body, or using some kind of peripheral object which can

    be recognised.

    The main aim of this thesis is to explore and expand the applicability of human

    pose estimation to video games. After an analysis of the techniques that have already

    been used, and of the current state of these techniques in research, our main application

    will be presented. Using a stereo pair of cameras, we will develop a system that unifies

    human segmentation, pose estimation, and depth estimation, solving the three tasks

    simultaneously. In order to evaluate this system, we will present a large dataset containing

    stereo images of humans in indoor environments where video games might be played.

    1.1 Contributions

    In summary, the principal contributions of this thesis are as follows:

    A system for the simultaneous segmentation and pose estimation of humans, as

    well as depth estimation of the entire scene. This system is further developed by

    the introduction of a stereo-based prior; the speed of the system is subsequently

    improved by applying a state-of-the-art approximate inference technique.

    The introduction of a novel, 9,000 image dataset of humans in two views.

    14

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    19/166

    1.2. Outline of the Thesis

    Throughout the thesis, the pronoun we is used instead of I. This is done to follow

    scientific convention; the contents of this thesis are the work of the author. Where others

    have contributed towards the work, their collaborations will be attributed in a short

    section at the end of each chapter.

    1.2 Outline of the Thesis

    Chapter 2contains a description of some of the various attempts that games developers

    have made to provide alternatives to controllers, and the impact that these games have

    had on the video games community. Starting with the accelerometer and infrared detec-

    tion based solutions provided by the Nintendo Wii, we observe the increasing amount

    of integration of vision techniques, with this trend demonstrated by the methods used

    by SonysEyeToyandPlayStation Eye-based games over the past several years. Finally,

    we consider the impact that depth information can have in enabling the software to

    determine the pose of the players body, as shown by the Microsoft Kinect.

    Following on from that,Chapter 3contains an appraisal of related work in computer

    vision that might be applied in computer games. We consider the different approaches

    commonly used to solve the problems of segmentation and human pose estimation, and

    give an overview of some of the approaches that have been used to provide 3D information

    given a pair of images from a stereo camera.

    Chapter 4 describes a novel framework for the simultaneous depth estimation of a

    scene, and segmentation and pose estimation of the humans within that scene. Using a

    stereo pair of images as input provides us with the ability to compute the distance of each

    pixel from the camera; additionally, we can use standard approaches to find the pixels

    occupied by the human, and predict its pose. In order to share information between

    these three approaches, we employ a dual decomposition framework [62,127]. Finally, to

    evaluate the results obtained by our method, we introduce a new dataset, called Humans

    in Two Views, which contains almost 9,000 stereo pairs of images of humans.

    InChapter 5, we extend this approach to improve the quality of information shared

    15

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    20/166

    1.3. Publications

    between the segmentation and depth estimation parts of the algorithm. Observing that

    the human occupies a continuous region of the cameras field of view, we infer that the

    distance of human pixels from the camera will vary only in certain ways, without sharp

    boundaries (we say that the depth issmooth). Therefore, starting from pixels that we are

    very confident lie within the human, we can extract a reliable initial segmentation from

    the depth map, significantly improving the overall segmentation results.

    The drawback of the dual decomposition-based approach, however, is that it is much

    too slow to be used in computer games. In Chapter 6, we adapt our framework in order

    to apply an approximate, but very fast, inference approach based on mean field [64]. Ouruse of this new inference approach enables us to improve the information sharing between

    the three parts of the framework, providing an improvement in accuracy, as well as an

    order-of-magnitude speed improvement.

    While the mean-field inference approach is much quicker than the dual decomposition-

    based approach, its speed (close to 1 fps) is still not fast enough for real-time application

    such as computer games. In Chapter 7, the thesis concludes with some suggestions for

    how to further improve the speed, as well as some other promising possible directions for

    future research. The concluding chapter also contains a summary of the work presented

    and contributions made.

    1.3 Publications

    Several chapters of this thesis first appeared as conference publications, as follows:

    G. Sheasby, J. Warrell, Y. Zhang, N. Crook, and P.H.S. Torr. Simultaneous hu-

    man segmentation, depth and pose estimation via dual decomposition. In British

    Machine Vision Conference, Student Workshop, 2012. (Chapter 4, [108])

    G. Sheasby, J. Valentin, N. Crook, and P.H.S. Torr. A robust stereo prior for human

    segmentation. InAsian Conference on Computer Vision (ACCV), 2012. (Chapter

    5, [107])

    16

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    21/166

    The contributions of co-authors are acknowledged in the corresponding chapters. The first

    paper [108] received the best student paper award at the BMVC workshop. Addition-

    ally, some sections of Chapter 6 form part of a paper that is currently under submission

    at a major computer vision conference.

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    22/166

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    23/166

    Chapter2

    Vision in Computer Games: A Brief History

    Figure 2.1: A screenshot [84] from Duck Hunt, an early example of a game that used

    sensing technology.

    The purpose of a game controller is to convey the users intentions to the game. A wide

    varieties of input methods, for instance a mouse and keyboard, a handheld controller,

    or a joystick, have been employed for this purpose. Video games using some sort of

    sensing technology (instead of, or in addition to, those listed above) have been available

    for several decades. In 1984, Nintendo released alight gun, which detects light emitted by

    CRT monitors; this release was made popular by the game Duck Huntfor the Nintendo

    19

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    24/166

    2.1. Motion Sensors: Nintendo Wii

    Figure 2.2: The sensor bar, which emits infrared light that is detected byWii remotes.The picture [81] was taken with a camera sensitive to infrared light; the LEDs are notvisible to the human eye.

    Entertainment System (NES), in which the player aimed the gun at ducks that appeared

    on the screen (Figure 2.1). When the trigger is fired, the screen is turned black for one

    frame, and then the target area is turned white in the next frame. If it is pointed at the

    correct place, the gun detects this change in intensity, and registers a hit.

    Over the past few years, technological developments have made it easier for video

    game developers to incorporate sensing devices to augment, or in some cases replace, the

    traditional controller pad method of user input. These devices include motion sensors,

    RGB cameras, and depth sensors. The following sections give a brief summary of the

    applications of each in turn.

    2.1 Motion Sensors: Nintendo Wii

    TheWiiis a seventh-generation games console that was released by Nintendo in late 2006.

    Unlike previous consoles, the unique selling point of the Wiiwas a new form of player

    interaction, rather than greater power or graphics capability. This new form of interaction

    was theWii Remote, a wireless controller with motion sensing capabilities. The controller

    contains an accelerometer, enabling it to sense acceleration in three dimensions, and an

    infrared sensor, which is used to determine where the remote is pointing [49].

    Unlike light guns, which sense light from CRT screens, the remote detects light from

    the consoles sensor bar, which features ten infrared LEDs (Figure 2.2). The light from

    each end of the bar is detected by the remotes optical sensor as two bright lights. Trian-

    gulation is used to determine the distance between the remote and the sensor bar, given

    20

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    25/166

    2.1. Motion Sensors: Nintendo Wii

    Figure 2.3: An example of the use of the Wii Remotes motion sensing capabilities tocontrol game input. Here, the player moves the remote as he would move a putter whenplaying golf. The power of the putt is determined by the magnitude of the swing [74].

    the observed distance between the two bright lights and the known distance between the

    LED arrays.

    The capability of the Wiito track position and motion enables the player to mimic

    actual game actions, such as swinging a sword or tennis racket. This capability is demon-

    strated by games such as Wii Sports, which was included with the games console in the

    first few years after its release. The remote can be used to mimic the action of bowling

    a ball, or swung like a tennis racket, a baseball bat, or a golf club (Figure 2.3).

    2.1.1 Impact

    The player still uses a controller, although in some games, like Wii Sports, it is now

    the position and movement of the remote that is used to influence events in-game. This

    21

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    26/166

    2.2. RGB Cameras: EyeToy and Playstation Eye

    (a) EyeToy [114] (b) PlayStation Eye [32]

    Figure 2.4: The two webcam peripherals released by Sony.

    makes playing the games more tiring than before, especially if they are played with

    vigour. However, a positive effect is that the control system is more intuitive, meaning

    that people who dont normally play traditional video games might still be interested in

    owning a Wiiconsole [103].

    2.2 RGB Cameras: EyeToy and Playstation Eye

    While Nintendos approach uses the position and motion of the controller to enhance

    gameplay, other games developers have made use of the RGB images provided by cameras.

    The first camera released as a games console peripheral and used as an input device for a

    computer game was the EyeToy, which was released for the PlayStation 2 (PS2) in 2003.

    This was followed in 2007 by the PlayStation Eye (PS Eye) for thePlayStation 3 (PS3).

    Some of Sonys recent games have used the PlayStation Move (PS Move) in addition

    to the PS Eye. The Move is a handheld plastic controller which has a large, bright ball

    on the top; the hue of this ball can be altered by the software. During gameplay, the ball

    is easily detectable by the software, and is used as a basis for determining the position,

    orientation and motion of the PS Move[113].

    The degree to which vision techniques have been applied toEyeToygames has varied

    widely. Some games only use the camera to allow the user to see themselves, whereas

    others require significant levels of image processing. The following sections contain de-

    22

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    27/166

    2.2. RGB Cameras: EyeToy and Playstation Eye

    (a) Ghost Catcher [48] (b) Keep Up [117] (c) Kung Foo [42]

    Figure 2.5: Screenshots of three mini-games from EyeToy: Play.

    scriptions of some of the games that have used image processing to enhance gameplay.

    2.2.1 Early Games

    In its original release, the EyeToywas released in a bundle with EyeToy: Play, which

    features twelve mini-games. The game play is simplistic, as is common with party-oriented

    video games. Many of them rely on motion detection; for instance, the object of Ghost

    Catcher is to fill ghosts with air and then pop them, and this is done by repeatedly

    waving your hands over them. Others, such as Keep Up, use human detection; the

    player is required to keep a ball in the air. Therefore, the game needs to determine

    whether there is a person in the area where the ball is.

    A third use of vision in this game occurs in Kung Foo; in this mini-game, the

    player stands in the middle of the cameras field of view, and is instructed to hit ninjas

    that fly onto the screen from various directions. Again, motion detection can be used to

    determine whether a hit has been registered, as it doesnt matter which body part was

    used to perform the hit.

    Impact

    As the mini-games in EyeToy: Play only require simplistic image understanding tech-

    niques, specifically the detection of motion within a small portion of the cameras field

    of view, the underlying techniques seemed to work well. As with Wii Sports, the game

    was aimed at casual gamers rather than traditional, or hardcore, gamers.

    23

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    28/166

    2.2. RGB Cameras: EyeToy and Playstation Eye

    (a) Antigravscreenshot (b) Close-up of user display

    Figure 2.6: A screenshot [91] from Antigrav, where the player has extended their right

    arm to grab an object (the first of three) and thus score some points. The user displayshows where the game has detected the players hands to be.

    2.2.2 Antigrav

    Antigrav, a PS2game that utilises the EyeToy, is a futuristic trick-based snowboarding

    game, and was brought out by Harmonix in late 2004. The player takes control of a

    character in the game, and guides them down a linear track. The game uses face tracking

    to control the characters movements, enabling the player to increase the characters speedby ducking, and change direction by leaning. In addition, the players hands are tracked,

    and their hand position is used to infer a pose, enabling the player to literally grab for

    collectible objects on-screen. The player can see what the computer calculates their head

    and hand positions to be in the form of a small diagram in the corner of the screen, as

    shown in Figure 2.6. A GameSpot review [24] points out:

    this is good for letting you know when the EyeToyis misreading your move-

    ments, which takes place more often than it ought to.

    The review, like other reviews of PS2 EyeToy releases, hints at further technological

    limitations impairing the enjoyment of the game:

    Harmonix pushes the limits of what you should expect from an EyeToy

    entry... unfortunately,EyeToypushes back, and its occasional inconsistency

    hobbles an otherwise bold and enjoyable experience.

    24

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    29/166

    2.2. RGB Cameras: EyeToy and Playstation Eye

    Impact

    The reviews above imply that the head and hand detection techniques employed by

    the game were not completely effective, meaning that users are often frustrated by their

    actions not being recognised by the game due to failure of the tracking system. This high-

    lights the importance of accuracy when developing vision algorithms for video games: if

    your tracking algorithm fails around 5% of the time, then the 95% accuracy is, quantit-

    atively, extremely good. However, during a 3-minute run down a track on Antigrav, this

    could result in a failure of the tracking system, taking several seconds to recover from.

    This would be clearly noticeable by gamers.

    2.2.3 Eye of Judgment

    Figure 2.7: An image [3] showing the set-up ofEye of Judgment. The camera is pointedat a cloth, on which several cards are placed. These cards are recognised by the game,and the on-screen display shows the objects or creatures that the cards represent.

    In 2007, Sony released Eye of Judgment, a role-playing card-game simulation that can be

    compared to the popular card game Magic: The Gathering. The PS3game comes with

    a cloth, and a set of cards with patterns on them that the computer can easily recognise

    25

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    30/166

    2.2. RGB Cameras: EyeToy and Playstation Eye

    (Figure 2.7). It can recognise the orientation as well as the identity of the cards, enabling

    them to have different functions when oriented differently. A review reported very

    few hardware-related issues, principally because of the pattern-based card recognition

    system [122].

    Since then, the PS3saw very little PS Eye-related development before the release of

    EyePetin October 2009; in the two years between the releases ofEye of Judgmentand

    EyePet, the use of the PS Eyewas generally limited to uploading personalised images for

    game characters.

    2.2.4 EyePet

    (a) EyePets AR marker (b) EyePetwith trampoline

    Figure 2.8: An example of augmented reality being used in EyePet [4].

    EyePetfeatures a virtual pet, which interacts with people and objects in the real world

    using fairly crude motion sensing. For example, if the player rolls a ball towards the pet,

    it will jump out of the way. Another major feature of the game is the use of augmented

    reality: a card with a specific pattern is detected in the cameras field of view, and a

    magic toy (a virtual object that the pet can interact with, such as a trampoline, a

    bubble-blowing monkey, or a tennis player) is shown on top of the card (see Figure 2.8).

    26

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    31/166

    2.2. RGB Cameras: EyeToy and Playstation Eye

    Impact

    Again,EyePetuses fairly simplistic vision techniques, with marker detection and a motion

    buffer being used throughout the game. This prevents it from receiving the sort of

    criticism that was associated with Antigrav.

    Although it was generally well-received, even EyePetdid not escape criticism for the

    limitations of its technology, which cant help but creak at times according to a review

    published in Eurogamer [129]. The review goes on to say that performance is robust

    under strong natural light, but patchy under electric light in the evening.

    This sort of comment shows the unforgivingness of video gamers, or at least of video

    game reviewers: for a vision technique to be useful in a game, it needs to be able to work

    under a very wide variety of environments and lighting conditions.

    2.2.5 Wonderbook

    (a) (b)

    Figure 2.9: The Wonderbook (a)[82] is used with the PlayStation Movecontroller. Theinterior(b)[115] features AR markers, as well as markings on the border to identify theedge of the book, and ones near the edge of the page, which help to identify the pagequickly.

    Wonderbook: Book of Spells, released by Sony in November 2012, is the first in an up-

    coming series of games that will use computer vision methods to enhance gameplay. The

    games will be centred upon a book whose pages contain augmented reality markers and

    other patterns (Figure 2.9). These are detected by various pattern recognition techniques,

    in order to determine where the book is, and which pages are currently visible. Once

    27

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    32/166

    2.3. Depth Sensors: Microsoft Kinect

    Figure 2.10: After the Wonderbook is detected, gameplay objects can be overlaid on-screen. In this image [116], a 3D stage is superimposed onto the book.

    this is known, augmented reality can be used to replace the image of the book with, for

    example, a burning stage (Figure 2.10).

    In Book of Spells, the book becomes a spell book, and through the gameplay, spells

    from the Harry Potter series are introduced [53]. At various points in the game, the

    player must interact with the book, for example to put out fires by patting the book.

    Skin detection algorithms are used to ensure that the players hands appear to occlude

    the spellbook, rather than going through it.

    The generality of the book enables it to be used in multiple different kinds of games.

    BBCs Walking with Dinosaurswill be made into an interactive documentary, with the

    player first excavating and completing dinosaur skeletons, and then feeding the dinosaurs

    using the PS Move[55]. It remains to be seen how the final versions of these games will

    be appraised by reviewers and customers, and thus whether the Wonderbook franchise

    will have a significant impact on the video gaming market.

    2.3 Depth Sensors: Microsoft Kinect

    While RGB cameras can be useful in enhancing gameplay with vision techniques, the

    extra information provided by depth cameras makes it significantly easier to determine

    the structure of a scene. This enables games developers to provide a new way of playing.

    WithKinect, which was released in November 2010, Microsoft offer a system where you

    28

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    33/166

    2.3. Depth Sensors: Microsoft Kinect

    (a) (b) (c)

    Figure 2.11: A selection of different games available for Kinect. (a) Kinect Sports[78]: twoplayers compete against each other at football. (b) Dance Central [7]: players performdance moves, which are tracked and judged by the game. (c) Kinect Star Wars [99]:players swing a lightsabre by making sweeping movements with their arm.

    are the controller. Using an infrared depth sensor to track 3D movement, they generate

    a detailed map of the scene, significantly simplifying the task of, for example, tracking

    the movement of a person.

    2.3.1 Technical Details

    TheKinectprovides a 320 240 16-bit depth image, and a 640 480 32-bit RGB image,both running at 30 frames per second (fps); the depth sensor has an active range of 1.2 to

    3.5 metres [93]. The skeletal detection system, used for detecting the human body in each

    frame, is based on random forest classifiers [110], and is capable of tracking a twenty-

    link skeleton of up to two active players in real-time.1 The software also provides an

    object-specific segmentation of the people in the scene (i.e.different people are segmented

    separately), and further enhances the players experience by using person recognition to

    provide greetings and content.

    2.3.2 Games

    As with the Nintendo Wii, the Kinectwas launched along with a sports game, namely

    Kinect Sports. The controls are intuitive: the player makes a kicking motion in order

    1In an articulated body, a link is defined as an inflexible part of the body. For example, if theflexibility of fingers and thumbs is ignored, each arm could be treated as three links, with one link each

    for hand, forearm and upper arm.

    29

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    34/166

    2.3. Depth Sensors: Microsoft Kinect

    Figure 2.12: The furniture removal guidelines in theKinectinstruction manual [79] advisethe player to move tables etc. that might block the cameras view, which may cause aproblem for some users.

    to kick a football, or runs on the spot in the athletics mini-games. No controllers or

    buttons are required, which makes the games very easy to adapt to, although some of

    the movements need to be exaggerated in order for the game to recognise them [18].

    Another intuitive game is Dance Central, which uses the Kinects full body tracking

    capabilities to compare the players dance moves to those shown by an on-screen in-

    structor. The object of the game is to imitate these moves in time with the music. This

    can be compared to classic games like Dance Dance Revolution, with the difference that

    the players whole body is now used, enabling a greater variety of moves and adding an

    element of realism [128].

    Up until now, games developers have struggled to produce a game that uses the

    Kinects capabilities, yet still appeals to the serious gamer. One attempt was made

    in the 2011 release Kinect Star Wars, in which the player uses their arms to control

    a lightsaber, making sweeping or chopping motions to remove obstacles, and to defeat

    enemies. However, this game was criticised due to the games inability to keep up with

    fast and frantic arm motions [126].

    A common problem with the Kinectmodel of gaming is that it is necessary to stand

    a reasonable distance away from the camera (2 to 3 metres is the recommended range),

    30

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    35/166

    2.4. Discussion

    which makes gaming very difficult in small rooms, especially as any furniture will need

    to be moved away (Figure 2.12).

    2.3.3 Vision Applications

    Since its release, and the subsequent release of an open-source software development

    kit [83], the Kinect has been used in a wide variety of non-gaming related work by

    computer vision researchers. Oikonomidis et al. [87] developed a hand-tracking system

    capable of running at 15 fps, while Izadi et al.[54] perform real-time 3D reconstruction of

    indoor scenes by slowly moving the Kinectcamera around the room. TheKinecthas also

    been shown to be a useful tool for easily collecting large amounts of training data [47].

    However, due to IR interference, the depth sensor does not work in direct sunlight,

    making it unsuitable for outdoor applications such as pedestrian detection [39].

    2.3.4 Overall Impact

    TheKinecthas had a huge impact worldwide, selling 19 million units worldwide in its first

    eighteen months. This has helped Microsoft improve sales of the Xbox 360year-on-year,

    despite the console now being in its seventh year. This is the reverse of the trend shown

    by competing consoles [98]. The method of controlling games using the human body

    rather than a controller is revolutionary, and the technology has also had a significant

    effect on vision research, as mentioned in Section 2.3.3 above.

    2.4 Discussion

    To date, a number of vision methods that use RGB cameras have been introduced to

    the video gaming community. However, these tend to be low-level (motion detection or

    marker detection) rather than high-level: if the only information given is an RGB signal,

    unconstrained object detection and human pose estimation are neither accurate nor fast

    enough to be useful in video games.

    31

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    36/166

    2.4. Discussion

    The depth camera used in the Microsoft Kinecthas provided a huge leap forward in

    this area, although the cost of this peripheral (which had a recommended retail price of

    129.99 at release, around four times more than the PS Eye) means that an improvement

    in the RGB-based techniques would be desirable. The next chapter contains an appraisal

    of related work in computer vision that might be of interest to games developers, and

    provides background for this thesis.

    32

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    37/166

    Chapter3

    State of the Art in Selected Vision

    Algorithms

    While we have seen in Chapter 2 that computer vision techniques are beginning to have a

    profound effect on computer games, there are a number of research areas which could be

    applied to further transform the gaming industry. Accurate object segmentation would

    allow actual objects, or even people, to be taken directly from the players surroundings

    and put into the virtual environment of the game. Human motion tracking could be

    used to allow the player to navigate a virtual world, for instance by steering a vehicle.

    Finally, human pose estimation could be used to allow the player to control an avatar in

    a platform or role-playing game. In this chapter, we will discuss the current state of the

    art in energy minimisation, human pose estimation, segmentation, and stereo vision.

    In order for computer vision techniques like localisation and pose estimation to be

    suitable for use in computer games, the algorithm that applies the technique needs to

    respond in real time as well as being accurate. A fast algorithm is necessary because

    the results (e.g. pose estimates) need to be used in real-time so that they can affect

    the game in-play; very high accuracy is a requirement because mistakes made by the

    game will undoubtedly frustrate the user (see [24] and Section 2.2.2). The problem is

    to find a suitable balance between these two requirements (a faster algorithm might

    involve approximate solutions, and hence could be less accurate). This may involve

    33

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    38/166

    3.1. Inference on Graphs: Energy Minimisation

    tweaking existing algorithms to produce significant speed increases without any loss in

    accuracy, or developing novel and significantly more accurate algortihms that still have

    speed comparable to current state-of-the-art algorithms.

    3.1 Inference on Graphs: Energy Minimisation

    Many of the most popular problems in computer vision can be framed as energy minim-

    isation problems. This requires the definition of a function, known as an energy function,

    which expresses the suitability of a particular solution to the problem. Solutions that are

    more probable should give the energy function a lower value; hence, we wish to find the

    solution that gives the lowest value.

    3.1.1 Conditional Random Fields

    Suppose we have a finite set Vof random variables, to which we wish to assign labels from

    a label set L. If all the variables are independent, then this problem is easily solvable

    - just find the best label for each variable. However, in general we have relationships

    between variables. LetEbe the set of pairs of variables {v1, v2} Vwhich are related to

    one another.

    We can then construct a graph G= (V, E) which specifies both the set of variables,

    and the relationships between those variables. G is a directed graphif the pairs in E are

    unordered; this enables us to construct graphs where, for some v1, v2, (v1, v2) E, but

    (v2, v1) / E.

    Given some observed data X, we can assign a set {yi : vi V} of values to the

    variables in V. Let fdenote a function that assigns a label f(vi) = yi to each vi V.

    Now, suppose that we also have a probability function p that gives us the probability of

    34

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    39/166

    3.1. Inference on Graphs: Energy Minimisation

    a particular labelling {f(vi) :vi V}given observed dataX. Then:

    Definition 3.1

    (X, V) is a conditional random field if, when conditioned on X, the variables

    Vobey the Markov propertywith respect to G:

    p(f(vi) =yi|X, {f(vj) :j =i}) =p(f(vi) =yi|X, {f(vj) : (vi, vj) E}).

    (3.1)

    In other words, each output variable yi only depends on its neighbours [72].

    3.1.2 Submodular Terms

    Now we consider set functions, which are functions whose input is a set. For example,

    suppose we have a set Y of possible variable values, and a set V, with size = |V|,

    of variables vi which each take a value yi Y. A function f which takes as input an

    assignment of these variables {yi:vi V} is a set function.

    Energy functions are set functionsf :Y R+{0}, which take as input the variable

    values {yi : vi V}, and output some non-negative real number. If the variable values

    are binary, then this f is a binary set functionf : 2 R+ {0}.

    Definition 3.2

    A binary set function f : 2 R+ {0}is submodularif and only if for every

    ordered setS, T Vwe have that:

    f(S) + f(T) f(S T) + f(S T). (3.2)

    35

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    40/166

    3.1. Inference on Graphs: Energy Minimisation

    For example, if = 2, S= [1, 0] and T = [0, 1], a submodular function will satisfy the

    following inequality [104]:

    f([1, 0]) + f([0, 1]) f([1, 1]) + f([0, 0]). (3.3)

    From Schrijver [104], we also have the following proposition:

    Proposition 3.1 The sum of submodular functions is submodular.

    Proof It is sufficient to prove that, given two submodular functions f : A R+ {0}

    and g : B R+ {0},h= f+ g:A B R+ {0}is submodular.

    h(S) + h(T) =(f+ g)(S) + (f+ g)(T)

    =f(S|A) + g(S|B) + f(T|A) + g(T|B)

    = (f(S|A) + f(T|A)) + (g(S|B) + g(T|B))

    (f((S T)|A) + f((S T)|A)) + (g((S T)|B) + g((S T)|B))

    =f((S T)|A) + g((S T)|B) + f((S T)|A) + g((S T)|B)=h(S T) + h(S T).

    As shown by Kolmogorov and Zabih [61], one way of minimising energy functions, par-

    ticularly submodular energy functions, is via graph cuts, which we will now introduce.

    3.1.3 The st-Mincut Problem

    In this section, we will consider directed graphs G = (V, E) that have special nodes

    s, t Vsuch that for all vi V\{s, t}, we have (s, vi) E, (vi, t) E, (vi, s) / E, and

    (t, vi) / E. We say that s is the source nodeand t is the sink nodeof the graph. Such a

    graph is also known as a flow network. Let c be a function c: E R+ {0}, where for

    each (v1, v2) E, c(v1, v2) represents the capacity, or maximum amount of flow, of the

    edge.

    36

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    41/166

    3.1. Inference on Graphs: Energy Minimisation

    Max Flow

    Definition 3.3

    A flow function is a function f :E R+ {0} which satisfies the following

    constraints:

    1. f(v1, v2) c(v1, v2) (v1, v2) E

    2.

    v1:(v1,v)Ef(v1, v) =

    v2:(v,v2)Ef(v, v2) v V.

    The definition given above gives us two guarantees: first, that the flow passing along a

    particular edge does not exceed that edges capacity; and second, that the flow entering

    a vertex is equal to the flow leaving that vertex. From this second constraint, we can

    derive the following:

    Definition 3.4

    Theflowof a flow function is the total amount passing from the source to the

    sink, and is equal to

    (s,v)Ef(s, v).

    The objective of the max flow problemis to maximise the flow of a network, i.e. to find

    a flow function fwith the highest flow.

    Min Cut

    Definition 3.5

    An s-t cut C= (S, T) is a partition of the variables v V into two disjoint

    sets Sand T, with s Sand t T.

    37

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    42/166

    3.1. Inference on Graphs: Energy Minimisation

    LetE be the set of edges that connect a variable v1 Sto a variablev2 T. Formally:

    E ={(v1, v2) E :v1 S, v2 T } (3.4)

    Note that there are at least |V| 2 edges in E, as if v S\s, then (v, t) E, and

    ifv T \t, then (s, v) E. Depending on the connectivity ofG, there may be up to

    (|S| 1) (| T | 1) additional edges.

    Definition 3.6

    The capacityof an s-t cut is the sum of the capacity of the edges connecting

    S toT, and is equal to

    (v1,v2)Ec(v1, v2).

    The objective of the min cut problem is to find an s-t cut which has minimal capacity

    (there may be more than one solution).

    In 1956, it was shown independently by Ford and Fulkerson [41] and by Elias et al.[30]

    that the two problems above are equivalent. Therefore, to find a flow function that has

    maximal flow, one needs only to find an s-t cut with minimal capacity. Algorithms that

    seek to obtain such an s-t cut are known as graph cut algorithms. Submodular functions

    can be efficiently minimised via graph cuts [15, 61]; C++ code is available that performs

    this minimisation using an augmented path algorithm [14,58,61]. This code is often used

    as a basis for image segmentation algorithms, for example [9,16,71,100,101,130].

    3.1.4 Application to Image Segmentation

    To illustrate the use of energy minimisation in image segmentation, consider the following

    example. We have an image, shown in Figure 3.1, with just 9 pixels ( 3 3). To construct

    a graph, we create a set of vertices V = {s,t,v1, v2, . . . , v9}, and a set of edges E, with

    (s, vi)and (vi, t) E fori = 1to 9, and(vi, vj) E ifviandvj are adjacent in the image,

    as shown in Figure 3.1. The vertices vi have pixel values pi between 0 and 255 inclusive

    (i.e. the image is 8-bit greyscale, with 0 corresponding to black, and 255 to white). Our

    38

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    43/166

    3.1. Inference on Graphs: Energy Minimisation

    Figure 3.1: Image for our toy example.

    objective is to separate the pixels into foreground and background sets, i.e. to define a

    labelling z= {z1, z2, . . . , z 9}, where zi = 1 if and only ifvi is assigned to the foreground

    set.

    We wish to separate the light pixels in the image from the dark ones, with the light

    pixels in the foreground, so we create foreground and background penalties F and B

    respectively for pixelsvi as follows:

    F(vi) = 255 pi; (3.5)

    B(vi) =pi. (3.6)

    These are known as unary pixel costs. The total unary costof a labelling z is:

    (z) =9

    i=1

    (zi F(vi) + (1 zi) B(vi)) . (3.7)

    We also want the boundary of the foreground set to align with edges in the image. There-

    fore, we wish to penalise cases where adjacent pixels have similar values, but different

    labels. This is done by including a pairwise cost:

    (z) =

    (vi,vj)E

    1(zi=zj) exp(|pi pj|), (3.8)

    where 1 is the indicator function, which has a value of 1 if the statement within the

    39

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    44/166

    3.2. Inference on Trees: Belief Propagation

    (a) = 0.1 (b) = 0.2 (c) = 1

    Figure 3.2: Segmentation results for different values of. A higher value punishes seg-mentations with large boundaries; a high enough value (as in (c)) will make the result

    either all foreground or all background.

    brackets is true, and zero otherwise. The overall energy function is:

    f(z) =(z) + (z), (3.9)

    where is a weight parameter; higher values ofwill make it more likely that adjacent

    pixels have similar labels.

    The energy function in (3.9) is submodular, and can therefore be minimised efficiently

    using the max flow code available at [58]. The segmentation results obtained for different

    values ofare shown in Figure 3.2. The ratio between the unary and pairwise weights

    influences the segmentation result produced.

    3.2 Inference on Trees: Belief Propagation

    While vision problems such as segmentation require a large number of variables (one per

    image pixel), others, such as pose estimation, only require a smaller number of variables,

    and hence a smaller graph. An important type of graph that is useful for this problem is

    40

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    45/166

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    46/166

    3.2. Inference on Trees: Belief Propagation

    E) in the form of a message. Here, a messagecan be as simple as a scalar value, or a

    matrix of values. This message is then combined with information relevant to the vertex

    itself, to form a new message for the next vertex.

    Messages are passed between vertices in a series of pre-defined updates. The number

    of updates required to find an overall solution depends on the complexity of the graph.

    If the graph has a simple structure, such as a chain (where each vertex is connected to

    at most two other vertices, and the graph is not a cycle), then only one set of updates is

    required to find the optimal set of values. This set can be found by an algorithm such as

    the Viterbi algorithm [125].

    3.2.2 Belief Propagation

    Belief propagation can be viewed as a variation of the Viterbi algorithm that is applicable

    to trees. To use this process to perform inference on a tree T, we must choose a vertex

    of the tree to be the root vertex, denoted v0.

    Since T is a tree, v0 is connected to each of the other vertices by exactly one path.We can therefore re-order the vertices such that, for any vertex vi, the path fromv0 tovi

    proceeds via vertices with indices in ascending order.1 Once we have done this, we can

    introduce the notions of parent-child relations between vertices, defined here for clarity.

    Definition 3.9

    We say that a vertex vi is the parentofvj if(vi, vj) E and i < j. If this is

    the case, we say that vj is a childofvi.

    Note that the root node has no parents, and each other vertex has exactly one parent,

    since if a vertex vj had two parents, then there would be more than one path from v0

    to vj, which contradicts the definition of a tree. However, a vertex may have multiple

    1

    There will typically be multiple ways to do this.

    42

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    47/166

    3.2. Inference on Trees: Belief Propagation

    children, or none at all.

    Definition 3.10

    A vertexvi with no children is known as a leaf vertex.

    We now describe the general form of belief propagation on our tree T. The vertices

    in V are considered in two passes: a down pass, where the vertices are processed in

    descending order, so that each vertex is processed after its children, but before its parent,

    and an up pass, where the order is reversed.

    For each leaf vertex vi, we have a set Li = {l1i , l2i , . . . , l

    Kii } of possible labels, where

    Ki is the number of labels in the set Li. Then the score associated with assigning a

    particular labellpi to vertex vi is:

    scorei(lpi ) =(vi = l

    pi ). (3.10)

    This score is the message that is passed to the parents ofvi.

    For a vertex vj with at least one child, we need to combine these messages with the

    unary and pairwise energies, in order to produce a message for the parents ofvj. Again,

    we have a finite set Lj = {l1j , l2j , . . . , l

    Kjj } of possible labels for vj . The score associated

    with assigning a particular label lqj is:

    scorej(lqj ) =(vj =lqj ) +

    i>j:(vi,vj)E

    mi(lqj ), (3.11)

    where:

    mi(lqj ) = max

    lpi

    (vi=l

    pi , vj =l

    qj ) +scorei(l

    pi )

    . (3.12)

    When the root vertex is reached, the optimal labelv0 can be found by maximising score0,

    defined in (3.11). Finally, the globally optimal configuration can be found by keeping

    track of the arg max indices, and then tracing back through the tree on the up pass

    43

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    48/166

    3.3. Human Pose Estimation

    to collect them. The up pass can be avoided if the arg max indices are recorded along

    with the messages during the down pass.

    One of the vision problems that is both interesting for computer games and suitable

    for the application of belief propagation is human pose estimation, and it is this problem

    that is described in the next section.

    3.3 Human Pose Estimation

    In laymans terms, the problem of human pose estimation can be stated as follows: given

    an image containing a person, the objective is to correctly classify the persons pose. This

    pose can either be in the form of selection from a constrained list, or freely estimating

    the locations of a persons limbs, and the angles of their joints (for example, the location

    of their left arm, and the angle at which their elbow is bent). This is often formalised

    by defining a skeleton model, which is to be fitted to the image. It is quite common to

    describe the human body as an articulated object, i.e. one formed of a connected set of

    rigid parts. Such a formalisation gives rise to a family of parts-based models known as

    pictorial structure models.

    These models typically consist of six parts (if the objective is restricted to upper body

    pose estimation) or ten parts (full body) [10, 3537]. The upper body model consists of

    head, torso, and upper and lower arms; to extend this to the full body, upper and lower

    leg parts are added. Having divided the human body into parts, one can then learn a

    separate detector for each part, taking advantage of the fact that the parts have both a

    simpler shape (not being articulated), and a simpler colour distribution.

    Indeed, pose estimation can be formulated as an energy minimisation problem. In

    contrast to segmentation problems, which require a different variable for each pixel, the

    number of variables required is equal to the number of parts in the skeleton model.

    However, the number of possible values that the variables can take is large (a part can,

    in theory, occupy any position in the image, with any orientation and extent).

    44

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    49/166

    3.3. Human Pose Estimation

    Figure 3.3: A depiction of the ten-part skeleton model used by Felzenszwalb and Hut-tenlocher [35].

    3.3.1 Pictorial Structures

    A pictorial structure model can be expressed as a graph G = (V, E), with the vertices

    V = {v1, v2, . . . , vn} corresponding to the parts, and the edges Especifying which pairs

    of parts(vi, vj) are connected. A typical graph is shown in Figure 3.3.

    In Felzenszwalb and Huttenlochers pictorial structure model [35], a particular la-

    belling of the graph is given by a configuration L = {l1, l2, . . . , ln}, where eachli specifies

    the location(xi, yi)of partvi, together with its orientation, and degree of foreshortening

    (i.e. the degree to which the limb appears to be shorter than it actually is, due to its

    angle relative to the camera). The energy of this labelling is then given by:

    E(L) =n

    i=1

    (li) +

    (vi,vj)E

    (li, lj), (3.13)

    where, as in the previous section, represents the unary energy on part configuration,

    andthe pairwise energy. These energies relate to the likelihood of a configuration, given

    45

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    50/166

    3.3. Human Pose Estimation

    the image data, and given prior knowledge of the parts; more realistic configurations

    will have lower energy. Despite the large number of possible configurations, a globally

    optimal configuration can be found efficiently. This can be done by using simple appear-

    ance models for each part, explained in the following section, and then applying belief

    propagation.

    Appearance Model

    For each part, appearance models can be learned from training data, and can be based on

    edges [95], colour-invariant features such as HOG [22,37], or the position of the part within

    the image [38]. Another approach for video sequences is to apply background subtraction,

    and define a unary potential based on the number of foreground pixels around the object

    location [35].

    Given an image, this appearance model can be evaluated over a dense grid [1]; to

    speed this process up, a feature pyramid can be defined, so that a number of promising

    locations are found from a coarse grid, and then higher resolution part filters produce

    more precise matching scores [34,36]. In order to reduce the time taken by the inference

    process, it might be desirable to reduce the set of possible part locations. Two ways to

    do this are:

    1. Thresholding, where part locations with a score that is worse than some predefined

    value, or with a score outside the top Nvalues for some N.

    2. Non-maximal suppression, which involves the removal of part locations that are

    similar, but inferior, to other part locations.

    Optimisation

    After applying these techniques, we now have a small set of possible locations {l1i , l2i , . . . , l

    ki }

    for each vertex vi. For a leaf vertex vi, the score of each location lpi is:

    scorei(lp

    i ) =(lp

    i ). (3.14)

    46

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    51/166

    3.3. Human Pose Estimation

    Now, for a vertex vj with at least one child, the score is defined in terms of the children:

    scorej(lpj ) =(l

    pj ) +

    vi:(vi,vj)E

    mi(lpj ), (3.15)

    where:

    mi(lpj ) = max

    lqi

    (lqi , l

    pj ) +scorei(l

    qi )

    . (3.16)

    Finally, the top-scoring part configuration is found by finding the root location with the

    highest score, and then tracing back through the tree, keeping track of the arg max indices.

    Multiple detections can be generated by thresholding this score and using non-maximal

    suppression.

    3.3.2 Flexible Mixtures of Parts

    Yang and Ramanan [131] extend these approaches by introducing a flexible mixture of

    parts model, allowing for greater intra-limb variation.

    Rather than using a classical articulated limb model such as that of Marr and Nishi-

    hara [75], they introduce a new representation: a mixture of non-orientable pictorial

    structures. Instead of having ten rigid parts, as the methods described in Section 3.3.1

    do, their model has twenty-six rigid parts, which can be combined to form limbs and

    produce an estimate for the ten parts, as shown in Figure 3.4. Each part has a number

    Tof possible types, learned from training data. Types may include orientations of a part

    (e.g. horizontal or vertical hand), and may also span semantic classes (e.g. open versus

    closed hand).

    Model

    Let us denote an image by I, the location of part i by pi = (x, y), and its mixture

    component by ti, with i {1, . . . , K }, pi {1, . . . , L}, and ti {1, . . . T }, where K is

    the number of parts, L is the number of possible part locations, and Tis the number of

    mixture components per part.

    47

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    52/166

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    53/166

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    54/166

    3.3. Human Pose Estimation

    wheredxand dy represent the relative location of parti with respect toj . The parameter

    w(ti,tj)

    (i,j) encodes the expected values for dx and dy, tailored for types t

    iand t

    j. So ifi is

    the elbow andj is the forearm, withtiandtj specifying vertically-oriented parts (i.e.the

    arm is at the persons side), we would expect pj to be below pi on the image.

    Inference

    To perform inference on this model, Yang and Ramanan maximise S(I, p , t)overp and t.

    Since the graph Gin Figure 3.5 is a tree, belief propagation (see Section 3.2.2) can again

    be used. The score of a particular leaf node pi with mixture ti is:

    scorei(ti, pi) =btii + w

    tii (I, pi), (3.19)

    and for all other nodes, we take into account the messages passed from the nodes children:

    scorei(ti, pi) =btii + wtii (I, pi) +

    cchildren(i)

    mc(ti, pi), (3.20)

    where:

    mc(ti, pi) = maxtc

    bti,tc(i,c)+ maxpcw

    (ti,tc)(i,c) (pi pc). (3.21)

    Once the messages passed reach the root part (i = 1), score1(c1, p1) contains the best-

    scoring skeleton model given the root part location p1. Multiple detections can be gen-

    erated by thresholding this score and using non-maximal suppression.

    3.3.3 Unifying Segmentation and Pose Estimation

    So far, a number of methods for solving either human segmentation or pose estimation

    have been discussed. Some recent work has also been done that attempts to solve both

    tasks together. In this section, we discuss PoseCut [16].

    50

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    55/166

    3.3. Human Pose Estimation

    PoseCut

    Bray et al. [16] tackle the segmentation problem by introducing a pose-specific Markov

    random field (MRF), which encourages the segmentation result to look human-like.

    This prior differs from image to image, as it depends on which pose the human is in.

    Given an image, they find the best pose prior opt by solving:

    opt= arg min

    minx

    3(x,)

    , (3.22)

    where x specifies the segmentation result, and 3 is the Object Category Specific MRF

    from [65], which defines how well a pose prior fits a segmentation resultx. It is defined

    as follows:

    3(x,) = i

    (D|xi) + (xi|) + j((I|xi, xj) + (xi, xj))

    , (3.23)

    where Iis the observed (image) data, (I|xi) is the unary segmentation energy, (xi|)

    is the cost of the segmentation given the pose prior (penalising pixels near to the shape

    being background, and pixels far from the shape being foreground), and the term is a

    pairwise energy. Finally,(I|xi, xj) is a contrast-sensitive term, defined as:

    (I|xi, xj) =

    (i, j), ifxi=xj ; (3.24)

    0, ifxi= xj ,

    where (i, j)is proportional to the difference in RGB values of pixels i and j ; pixels with

    similar values will have a high value for (i, j), since we wish to encourage these pixels

    to have the same label.

    Given a particular pose prior , the optimal configuration x = argminx3(x,)

    can be found using a single graph cut. The final solutionarg minx3(x,opt) is found

    using the Powell minimisation algorithm [94].

    51

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    56/166

    3.4. Stereo Vision

    3.4 Stereo Vision

    Stereo correspondence algorithms typically denote one image as the reference imageand

    the other as the target image. A dense set of patches is extracted from the reference

    image, and for each of these patches, the best match is found in the target image. The

    displacement between the two patches is known as the disparity; the disparity for each

    pixel in the reference image is stored in a disparity map. It can easily be shown that

    the disparity of a pixel is inversely proportional to its distance from the camera, or its

    depth [105]. A typical disparity map is shown in Figure 3.6.

    A plethora of stereo correspondence algorithms have been developed over the years.

    Scharstein and Szeliski [102] note that earlier methods can typically be divided into four

    stages: (i) matching cost computation; (ii) cost aggregation; (iii) disparity computation;

    and (iv) disparity refinement; later methods can be described in a similar fashion [20,60,

    76,92,97].

    It is quite common to use the sum of absolute differencesmeasure when finding the

    matching cost for each pixel. A patch with a height and width of2n + 1 pixels for some

    n 0 is extracted from the reference image. Then, for each disparity value d, a patch

    is extracted from the target image, and the pixelwise intensity values for each patch are

    compared.

    WithL and R representing the reference (left) and target (right) images respectively,

    the cost of assigning disparity d to a pixel (x, y) in L is as follows:

    (x,y,d) =n

    x=n

    n

    y=n

    |L(x + x,y+ y) R(x + x d, y+ y)| (3.25)

    Evaluating this cost over all pixels and disparities provides a cost volume, on which

    aggregation methods such as smoothing can be applied in order to reduce noise. Disparity

    values for each pixel can then be computed. The simplest method for doing this is just

    to find for each pixel(x, y) the disparity value d which minimises (x,y,d).

    However, such a method is likely to result in a high degree of noise. Additionally, pixels

    immediately outside a foreground object are often given disparities that are higher than

    52

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    57/166

    3.4. Stereo Vision

    (a) Left image

    (b) Right image

    (c) Disparity map

    Figure 3.6: An example disparity map produced by a stereo correspondence algorithm.Pixels that are closer to the camera appear brighter in colour. Note the dark gaps

    produced in areas with little or no texture.

    53

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    58/166

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    59/166

    the possibility of controlling video games with the human body. To this end, in the next

    chapter, we will begin exploring the possibility of providing a human scene understanding

    framework, combining pose estimation with segmentation and depth estimation.

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    60/166

  • 8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

    61/166

    Chapter4

    3D Human Pose Estimation in a Stereo Pair

    of Images

    The problem of human pose estimation has been widely studied in the computer vision

    literature; a survey of recent work is provided in Section 3.3. Despite the large body of

    research focussing on 2D human pose estimation, relatively little work has been done to

    estimate pose in 3D, and in particular, annotated datasets featuring frontoparallel stereo

    views of humans are non-existent.

    In recent years, some research has focussed on combining segmentation and pose

    estimation to produce a richer understanding of a scene [10, 16, 66, 89]. Many of these

    approaches simply put the algorithms into a pipeline, where the result of one algorithm

    is used to drive the other [10, 16, 89]. The problem with this is that it often proves

    impossible to recover from errors made in the early stages of the process. Therefore,

    a joint inference framework, as proposed by Wang and Koller [127] for 2D human pose

    estimation, is desired.

    This chapter describes a new algorithm for estimating human pose in 3D, while sim-

    ultaneously solving the problems of stereo matching and human segmentation. The al-

    gorithm uses an optimisation method known as dual decomposition, of which we give an

    overview in Section 4.1.

    Following that, a new dataset for two-view human segmentation and pose-estimation,

    57

  • 8/13/2019 Beyon