Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

8/13/2019 Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms

1/166

Beyond Controllers

Human Segmentation, Pose, and Depth Estimationas Game Input Mechanisms

Glenn Sheasby

Thesis submitted in partial fulfilment of the requirements of the award of

Doctor of Philosophy

Oxford Brookes University

in collaboration with Sony Computer Entertainment Europe

December 2012


2/166


3/166

Abstract

Over the past few years, video game developers have begun moving away from the tradi-

tional methods of user input through physical hardware interactions such as controllers or

joysticks, and towards acquiring input via an optical interface, such as an infrared depth

camera (e.g.the MicrosoftKinect) or a standard RGB camera (e.g.the PlayStation Eye).

Computer vision techniques form the backbone of both input devices, and in this thesis,

the latter method of input will be the main focus.In this thesis, the problem of human understanding is considered, combining segment-

ation and pose estimation. While focussing on these tasks, we examine the stringent

challenges associated with the implementation of these techniques in games, noting par-

ticularly the speed required for any algorithm to be usable in computer games. We also

keep in mind the desire to retain information wherever possible: algorithms which put

segmentation and pose estimation into a pipeline, where the results of one task are used

to help solve the other, are prone to discarding potentially useful information at an early

stage, and by sharing information between the two problems and depth estimation, we

show that the results of each individual problem can be improved.

We adapt Wang and Kollers dual decomposition technique to take stereo information

into account, and tackle the problems of stereo, segmentation and human pose estimation

simultaneously. In order to evaluate this approach, we introduce a novel, large dataset

featuring nearly 9,000 frames of fully annotated humans in stereo.

Our approach is extended by the addition of a robust stereo prior for segmenta-

tion, which improves information sharing between the stereo correspondence and human

segmentation parts of the framework. This produces an improvement in segmentation

results. Finally, we increase the speed of our framework by a factor of 20, using a highly

efficient filter-based mean field inference approach. The results of this approach compare

favourably to the state of the art in segmentation and pose estimation, improving on the

best results in these tasks by 6.5% and 7% respectively.


4/166


5/166

Acknowledgements

Okay... now what?

(Mike Slackenerny, PhD comic #844)

It is finished. Although the PhD thesis is a beast that must be tamed in solitude, I

dont believe its something that can be done entirelyalone, and there are many people

to whom I owe a debt of gratitude.

My supervisor, Phil Torr, made it possible for me to get started in the first place,

and gave me immeasurable help along the way. While were talking about how I came to

be doing a PhD, I should also thank my old boss, Andrew Stoddart, who recommended

that I apply, and the recession for costing me the software job I was doing after leaving

student life for the first time. I guess my escape velocity wasnt high enough, moving

only two miles from my first alma mater. Im about 770 miles away now, so that should

be enough!

My colleagues at Brookes also helped immensely, from those who helped me settle

in: David Jarzebowski, Jon Rihan, Chris Russell, Lubor Ladicky, Karteek Alahari, Sam

Hare, Greg Rogez, and Paul Sturgess; to those who saw me off at the end of it: Paul

Sturgess, Sunando Sengupta, Michael Sapienza, Ziming Zhang, Kyle Zheng, and Ming-

Ming Cheng. Special thanks are due to Morten Lindegaard, who proof-read large chunksof this thesis, and to my co-authors: Julien Valentin, Vibhav Vineet, Jonathan Warrell,

and my second supervisor, Nigel Crook.

Financial support from the EPSRC partnership with Sony is gratefully acknowledged,

and weekly meetings and regular feedback from Diarmid Campbell helped to guide and

focus my research. Furthermore, Amir Saffari and the rest of the crew at SCEE London

Studio provided a dataset, as well as feedback from a professional perspective.

Id also like to thank my examiners, Teo de Campos, Mark Bishop, and David Duce,

for taking the time to read my thesis, and for providing useful feedback and engaging

discussion during the viva.

While struggling through my PhD years, I was kept sane in Oxford by a variety ofgroups, including the prayer group at St. Mary Magdalenes, Brookes Ultimate Frisbee,

and of course, the Oxford University Bridge Club, where I spent many Monday even-

ings exercising my mind (and liver), and where I met my wonderful fiance, the future

Dr. Mrs. Dr. Sheasby, Aleksandra: wszystkie nasze sukcesy s wsplne, ale ten jeden za-

wdziczam wycznie Tobie. Wierzya we mnie nawet wtedy, kiedy ja sam w siebie nie

wierzyem i za to bd Ci wdziczny do koca ycia.

Lastly, but most importantly, Id like to thank my parents, for raising me, for sup-

porting me in all of my endeavours, and for teaching me to question everything.


6/166


7/166

Contents

List of Figures 7

List of Tables 9

List of Algorithms 11

1 Introduction 13

1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.2 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Vision in Computer Games: A Brief History 19

2.1 Motion Sensors: Nintendo Wii . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1.1 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 RGB Cameras: EyeToy and Playstation Eye . . . . . . . . . . . . . . . . 22

2.2.1 Early Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.2 Antigrav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.3 Eye of Judgment . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.4 EyePet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2.5 Wonderbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Depth Sensors: Microsoft Kinect . . . . . . . . . . . . . . . . . . . . . . 28

2.3.1 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.2 Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.3 Vision Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3.4 Overall Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3 State of the Art in Selected Vision Algorithms 33

3.1 Inference on Graphs: Energy Minimisation . . . . . . . . . . . . . . . . . 34

3


8/166

Contents

3.1.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . 34

3.1.2 Submodular Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1.3 The st-Mincut Problem . . . . . . . . . . . . . . . . . . . . . . . . 363.1.4 Application to Image Segmentation . . . . . . . . . . . . . . . . . 38

3.2 Inference on Trees: Belief Propagation . . . . . . . . . . . . . . . . . . . 40

3.2.1 Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.2 Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 Human Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.1 Pictorial Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.2 Flexible Mixtures of Parts . . . . . . . . . . . . . . . . . . . . . . 47

3.3.3 Unifying Segmentation and Pose Estimation . . . . . . . . . . . . 50

3.4 Stereo Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.4.1 Humans in Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 3D Human Pose Estimation in a Stereo Pair of Images 57

4.1 Joint Inference via Dual Decomposition . . . . . . . . . . . . . . . . . . . 58

4.1.1 Introduction to Dual Decomposition . . . . . . . . . . . . . . . . 59

4.1.2 Related Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.2 Humans in Two Views (H2view) Dataset . . . . . . . . . . . . . . . . . . 67

4.2.1 Evaluation Metrics Used . . . . . . . . . . . . . . . . . . . . . . . 68

4.3 Inference Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.3.1 Segmentation Term . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.3.2 Pose Estimation Term . . . . . . . . . . . . . . . . . . . . . . . . 77

4.3.3 Stereo Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.3.4 Joint Estimation of Pose and Segmentation . . . . . . . . . . . . . 81

4.3.5 Joint Estimation of Segmentation and Stereo . . . . . . . . . . . . 82

4.4 Dual Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.4.1 Binarisation of Energy Functions . . . . . . . . . . . . . . . . . . 84

4.4.2 Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.4.3 Solving Sub-ProblemL1 . . . . . . . . . . . . . . . . . . . . . . . 884.4.4 Solving Sub-ProblemL2 . . . . . . . . . . . . . . . . . . . . . . . 88

4.4.5 Solving Sub-ProblemL3 . . . . . . . . . . . . . . . . . . . . . . . 89

4.5 Weight Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.5.1 Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.5.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.6.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.6.2 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4


9/166

Contents

5 A Robust Stereo Prior for Human Segmentation 103

5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.1.1 Range Move Formulation . . . . . . . . . . . . . . . . . . . . . . . 1065.2 Flood Fill Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.3 Application: Human Segmentation . . . . . . . . . . . . . . . . . . . . . 112

5.3.1 Original Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 112

5.3.2 Stereo TermfD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

5.3.3 Segmentation TermsfS and fSD . . . . . . . . . . . . . . . . . . . 114

5.3.4 Pose Estimation Terms fP and fPS . . . . . . . . . . . . . . . . . 115

5.3.5 Energy Minimisation . . . . . . . . . . . . . . . . . . . . . . . . . 116

5.3.6 Modifications toD Vector . . . . . . . . . . . . . . . . . . . . . . 118

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.4.1 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.4.2 Pose Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.4.3 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6 An Efficient Mean Field Based Method for Joint Estimation of Human

Pose, Segmentation, and Depth 125

6.1 Mean Field Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.1.1 Introduction to Mean-Field Inference . . . . . . . . . . . . . . . . 128

6.1.2 Simple Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 1306.1.3 Performance Comparison: Mean Field vs Graph Cuts . . . . . . . 131

6.2 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.2.1 Joint Energy Function . . . . . . . . . . . . . . . . . . . . . . . . 132

6.3 Inference in the Joint Model . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

6.4.1 Segmentation Performance . . . . . . . . . . . . . . . . . . . . . . 137

6.4.2 Pose Estimation Performance . . . . . . . . . . . . . . . . . . . . 137

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7 Conclusions and Future Work 1437.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . . . . . . 144

Bibliography 147

5


10/166


11/166

List of Figures

2.1 Duck Hunt screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 WiiSensor Bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.3 Putting action fromWii Sports. . . . . . . . . . . . . . . . . . . . . . . 21

2.4 EyeToyand Playstation Eyecameras. . . . . . . . . . . . . . . . . . . . . 22

2.5 EyeToy: Play . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6 Antigrav . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.7 Eye of Judgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.8 EyePet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.9 Wonderbookdesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.10 AWonderbook scene . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.11 Kinectgames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.12 Furniture removal guidelines in Kinect instruction manual . . . . . . . . 30

3.1 Image for our toy example. . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Segmentation results on the toy image . . . . . . . . . . . . . . . . . . . 40

3.3 Skeleton Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.4 Part models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.5 Yang-Ramanan skeleton model . . . . . . . . . . . . . . . . . . . . . . . 49

3.6 Stereo example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.1 Subgradient example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2 Dual functions versus the cost variable. . . . . . . . . . . . . . . . . . . 66

4.3 Values of the dual functiong() . . . . . . . . . . . . . . . . . . . . . . . 66

4.4 Accuracy of Part Proposals . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.5 CRF Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.6 Foreground weightings on a cluttered image from the Parse dataset . . . 76

4.7 Results using just fS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.8 Part selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.9 Limb recovery due toJ1 term . . . . . . . . . . . . . . . . . . . . . . . . 83

7


12/166

List of Figures

4.10 Master-slave update process . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.11 Decision tree: parameter optimisation . . . . . . . . . . . . . . . . . . . . 94

4.12 Sample stereo and segmentation results . . . . . . . . . . . . . . . . . . . 974.13 Segmentation results on H2view . . . . . . . . . . . . . . . . . . . . . . . 98

4.14 Results from H2view dataset . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.1 Flood fill example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.2 Three successive range expansion iterations . . . . . . . . . . . . . . . . . 109

5.3 The new master-slave update process . . . . . . . . . . . . . . . . . . . . 117

5.4 Segmentation results on H2View . . . . . . . . . . . . . . . . . . . . . . . 119

5.5 Comparison of segmentation results on H2View . . . . . . . . . . . . . . 120

5.6 Failure cases of segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 124

6.1 Segmentation of theTreeimage . . . . . . . . . . . . . . . . . . . . . . . 127

6.2 Basic 6-part skeleton model . . . . . . . . . . . . . . . . . . . . . . . . . 131

6.3 Segmentation results on H2view compared to other methods . . . . . . . 138

6.4 Further segmentation results . . . . . . . . . . . . . . . . . . . . . . . . . 139

6.5 Qualitative results on H2View dataset . . . . . . . . . . . . . . . . . . . 141

8


13/166

List of Tables

4.1 Table of Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 Evaluation offS only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.3 Evaluation offS combined with fPS . . . . . . . . . . . . . . . . . . . . . 82

4.4 List of weights learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4.5 Evaluation of segmentation performance . . . . . . . . . . . . . . . . . . 96

4.6 Dual Decomposition results . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.1 Segmentation results on the H2View dataset . . . . . . . . . . . . . . . . 121

5.2 Results (given in % PCP) on the H2view test sequence. . . . . . . . . . . 122

6.1 Evaluation of mean field on the MSRC-21 dataset . . . . . . . . . . . . . 131

6.2 Quantitative segmentation results on the H2View dataset . . . . . . . . . 137

6.3 Pose estimation results on the H2View dataset . . . . . . . . . . . . . . . 140

9


14/166


15/166

List of Algorithms

4.1 Parameter optimisation algorithm for dual decomposition framework. . . 935.1 Generic flood fill algorithm for an imageIof size W H. . . . . . . . . 110

5.2 doLinearFill: perform a linear fill from the seed point (sx, sy). . . . . . 111

6.1 Nave mean field algorithm for fully connected CRFs . . . . . . . . . . . 130

11


16/166


17/166

Chapter1

Introduction

Over the past several years, a wide range of commercial applications of computer vision

have begun to emerge, such as face detection in cameras, augmented reality (AR) in shop

displays, and the automatic construction of image panoramas. Another key application

of computer vision that has become popular recently is computer games, with commercial

products such as Sonys Playstation Eyeand Microsofts Kinectselling millions of units

[98].

In creating these products, video game developers have been able to partially expand

the demographic of players. They have done this by moving away from the traditional

controller pad method of user input, and enabling the player to control the game using

other objects, such as books or AR markers, and even their own bodies. Some of the

most popular games that are either partially or completely driven using human motion

include sports games such as Wii Sports and Kinect Sports, and party games such as

the EyeToy: Playseries. More recent games, such as EyePetand Wonderbook: Book of

Spells, combine motion information with object detection. A more thorough description

of these games can be found in Chapter 2.

Three main computer vision techniques are used to obtain input instructions for these

games: motion detection, object detection, and human pose estimation. The first of these,

motion detection, involves detecting changes in image intensity across several frames; in

video games, motion detection is used in particular areas of the screen as the player

13


18/166

1.1. Contributions

attempts to complete tasks. Secondly,object detectioninvolves determining the presence,

position and orientation of particular objects in the frame. The object can be a simple

shape (e.g. a quadrilateral) or a complex articulated object, such as a cat. In certain

video games, the detection of AR markers is used to add computer graphics to an image

of the players surroundings. Finally, the goal ofhuman pose estimation is to determine

the position and orientation of each of a persons body parts. Using images obtained via

an infrared depth sensor, Kinectgames can track human poses over several frames, in

order to detect actions [110].

Theoretically, an image contains a lot more information than a controller can sup-ply. However, the player can only provide information via a relatively limited set of

actions, either with their own body, or using some kind of peripheral object which can

be recognised.

The main aim of this thesis is to explore and expand the applicability of human

pose estimation to video games. After an analysis of the techniques that have already

been used, and of the current state of these techniques in research, our main application

will be presented. Using a stereo pair of cameras, we will develop a system that unifies

human segmentation, pose estimation, and depth estimation, solving the three tasks

simultaneously. In order to evaluate this system, we will present a large dataset containing

stereo images of humans in indoor environments where video games might be played.

1.1 Contributions

In summary, the principal contributions of this thesis are as follows:

A system for the simultaneous segmentation and pose estimation of humans, as

well as depth estimation of the entire scene. This system is further developed by

the introduction of a stereo-based prior; the speed of the system is subsequently

improved by applying a state-of-the-art approximate inference technique.

The introduction of a novel, 9,000 image dataset of humans in two views.

14


19/166

1.2. Outline of the Thesis

Throughout the thesis, the pronoun we is used instead of I. This is done to follow

scientific convention; the contents of this thesis are the work of the author. Where others

have contributed towards the work, their collaborations will be attributed in a short

section at the end of each chapter.

1.2 Outline of the Thesis

Chapter 2contains a description of some of the various attempts that games developers

have made to provide alternatives to controllers, and the impact that these games have

had on the video games community. Starting with the accelerometer and infrared detec-

tion based solutions provided by the Nintendo Wii, we observe the increasing amount

of integration of vision techniques, with this trend demonstrated by the methods used

by SonysEyeToyandPlayStation Eye-based games over the past several years. Finally,

we consider the impact that depth information can have in enabling the software to

determine the pose of the players body, as shown by the Microsoft Kinect.

Following on from that,Chapter 3contains an appraisal of related work in computer

vision that might be applied in computer games. We consider the different approaches

commonly used to solve the problems of segmentation and human pose estimation, and

give an overview of some of the approaches that have been used to provide 3D information

given a pair of images from a stereo camera.

Chapter 4 describes a novel framework for the simultaneous depth estimation of a

scene, and segmentation and pose estimation of the humans within that scene. Using a

stereo pair of images as input provides us with the ability to compute the distance of each

pixel from the camera; additionally, we can use standard approaches to find the pixels

occupied by the human, and predict its pose. In order to share information between

these three approaches, we employ a dual decomposition framework [62,127]. Finally, to

evaluate the results obtained by our method, we introduce a new dataset, called Humans

in Two Views, which contains almost 9,000 stereo pairs of images of humans.

InChapter 5, we extend this approach to improve the quality of information shared

15


20/166

1.3. Publications

between the segmentation and depth estimation parts of the algorithm. Observing that

the human occupies a continuous region of the cameras field of view, we infer that the

distance of human pixels from the camera will vary only in certain ways, without sharp

boundaries (we say that the depth issmooth). Therefore, starting from pixels that we are

very confident lie within the human, we can extract a reliable initial segmentation from

the depth map, significantly improving the overall segmentation results.

The drawback of the dual decomposition-based approach, however, is that it is much

too slow to be used in computer games. In Chapter 6, we adapt our framework in order

to apply an approximate, but very fast, inference approach based on mean field [64]. Ouruse of this new inference approach enables us to improve the information sharing between

the three parts of the framework, providing an improvement in accuracy, as well as an

order-of-magnitude speed improvement.

While the mean-field inference approach is much quicker than the dual decomposition-

based approach, its speed (close to 1 fps) is still not fast enough for real-time application

such as computer games. In Chapter 7, the thesis concludes with some suggestions for

how to further improve the speed, as well as some other promising possible directions for

future research. The concluding chapter also contains a summary of the work presented

and contributions made.

1.3 Publications

Several chapters of this thesis first appeared as conference publications, as follows:

G. Sheasby, J. Warrell, Y. Zhang, N. Crook, and P.H.S. Torr. Simultaneous hu-

man segmentation, depth and pose estimation via dual decomposition. In British

Machine Vision Conference, Student Workshop, 2012. (Chapter 4, [108])

G. Sheasby, J. Valentin, N. Crook, and P.H.S. Torr. A robust stereo prior for human

segmentation. InAsian Conference on Computer Vision (ACCV), 2012. (Chapter

5, [107])

16


21/166

The contributions of co-authors are acknowledged in the corresponding chapters. The first

paper [108] received the best student paper award at the BMVC workshop. Addition-

ally, some sections of Chapter 6 form part of a paper that is currently under submission

at a major computer vision conference.


22/166


23/166

Chapter2

Vision in Computer Games: A Brief History

Figure 2.1: A screenshot [84] from Duck Hunt, an early example of a game that used

sensing technology.

The purpose of a game controller is to convey the users intentions to the game. A wide

varieties of input methods, for instance a mouse and keyboard, a handheld controller,

or a joystick, have been employed for this purpose. Video games using some sort of

sensing technology (instead of, or in addition to, those listed above) have been available

for several decades. In 1984, Nintendo released alight gun, which detects light emitted by

CRT monitors; this release was made popular by the game Duck Huntfor the Nintendo

19


24/166

2.1. Motion Sensors: Nintendo Wii

Figure 2.2: The sensor bar, which emits infrared light that is detected byWii remotes.The picture [81] was taken with a camera sensitive to infrared light; the LEDs are notvisible to the human eye.

Entertainment System (NES), in which the player aimed the gun at ducks that appeared

on the screen (Figure 2.1). When the trigger is fired, the screen is turned black for one

frame, and then the target area is turned white in the next frame. If it is pointed at the

correct place, the gun detects this change in intensity, and registers a hit.

Over the past few years, technological developments have made it easier for video

game developers to incorporate sensing devices to augment, or in some cases replace, the

traditional controller pad method of user input. These devices include motion sensors,

RGB cameras, and depth sensors. The following sections give a brief summary of the

applications of each in turn.

2.1 Motion Sensors: Nintendo Wii

TheWiiis a seventh-generation games console that was released by Nintendo in late 2006.

Unlike previous consoles, the unique selling point of the Wiiwas a new form of player

interaction, rather than greater power or graphics capability. This new form of interaction

was theWii Remote, a wireless controller with motion sensing capabilities. The controller

contains an accelerometer, enabling it to sense acceleration in three dimensions, and an

infrared sensor, which is used to determine where the remote is pointing [49].

Unlike light guns, which sense light from CRT screens, the remote detects light from

the consoles sensor bar, which features ten infrared LEDs (Figure 2.2). The light from

each end of the bar is detected by the remotes optical sensor as two bright lights. Trian-

gulation is used to determine the distance between the remote and the sensor bar, given

20


25/166

2.1. Motion Sensors: Nintendo Wii

Figure 2.3: An example of the use of the Wii Remotes motion sensing capabilities tocontrol game input. Here, the player moves the remote as he would move a putter whenplaying golf. The power of the putt is determined by the magnitude of the swing [74].

the observed distance between the two bright lights and the known distance between the

LED arrays.

The capability of the Wiito track position and motion enables the player to mimic

actual game actions, such as swinging a sword or tennis racket. This capability is demon-

strated by games such as Wii Sports, which was included with the games console in the

first few years after its release. The remote can be used to mimic the action of bowling

a ball, or swung like a tennis racket, a baseball bat, or a golf club (Figure 2.3).

2.1.1 Impact

The player still uses a controller, although in some games, like Wii Sports, it is now

the position and movement of the remote that is used to influence events in-game. This

21


26/166

2.2. RGB Cameras: EyeToy and Playstation Eye

(a) EyeToy [114] (b) PlayStation Eye [32]

Figure 2.4: The two webcam peripherals released by Sony.

makes playing the games more tiring than before, especially if they are played with

vigour. However, a positive effect is that the control system is more intuitive, meaning

that people who dont normally play traditional video games might still be interested in

owning a Wiiconsole [103].

2.2 RGB Cameras: EyeToy and Playstation Eye

While Nintendos approach uses the position and motion of the controller to enhance

gameplay, other games developers have made use of the RGB images provided by cameras.

The first camera released as a games console peripheral and used as an input device for a

computer game was the EyeToy, which was released for the PlayStation 2 (PS2) in 2003.

This was followed in 2007 by the PlayStation Eye (PS Eye) for thePlayStation 3 (PS3).

Some of Sonys recent games have used the PlayStation Move (PS Move) in addition

to the PS Eye. The Move is a handheld plastic controller which has a large, bright ball

on the top; the hue of this ball can be altered by the software. During gameplay, the ball

is easily detectable by the software, and is used as a basis for determining the position,

orientation and motion of the PS Move[113].

The degree to which vision techniques have been applied toEyeToygames has varied

widely. Some games only use the camera to allow the user to see themselves, whereas

others require significant levels of image processing. The following sections contain de-

22


27/166


(a) Ghost Catcher [48] (b) Keep Up [117] (c) Kung Foo [42]

Figure 2.5: Screenshots of three mini-games from EyeToy: Play.

scriptions of some of the games that have used image processing to enhance gameplay.

2.2.1 Early Games

In its original release, the EyeToywas released in a bundle with EyeToy: Play, which

features twelve mini-games. The game play is simplistic, as is common with party-oriented

video games. Many of them rely on motion detection; for instance, the object of Ghost

Catcher is to fill ghosts with air and then pop them, and this is done by repeatedly

waving your hands over them. Others, such as Keep Up, use human detection; the

player is required to keep a ball in the air. Therefore, the game needs to determine

whether there is a person in the area where the ball is.

A third use of vision in this game occurs in Kung Foo; in this mini-game, the

player stands in the middle of the cameras field of view, and is instructed to hit ninjas

that fly onto the screen from various directions. Again, motion detection can be used to

determine whether a hit has been registered, as it doesnt matter which body part was

used to perform the hit.

Impact

As the mini-games in EyeToy: Play only require simplistic image understanding tech-

niques, specifically the detection of motion within a small portion of the cameras field

of view, the underlying techniques seemed to work well. As with Wii Sports, the game

was aimed at casual gamers rather than traditional, or hardcore, gamers.

23


28/166


(a) Antigravscreenshot (b) Close-up of user display

Figure 2.6: A screenshot [91] from Antigrav, where the player has extended their right

arm to grab an object (the first of three) and thus score some points. The user displayshows where the game has detected the players hands to be.

2.2.2 Antigrav

Antigrav, a PS2game that utilises the EyeToy, is a futuristic trick-based snowboarding

game, and was brought out by Harmonix in late 2004. The player takes control of a

character in the game, and guides them down a linear track. The game uses face tracking

to control the characters movements, enabling the player to increase the characters speedby ducking, and change direction by leaning. In addition, the players hands are tracked,

and their hand position is used to infer a pose, enabling the player to literally grab for

collectible objects on-screen. The player can see what the computer calculates their head

and hand positions to be in the form of a small diagram in the corner of the screen, as

shown in Figure 2.6. A GameSpot review [24] points out:

this is good for letting you know when the EyeToyis misreading your move-

ments, which takes place more often than it ought to.

The review, like other reviews of PS2 EyeToy releases, hints at further technological

limitations impairing the enjoyment of the game:

Harmonix pushes the limits of what you should expect from an EyeToy

entry... unfortunately,EyeToypushes back, and its occasional inconsistency

hobbles an otherwise bold and enjoyable experience.

24


29/166


Impact

The reviews above imply that the head and hand detection techniques employed by

the game were not completely effective, meaning that users are often frustrated by their

actions not being recognised by the game due to failure of the tracking system. This high-

lights the importance of accuracy when developing vision algorithms for video games: if

your tracking algorithm fails around 5% of the time, then the 95% accuracy is, quantit-

atively, extremely good. However, during a 3-minute run down a track on Antigrav, this

could result in a failure of the tracking system, taking several seconds to recover from.

This would be clearly noticeable by gamers.

2.2.3 Eye of Judgment

Figure 2.7: An image [3] showing the set-up ofEye of Judgment. The camera is pointedat a cloth, on which several cards are placed. These cards are recognised by the game,and the on-screen display shows the objects or creatures that the cards represent.

In 2007, Sony released Eye of Judgment, a role-playing card-game simulation that can be

compared to the popular card game Magic: The Gathering. The PS3game comes with

a cloth, and a set of cards with patterns on them that the computer can easily recognise

25


30/166


(Figure 2.7). It can recognise the orientation as well as the identity of the cards, enabling

them to have different functions when oriented differently. A review reported very

few hardware-related issues, principally because of the pattern-based card recognition

system [122].

Since then, the PS3saw very little PS Eye-related development before the release of

EyePetin October 2009; in the two years between the releases ofEye of Judgmentand

EyePet, the use of the PS Eyewas generally limited to uploading personalised images for

game characters.

2.2.4 EyePet

(a) EyePets AR marker (b) EyePetwith trampoline

Figure 2.8: An example of augmented reality being used in EyePet [4].

EyePetfeatures a virtual pet, which interacts with people and objects in the real world

using fairly crude motion sensing. For example, if the player rolls a ball towards the pet,

it will jump out of the way. Another major feature of the game is the use of augmented

reality: a card with a specific pattern is detected in the cameras field of view, and a

magic toy (a virtual object that the pet can interact with, such as a trampoline, a

bubble-blowing monkey, or a tennis player) is shown on top of the card (see Figure 2.8).

26


31/166


Impact

Again,EyePetuses fairly simplistic vision techniques, with marker detection and a motion

buffer being used throughout the game. This prevents it from receiving the sort of

criticism that was associated with Antigrav.

Although it was generally well-received, even EyePetdid not escape criticism for the

limitations of its technology, which cant help but creak at times according to a review

published in Eurogamer [129]. The review goes on to say that performance is robust

under strong natural light, but patchy under electric light in the evening.

This sort of comment shows the unforgivingness of video gamers, or at least of video

game reviewers: for a vision technique to be useful in a game, it needs to be able to work

under a very wide variety of environments and lighting conditions.

2.2.5 Wonderbook

(a) (b)

Figure 2.9: The Wonderbook (a)[82] is used with the PlayStation Movecontroller. Theinterior(b)[115] features AR markers, as well as markings on the border to identify theedge of the book, and ones near the edge of the page, which help to identify the pagequickly.

Wonderbook: Book of Spells, released by Sony in November 2012, is the first in an up-

coming series of games that will use computer vision methods to enhance gameplay. The

games will be centred upon a book whose pages contain augmented reality markers and

other patterns (Figure 2.9). These are detected by various pattern recognition techniques,

in order to determine where the book is, and which pages are currently visible. Once

27


32/166

2.3. Depth Sensors: Microsoft Kinect

Figure 2.10: After the Wonderbook is detected, gameplay objects can be overlaid on-screen. In this image [116], a 3D stage is superimposed onto the book.

this is known, augmented reality can be used to replace the image of the book with, for

example, a burning stage (Figure 2.10).

In Book of Spells, the book becomes a spell book, and through the gameplay, spells

from the Harry Potter series are introduced [53]. At various points in the game, the

player must interact with the book, for example to put out fires by patting the book.

Skin detection algorithms are used to ensure that the players hands appear to occlude

the spellbook, rather than going through it.

The generality of the book enables it to be used in multiple different kinds of games.

BBCs Walking with Dinosaurswill be made into an interactive documentary, with the

player first excavating and completing dinosaur skeletons, and then feeding the dinosaurs

using the PS Move[55]. It remains to be seen how the final versions of these games will

be appraised by reviewers and customers, and thus whether the Wonderbook franchise

will have a significant impact on the video gaming market.

2.3 Depth Sensors: Microsoft Kinect

While RGB cameras can be useful in enhancing gameplay with vision techniques, the

extra information provided by depth cameras makes it significantly easier to determine

the structure of a scene. This enables games developers to provide a new way of playing.

WithKinect, which was released in November 2010, Microsoft offer a system where you

28


33/166


(a) (b) (c)

Figure 2.11: A selection of different games available for Kinect. (a) Kinect Sports[78]: twoplayers compete against each other at football. (b) Dance Central [7]: players performdance moves, which are tracked and judged by the game. (c) Kinect Star Wars [99]:players swing a lightsabre by making sweeping movements with their arm.

are the controller. Using an infrared depth sensor to track 3D movement, they generate

a detailed map of the scene, significantly simplifying the task of, for example, tracking

the movement of a person.

2.3.1 Technical Details

TheKinectprovides a 320 240 16-bit depth image, and a 640 480 32-bit RGB image,both running at 30 frames per second (fps); the depth sensor has an active range of 1.2 to

3.5 metres [93]. The skeletal detection system, used for detecting the human body in each

frame, is based on random forest classifiers [110], and is capable of tracking a twenty-

link skeleton of up to two active players in real-time.1 The software also provides an

object-specific segmentation of the people in the scene (i.e.different people are segmented

separately), and further enhances the players experience by using person recognition to

provide greetings and content.

2.3.2 Games

As with the Nintendo Wii, the Kinectwas launched along with a sports game, namely

Kinect Sports. The controls are intuitive: the player makes a kicking motion in order

1In an articulated body, a link is defined as an inflexible part of the body. For example, if theflexibility of fingers and thumbs is ignored, each arm could be treated as three links, with one link each

for hand, forearm and upper arm.

29


34/166


Figure 2.12: The furniture removal guidelines in theKinectinstruction manual [79] advisethe player to move tables etc. that might block the cameras view, which may cause aproblem for some users.

to kick a football, or runs on the spot in the athletics mini-games. No controllers or

buttons are required, which makes the games very easy to adapt to, although some of

the movements need to be exaggerated in order for the game to recognise them [18].

Another intuitive game is Dance Central, which uses the Kinects full body tracking

capabilities to compare the players dance moves to those shown by an on-screen in-

structor. The object of the game is to imitate these moves in time with the music. This

can be compared to classic games like Dance Dance Revolution, with the difference that

the players whole body is now used, enabling a greater variety of moves and adding an

element of realism [128].

Up until now, games developers have struggled to produce a game that uses the

Kinects capabilities, yet still appeals to the serious gamer. One attempt was made

in the 2011 release Kinect Star Wars, in which the player uses their arms to control

a lightsaber, making sweeping or chopping motions to remove obstacles, and to defeat

enemies. However, this game was criticised due to the games inability to keep up with

fast and frantic arm motions [126].

A common problem with the Kinectmodel of gaming is that it is necessary to stand

a reasonable distance away from the camera (2 to 3 metres is the recommended range),

30


35/166

2.4. Discussion

which makes gaming very difficult in small rooms, especially as any furniture will need

to be moved away (Figure 2.12).

2.3.3 Vision Applications

Since its release, and the subsequent release of an open-source software development

kit [83], the Kinect has been used in a wide variety of non-gaming related work by

computer vision researchers. Oikonomidis et al. [87] developed a hand-tracking system

capable of running at 15 fps, while Izadi et al.[54] perform real-time 3D reconstruction of

indoor scenes by slowly moving the Kinectcamera around the room. TheKinecthas also

been shown to be a useful tool for easily collecting large amounts of training data [47].

However, due to IR interference, the depth sensor does not work in direct sunlight,

making it unsuitable for outdoor applications such as pedestrian detection [39].

2.3.4 Overall Impact

TheKinecthas had a huge impact worldwide, selling 19 million units worldwide in its first

eighteen months. This has helped Microsoft improve sales of the Xbox 360year-on-year,

despite the console now being in its seventh year. This is the reverse of the trend shown

by competing consoles [98]. The method of controlling games using the human body

rather than a controller is revolutionary, and the technology has also had a significant

effect on vision research, as mentioned in Section 2.3.3 above.

2.4 Discussion

To date, a number of vision methods that use RGB cameras have been introduced to

the video gaming community. However, these tend to be low-level (motion detection or

marker detection) rather than high-level: if the only information given is an RGB signal,

unconstrained object detection and human pose estimation are neither accurate nor fast

enough to be useful in video games.

31


36/166

2.4. Discussion

The depth camera used in the Microsoft Kinecthas provided a huge leap forward in

this area, although the cost of this peripheral (which had a recommended retail price of

129.99 at release, around four times more than the PS Eye) means that an improvement

in the RGB-based techniques would be desirable. The next chapter contains an appraisal

of related work in computer vision that might be of interest to games developers, and

provides background for this thesis.

32


37/166

Chapter3

State of the Art in Selected Vision

Algorithms

While we have seen in Chapter 2 that computer vision techniques are beginning to have a

profound effect on computer games, there are a number of research areas which could be

applied to further transform the gaming industry. Accurate object segmentation would

allow actual objects, or even people, to be taken directly from the players surroundings

and put into the virtual environment of the game. Human motion tracking could be

used to allow the player to navigate a virtual world, for instance by steering a vehicle.

Finally, human pose estimation could be used to allow the player to control an avatar in

a platform or role-playing game. In this chapter, we will discuss the current state of the

art in energy minimisation, human pose estimation, segmentation, and stereo vision.

In order for computer vision techniques like localisation and pose estimation to be

suitable for use in computer games, the algorithm that applies the technique needs to

respond in real time as well as being accurate. A fast algorithm is necessary because

the results (e.g. pose estimates) need to be used in real-time so that they can affect

the game in-play; very high accuracy is a requirement because mistakes made by the

game will undoubtedly frustrate the user (see [24] and Section 2.2.2). The problem is

to find a suitable balance between these two requirements (a faster algorithm might

involve approximate solutions, and hence could be less accurate). This may involve

33


38/166

3.1. Inference on Graphs: Energy Minimisation

tweaking existing algorithms to produce significant speed increases without any loss in

accuracy, or developing novel and significantly more accurate algortihms that still have

speed comparable to current state-of-the-art algorithms.

3.1 Inference on Graphs: Energy Minimisation

Many of the most popular problems in computer vision can be framed as energy minim-

isation problems. This requires the definition of a function, known as an energy function,

which expresses the suitability of a particular solution to the problem. Solutions that are

more probable should give the energy function a lower value; hence, we wish to find the

solution that gives the lowest value.

3.1.1 Conditional Random Fields

Suppose we have a finite set Vof random variables, to which we wish to assign labels from

a label set L. If all the variables are independent, then this problem is easily solvable

- just find the best label for each variable. However, in general we have relationships

between variables. LetEbe the set of pairs of variables {v1, v2} Vwhich are related to

one another.

We can then construct a graph G= (V, E) which specifies both the set of variables,

and the relationships between those variables. G is a directed graphif the pairs in E are

unordered; this enables us to construct graphs where, for some v1, v2, (v1, v2) E, but

(v2, v1) / E.

Given some observed data X, we can assign a set {yi : vi V} of values to the

variables in V. Let fdenote a function that assigns a label f(vi) = yi to each vi V.

Now, suppose that we also have a probability function p that gives us the probability of

34


39/166


a particular labelling {f(vi) :vi V}given observed dataX. Then:

Definition 3.1

(X, V) is a conditional random field if, when conditioned on X, the variables

Vobey the Markov propertywith respect to G:

p(f(vi) =yi|X, {f(vj) :j =i}) =p(f(vi) =yi|X, {f(vj) : (vi, vj) E}).

(3.1)

In other words, each output variable yi only depends on its neighbours [72].

3.1.2 Submodular Terms

Now we consider set functions, which are functions whose input is a set. For example,

suppose we have a set Y of possible variable values, and a set V, with size = |V|,

of variables vi which each take a value yi Y. A function f which takes as input an

assignment of these variables {yi:vi V} is a set function.

Energy functions are set functionsf :Y R+{0}, which take as input the variable

values {yi : vi V}, and output some non-negative real number. If the variable values

are binary, then this f is a binary set functionf : 2 R+ {0}.

Definition 3.2

A binary set function f : 2 R+ {0}is submodularif and only if for every

ordered setS, T Vwe have that:

f(S) + f(T) f(S T) + f(S T). (3.2)

35


40/166


For example, if = 2, S= [1, 0] and T = [0, 1], a submodular function will satisfy the

following inequality [104]:

f([1, 0]) + f([0, 1]) f([1, 1]) + f([0, 0]). (3.3)

From Schrijver [104], we also have the following proposition:

Proposition 3.1 The sum of submodular functions is submodular.

Proof It is sufficient to prove that, given two submodular functions f : A R+ {0}

and g : B R+ {0},h= f+ g:A B R+ {0}is submodular.

h(S) + h(T) =(f+ g)(S) + (f+ g)(T)

=f(S|A) + g(S|B) + f(T|A) + g(T|B)

= (f(S|A) + f(T|A)) + (g(S|B) + g(T|B))

(f((S T)|A) + f((S T)|A)) + (g((S T)|B) + g((S T)|B))

=f((S T)|A) + g((S T)|B) + f((S T)|A) + g((S T)|B)=h(S T) + h(S T).

As shown by Kolmogorov and Zabih [61], one way of minimising energy functions, par-

ticularly submodular energy functions, is via graph cuts, which we will now introduce.

3.1.3 The st-Mincut Problem

In this section, we will consider directed graphs G = (V, E) that have special nodes

s, t Vsuch that for all vi V\{s, t}, we have (s, vi) E, (vi, t) E, (vi, s) / E, and

(t, vi) / E. We say that s is the source nodeand t is the sink nodeof the graph. Such a

graph is also known as a flow network. Let c be a function c: E R+ {0}, where for

each (v1, v2) E, c(v1, v2) represents the capacity, or maximum amount of flow, of the

edge.

36


41/166


Max Flow

Definition 3.3

A flow function is a function f :E R+ {0} which satisfies the following

constraints:

1. f(v1, v2) c(v1, v2) (v1, v2) E

2.

v1:(v1,v)Ef(v1, v) =

v2:(v,v2)Ef(v, v2) v V.

The definition given above gives us two guarantees: first, that the flow passing along a

particular edge does not exceed that edges capacity; and second, that the flow entering

a vertex is equal to the flow leaving that vertex. From this second constraint, we can

derive the following:

Definition 3.4

Theflowof a flow function is the total amount passing from the source to the

sink, and is equal to

(s,v)Ef(s, v).

The objective of the max flow problemis to maximise the flow of a network, i.e. to find

a flow function fwith the highest flow.

Min Cut

Definition 3.5

An s-t cut C= (S, T) is a partition of the variables v V into two disjoint

sets Sand T, with s Sand t T.

37


42/166


LetE be the set of edges that connect a variable v1 Sto a variablev2 T. Formally:

E ={(v1, v2) E :v1 S, v2 T } (3.4)

Note that there are at least |V| 2 edges in E, as if v S\s, then (v, t) E, and

ifv T \t, then (s, v) E. Depending on the connectivity ofG, there may be up to

(|S| 1) (| T | 1) additional edges.

Definition 3.6

The capacityof an s-t cut is the sum of the capacity of the edges connecting

S toT, and is equal to

(v1,v2)Ec(v1, v2).

The objective of the min cut problem is to find an s-t cut which has minimal capacity

(there may be more than one solution).

In 1956, it was shown independently by Ford and Fulkerson [41] and by Elias et al.[30]

that the two problems above are equivalent. Therefore, to find a flow function that has

maximal flow, one needs only to find an s-t cut with minimal capacity. Algorithms that

seek to obtain such an s-t cut are known as graph cut algorithms. Submodular functions

can be efficiently minimised via graph cuts [15, 61]; C++ code is available that performs

this minimisation using an augmented path algorithm [14,58,61]. This code is often used

as a basis for image segmentation algorithms, for example [9,16,71,100,101,130].

3.1.4 Application to Image Segmentation

To illustrate the use of energy minimisation in image segmentation, consider the following

example. We have an image, shown in Figure 3.1, with just 9 pixels ( 3 3). To construct

a graph, we create a set of vertices V = {s,t,v1, v2, . . . , v9}, and a set of edges E, with

(s, vi)and (vi, t) E fori = 1to 9, and(vi, vj) E ifviandvj are adjacent in the image,

as shown in Figure 3.1. The vertices vi have pixel values pi between 0 and 255 inclusive

(i.e. the image is 8-bit greyscale, with 0 corresponding to black, and 255 to white). Our

38


43/166


Figure 3.1: Image for our toy example.

objective is to separate the pixels into foreground and background sets, i.e. to define a

labelling z= {z1, z2, . . . , z 9}, where zi = 1 if and only ifvi is assigned to the foreground

set.

We wish to separate the light pixels in the image from the dark ones, with the light

pixels in the foreground, so we create foreground and background penalties F and B

respectively for pixelsvi as follows:

F(vi) = 255 pi; (3.5)

B(vi) =pi. (3.6)

These are known as unary pixel costs. The total unary costof a labelling z is:

(z) =9

i=1

(zi F(vi) + (1 zi) B(vi)) . (3.7)

We also want the boundary of the foreground set to align with edges in the image. There-

fore, we wish to penalise cases where adjacent pixels have similar values, but different

labels. This is done by including a pairwise cost:

(z) =

(vi,vj)E

1(zi=zj) exp(|pi pj|), (3.8)

where 1 is the indicator function, which has a value of 1 if the statement within the

39


44/166

3.2. Inference on Trees: Belief Propagation

(a) = 0.1 (b) = 0.2 (c) = 1

Figure 3.2: Segmentation results for different values of. A higher value punishes seg-mentations with large boundaries; a high enough value (as in (c)) will make the result

either all foreground or all background.

brackets is true, and zero otherwise. The overall energy function is:

f(z) =(z) + (z), (3.9)

where is a weight parameter; higher values ofwill make it more likely that adjacent

pixels have similar labels.

The energy function in (3.9) is submodular, and can therefore be minimised efficiently

using the max flow code available at [58]. The segmentation results obtained for different

values ofare shown in Figure 3.2. The ratio between the unary and pairwise weights

influences the segmentation result produced.

3.2 Inference on Trees: Belief Propagation

While vision problems such as segmentation require a large number of variables (one per

image pixel), others, such as pose estimation, only require a smaller number of variables,

and hence a smaller graph. An important type of graph that is useful for this problem is

40


45/166


46/166


E) in the form of a message. Here, a messagecan be as simple as a scalar value, or a

matrix of values. This message is then combined with information relevant to the vertex

itself, to form a new message for the next vertex.

Messages are passed between vertices in a series of pre-defined updates. The number

of updates required to find an overall solution depends on the complexity of the graph.

If the graph has a simple structure, such as a chain (where each vertex is connected to

at most two other vertices, and the graph is not a cycle), then only one set of updates is

required to find the optimal set of values. This set can be found by an algorithm such as

the Viterbi algorithm [125].

3.2.2 Belief Propagation

Belief propagation can be viewed as a variation of the Viterbi algorithm that is applicable

to trees. To use this process to perform inference on a tree T, we must choose a vertex

of the tree to be the root vertex, denoted v0.

Since T is a tree, v0 is connected to each of the other vertices by exactly one path.We can therefore re-order the vertices such that, for any vertex vi, the path fromv0 tovi

proceeds via vertices with indices in ascending order.1 Once we have done this, we can

introduce the notions of parent-child relations between vertices, defined here for clarity.

Definition 3.9

We say that a vertex vi is the parentofvj if(vi, vj) E and i < j. If this is

the case, we say that vj is a childofvi.

Note that the root node has no parents, and each other vertex has exactly one parent,

since if a vertex vj had two parents, then there would be more than one path from v0

to vj, which contradicts the definition of a tree. However, a vertex may have multiple

1

There will typically be multiple ways to do this.

42


47/166


children, or none at all.

Definition 3.10

A vertexvi with no children is known as a leaf vertex.

We now describe the general form of belief propagation on our tree T. The vertices

in V are considered in two passes: a down pass, where the vertices are processed in

descending order, so that each vertex is processed after its children, but before its parent,

and an up pass, where the order is reversed.

For each leaf vertex vi, we have a set Li = {l1i , l2i , . . . , l

Kii } of possible labels, where

Ki is the number of labels in the set Li. Then the score associated with assigning a

particular labellpi to vertex vi is:

scorei(lpi ) =(vi = l

pi ). (3.10)

This score is the message that is passed to the parents ofvi.

For a vertex vj with at least one child, we need to combine these messages with the

unary and pairwise energies, in order to produce a message for the parents ofvj. Again,

we have a finite set Lj = {l1j , l2j , . . . , l

Kjj } of possible labels for vj . The score associated

with assigning a particular label lqj is:

scorej(lqj ) =(vj =lqj ) +

i>j:(vi,vj)E

mi(lqj ), (3.11)

where:

mi(lqj ) = max

lpi

(vi=l

pi , vj =l

qj ) +scorei(l

pi )

. (3.12)

When the root vertex is reached, the optimal labelv0 can be found by maximising score0,

defined in (3.11). Finally, the globally optimal configuration can be found by keeping

track of the arg max indices, and then tracing back through the tree on the up pass

43


48/166

3.3. Human Pose Estimation

to collect them. The up pass can be avoided if the arg max indices are recorded along

with the messages during the down pass.

One of the vision problems that is both interesting for computer games and suitable

for the application of belief propagation is human pose estimation, and it is this problem

that is described in the next section.

3.3 Human Pose Estimation

In laymans terms, the problem of human pose estimation can be stated as follows: given

an image containing a person, the objective is to correctly classify the persons pose. This

pose can either be in the form of selection from a constrained list, or freely estimating

the locations of a persons limbs, and the angles of their joints (for example, the location

of their left arm, and the angle at which their elbow is bent). This is often formalised

by defining a skeleton model, which is to be fitted to the image. It is quite common to

describe the human body as an articulated object, i.e. one formed of a connected set of

rigid parts. Such a formalisation gives rise to a family of parts-based models known as

pictorial structure models.

These models typically consist of six parts (if the objective is restricted to upper body

pose estimation) or ten parts (full body) [10, 3537]. The upper body model consists of

head, torso, and upper and lower arms; to extend this to the full body, upper and lower

leg parts are added. Having divided the human body into parts, one can then learn a

separate detector for each part, taking advantage of the fact that the parts have both a

simpler shape (not being articulated), and a simpler colour distribution.

Indeed, pose estimation can be formulated as an energy minimisation problem. In

contrast to segmentation problems, which require a different variable for each pixel, the

number of variables required is equal to the number of parts in the skeleton model.

However, the number of possible values that the variables can take is large (a part can,

in theory, occupy any position in the image, with any orientation and extent).

44


49/166


Figure 3.3: A depiction of the ten-part skeleton model used by Felzenszwalb and Hut-tenlocher [35].

3.3.1 Pictorial Structures

A pictorial structure model can be expressed as a graph G = (V, E), with the vertices

V = {v1, v2, . . . , vn} corresponding to the parts, and the edges Especifying which pairs

of parts(vi, vj) are connected. A typical graph is shown in Figure 3.3.

In Felzenszwalb and Huttenlochers pictorial structure model [35], a particular la-

belling of the graph is given by a configuration L = {l1, l2, . . . , ln}, where eachli specifies

the location(xi, yi)of partvi, together with its orientation, and degree of foreshortening

(i.e. the degree to which the limb appears to be shorter than it actually is, due to its

angle relative to the camera). The energy of this labelling is then given by:

E(L) =n

i=1

(li) +

(vi,vj)E

(li, lj), (3.13)

where, as in the previous section, represents the unary energy on part configuration,

andthe pairwise energy. These energies relate to the likelihood of a configuration, given

45


50/166


the image data, and given prior knowledge of the parts; more realistic configurations

will have lower energy. Despite the large number of possible configurations, a globally

optimal configuration can be found efficiently. This can be done by using simple appear-

ance models for each part, explained in the following section, and then applying belief

propagation.

Appearance Model

For each part, appearance models can be learned from training data, and can be based on

edges [95], colour-invariant features such as HOG [22,37], or the position of the part within

the image [38]. Another approach for video sequences is to apply background subtraction,

and define a unary potential based on the number of foreground pixels around the object

location [35].

Given an image, this appearance model can be evaluated over a dense grid [1]; to

speed this process up, a feature pyramid can be defined, so that a number of promising

locations are found from a coarse grid, and then higher resolution part filters produce

more precise matching scores [34,36]. In order to reduce the time taken by the inference

process, it might be desirable to reduce the set of possible part locations. Two ways to

do this are:

1. Thresholding, where part locations with a score that is worse than some predefined

value, or with a score outside the top Nvalues for some N.

2. Non-maximal suppression, which involves the removal of part locations that are

similar, but inferior, to other part locations.

Optimisation

After applying these techniques, we now have a small set of possible locations {l1i , l2i , . . . , l

ki }

for each vertex vi. For a leaf vertex vi, the score of each location lpi is:

scorei(lp

i ) =(lp

i ). (3.14)

46


51/166


Now, for a vertex vj with at least one child, the score is defined in terms of the children:

scorej(lpj ) =(l

pj ) +

vi:(vi,vj)E

mi(lpj ), (3.15)

where:

mi(lpj ) = max

lqi

(lqi , l

pj ) +scorei(l

qi )

. (3.16)

Finally, the top-scoring part configuration is found by finding the root location with the

highest score, and then tracing back through the tree, keeping track of the arg max indices.

Multiple detections can be generated by thresholding this score and using non-maximal

suppression.

3.3.2 Flexible Mixtures of Parts

Yang and Ramanan [131] extend these approaches by introducing a flexible mixture of

parts model, allowing for greater intra-limb variation.

Rather than using a classical articulated limb model such as that of Marr and Nishi-

hara [75], they introduce a new representation: a mixture of non-orientable pictorial

structures. Instead of having ten rigid parts, as the methods described in Section 3.3.1

do, their model has twenty-six rigid parts, which can be combined to form limbs and

produce an estimate for the ten parts, as shown in Figure 3.4. Each part has a number

Tof possible types, learned from training data. Types may include orientations of a part

(e.g. horizontal or vertical hand), and may also span semantic classes (e.g. open versus

closed hand).

Model

Let us denote an image by I, the location of part i by pi = (x, y), and its mixture

component by ti, with i {1, . . . , K }, pi {1, . . . , L}, and ti {1, . . . T }, where K is

the number of parts, L is the number of possible part locations, and Tis the number of

mixture components per part.

47


52/166


53/166


54/166


wheredxand dy represent the relative location of parti with respect toj . The parameter

w(ti,tj)

(i,j) encodes the expected values for dx and dy, tailored for types t

iand t

j. So ifi is

the elbow andj is the forearm, withtiandtj specifying vertically-oriented parts (i.e.the

arm is at the persons side), we would expect pj to be below pi on the image.

Inference

To perform inference on this model, Yang and Ramanan maximise S(I, p , t)overp and t.

Since the graph Gin Figure 3.5 is a tree, belief propagation (see Section 3.2.2) can again

be used. The score of a particular leaf node pi with mixture ti is:

scorei(ti, pi) =btii + w

tii (I, pi), (3.19)

and for all other nodes, we take into account the messages passed from the nodes children:

scorei(ti, pi) =btii + wtii (I, pi) +

cchildren(i)

mc(ti, pi), (3.20)

where:

mc(ti, pi) = maxtc

bti,tc(i,c)+ maxpcw

(ti,tc)(i,c) (pi pc). (3.21)

Once the messages passed reach the root part (i = 1), score1(c1, p1) contains the best-

scoring skeleton model given the root part location p1. Multiple detections can be gen-

erated by thresholding this score and using non-maximal suppression.

3.3.3 Unifying Segmentation and Pose Estimation

So far, a number of methods for solving either human segmentation or pose estimation

have been discussed. Some recent work has also been done that attempts to solve both

tasks together. In this section, we discuss PoseCut [16].

50


55/166


PoseCut

Bray et al. [16] tackle the segmentation problem by introducing a pose-specific Markov

random field (MRF), which encourages the segmentation result to look human-like.

This prior differs from image to image, as it depends on which pose the human is in.

Given an image, they find the best pose prior opt by solving:

opt= arg min

minx

3(x,)

, (3.22)

where x specifies the segmentation result, and 3 is the Object Category Specific MRF

from [65], which defines how well a pose prior fits a segmentation resultx. It is defined

as follows:

3(x,) = i

(D|xi) + (xi|) + j((I|xi, xj) + (xi, xj))

, (3.23)

where Iis the observed (image) data, (I|xi) is the unary segmentation energy, (xi|)

is the cost of the segmentation given the pose prior (penalising pixels near to the shape

being background, and pixels far from the shape being foreground), and the term is a

pairwise energy. Finally,(I|xi, xj) is a contrast-sensitive term, defined as:

(I|xi, xj) =

(i, j), ifxi=xj ; (3.24)

0, ifxi= xj ,

where (i, j)is proportional to the difference in RGB values of pixels i and j ; pixels with

similar values will have a high value for (i, j), since we wish to encourage these pixels

to have the same label.

Given a particular pose prior , the optimal configuration x = argminx3(x,)

can be found using a single graph cut. The final solutionarg minx3(x,opt) is found

using the Powell minimisation algorithm [94].

51


56/166

3.4. Stereo Vision

3.4 Stereo Vision

Stereo correspondence algorithms typically denote one image as the reference imageand

the other as the target image. A dense set of patches is extracted from the reference

image, and for each of these patches, the best match is found in the target image. The

displacement between the two patches is known as the disparity; the disparity for each

pixel in the reference image is stored in a disparity map. It can easily be shown that

the disparity of a pixel is inversely proportional to its distance from the camera, or its

depth [105]. A typical disparity map is shown in Figure 3.6.

A plethora of stereo correspondence algorithms have been developed over the years.

Scharstein and Szeliski [102] note that earlier methods can typically be divided into four

stages: (i) matching cost computation; (ii) cost aggregation; (iii) disparity computation;

and (iv) disparity refinement; later methods can be described in a similar fashion [20,60,

76,92,97].

It is quite common to use the sum of absolute differencesmeasure when finding the

matching cost for each pixel. A patch with a height and width of2n + 1 pixels for some

n 0 is extracted from the reference image. Then, for each disparity value d, a patch

is extracted from the target image, and the pixelwise intensity values for each patch are

compared.

WithL and R representing the reference (left) and target (right) images respectively,

the cost of assigning disparity d to a pixel (x, y) in L is as follows:

(x,y,d) =n

x=n

n

y=n

|L(x + x,y+ y) R(x + x d, y+ y)| (3.25)

Evaluating this cost over all pixels and disparities provides a cost volume, on which

aggregation methods such as smoothing can be applied in order to reduce noise. Disparity

values for each pixel can then be computed. The simplest method for doing this is just

to find for each pixel(x, y) the disparity value d which minimises (x,y,d).

However, such a method is likely to result in a high degree of noise. Additionally, pixels

immediately outside a foreground object are often given disparities that are higher than

52


57/166

3.4. Stereo Vision

(a) Left image

(b) Right image

(c) Disparity map

Figure 3.6: An example disparity map produced by a stereo correspondence algorithm.Pixels that are closer to the camera appear brighter in colour. Note the dark gaps

produced in areas with little or no texture.

53


58/166


59/166

the possibility of controlling video games with the human body. To this end, in the next

chapter, we will begin exploring the possibility of providing a human scene understanding

framework, combining pose estimation with segmentation and depth estimation.


60/166


61/166

Chapter4

3D Human Pose Estimation in a Stereo Pair

of Images

The problem of human pose estimation has been widely studied in the computer vision

literature; a survey of recent work is provided in Section 3.3. Despite the large body of

research focussing on 2D human pose estimation, relatively little work has been done to

estimate pose in 3D, and in particular, annotated datasets featuring frontoparallel stereo

views of humans are non-existent.

In recent years, some research has focussed on combining segmentation and pose

estimation to produce a richer understanding of a scene [10, 16, 66, 89]. Many of these

approaches simply put the algorithms into a pipeline, where the result of one algorithm

is used to drive the other [10, 16, 89]. The problem with this is that it often proves

impossible to recover from errors made in the early stages of the process. Therefore,

a joint inference framework, as proposed by Wang and Koller [127] for 2D human pose

estimation, is desired.

This chapter describes a new algorithm for estimating human pose in 3D, while sim-

ultaneously solving the problems of stereo matching and human segmentation. The al-

gorithm uses an optimisation method known as dual decomposition, of which we give an

overview in Section 4.1.

Following that, a new dataset for two-view human segmentation and pose-estimation,

57

8/13/2019 Beyon

Documents

Beyond Controllers - Human Segmentation, Pose, And Depth Estimation as Game Input Mechanisms