3D Gesture Recognition and Tracking for Next Generation of ...699011/FULLTEXT01.pdf · 3D Gesture Recognition and Tracking for Next Generation of Smart Devices Theories, Concepts,

3D Gesture Recognition and

Tracking for Next Generation of

Smart Devices

Theories, Concepts, and Implementations

SHAHROUZ YOUSEFI

Department of Media Technology and Interaction Design

School of Computer Science and Communication

KTH Royal Institute of Technology

Doctoral Thesis in Media Technology

Stockholm, February 2014

https://www.kth.se/profile/182758/

http://www.kth.se/en/csc/forskning/mid

http://www.kth.se/en/csc/forskning/mid

http://www.kth.se

3D Gesture Recognition and Tracking for Next Generation of Smart De-

vices: Theories, Concepts, and Implementations

Shahrouz Yousefi

Department of Media Technology and Interaction Design (MID)

School of Computer Science and Communication (CSC)

KTH Royal Institute of Technology

SE-100 44, Stockholm, Sweden

Author’s e-mail: [email protected]

Akademisk avhandling som med tillstand av Kungliga Tekniska Hogskolan

framlaggs till offentlig granskning for avlaggande av Teknologie Doktorsex-

amen i Medieteknik, mandagen den 17 mars 2014 kl 13:15 i sal F3 Lindst-

edsvagen 26, Kungliga Tekniska Hogskolan, Stockholm.

TRITA-CSC-A-2014-02

ISSN-1653-5723

ISRN-KTH/CSC/A–14/02-SE

ISBN-978-91-7595-031-0

Copyright c© 2014 by Shahrouz Yousefi, All rights reserved.

Typeset in LATEX by Shahrouz Yousefi

E-version available at http://kth.diva-portal.org

Printed by E-print AB, Stockholm, Sweden, 2014

Distributor: KTH School of Computer Science and Communication

Abstract

The rapid development of mobile devices during the recent decade has

been greatly driven by interaction and visualization technologies. Al-

though touchscreens have significantly enhanced the interaction technol-

ogy, it is predictable that with the future mobile devices, e.g., augmented

reality glasses and smart watches, users will demand more intuitive in-

puts such as free-hand interaction in 3D space. Specifically, for manipu-

lation of the digital content in augmented environments, 3D hand/body

gestures will be extremely required. Therefore, 3D gesture recognition

and tracking are highly desired features for interaction design in future

smart environments. Due to the complexity of the hand/body motions,

and limitations of mobile devices in expensive computations, 3D gesture

analysis is still an extremely difficult problem to solve.

This thesis aims to introduce new concepts, theories and technologies

for natural and intuitive interaction in future augmented environments.

Contributions of this thesis support the concept of bare-hand 3D ges-

tural interaction and interactive visualization on future smart devices.

The introduced technical solutions enable an effective interaction in the

3D space around the smart device. High accuracy and robust 3D mo-

tion analysis of the hand/body gestures is performed to facilitate the 3D

interaction in various application scenarios. The proposed technologies

enable users to control, manipulate, and organize the digital content in

3D space.

Keywords: 3D gestural interaction, gesture recognition, gesture tracking,

3D visualization, 3D motion analysis, augmented environments.

Shahrouz Yousefi

February 2014

Sammanfattning

Den snabba utvecklingen av mobila enheter under det senaste decenniet

har i stor utstrackning drivits av interaktion och visualiseringsteknologi.

Aven om pekskarmar avsevart har forbattrat interaktions tekniken ar

det forutsagbart att med framtida mobila enheter, t.ex. augmented real-

ity glasogon och smarta klockor, kommer anvandare kraver mer intuitiva

satt att interagera, sasom ex. fri hand interaktion i 3D-rymden. Speciellt

viktigt blir det vid manipulation av digitalt innehall i utokade miljoer

dar 3D hand/kropp gester kommer att vara ytterst nodvandig. Darfor

ar 3D gest igenkanning och sparning hogt onskade egenskaper for in-

teraktionsdesign i framtida smarta miljoer. Pa grund av komplexiteten

i hand/kroppsrorelser, och begransningar av mobila enheter vid dyra

berakningar, ar 3D- gest analys fortfarande ett mycket svart problem att

losa.

Avhandlingen syftar till att infora nya begrepp, teorier och tekniker for

naturlig och intuitiv interaktion i framtida utokade miljoer. Bidrag fran

denna avhandling stoder begreppet naken-hand 3D gest interaktion och

interaktiv visualisering pa framtida smarta enheter. De inforda tekniska

losningar mojliggor effektiv interaktion i 3D-rymden runt den smarta

enheten. Hog noggrannhet och robust 3D rorelseanalys av hand/kropp

gester utfors for att underlatta 3D-interaktion i olika tillampningsscenar-

ier. De foreslagna teknik gor det mojligt for anvandare att kontrollera,

manipulera, och organiserar digitalt innehall i 3D-rymden.

Nyckelord: 3D gest interaktion, gest igenkannande, gest sparning, 3D

visualisering, 3D rorelseanalys, utokade miljoer.

Shahrouz Yousefi

februari 2014

Acknowledgements

First of all, I wish to express my sincere gratitude to my main ad-

visor, Prof. Haibo Li, for providing me this research opportunity.

Thank you for your motivation, enthusiasm, and support during

these years. Without your supervision and mentoring this thesis

would not have been possible. You inspired me to be more adven-

turous in research.

I would like to thank my second advisor, Dr. Li Liu, for all the

motivational and fruitful discussions. Special thanks to my dear

friend and colleague, Farid Kondori. We had many collaborations,

interesting discussions and enjoyable moments during these years.

I would like to thank my former colleagues at Digital Media Lab,

Umea University for their helpful suggestions and comments on my

research projects. Special thanks to Annemaj Nilsson, Mona-Lisa

Gunnarsson, and the friendly staff of the department of Applied

Physics and Electronics, Umea University.

My time at KTH was really enjoyable due to the friendly col-

leagues of the department of Media Technology and Interaction

Design. I am grateful for the time spent with them at work meet-

ings, seminars and social events. I must especially thank Prof. Ann

Lantz for providing an excellent research environment at the MID

department. Thanks for your support, encouragement and kind-

ness. I would also like to thank Henrik Artman, Cristian Bogdan,

Ambjorn Naeve, Olle Balter, Eva-Lotta Sallnas, and other senior

researchers at MID department for their support and guidance.

Many thanks should go to Dr. Roberto Bresin and Prof. Yngve

Sundblad for reviewing my thesis. Your constructive ideas, in-

sightful comments, and suggestions made a great improvement in

the quality of my PhD thesis.

Winning the first prize in KTH Innovation Idea Competition, best

project work in Uminova Academic Business Challenge, and being

selected as one of the top PhD works at ACM Multimedia Doctoral

Symposium, motivated me to work harder on the development of

my research ideas. I would especially like to thank Hakan Borg

and Cecilia Sandell from KTH Innovation for their great support

on patentability analysis, business development and commercial-

ization of my research results.

Finally, and most importantly, I am grateful to my loving par-

ents, my brother, and his family for giving me the endless intellec-

tual support and encouragement to pursue my studies during these

years. I would especially like to thank my best friend and com-

panion Shora. Thanks for the wonderful and precious moments we

shared together.

Shahrouz Yousefi

February 2014

Contents

Contents v

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Future Mobile Devices . . . . . . . . . . . . . . . . . . . 4

1.2.2 Experience Design . . . . . . . . . . . . . . . . . . . . . 4

1.2.3 Limitations in Interaction Facilities . . . . . . . . . . . . 6

1.2.4 Limitations in Visualization . . . . . . . . . . . . . . . . 8

1.2.5 Technical Challenges in 3D Gestural Interaction . . . . 8

1.3 Future Trends in Multimedia Context . . . . . . . . . . . . . . 10

1.3.1 3D Interaction Technology . . . . . . . . . . . . . . . . . 10

1.3.2 3D Visualization . . . . . . . . . . . . . . . . . . . . . . 10

1.3.3 Passive Vision to Active/Interactive Vision . . . . . . . 10

1.3.4 Gesture Analysis: from Computer Vision Methods to

Image-based Search Methods . . . . . . . . . . . . . . . 11

1.4 Research Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Related Work 13

2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1 3D Motion Capture Technologies in Available

Interactive Systems . . . . . . . . . . . . . . . . . . . . . 15

v

CONTENTS

2.2.1.1 Passive Motion Tracking and Its Applications 15

2.2.1.2 Active Motion Tracking and Its Applications . 16

2.2.1.3 Comparison Between Active and Passive Meth-

ods . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 3D Motion Estimation for Mobile Interaction . . . . . . 18

2.2.3 3D Gesture Recognition and Tracking . . . . . . . . . . 19

2.2.4 3D Visualization on Mobile Devices . . . . . . . . . . . 21

3 General Concept and Methodology 23

3.1 General Concept . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Interaction/Visualization Space . . . . . . . . . . . . . . 25

3.1.2 Sharing the Interaction/Visualization space . . . . . . . 27

3.1.2.1 Single-user, Single-device . . . . . . . . . . . . 27

3.1.2.2 Multi-user, Multi-device with Shared Interac-

tion Space . . . . . . . . . . . . . . . . . . . . 28

3.1.2.3 Multi-user, Single-device with Shared Visual-

ization Space . . . . . . . . . . . . . . . . . . . 28

3.1.2.4 Interaction from Different Locations for Multi-

user Multi-device . . . . . . . . . . . . . . . . . 28

3.2 Evolution of Interaction/Visualization Spaces . . . . . . . . . . 28

3.3 Enabling Media Technologies . . . . . . . . . . . . . . . . . . . 31

3.3.1 Vision-based Motion Tracking in 3D Space . . . . . . . 32

3.3.2 3D Visualization . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Methodology Overview . . . . . . . . . . . . . . . . . . . . . . . 36

3.5 Gesture Analysis through the Pattern

Recognition Methods . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6 Gesture Analysis through the Large-scale

Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Enabling Media Technologies 43

4.1 Gesture Detection and Tracking Based on Low-level Pattern

Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

vi

CONTENTS

4.1.1 3D Motion Analysis . . . . . . . . . . . . . . . . . . . . 46

4.2 Gesture Detection and Tracking Based on

Gesture Search Engine . . . . . . . . . . . . . . . . . . . . . . . 48

4.2.1 Providing the Database of Gesture Images . . . . . . . . 49

4.2.2 Query Processing and Matching . . . . . . . . . . . . . 50

4.2.3 Scoring System . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.4 Quality of Hand Gesture Database . . . . . . . . . . . . 52

4.3 Interactive 3D Visualization . . . . . . . . . . . . . . . . . . . . 54

4.4 Methods for 3D Visualization . . . . . . . . . . . . . . . . . . . 56

4.4.1 Depth Recovery and 3D Visualization from a Single View 56

4.4.2 3D Visualization from Multiple 2D Views . . . . . . . . 57

4.5 3D Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . 57

5 Experimental Results 59

5.1 Experiments on Gesture Detection, Tracking

and 3D Motion Analysis . . . . . . . . . . . . . . . . . . . . . . 59

5.1.1 Camera and Experiment Condition . . . . . . . . . . . . 59

5.1.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1.3 Programming Environment and Results . . . . . . . . . 62

5.2 Experiments on Gesture Search Framework . . . . . . . . . . . 63

5.2.1 Constructing the Database . . . . . . . . . . . . . . . . 63

5.2.2 Forming the Vocabulary Table . . . . . . . . . . . . . . 65

5.2.3 Gesture Search Engine and Neighborhood Analysis . . . 66

5.2.4 Gesture Search Results . . . . . . . . . . . . . . . . . . 66

5.3 Technical Comparison between the Prior Art and the Proposed

Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.4 3D Rendering and Graphical Interface . . . . . . . . . . . . . . 69

5.5 Research Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.5.1 Implementation of the 3D Gestural Interaction on Mo-

bile Platform . . . . . . . . . . . . . . . . . . . . . . . . 71

5.5.2 Implementation of the Interactive 3D Vision on a Wall-

sized Display . . . . . . . . . . . . . . . . . . . . . . . . 72

vii

CONTENTS

5.5.3 3D Rendering and Visualization of 2D Content . . . . . 73

5.6 Potential Applications . . . . . . . . . . . . . . . . . . . . . . . 74

5.6.1 3D Photo Browsing . . . . . . . . . . . . . . . . . . . . 75

5.6.2 Virtual/Augmented Reality . . . . . . . . . . . . . . . . 75

5.6.3 Interactive 3D Display . . . . . . . . . . . . . . . . . . . 76

5.6.4 Medical Applications . . . . . . . . . . . . . . . . . . . . 76

5.6.5 3D Games . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.6.6 3D Modeling and Reconstruction . . . . . . . . . . . . . 77

5.6.7 Wearable AR Displays . . . . . . . . . . . . . . . . . . . 77

5.7 Usability Analysis in Object Manipulation:

Touchscreen Interaction vs. 3D Gestural Interaction . . . . . . 77

5.7.1 User Test . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.7.2 Usability Results . . . . . . . . . . . . . . . . . . . . . . 80

6 Concluding Remarks and Future Direction 83

6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.1.1 Conceptual Models for Future Human Mobile Device

Interaction . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.1.2 Technical Contributions for 3D Gestural Interaction and

3D Interactive Visualization . . . . . . . . . . . . . . . . 84

6.1.3 Implementations . . . . . . . . . . . . . . . . . . . . . . 85

6.2 Concluding Remarks and Future Direction . . . . . . . . . . . . 86

6.2.1 Technical Challenges . . . . . . . . . . . . . . . . . . . . 88

6.2.1.1 Active vs. Passive Motion Capture . . . . . . . 88

6.2.1.2 Gesture Detection and Tracking without Intel-

ligence . . . . . . . . . . . . . . . . . . . . . . 89

6.2.1.3 Adaptability of the Contributions to Future

Hardware Evolution . . . . . . . . . . . . . . . 89

6.2.1.4 Contributions of other Research Areas to Com-

puter Vision . . . . . . . . . . . . . . . . . . . 90

6.2.2 Further Development . . . . . . . . . . . . . . . . . . . . 90

6.2.2.1 Concept of Collaborative 3D Interaction . . . . 91

viii

CONTENTS

6.2.2.2 Concept of Interaction in the Space using Body

Gestures . . . . . . . . . . . . . . . . . . . . . 91

6.2.2.3 Extension of the Gesture Search Framework to

Extremely Large Scale . . . . . . . . . . . . . . 91

6.2.3 Future of Mobile Interaction and Visualization . . . . . 92

7 Summary of the Selected Articles 95

7.1 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . 97

8 Paper I:

Experiencing Real 3D Gestural Interaction with Mobile De-

vices 105

8.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

8.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

8.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

8.4 System Description . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.4.1 Gesture Detection and Tracking . . . . . . . . . . . . . 111

8.4.2 Local Orientation and Double-angle Representation . . 111

8.4.3 Rotational Symmetries Detection . . . . . . . . . . . . . 113

8.4.4 3D Structure from Motion . . . . . . . . . . . . . . . . . 115

8.4.5 Finger Detection and Tracking . . . . . . . . . . . . . . 116

8.4.5.1 Fingertip Detection . . . . . . . . . . . . . . . 117

8.4.5.2 Localization by Clustering . . . . . . . . . . . 118

8.4.5.3 Finger Tracking . . . . . . . . . . . . . . . . . 119

8.4.6 3D Coding and Visualization . . . . . . . . . . . . . . . 119

8.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 121

8.6 Usability of the Proposed System . . . . . . . . . . . . . . . . . 124

8.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

9 Paper II:

ix

CONTENTS

3D Photo Browsing for Future Mobile Devices 133

9.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

9.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

9.4 Enabling Media Technologies . . . . . . . . . . . . . . . . . . . 137

9.4.1 Vision-based Motion Tracking in 3D Space . . . . . . . 137

9.4.2 3D Visualization . . . . . . . . . . . . . . . . . . . . . . 138

9.5 Design of the 3D Photo Browser . . . . . . . . . . . . . . . . . 139

9.6 Technical Contributions . . . . . . . . . . . . . . . . . . . . . . 142

9.6.1 Gesture Detection and Tracking . . . . . . . . . . . . . 142

9.6.2 3D Motion Analysis . . . . . . . . . . . . . . . . . . . . 143

9.6.3 Methods for 3D Visualization . . . . . . . . . . . . . . . 143

9.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . 144

9.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

10 Paper III:

Bare-hand Gesture Recognition and Tracking through the

Large-scale Image Retrieval 147

10.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

10.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

10.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150


10.4.1 Pre-processing on the Database . . . . . . . . . . . . . . 152

10.4.1.1 Position/Orientation Tagging to the Database 152

10.4.1.2 Defining and Filling the Edge-orientation Table 155

10.4.2 Query Processing and Matching . . . . . . . . . . . . . 156

10.4.2.1 Direct Scoring . . . . . . . . . . . . . . . . . . 156

10.4.2.2 Reverse Scoring . . . . . . . . . . . . . . . . . 159

10.4.2.3 Weighting the Second Level Top Matches . . . 160

10.4.2.4 Dimensionality Reduction for Motion Path Anal-

ysis . . . . . . . . . . . . . . . . . . . . . . . . 160

x

CONTENTS

10.4.2.5 Motion Averaging . . . . . . . . . . . . . . . . 161


10.5.1 Dimensionality Reduction for Selective Search . . . . . . 166

10.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . 167

11 Paper IV:

Interactive 3D Visualization on a 4K Wall-Sized Display 173

11.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

11.2 Introduction and Related Work . . . . . . . . . . . . . . . . . . 174

11.3 3D Motion Analysis . . . . . . . . . . . . . . . . . . . . . . . . 178

11.4 Error Analysis in 3D Motion Estimation . . . . . . . . . . . . . 180


11.5.1 Visualization on a 4K Wall-sized Display . . . . . . . . 183

11.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . 183

12 Paper V:

3D Visualization of Single Images Using Patch Level Depth 187

12.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

12.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

12.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

12.4 Monocular Features for Depth Estimation. . . . . . . . . . . . . 190

12.5 Feature Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

12.6 MRF and Depth Map Recovery . . . . . . . . . . . . . . . . . . 193

12.7 Depth Normalization and Pixel Level Translation . . . . . . . . 194

12.8 Anaglyph 3D Coding . . . . . . . . . . . . . . . . . . . . . . . . 195


12.10Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

13 Paper VI:

xi

CONTENTS

Stereoscopic Visualization of Monocular Images in Photo Col-

lections 201

13.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

13.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

13.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203


13.4.1 SIFT Feature Detection and Matching . . . . . . . . . 204

13.4.2 Image Transformation . . . . . . . . . . . . . . . . . . . 205

13.4.3 Image Projection and Stereoscopic Adjustment . . . . . 206

13.4.4 3D Coding and Visualization . . . . . . . . . . . . . . . 207


13.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

14 Paper VII:

Robust Correction of 3D Geo-Metadata in Photo Collections

by Forming a Photo Grid 213

14.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

14.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

14.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216

14.4 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 216


14.5.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . 218

14.5.2 Structure from Motion . . . . . . . . . . . . . . . . . . . 219

14.5.3 Uncertainty Analysis . . . . . . . . . . . . . . . . . . . . 221

14.5.4 Signal Model . . . . . . . . . . . . . . . . . . . . . . . . 223

14.5.5 Measurement Model . . . . . . . . . . . . . . . . . . . . 224

14.5.6 Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . 225


14.7 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . 226

Bibliography 235

xii

Chapter 1

Introduction

1.1 Motivation

Mobile devices play an important role in the modern world. Beyond the or-

dinary use in daily life, they are being used by people for various advanced

purposes in scientific areas, entertainment, education, medical applications,

communication, gaming etc. The fast growing market of mobile devices re-

veals that the sales of mobile devices are overtaking PCs. The recent statistics

on mobile device market indicate that the total smartphone sales has reached

490 million units in 2011 and 700 million units in 2012 across the globe [1, 2].

With the current rate of growth, the sales of smartphones will exceed 1.5 billion

in 2017 [3]. In addition to this enormous number, we should take into account

the other types of mobile devices such as tablets, advanced portable gaming

devices, digital cameras, camcorders, multimedia players, smart watches, and

augmented reality glasses.

Capability of mobile devices in capturing, storing, processing, and visualiza-

tion of multimedia content has been significantly increased during the recent

years. In addition to the high resolution cameras, other embedded sensors

such as GPS, accelerometer, gyroscope and magnetometer provide a chance

to collect extra metadata and integrate them in a wide range of application

scenarios. In addition, variety of sensors might be considered as alternative

1

Chapter 1

input facilities for interaction between users and their mobile devices.

Introduction of smartphones has changed our way to interact with mobile

phones. Nowadays, people interact with their mobile devices through the

touchscreen displays. The current technology offers single or multi-touch ges-

tural interaction on 2D touchscreens. This approach is designed to provide a

more natural interaction when users are operating their mobile devices. Over

the touchscreen of a smartphone, users could manipulate a soft keyboard, vir-

tual objects, and perform actions just through moving fingers. Although this

technology has solved many limitations in human mobile device interaction,

the recent trend in the digital world reveals that people always prefer to have

intuitive experiences with their digital devices. For instance, popularity of the

Microsoft Kinect can demonstrate this idea that people enjoy experiences that

give them the freedom to act like they would in the real world.

Rapid development and wide adaptation of smart phones have greatly changed

our lives. Nowadays, we are more and more relying on our smartphones. It is a

strong trend that a smartphone will become a part of our body. An indicative

example is Google Glass, which can be seen as a version of next generation

smartphones. Most probably, for next generation smartphones, users will no

longer be satisfied with just performing interaction over 2D touchscreen, they

will demand more natural interactions performed by the bare hands in 3D

free space, at the back of the phone, or in front of the smart device, for in-

stance. Thus, next generation of smart devices will need a gesture interface to

facilitate the bare hands for manipulating digital objects directly, for instance,

playing Spotify, scanning photo collections, and reading emails.

Due to the strong indications and current trends, mobile devices will be an

essential inseparable part of our body in the near future. In fact, smartphones,

tablets or wearable augmented reality glasses will not be just ordinary devices.

They will bring any experience to the personalized visualization space from

the huge sea of information. For instance, mobile device might be a guitar,

fitness trainer, home theater, shopping center, navigation system, game con-

sole and thousands of other possible scenarios.

2

Introduction

Currently the major discussion is how we interact with the mobile device,

while in the near future we should also consider how we interact through the

mobile device, with the physical space, objects, information, etc. However,

when we discuss about the next generation of mobile devices we should con-

sider the next generation of interaction facilities too. The important question

is in which space and how will we interact with and through our future mobile

devices?

1.2 Research Problem

Design of the interaction experience for future mobile devices incorporates

many challenging problems. The rapid growth in the technology of mobile

devices shows that in the near future we will have extremely powerful hand-

held or wearable devices. Although it is rather hard to exactly predict the

hardware capabilities and features of the future mobile devices, but current

trends in multimedia technology indicate that interaction with future devices

should happen in a more intuitive and natural manner. Here, some important

scientific questions might be considered. First, in which space and how should

the intuitive interaction happen? Will touchscreens and track pads be replaced

by other input facilities? And how should we design a new space for intuitive

interaction with future mobile devices?

Intuitive interaction is highly related to the mental connection of humans

to their natural experiences. Since humans interact with their environment

through the physical gestures, 3D hand/body gestures might be the effective

alternatives for existing interaction facilities.

Assume that the new interaction space is designed and introduced. Now the

main challenge is how to technically support this concept. What types of

technologies are required to perform this significant change? What are the

limitations of media technologies to perform the 3D gestural interaction? How

can we detect, recognize and track the complex hand gestures, head motions

and body movements in 3D space? And how can enabling media technologies

solve the technical problems?

3

Chapter 1

However, from both design and technical perspectives, introducing new ways

for interaction with future mobile devices seems to be an extremely challenging

task. This thesis aims to tackle these challenges and introduce new concepts,

designs and technical solutions for the mentioned issues. These challenges are

discussed in the following sections in detail.

1.2.1 Future Mobile Devices

In the discussions of future mobile devices we have to consider some impor-

tant points. Five to ten years from now, we will most probably be faced with

substantial changes in mobile technology. From a hardware point of view, fu-

ture mobile devices will be featuring more advanced and powerful components

such as various types of sensors, high speed processors, 3D displays, and huge

memories. It seems that user experience in interaction with mobile devices will

be quite different from now. Therefore, designing any interactive application

for future mobile devices needs extensive investigation and research. From a

design point of view, interaction environment and visualization quality will

be substantially changed in the near future. Here, the main challenge is how

to design a usable system to enhance the user experience in interaction with

future mobile devices.

1.2.2 Experience Design

In the multimedia context, experience is defined as the sensation of interaction

with a product, service or event [4]. Therefore, in experience design for mo-

bile users, quality and sensation of interaction on physiological, affective, and

cognitive levels should be taken into account. In fact, for a more convenient

and desired experience, interaction between user and device should happen in

a natural and effective manner. Unlike the interaction with the physical world

where people use their body gestures, the best available technologies in smart-

phones and tablets use interaction on limited 2D touchscreens. This limitation

stops users from having a natural interaction in a wide range of applications

where using physical gestures for 3D manipulations might be unavoidable. For

4

Introduction

instance, picking, placing, grabbing, moving, pushing, zooming, and in gen-

eral, manipulating virtual menus, objects and graphics in 3D environments

require physical hand gestures. In addition, because the interaction happens

on the display, in practice, users’ fingers or hands cover a large area or some

parts of the screen while they are operating on the device. As a result, they

will lose the visibility of the display during the interaction. Since the hard-

ware capability of the mobile devices is increasing rapidly, the complexity of

the applications will increase as well. This means that in the near future we

will interact with our digital devices in a quite different manner. Another im-

portant point to consider is the visualization quality. For a high quality user

perception, a realistic visualization is needed. This is the main idea behind

the development of 3D display technologies such as 3D cinemas or TVs. It

is predictable that in the future, multimedia content will be displayed in 3D

format. Therefore, adaptation of the old content to the future visualization

technologies should be considered. For instance, we need to find an effective

way to convert our old multimedia collections such as 2D photo albums, videos,

etc. to 3D, which is quite a challenging problem. However, experience design

for future mobile devices seems to be a difficult task from both interaction and

visualization perspectives.

Quality of user experience is rather a difficult concept to define, measure and

evaluate. Although substantial research has been done on this subject, find-

ing a straightforward method to measure the quality of experience(QoE) is

still challenging. Usability is an important criterion to consider when we in-

vestigate the QoE concept. Usability might be perceived from three angles:

efficiency, effectiveness and user satisfaction [5]. From technical point of view,

these three factors have been found to be more practical to measure and eval-

uate. Therefore, improving the usability factors might significantly enhance

the quality of user experience in multimedia context.

5

Chapter 1

1.2.3 Limitations in Interaction Facilities

Designing interactive applications for mobile devices is still a challenging prob-

lem. Although from the processing point of view, new devices are quite pow-

erful, but due to the limitations in size and weight for portability purpose,

many problems remain unsolved. One major problem is that how users can

effectively communicate with their devices in hardware level. The current

technology provides several solutions to this problem. The commonly used

hardware facilities to communicate with mobile devices are miniature key-

boards, tiny joysticks and touchscreen displays [6].

Keyboards allow users to perform tasks through the menus, type, search, nav-

igate etc., but in reality, even a small-sized keyboard, occupies a large space

and limits the area for display. Secondly, usability of those keyboards is ques-

tionable for users with large fingers for selecting tiny buttons. Substantial

research has been done to reduce the size of keyboards, for example simplify-

ing the devices by mapping the QWERTY keyboards to other formats or using

few or single buttons [7, 8]. Swype is another known technique for enhancing

the interaction with mobile device through the virtual keyboard. In Swype

virtual keyboard user enters words by sliding a finger from the first letter of a

word to its last letter, lifting only between words.

Although joysticks are useful in some applications for scrolling up and down

and selecting menus, but they are very limited and difficult to work with, es-

pecially on small screens. Nowadays, the touchscreen displays are being used

by most smartphones and tablet PCs and the trend shows that the button-

less devices are becoming more popular. This indicates that users prefer to

work on larger screens and designers allocate the whole device’s surface to

the touchscreen display. On the other hand, touchscreen displays have several

drawbacks. First, for typing scenarios a virtual keyboard will be rendered on

the screen that occupies a large space for user convenience. Second, in most

applications at least one hand works on the surface which brings the occlusion

problem and in many cases both hands are involved. Therefore, the occlu-

sion limits the visualization and the quality of experience will be degraded.

6

Introduction

In some contributions, a novel approach by touching the back of the device

is presented [9]. Although this solution might work on limited scenarios but

generally users lose the matching between visual perception and the touch.

The commonly used technology in interaction design with mobile devices is

2D touchscreen display with a single button or without any physical button.

Since humans interact with the physical world in 3D space, the quality of

interaction will be degraded while it is mapped to 2D surfaces. Technically

speaking, in 3D space the motion is represented by six degrees of freedom

(three rotation parameters and three translation parameters), while in the

mapping from the real world to 2D screens, the motion parameters will be

reduced to two. On single-touch displays, motion is limited to translation in

2D (x and y) coordinates, but new products in the market, use multi-touch

gestures to simulate rotations and translations in z-axis. However, with the

best interactive devices in the market, the motion parameters are limited on

2D screens. This means that without using extra buttons, 2D gestures, or

the aid of embedded orientation sensors, manipulation around x,y axes on 2D

screens is not possible. Due to the fact that in magnetometer aided applica-

tions the device itself should move, the visual content such as graphics, photos,

videos, etc. might be out of the user’s sight and it will not be applicable in

most cases.

Another important group of mobile devices are the forthcoming augmented

reality glasses such as Google Glass. Since this type of gadgets might be the

future of smartphones, it is quite important to investigate how convenient users

interact with them. Voice commands are one effective solution for command-

ing the device to take a set of actions. For instance, for dialing a number,

searching a phrase, or capturing a photo, voice commands might be really

useful. For more complex tasks such as writing a text, skimming emails or

browsing photos, users definitely need more input facilities. Google has intro-

duced a touchscreen bar on the side frame of the Glass to solve this problem.

Although this small surface provides more capability for user interaction, but

it is obviously weaker than current smartphone touchscreens due to its size

7

Chapter 1

and invisibility to the users’ sight.

However, designing usable and convenient input facilities for future mobile

devices incorporates deep research and investigation. 3D gestural interaction

in free space might be an effective alternative for the current interaction tech-

nology. Enabling media technologies can support the user device interaction

and enhance the user experience.

1.2.4 Limitations in Visualization

In order to improve the user experience in multimedia applications, in ad-

dition to the effective interaction, high quality and realistic visualization is

required. The main reason behind the manufacturing of mobile devices with

larger displays is to enhance the visual output and the quality of experience.

Although size and mobility have been in a trade-off in design for many years,

due to the importance of the visual interaction, mobile devices became larger

in screen size. Substantial experiments have been done to find the optimal

and most effective size for different mobile devices [6]. A mobile device, as it

is named, should be portable and easy to be held by its user, and this criterion

brings the most challenging task to keep the balance between portability and

size, besides the power consumption issue that is out of our discussion [10].

However, today’s smartphones just offer a limited surface for visualization. If

wearable smart glasses provide a high quality visualization they might signifi-

cantly enhance the visual experience. 3D display technology is another feature

that might improve the perception quality in future mobile devices.

1.2.5 Technical Challenges in 3D Gestural Interaction

3D gestural interaction is rather a new trend in the multimedia context. Sub-

stantial efforts have been done in this area. Specifically, 3D gestural interfaces

are used in gaming and entertainment applications. One of the enabling tech-

nologies to build such gesture interfaces is hand tracking and gesture recogni-

tion. The major technology bottleneck lies in the difficulty of capturing and

analyzing the articulated hand motions. One of existing solutions is to em-

8

Introduction

ploy glove-based devices, which directly measure the finger joint angles and

spatial positions of the hand by using a set of sensors (i.e. electromagnetic or

fiber-optical sensors). Although there exits such applications in human com-

puter interaction, virtual reality, and 3D games, the glove-based solutions are

too intrusive, cumbersome, and expensive for natural interactions with mobile

devices. To overcome this, vision-based hand motion capturing and tracking

solutions need to be developed. Capturing hand and finger motions in video

sequences is a highly challenging task due to the large number of degrees of

freedom (DOF) of the hand kinematics. Tracking articulated objects through

sequences of images is one of the grand challenges in computer vision. Re-

cently, Microsoft demonstrated how to capture full body motions by means

of their newly developed depth cameras, Kinect. The question is whether the

problem of 3D hand tracking and gesture recognition can be potentially solved

by using 3D depth cameras. Of course, this problem has been greatly simpli-

fied by the introduction of real-time depth cameras. However, the technologies

based on 3D depth information for hand tracking and gesture recognition still

face major challenges for mobile applications.

Mobile applications at least have two critical requirements: computational

efficiency and robustness. For mobile applications, feedback and interaction

in a timely fashion is assumed. Any latency should not be perceived as un-

natural to the human participant. Therefore, the maximum time between

the completion of a gestural action from a person and response from the de-

vice must be no longer than 100 ms (at least 10 frames per second should

be processed in real-time vision-based systems). This requires an extremely

fast solution for hand tracking and gesture recognition. It is doubtful if most

existing technical approaches, including the one used in Kinect body tracking

system would be the direction leading to the technical development for future

mobile devices due to their inherent resource-intensive nature. Another issue

is the robustness. The solutions for mobile applications should always work no

matter indoor or outdoor. This may somehow exclude the possibility of using

Kinect-type depth sensors in the next generation of mobile devices. Therefore,

9

Chapter 1

we come back to our original problem again, how to solve the problem of hand

tracking and gesture recognition with video cameras.

A critical question is whether we could develop alternative video-based so-

lutions to hand tracking and gesture recognition that may fit future mobile

applications better. Obviously, this question is of significance to address, since

it is not only one of the fundamental problems in computer vision, but also it

would have a potential impact on the mobile industry, and above all on the

interaction with mobile devices in the future.

1.3 Future Trends in Multimedia Context

Contributions of this thesis are highly inspired by the following key trends in

the multimedia context.

1.3.1 3D Interaction Technology

Major direction of interaction technology is towards more intuitive and natural

interaction between users and digital devices. This is the main reason that

keyboards, joysticks and other traditional input facilities are mainly replaced

by track pads and touchscreens. The rapid development of new sensors such

as Microsoft Kinect and Leap Motion for 3D interaction is another indication

that proves this trend.

1.3.2 3D Visualization

Visualization quality has been significantly improved during the recent decade.

The major trend indicates that 2D display devices are replacing with 3D tech-

nology. Realistic perception and quality of experience are important features

introduced by 3D display technology.

1.3.3 Passive Vision to Active/Interactive Vision

Introduction of wearable smart displays such as Google Glass reveals a new

trend in multimedia world. Augmented reality glasses or similar products will

10

Introduction

change the user perception from passive to active or interactive vision. In

fact, users might interact with the environment through the wearable display,

receive information from various channels and command the display.

1.3.4 Gesture Analysis: from Computer Vision Methods toImage-based Search Methods

Gesture detection, recognition, and tracking are mainly considered as classical

computer vision/pattern recognition problems. Capability of new devices in

storing and processing of large databases motivates the idea of solving the

mentioned problem by using image-based search approaches. Therefore, de-

velopment of search methods for visual content might be the future approach

for gesture analysis.

1.4 Research Strategy

The main objective behind this research is to develop concepts and technolo-

gies for effective interaction with future mobile devices. In order to fulfill this

objective, challenges from both design and technical aspects should be consid-

ered. This thesis aims to cover the concept of human mobile device interaction,

future challenges and new possible frameworks to improve the quality of user

experience. Afterwards technical solutions to overcome these challenges will be

introduced and experimental results will be demonstrated. The main research

strategies towards achieving these goals can be summarized in the following

items:

ä Concepts of interaction and visualization spaces have been deeply investi-

gated during this work. Idea of extending the interaction and visualization

spaces to real 3D space, sharing the interaction/visualization spaces and po-

tential application scenarios are introduced.

ä In order to support the future interaction/visualization concept, enabling

media technologies have been deeply studied and widely used during this re-

search.

ä 3D gestural interaction as a powerful tool for future interaction technology

11

Chapter 1

is suggested. Different methods for gesture detection, recognition, tracking,

and 3D motion analysis have been studied and new algorithms for supporting

this concept have been developed during this research.

ä Concepts of active and passive vision have been investigated and studied

during this work. An interactive framework for 3D displays has been intro-

duced. Moreover, new methods for 3D visualization of multimedia content

have been developed.

ä Various implementations of gestural interaction and 3D visualizations and

different experiments have been conducted on stationary and mobile platforms.

Experimental results have been compared and the final conclusions have been

reflected.

ä Future direction of mobile multimedia, potential application scenarios, en-

abling technologies, and new frameworks have been investigated.

12

Chapter 2

Related Work

2.1 Terminology

Nowadays, gesture-based interaction is a strong trend in the multimedia con-

text. In general, effective 3D gestural interaction can be achieved by combining

the technical solutions in gesture analysis, designing usable applications and

efficient user interface. Since the main focus of this thesis is on the technical

aspects of gesture analysis, it is quite crucial to provide a comprehensive def-

inition for technical keywords and expressions that have been frequently used

in the discussions.

Gesture recognition: gesture recognition is the process of interpreting the

human gestures by mathematical models or computer vision algorithms. Ges-

ture recognition is widely considered for communication between users and

computers using sign language. Various hand gestures might be used for

commanding the digital devices in different tasks. In this context, gesture

recognition is the process of differentiating between various hand gestures and

assigning the different labels to them. For instance, all variations of the grab

gesture in different poses and orientations should be recognized as grab gesture.

Gesture detection: the process of detecting the presence of a gesture pat-

13

Chapter 2

tern in an image frame is known as gesture detection. In this context, for a

specific hand gesture, detection output indicates the presence or absence of

the gesture pattern in an image frame.

Gesture localization: the process of returning the estimated position of

the detected gesture in an image frame is known as gesture localization. The

location of the gesture might be returned by different parameters such as

bounding box, ellipse axes, or center of mass.

Gesture tracking: the process of gesture localization in a video sequence

is known as gesture tracking. Gesture tracking might be performed by gesture

localization in each single frame in an image sequence. In another approach,

following the motion of the localized gesture in consecutive frames might be

considered as gesture tracking.

Gesture pose estimation: estimating the position and orientation of the

detected gesture with respect to the camera origin is pose estimation. In the

discussions of this thesis 3D pose is referred to position (three parameters),

and orientation (three parameters), with respect to the camera coordinate sys-

tem.

3D gestural interaction: interaction between users and digital devices by

employing hand/body gestures in 3D space is regarded as 3D gestural inter-

action. Gesture detection, localization, recognition, tracking, and 3D pose

estimation are the essential components of gestural interaction.

2.2 Related Work

3D technology and the research area around it have been developed rapidly

during the recent years. Although substantial efforts have been done to facil-

itate the devices with 3D technology, there are still numerous problems and

challenges with them. Nowadays, the main focus of the 3D research world is

14

Related Work

on 3D visualization. For instance, in 3D cinemas, 3D TVs, 3D digital cameras

or even 3D mobile phones, the main goal is to add 3D features to the visual-

ization part. In this context, 3D technology is considered from two aspects.

First, how the interaction between user and device happens in 3D space, and

second, the visualization technology where digital output should be displayed

in 3D format.

In the following sections, first, current motion capture and tracking technolo-

gies in interactive products and environments are reviewed. Afterwards, ap-

plicability of the current technologies to mobile applications and the available

solutions and related works are discussed.

2.2.1 3D Motion Capture Technologies in AvailableInteractive Systems

During the recent years, 3D motion analysis has been found to be useful in

various scenarios such as entertainment, virtual/augmented reality, and med-

ical applications [11]. Different methods have been studied and introduced to

effectively retrieve the motion parameters. Since the aim is to interact with

mobile device through the user’s gesture in 3D space, different techniques in

3D motion tracking and analysis must be studied. If we successfully retrieve

the 3D motion parameters, with high accuracy, we will be able to design an

effective interaction environment for manipulation of digital content. In the

following sections, different approaches for analysis of the 3D motion captured

by various sensors in different setups are introduced and compared.

2.2.1.1 Passive Motion Tracking and Its Applications

The most common method to analyze the motion is using static cameras.

The tracking method is known as passive, where the cameras are static and

subjects move. The Microsoft Kinect is one example of passive motion analysis

where the sensor is mounted somewhere in the room and users move in front

of it. Kinect, features a RGB camera and depth sensor that provides full-body

3D motion capture and facial recognition [12]. Sony has also added motion

15

Chapter 2

capture systems in its game console [13]. PlayStation Move performs motion

capture by holding a wiggle stick in hand. This controller features a spherical

glowing part which can shine in full range of RGB colors. Based on the size

and position of the shining part, captured by the PlayStation camera in 3D

space, the motion will be accurately estimated [14, 15, 16]. Passive approach

is widely used in medical applications to analyze the patients’ motions for

diagnosing different types of physical disorders. Such systems usually use

several expensive cameras, mounted at different positions in the room, and

wearable markers or special clothes with visible markers on the body joints to

be detected by the cameras from the distance [17]. In the systems that work

without any marker or wearable devices, usually additional sensors like 3D

cameras, depth or distance sensors are added to the installation or capturing

device [17].

2.2.1.2 Active Motion Tracking and Its Applications

Active motion capture configurations estimate the 3D motion by using wear-

able devices. Wearable devices might be different types of sensors for measur-

ing the 3D motion parameters such as orientation and acceleration changes.

They transmit the information to the base station for processing. The Wii

MotionPlus game controller performs motion analysis by active configuration.

The device incorporates gyroscope and accelerometer to accurately capture

and report the 3D motion [18]. In high accuracy virtual reality and medical

applications similar types of sensors are extensively used on body joints [17].

2.2.1.3 Comparison Between Active and Passive Methods

Active motion analysis usually provides more accurate results due to its higher

resolution in measuring the motion parameters. Since sensors are mounted on

body parts, they can measure the body motions with a higher resolution com-

paring with the passive installation where motion is captured from distance.

Based on the conducted research in [19], accuracy of the active motion cap-

ture is about 10 times higher than passive motion capture where a single RGB

16

Related Work

Figure 2.1: Passive and active motion tracking in gaming consols; Left: TheMicrosoft Kinect; Middle: PlayStation 3 Move; Right: Wii MotionPlus.

camera is used for capturing and measuring the 3D motion. Although the

measurement is highly dependent on the motion analysis techniques, but in

general, the substantial difference reveals that for accuracy reasons, body-

mounted sensors are preferred to passive configuration. A major drawback

with active motion capture is that wearable devices are usually uncomfortable

for users. Moreover, many active motion capture systems use specific instal-

lations, expensive materials and many sensors that will substantially increase

the total cost. On the other hand, passive systems suffer from high accuracy

due to the error caused by distant motion estimation. Apart from passive sys-

tems with wearable markers, marker-less systems use natural gesture analysis

and they are convenient for users.

Due to the fact that mobile devices are equipped with different types of sen-

sors, it is possible to make use of them in the active or passive configurations.

On one hand, when they are in motion, they can be used as an active sensor

(for instance by reading the orientation sensor or analysis of the video input).

On the other hand, in static setup, they might be considered as passive sen-

sors. Motion analysis of a moving object from the video input, captured by

the device’s camera, is an example of this configuration. This thesis is focused

on the gestural interaction behind the mobile device’s camera. Basically, this

scenario is similar to passive motion tracking using a vision sensor, but due to

17

Chapter 2

the close distance between the vision sensor and the user’s gesture, it can also

present the features of the active motion tracking.

2.2.2 3D Motion Estimation for Mobile Interaction

Currently, the most popular way to interact with mobile devices is through the

2D touchscreens. As mentioned before, touchscreen displays have limitations

in 3D application scenarios, and the idea to hire other types of sensors such as

orientation or vision sensors provides an opportunity to enhance the quality

of interaction. Generally, in HCI applications, different solutions have been

used to analyze the human body or gesture motion. The retrieved information

from motion analysis might be used to facilitate the interaction. Many solu-

tions are developed based on the marked gloves or markers on body joints (see

Fig. 2.2) [20, 21, 22, 23, 24, 25, 26]. Some of them perform gesture analysis

using depth sensors [27, 28, 29]. Model based approaches have been used in

many applications [30, 31]. Other solutions analyze the motion by means of

shape or temperature sensors [32, 33, 34], etc. Almost all the solutions are

developed for stationary systems with powerful components. Due to the limi-

tations on mobile devices, most of the proposed solutions will not be practical.

Limited power resources, cost, mobility, and size are important features that

make the design process for 3D interaction really difficult. New devices are

equipped with different types of integrated sensors (orientation sensor, optical

cameras, GPS, etc). Now the question is whether is it possible to use them

in an effective way to analyze the 3D motion? Generally, the answer is yes.

In many virtual reality and augmented reality applications, integrated sensors

are used to control the motion. In [35], rendered graphics is controlled by ori-

entation sensor. In [36, 37] vision sensors are hired to detect hand, gestures,

or different types of objects. The major weakness of all current technologies

is limitations in 3D motion analysis. Most of them are limited to object de-

tection algorithms to augment graphics or manipulate virtual objects. The

problem to be tackled is to analyze the six DOF motion in 3D space. There-

fore, when the real 3D interaction with mobile devices is discussed, it means

18

Related Work

Figure 2.2: Motion-based interaction using wearable markers and gloves. Left:visual markers; Middle: T(ether), motion tracking glove; Right: ShapeHand,motion capture device.

that all motion parameters in 3D space must be considered.

2.2.3 3D Gesture Recognition and Tracking

Existing algorithms of hand tracking and gesture recognition can be grouped

into two categories: appearance-based approaches and 3D hand model-based

approaches. Appearance-based approaches are based on a direct comparison

of hand gestures with 2D image features. The popular image features used to

detect human hands and recognize gestures include hand colors and shapes,

local hand features, optical flow and so on. The earlier proposed works on

hand tracking belong to this type of approaches [38, 39]. The drawback of

these feature-based based approaches is that clean image segmentation is gen-

erally required in order to extract the hand features. This is not a trivial task

when the background is cluttered, for instance. Furthermore, human hands

are highly articulated. It is often difficult to find local hand features due to

the self-occlusion, and some kinds of heuristics are needed to handle the large

variety of hand gestures. Instead of employing 2D image features to repre-

sent the hand directly, 3D hand model-based approaches use a 3D kinematic

hand model to render hand poses. An analysis-by-synthesis (ABS) strategy is

employed to recover the hand motion parameters by aligning the appearance

projected by the 3D hand model with the observed image from the camera,

and minimizing the discrepancy between them.

Generally, it is easier to achieve real-time performance with appearance-based

19

Chapter 2

approaches due to the fact of simpler 2D image features. However, this type of

approaches can only handle simple hand gestures, like detection and tracking

of fingertips. In contrast, 3D hand model based approaches offer a rich de-

scription that potentially allows a wide class of hand gestures. The bad news

is that the 3D hand model is a complex articulated deformable object with

27 DOF. To cover all the characteristic hand images under different views, a

very large image database is required. Matching the query images from the

video input with all hand images in the database is time-consuming and com-

putationally expensive. This is why the most existing 3D hand model-based

approaches focus on real-time tracking for global hand motions with restricted

lighting and background conditions.

For general mobile applications, we need to cover full range of hand gestures.

3D hand model-based approaches seem more promising. To handle the chal-

lenging exhaustive search problem in a high dimensional space of human hands,

the efficient index technologies used in information retrieval field, have been

tested. Zhou et al. proposed an approach that integrates the powerful text re-

trieval tools with computer vision techniques in order to improve the efficiency

for hand images retrieval [40]. An Okapi-Chamfer matching algorithm is used

in their work based on the inverted index technique. Athitsos et al. proposed

a method that can generate a ranked list of three-dimensional hand configura-

tions that best match an input image [41]. Hand pose estimation is achieved

by searching for the closest matches for an input hand image from a large

database of synthetic hand images. The novelty of their system is the ability

to handle the presence of clutter. Imai et al. proposed a 2D appearance-based

method by using hand contours to estimate 3D hand posture [42]. In their

method, the variations of possible hand contours around the registered typical

appearances are trained from a number of graphical images generated from a

3D hand model. A low-dimensional embedded manifold is created to overcome

the high computation cost of the large number of appearance variations.

Although the methods based on text retrieval are very promising, they are

too few to be visible in the field. The reason might be that the approach is

20

Related Work

too primary, or the results are not impressive due to the tests just over very

limited size of database. Moreover, it might be also a consequence of the suc-

cess of Kinect in real-time human body gesture recognition and tracking. The

statistical approaches (random forest tree, for example) adopted in Kinect

start dominating mainstream gesture recognition approaches. This effect is

enhanced by the introduction of a new type of depth sensor from the Leap

Motion company. This type of depth sensor can run at interactive rates (it

should process at least 10 frames/second) on consumer hardware and interact

with moving objects in real-time. Despite of its fantastic demo, Leap Motion

sensor cannot handle full range of human hand shapes and sizes. The main

reason is that such sensors usually detect and track the presence of fingertips

or points in free space when user’s hands enter the sensor’s field of view. In

fact, they can be used for general hand motion tracking.

Regarding the special requirements for mobile applications such as real-time

processing, low-complexity and robustness, it seems that a promising approach

to handle the problem of hand tracking and hand gesture recognition is to use

text retrieval technologies for search. In order to apply this technology to next

generation of mobile devices, a systematic study is needed regarding how text

retrieval tools should be applied to handle gesture recognition, particularly,

how to integrate the advanced image search technologies [43]. Surely, there

exist many powerful tools to overcome this problem. The key issue is how

to relate the vision-based gesture analysis to the large-scale search framework

and define a right problem. Once the right problem is defined, we can identify

and integrate right tools to form a powerful solution.

2.2.4 3D Visualization on Mobile Devices

3D visualization or 3D imaging refers to the techniques for conveying the illu-

sion of depth to the viewer’s eyes. First efforts on 3D imaging started around

mid-1800s [44, 45]. Most 3D vision systems are based on Stereoscopic vision.

Stereoscopic vision or Stereopsis is the process of conveying the 3D illusion

by making stereo images. Various techniques have been introduced in this

21

Chapter 2

Figure 2.3: Examples of available 3D mobile devices; Left: HTC EVO 3D;Right: LG Optimus 3D.

way. Old techniques such as parallel or crossed-eye viewing [46] without eye-

glasses, color anaglyphs and color codes using eye-glasses are widely used for

3D photography [47, 48]. During the recent years, 3D technology has become

popular in cinema industry and TV production. Passive technology using po-

larized glasses [49] and Active shutter glasses [50] are common approaches in

3D cinemas and 3D TVs respectively. Autostereoscopic 3D is another tech-

nology that requires no glasses. In this method, stereo images are transmitted

separately to each eye from the light source. Some advanced 3D displays also

provide limited number of views of a scene for more realistic 3D perception

while user is moving his/her head [51]. Popularity of the 3D displays has

been increased after 3D cinemas and 3D TVs attracted the public attention.

The current trend in manufacturing of the 3D devices shows that we should

expect a tremendous growth in this market for the coming few years. Mobile

device manufacturers have started releasing smartphones with 3D capabilities

and there are few 3D mobile phones available in the market (see Fig. 2.3).

Among different companies two famous mobile phone manufacturers, LG and

HTC, have introduced their 3D smartphones. Both devices use autostereo-

scopic technology and dual cameras for recording and displaying stereoscopic

images and videos [52, 53].

22

Chapter 3

General Concept andMethodology

3.1 General Concept

Intuitive interaction between multimedia users and digital devices is a desired

feature of future technology. Although introduction of breakthrough technolo-

gies such as iPhone dramatically changed the manner users interact with their

phones, users always demand more realistic ways to communicate with their

devices.

Currently, limitations of 2D touchscreens stop users from having a natural

interaction in a wide range of applications where using physical gestures is

unavoidable. For instance, 3D manipulation of graphical content such as 3D

rotations, zooming in/out, grabbing, pushing, moving etc. requires physical

3D space for intuitive interaction. Even the latest technologies of smartphones

and tablets are limited to touch capabilities for interaction on a limited 2D sur-

face. Moreover, occlusion problems caused by fingers and hands might degrade

the quality of interaction and visualization. However, various entertainment

and gaming applications are limited to using the rendered virtual buttons on

the touchscreen while they need real 3D manipulation in free space. For in-

stance, playing musical instruments such as virtual piano, guitar and drums

requires pushing, moving, and tapping in 3D space. Obviously, limiting the

23

Chapter 3

same tasks to 2D surface degrades the natural interaction. Rendered virtual

controllers on 2D surface limit the visualisation space and affect the quality of

user experience. 3D gestural interaction might be even more vital for forth-

coming augmented reality glasses since the dedicated surface for interaction is

shrunk and shifted to the frame.

This thesis aims to introduce a new space for interaction between user and

mobile device. The main idea is to shift the interaction space from 2D surface

to real 3D space around the device where the vision sensor can capture the

hand/body gestures. In other words, performing gestural interaction in 3D

space is proposed to solve the limitations of 2D interaction technology.

Delivering the experience of free-hand interaction to the mobile device users

can totally affect the mobile industry. Bare-hand 3D interaction enables users

to communicate with their mobile devices exactly in the same manner as they

do in the physical world with people, objects, etc. (In the same way they push

a button or they pick and rotate an object in the physical space). Analysis of

the users’ gestures from the video input might be used to control the ongoing

operation on the device. Users might perform a set of actions using different

hand gestures or they might control and manipulate virtual objects and but-

tons by their hand movements.

From the design point of view, the main goal is to enhance the quality of user

experience by designing a new interactive environment. Since the quality of

user experience is highly affected by the interaction design, by introducing

a new interaction space the main goal is to solve the current limitations of

interaction technology.

From the technical point of view, the aim is to introduce enabling technologies

for gesture recognition, tracking, and 3D motion analysis that can effectively

facilitate the interaction design in mobile devices and multimedia applications.

In addition, novel techniques in 3D visualization of multimedia collections such

as single images and photo albums is considered. Finally, other related imple-

mented techniques that might support the main contributions are included in

this study.

24

General Concept and Methodology

Figure 3.1: For a better experience design, interaction and visualization spacescan be extended to 3D space.

From the practical point of view, this work aims to demonstrate the con-

ducted experiments and implementations of the main contributions in real

applications. Different application scenarios such as photo browsing, graphi-

cal manipulation and 3D motion control are included in this work.

3.1.1 Interaction/Visualization Space

Miniature keyboards, tiny joysticks, and specifically, touchscreen surfaces are

various designed input facilities for mobile devices. Considering the today’s

mobile devices, it is clearly observable that interaction and visualization spaces

are located on one side of the device. Since the physical buttons have been

gradually removed from the mobile devices, the allocated surface for display

25

Chapter 3

Figure 3.2: Bare-hand 3D interaction with mobile device in the extendedinteraction/visualization space.

has been significantly increased. The major concern about today’s mobile

devices is the overlap between the interaction surface and display due to the

common designed surface for input and output modules.

Due to the fact that users prefer to keep their visual contact with input fa-

cilities, it makes sense to design them in this way. For instance, users want

to see which button they push or where they touch. On the other hand, this

configuration might cause some problems in visualization. Obviously, when we

work on touchscreen displays the occlusion problem might happen. Users lose

the visibility and the quality of experience will be degraded. A novel solution

for this problem is to extend the interaction and visualization spaces from 2D

surface to 3D space (see Fig. 3.1). This extension should be done in a way that

preserves the visual contact of user and the ongoing operation in interaction

with the device. Since mobile devices have at least one embedded camera at

the back side, it is possible to see the space behind the device from that vision

26


sensor. If we manage to interact with the device in 3D space behind the cam-

era, we can successfully extend the interaction space. Furthermore, interaction

in 3D space features substantial capabilities that can facilitate a wide range

of applications. For effective interaction in 3D space, advanced technologies

in 3D motion analysis should be developed. From the technical perspective,

interaction in 3D space can handle the limitations of 2D interaction on touch-

screen displays. On the other hand, 3D visualization technology such as 3D

coding and 3D displays might help to extend the visualization space from 2D

surface to 3D. 3D visualization technology conveys the illusion of depth and

3D perception to users (see Fig. 3.2).

3.1.2 Sharing the Interaction/Visualization space

One of the great advantages that 3D interaction offers is the possibility of shar-

ing the physical space for collaboration. In fact, by turning the interaction

space from limited surface to free space, users will be able to collaborate within

the provided common physical space between the mobile devices. Therefore,

concept of single-user single-device might be extended to collaborative multi-

user multi-device using the shared space. In general, different configurations

for interaction between users and mobile devices might be considered. Based

on the desired purpose, number of users, number of devices, and the interac-

tion/visualization spaces might vary. Here, several possible scenarios for single

and shared interactive applications are introduced (see Fig. 3.3, 3.4).

3.1.2.1 Single-user, Single-device

In this scenario user holds the mobile device in one hand and the other hand

controls the interaction in the 3D space behind the device. The 3D space

between the display and user’s eyes belongs to the 3D visualization. User can

control and manipulate the content within the allocated spaces for interaction.

27

Chapter 3

3.1.2.2 Multi-user, Multi-device with Shared Interaction Space

In this configuration, more than one user share a common interaction space

to manipulate the content. They might sit in front or next to each other.

Each one holds a device in one hand and by the other hand interacts with

the content. Interaction happens in the common space between devices where

users might share the content, pass them around or manipulate them together.

3.1.2.3 Multi-user, Single-device with Shared Visualization Space

In this setup users share a single device for collaboration. They should use

the space behind the device for 3D interaction, and visualization happens on

a single display. This configuration is suitable for collaboration of two users.

3.1.2.4 Interaction from Different Locations for Multi-user Multi-device

In this configuration each user has his/her own location, device and space for

interaction with the digital content, but they share the same virtual space.

In other words, they interact with a common content from different locations

through the network connection. This model might be extended to the case

that the visualization space of one user is affected by the interaction of the

other users and vice versa.

3.2 Evolution of Interaction/Visualization Spaces

Mobile phones have been evolved substantially since 1980s, both from design

and functionality aspects. Before smartphones come to the market, the chal-

lenge was to reduce the size for portability purposes. After smartphones at-

tracted the users’ attention, for a better quality of experience, devices became

larger in display size with less physical buttons. During this evolution, many

devices have hit the market by their special features. Generally, if we consider

the mobile devices evolution during the recent decade, we can distinguish a

gradual change from both interaction and visualization aspects. In the earlier

28


Figure 3.3: Different configurations of single and collaborative interaction. 1:Single-user, single-device; 2: Multi-user, multi-device with shared interactionspace; 3: Multi-user, single-device with shared visualization space; 4: Interac-tion from different locations for multi-user, multi-device.

generation of mobile phones, the device’s surface was allocated to both inter-

action and visualization facilities (keypad and display). In that configuration,

visualization quality was quite weak due to the very limited area for display.

Afterwards, some manufacturers proposed an innovative solution to allocate

the whole surface to the display and design a physical keypad layer under the

display layer. During the recent years most smartphone manufacturers have

introduced their products with touchscreen displays. In this configuration

both interaction and visualization spaces are located on the same area.

By introduction of wearable displays such as Google Glass, the way users inter-

act with the device might be totally changed due to the removal of hand-held

module. This significant change enables users to benefit from 3D interaction

using both hands. It means that if the technical solution for bare-hand inter-

action is provided, users may perform different actions in 3D instead of using

weaker input facilities such as touch frames or voice commands.

In this thesis the proposed solutions to the limitations of the today’s technol-

ogy, extend the interaction to the physical 3D space. Interaction happens in

29

Chapter 3

Figure 3.4: In 3D interaction, users might share the 3D space for collaborativetasks in different applications.

the 3D space behind the mobile device and 3D visualization shows its effect

in the 3D space between the user and display. If we take into account the

other body parts such as head or foot as interaction facilities, the interaction

space can be extended to the 3D space around the device. This conceptual

model might be considered for future interactive smart devices (see Fig. 3.5).

However, the evolution trend in mobile interaction is towards designing sim-

pler and more intuitive input facilities. Clearly, advanced media technologies

combined with powerful hardware are required to make this evolution happen.

30


Figure 3.5: Evolution of the interaction/visualization spaces in mobile devices.

3.3 Enabling Media Technologies

As discussed before, the main objective behind this thesis is to provide techni-

cal solutions that enable users to experience a realistic interaction with future

smart devices in entertainment, communication, and information contexts. In

order to support the main concept, this thesis focuses on two major problems:

first, interaction design with mobile devices based on motion analysis in 3D

space [38, 39, 54, 55, 56, 57], and second, 3D visualization on ordinary 2D

displays [58, 59, 60, 61]. The proposed interactive systems are based on the

detection, tracking, and analysis of the user’s 3D motion from the visual input.

This visual input might be received from the mobile device’s camera, body-

mounted camera, webcam, and in general, from any type of vision sensor.

In this thesis, the main focus is on the analysis of the user’s gestures, captured

by the vision sensor, in real-time. Specifically, hand gestures are considered

due to the direct connection of the hands to the real life gestural activities.

31

Chapter 3

Figure 3.6: Enabling media technologies support the concept of 3D interactionand 3D visualization.

The retrieved 3D motion parameters from the detected gestures are used to

drive the real-time interaction in various applications.

Other technical contributions of this thesis are focused on providing the real-

istic visualization. They mainly include the technical solutions for converting

the 2D content to 3D and interactive visualization of the content based on the

user’s head motion (see Fig. 3.6).

3.3.1 Vision-based Motion Tracking in 3D Space

In this thesis 3D gestural interaction and its significant advantages in com-

parison with the current 2D technology is introduced. As it is technically

discussed in paper I [57], efficiency and effectiveness of the interaction with

mobile devices in 3D space is substantially higher than 2D space. This interac-

tion might happen by detecting and tracking specific hand gestures that plays

important roles in interactive applications. Alternatively, other body parts

such as head or foot might be hired to perform the 3D interaction. However,

the main idea is to use physical space for 3D manipulation on 2D devices. By

using six DOF motion analysis, problems of 2D interaction will be handled in

most cases [38, 39, 54, 55]. The important features of the proposed systems

are bare-hand, marker-less gesture detection, recognition and tracking from

2D video input. This technology enables users to efficiently interact with their

devices in real-time applications (see Fig. 3.7). Since the vision-based interac-

tion with hand-held mobile devices happens in the space close to the camera,

the distance between the moving subject and capturing sensor has some phys-

32


Figure 3.7: 3D gestural interaction with mobile device.

ical limitations. For instance, user cannot move his/her hand more than 35-40

cm away from the body. This limited space for interaction preserves the high

resolution motion analysis. When the vision sensor and moving subject are

relatively close to each other, the configuration will be more accurate for mea-

suring the 3D motion parameters. Therefore, the resolution in motion analysis

will be increased.

Another proposed configuration for 3D spatial interaction is interactive vision.

This setup is proposed for interactive 3D display where user interacts with the

content of the display based on the head motion. Head-mounted or static

vision sensors might be used to measure and report the head movements. By

the measured 3D motion parameters, users control the angle and viewpoint of

the digital content in real-time.

3.3.2 3D Visualization

Amount of multimedia content in digital devices has been enormously in-

creased during the recent years. Due to the substantial improvement of the

33

Chapter 3

Figure 3.8: 3D processing on 2D query images.

quality of cameras in smart devices, users can capture large amount of photos

and videos using their smartphones. Besides the challenges in ordering, orga-

nizing and interacting with huge collections, visualization quality is another

issue that should be taken into account. Since majority of the devices have

captured and stored the visual content by 2D technology, 3D visualization of

2D content will become a challenging task. While the today’s 3D technology

has attracted the users’ attention, it is quite important to find an effective way

to visualize the content in a more realistic fashion. Fig. 3.8 shows the system

overview for 3D processing and visualization of 2D content in mobile devices.

This thesis aims to tackle several challenges in visualization and improve the

quality of visual perception in interaction with the multimedia content. Specif-

ically, the following items are considered in the discussions:

ä First, is there any way to convey the experience of real 3D visualiza-

tion (similar to what people experience in watching a real world scene) to

users by measuring the users’ dynamic position/orientation in real-time (pa-

34


Figure 3.9: Active 3D vision for head motion-based user-device interaction.

per IV [61])?

ä Second, is it possible to display the stereoscopic 3D content on normal 2D

displays (paper I, II, IV, V, VI [56, 57, 58, 59, 62])?

ä Third, how can we recover the 3D information from a single 2D image and

visualize that in 3D format on an ordinary 2D display (paper V [59])?

ä Fourth, is it possible to make use of photo/video collections, captured by

2D devices, to convert and display the content in 3D (paper VI [58])?

ä Fifth, is there any efficient way to correct the 3D modeling, positioning

and localization errors by integrating the metadata from position/orientation

sensors and computer vision techniques (paper VII [60])?

The main focus of all 3D display technologies is to convey the illusion of

depth to the viewers’ eyes, while 3D visualization might be seen from different

perspectives as well. In fact, the real 3D perception, in a way we observe the

real world, is 3D manipulation based on motion, plus the depth perception.

An example might clarify this idea. Imagine a box in front of a user. Each

35

Chapter 3

side has a different color and pattern and from the front view user can only see

the top and front sides. In a natural manner, if user wants to see the left or

right sides he/she will move to the left or right directions and in the same way

users observe any scene by moving towards different directions. In another

approach, users can pick the box and rotate it to see any side they desire.

One way to observe the 3D space is to manipulate the scene by users’ motion.

In other words they should be able to control what they like to see. This

type of visualization might happen by analysis of the user’s motion in front of

the vision sensor and transmission of the motion information to the rendering

system for 3D visualization. Of course, this process should be performed in a

real-time fashion without any noticeable delay to deliver a realistic experience

to the users eyes. Moreover, the output might be rendered by stereoscopic

techniques to convey the illusion of depth. Therefore, concept of interactive

vision or interactive 3D displays can be formed based on this idea (see Fig. 3.9).

3.4 Methodology Overview

Technical contributions of this thesis are mainly focused on the development

of the enabling technologies for 3D gestural interaction. Therefore, 3D ges-

ture detection, recognition and tracking are technical features that are aimed

to be extensively used in the proposed solutions. Generally, gesture analysis

is considered as a classical computer vision and pattern recognition problem.

Thus, substantial part of the technical discussion of this work is allocated to

these challenges from classical approach. Low-level feature/pattern detection,

global model-based detection, estimating the motion from tracking robust fea-

tures and other computer vision methods are hired to find novel solutions for

solving the challenges of gesture analysis.

In addition to the common computer vision methods, a new framework for

gesture analysis is introduced. Due to the fact that capability of the modern

computers in storing and processing extremely large databases is substantially

increased, shifting the complexity of the methods from pattern recognition al-

gorithms to large-scale retrieval approach might be the new trend to tackle

36


the gesture analysis problems. Therefore, the introduced method is based on

collecting an extremely large database of gesture images and retrieving the

best match from the provided data. In the ideal scenario for gesture analy-

sis the database should include all possible articulated hand gestures and the

corresponding metadata including the relative spatial position and orientation

with respect to the camera. The major methodology is based on direct re-

trieval of the best match for any query gesture. The retrieval process should

be performed in a way that preserves the smooth motion in a continuous ges-

tural interaction. This step might be done by analysis of the gesture patterns

in high dimensional space.

The main methodology towards improving the visual perception is based on

3D visualization of the today’s multimedia content on current display devices.

The whole process might be divided into two steps. In the first step, 3D motion

analysis of the user’s head for real-time manipulation of the content should

be performed. In this step a vision sensor is used to track visual features

from the environment and motion analysis from consecutive frames is hired

to measure the 3D motion parameters. The measured parameters in real-time

help users interact with the content and manipulate that in a natural manner.

The methodology for visualization of the content is based on the processing

of the captured images and videos by the current 2D devices. The conversion

methods from 2D to 3D are based on the direct analysis of the single views or

multiple view analysis in photo collections. The main strategy is to convert

the 2D multimedia to 3D and use stereoscopic coding. This approach adds

additional value to the visual experience, while it does not require extra hard-

ware. In other words, besides the user manipulated content, the output might

be visualized by stereoscopic techniques to convey the illusion of depth to the

user’s eyes.

37

Chapter 3

3.5 Gesture Analysis through the PatternRecognition Methods

Basically, a common vision-based system for real-time gestural interaction is

composed of four main elements: user, vision sensor, gesture analysis compo-

nent, and visualization component. The real-time query input from user is

a continuous set of hand/body gestures. In this context bare-hand gestural

performance in free space is considered for most of the proposed scenarios and

in few cases head movements are used as query input. Ordinary vision sensors

can be divided into two groups: 2D cameras such as normal RGB webcams

and 3D depth sensors such as Microsoft Kinect. Since most of the ordinary

devices are equipped with normal RGB cameras, the main focus of this work

is to use that type of sensor for different research scenarios (embedding depth

sensors to mobile devices does not seem to be possible in the near future).

The gesture analysis step usually includes feature extraction, gesture detec-

tion, motion analysis and tracking parts. Pattern recognition methods for

detecting and analyzing the hand gestures are mainly based on local or global

image features. Simple features such as edges, corners, lines, and more com-

plex features such as Symmetry patterns, SIFT, SURF, and FAST features

are widely used in the computer vision applications [63, 64]. If the desired goal

is to detect a specific pattern, a combination of image features might be used.

For dynamic hand gestures, it is quite challenging to define a single pattern

for detection due to the complex combination of the hand joints. Therefore,

combination of local/global image features might be useful to detect and lo-

calize the hand gestures. Distinctive features are extremely useful for robust

tracking and 3D motion analysis. If the hand gesture is correctly detected and

localized, robust features such as SIFT or SURF might be used to analyze

the 3D motion parameters in a sequence of image frames. If the main goal is

to track the gesture in consecutive frames, the detection algorithm might be

conducted on single frames in a sequence. Another way to track the gesture

is to detect and localize the gesture in a single frame and follow the detected

38


Figure 3.10: Overview of the 3D gesture analysis process based on computervision methods.

pattern in the coming frames using common tracking methods such as Opti-

cal Flow. However, depending on the application scenario, if the recognition

of different types of gestures is required, different gesture patterns should be

analyzed. If the goal is to track a special gesture, the specific pattern might

be detected in consecutive frames, and if the 3D motion of the hand gesture

is required, gesture localization and 3D motion analysis from the sequence of

frames should be performed.

Finally, the gesture analysis output might provide the required information

about the type of gesture, and position/orientation of the detected gesture

with respect to the vision sensor. The retrieved information will be sent to

the real-time applications. The final output might be rendered in 2D/3D for

visualization on the display. Fig. 3.10 demonstrates the block diagram of the

3D gesture analysis based on computer vision methods.

39

Chapter 3

3.6 Gesture Analysis through the Large-scaleImage Retrieval

In addition to the computer vision methods for gesture analysis, this thesis

introduces a new framework and methodology for tracking articulated hand

motions in video sequences based on search technologies. The innovative so-

lution is to define the problem of hand tracking and gesture recognition as a

general image search problem. The idea is to build a large database that con-

tains at least thousands of hand gesture images. Ideally, these images should

emulate all possible hand gestures. Furthermore, these images are tagged with

hand motion parameters including 3D position and orientation of the gestures.

When the hand of a mobile device user is captured by the video camera around

the mobile device, the captured hand image is used to retrieve the most sim-

ilar hand gesture image stored in the database. Then the motion parameters

tagged with the matched image are given to the captured hand image. Thus,

3D hand tracking and gesture recognition can be achieved. The key of this

approach is how to quickly find the best match from a database. The proposed

solution is to treat each image as a document, convert shape features to a huge

visual vocabulary table, and employ the inverted indexing as a powerful re-

trieval tool to perform the search. The developed framework might have a big

impact on gesture analysis, where high resolution hand/gesture tracking is re-

quired. In fact, unlike the classical pattern recognition methods, in the search

framework, entries of the database will not be analyzed from shape-based or

model-based methods. The main idea is to include every possible hand ges-

ture image regardless of its shape or model. The entries of the database might

be real images of articulated hand gestures or computer generated graphics.

Here, the important point is to annotate the database entries with the position

and orientation information of the recorded hand gestures. The vocabulary of

hand gestures integrates the information from visual features of the gesture

images and their pose information in an extremely large table.

On the other hand, the query frame, captured by the vision sensor, will be

40


Figure 3.11: Overview of the 3D gesture analysis system based on the large-scale image search method.

pre-processed and its visual features will be extracted for analysis in the ges-

ture search block. The core of the system is the gesture search engine that

analyzes the similarity of the query input with the database entries in several

steps and retrieves the best match. The output of the system is the most

similar gesture image to the query input and in the ideal case is identical to

the query. Finally, the retrieved image and its annotated pose information

will be employed in the application. Fig. 3.11 shows the block diagram of the

3D gesture analysis system based on the large-scale image retrieval.

41

Chapter 4

Enabling Media Technologies

From a technical point of view, in order to enhance the usability of an interac-

tive system, numerous challenges must be considered. Specifically, interaction

design for mobile devices using hand gestures incorporates technical issues in

computer vision techniques such as detection, tracking, 3D motion estimation

and visualization. Basically, technical discussions around the proposed meth-

ods can be divided into the following categories: low-level pattern recognition

for gesture analysis, search-based gesture analysis, and interactive 3D visual-

ization.

In order to implement a gesture-based interactive system, various hand ges-

tures should be considered. Fig.4.1 demonstrates the most common hand

gestures for 3D interaction and manipulation of objects in different digital en-

vironments. Although the collected gestures can be used for different actions

such as pick, place, move, grab, zoom, rotate, etc., but they all might be seen

as variations of basic hand poses such as Grab or Pinch gestures. This is

the main reason that, in this context, gesture detection and recognition based

on computer vision methods are mainly focused on the Grab gesture and its

variations such as deformations, scaling, and rotations. Clearly, these gestures

can cover majority of the required actions in 3D interaction. Moreover, the

proposed search-based method for gesture analysis can be used for extremely

large number of hand gestures.

43

Chapter 4

Figure 4.1: Most common hand gestures in 3D interaction scenarios.

4.1 Gesture Detection and Tracking Based on Low-level Pattern Recognition

Low-level pattern recognition algorithms might be extremely useful in gesture

analysis. Although low-level features do not represent complex patterns in-

dependently, but due to the extremely fast process and low complexity, they

are highly recommended for real-time applications. Here the main challenge

is how to combine low-level features in an effective way to retrieve a global

meaning such as detecting a gesture pattern from a video sequence.

In the contributions of this thesis towards hand gesture detection and track-

ing, low-level features are extensively used [38, 39, 54]. Specifically, gesture

tracking based on low-level operators known as rotational symmetry patterns

is considered. As discussed in paper I [57], rotational symmetries are specific

curvature patterns derived from the local orientation image [65]. The main

idea behind rotational symmetry is to use local orientation to detect complex

curvatures in double-angle representation. The double-angle representation,

44


z, of an orientation with the direction θ, is defined as a complex number with

an argument (angle), that is double the local orientation, z = cei2θ, where the

magnitude, c, represents the information about the signal energy or confidence.

Rotational symmetries can be categorized in different orders and phases. By

using a set of specific filters on the orientation image, it is possible to detect

different members of rotational symmetry patterns such as curvatures, circular

and star patterns. The idea of taking advantage of the rotational symmetries

in gesture detection seems to be rather general and complex, but modeling the

gesture by the choice of the rotational symmetry patterns of different classes

could lead us to differentiate between them and other features even in clut-

tered backgrounds. Theories and mathematical definitions of local orientation,

rotational symmetries, detection of symmetry patterns etc. are fully discussed

in paper I [57].

Through the experiments, it can be demonstrated that hand gestures or finger-

tips show high responses if we search for specific group members of rotational

symmetry patters in the orientation image. For example, fingertips are respon-

sive to the group of first order rotational symmetries (curvature patterns) [38],

or grab gesture is responsive to the group of second order rotational symmetries

(circular patterns) [39, 54]. Therefore, depending on the application scenario,

the proper detector for different hand gestures can be introduced. By this

approach, hand gesture can be localized in a sequence of frames captured by

the device’s camera. Increasing the selectivity of the desired patterns can be

achieved by applying the removal process on the noisy responses caused by

complex backgrounds [38].

For instance, if detecting the fingertips is desired, first-order symmetry pattern

detection will return the position of the fingertips as well as the noisy features

from the background. During the further processing, magnitude, phase, and

color properties of the responses can be used to differentiate between the cor-

rect detections and noisy points [66] (see Fig. 4.2).

Second order rotational symmetries return more specific patterns. For in-

stance, the grab gesture responses to the circular pattern from this group of

45

Chapter 4

Figure 4.2: Gesture detection, tracking, and 3D motion analysis based onrotational symmetry patterns.

symmetries. It is possible to enhance the detection by controlling the phase of

the pattern using a simple threshold. Thus, the grab gesture can be localized

properly in a video sequence [57].

However, the mentioned processing will result in a proper detection and re-

jects the noisy responses. From technical perspective, gesture detection and

tracking are the first steps in the 3D gesture-based interaction. The core of

the system is 3D motion analysis where the 3D motion parameters will be

recovered from the video sequence (see Fig. 4.2).

In many interactive applications, 2D gesture detection and tracking are enough

to perform the task and further 3D motion analysis is not required. For real

3D (six DOF) interaction, extra information about the 3D position and ori-

entation must be recovered.

4.1.1 3D Motion Analysis

In computer vision and image processing discussions, a common way to re-

trieve and estimate the motion between image frames is to analyze the motion

between the extracted feature points. 3D structure can be studied by finding

and matching the corresponding feature points in consecutive frames [67]. In

computer vision algorithms, various types of feature detectors and descriptors

have been introduced. Generally, feature detectors can be divided into edge,

46


Figure 4.3: 3D motion analysis steps. Retrieving the 3D structure from motionbetween image frames.

corner and blob detectors or any combination of them [68].

In applications where robustness and accuracy have higher priority, more com-

plex feature descriptors are required. SIFT, SURF and ChoG [63, 64, 69] are

examples of robust feature descriptors which have been found to be useful in

many multimedia applications.

In the contributions of this thesis, scale-invariant feature transform (SIFT)

is widely used as a robust scale/rotation-invariant feature descriptor. Once

the hand gesture is localized, the SIFT features will be extracted on the de-

sired region (user’s gesture) in the image frame. The extracted features will

be tracked in consecutive frames and the structure of the 3D motion can be

derived by finding the transformation between two frames. This transforma-

tion might be in the form of planar homography [67], as discussed in [58], or

fundamental and essential matrix [67], as suggested in [39, 54, 60]. In order

to remove the outliers in matching between the feature points and find the

best transformation matrix, consistent with the true matches, robust iterative

methods such as RANSAC [70, 71] is performed. As a result, the best motion

transformation between two frames will be estimated (see Fig. 4.3). Paper

I [57], extensively explains how the 3D motion parameters can be retrieved

by decomposing the estimated transformation [39, 54]. In paper I [57] gesture

detection, tracking and 3D motion analysis from rotational symmetry patterns

are explained in detail. Moreover, the effect of applying 3D motion parameters

to different applications is demonstrated [57, 62].

47

Chapter 4

4.2 Gesture Detection and Tracking Based onGesture Search Engine

In this thesis a new framework and algorithms for tracking hands in cluttered

images and recognizing underlying gestures are introduced. To better specify

hand gestures and hand motions, two concepts might be distinguished:

Hand Posture: a hand posture is a static hand pose and its current position

without any movements involved.

Hand Gesture: a hand gesture is a sequence of hand postures connected by

continuous hand or finger movements over a short period of time.

For real-world hand tracking applications, the problems of initialization and

recovery have to be addressed. In order to develop robust solutions, we can

adopt a static approach, that is, to localize and recognize hand posture from in-

dividual frames. Thus, hand gesture recognition could be achieved by reading

individual posture images. The goal is that the new framework and algorithms

could lead to solutions with high tracking and recognition accuracy. The ac-

curacy should be so high that the solution could be used as a stand-alone

module for 3D hand tracking. Obviously, such solutions will also be useful in

providing single frame estimate to a 3D hand tracker, consequently achieve

automatic initialization and error recovery.

The proposed technical approach is to redefine the problem of hand tracking

and gesture recognition as a text search problem. The framework is based on

the idea of building a large database which in the best case emulates all possi-

ble articulated hand motions. Furthermore, these images are tagged with 3D

hand motion parameters including joint angles of articulated fingers. When

the hand of a user is captured by the device’s camera, the captured hand image

is used to retrieve the most similar image from the database. The ground truth

labels of the retrieved matches are used as hand pose estimates for the input.

It can work even under worse segmentation conditions. What is required to

input is just a bounding box around the hand gesture. The bounding box is

allowed to include arbitrary amounts of clutter in addition to the hand region.

48


The key issue in this approach is how fast to find the best match from a

database containing gesture images. The proposed solution is based on treat-

ing each image as a document, converting shape features as words, and em-

ploying the powerful text retrieval tool (inverted indexing to perform the fast

search).

4.2.1 Providing the Database of Gesture Images

The core in the gesture search system is how to represent gesture contours.

To enable the formulation of the gestural interaction problem into a search

framework, two particular properties should be considered: first: shape sensi-

tivity, which means that the matched hand gesture shape should be as close

as possible to the one from input frame; second: position sensitivity, which

means that the matched gesture should be at a similar position as the in-

put gesture. In this work a new type of shape vocabulary is defined. The

introduced technique is based on dividing the contour into segments or edge

features. An individual segment is considered as a word for forming the search

table.

In order to form the search table, all the database images will be normalized

and their corresponding edge images are computed. Each single edge pixel

will be represented by its position and orientation. In order to make a global

structure for low-level edge orientation features, we can form a large table to

represent all the possible cases that each edge feature might happen. Con-

sidering the whole database with respect to the position and orientation of

the edges, an extremely large table can represent the whole vocabulary of the

hand gestures in edge pixel format. For instance, for image size of 640x480

and 8 orientation representation, for a database of 10000 images of hand ges-

tures, the gesture vocabulary table will have the dimension of 2457600x10000.

After forming this huge table, each block will be filled with the indices of all

database images that have features at that specific point. Therefore, this table

collects the required information from the whole database, which is essential

in the online gesture search.

49

Chapter 4

In addition to the processing of the database images to form the search ta-

ble, for each single gesture image in the database, the 3D motion parameters

will be calculated and tagged to that specific image. This process is done, by

mounting a motion capture sensor on the hand, while the database images are

being recorded. In the database, active vision sensor (hand-mounted camera)

is used to measure the gesture movements and annotate the gesture images.

4.2.2 Query Processing and Matching

A query hand gesture is any type of hand gesture with its specific position and

orientation. The first step in the retrieval and matching process is edge de-

tection. This process is the same as edge detection in the database processing

but the result will be totally different, because for the query gesture, pre-

sense of edge features from cluttered background and other irrelevant objects

is expected.

4.2.3 Scoring System

Assume that each query edge image, Qi, contains a set of edge points that

can be represented by the row column positions and specific directions. Basi-

cally, during the first step of scoring process, for all single query edge pixels,

Qi|(xu, yv), similarity function to the database images at that specific posi-

tion is computed as: Sim(Qi, Dj). If the certain condition is satisfied for the

edge pixel in the query image and the corresponding database images, the

first level of scoring starts and all the database images that have an edge with

similar direction at that specific coordinate receive +3 points in the scoring

table. Similarly, for all the edge pixels in the query image the same process is

performed and corresponding database images receive their +3 points. Here,

an important issue that might happen during the scoring system should be

considered. The first step of scoring system satisfies the need where two edge

patterns from the query and database images exactly cover each other, whereas

in most real cases two similar patterns are extremely close to each other in

position but there is not a large overlap between them. For these cases that

50


regularly happen, the first and second-level neighbor scoring are introduced.

A very probable case is when two extremely similar patterns do not overlap

but fall on the neighboring pixels of each other. In order to consider these

cases, besides the first step scoring, for any single pixel, the first-level 8 neigh-

boring and second-level 16 neighboring pixels in the database images should

be checked. All the database images that have edge with similar direction in

the first level and second level neighbors receive +2 and +1 points respectively.

In short, scoring system is performed for all the edge pixels in the query with

respect to the similarity to the database images in three levels with different

weights. The accumulated score of each database image is calculated and nor-

malized and the maximum scores will be selected as the best top matches.

Finally, the proposed algorithm selects top ten matches form the database. In

order to find the closest match among the top matches, the reverse comparison

system is required. Reverse scoring means that besides finding the similarity of

the query gesture to the database images (Sim(Qi, D)), the reverse similarity

of the selected top database images to the query gesture should be computed.

Combination of the direct and reverse similarity functions will result in a

much higher accuracy in finding the closest match from the database. The

final scoring function will be computed as: S = [Sim(Qi, D)×Sim(D,Qi)]0.5.

The highest value of this function returns the best match from the database

images for the given query gesture. Afterwards, the tagged motion parameters

to the best match can be immediately used to facilitate various application

scenarios.

Another additional step in a sequence of gestural interaction is the smoothness

of the gesture search. Smoothness means that the retrieved best matches in a

sequence should represent a smooth motion. In order to perform a smooth re-

trieval, database gesture images should be analyzed in high dimensional space

to detect the motion maps. Motion maps indicate that which gestures are

closer to each other and fall in the same neighborhood in high dimensional

space. Therefore, for a query gesture image in a sequence, after the top ten

selection, the reverse similarity will be computed and top four matches will be

51

Chapter 4

Figure 4.4: Overview of the gesture search engine.

selected. Afterwards, the algorithm searches the motion paths to check which

of these top matches is closer to the previous frame match and the closest im-

age will be selected as the final best match. Fig. 4.4 shows the block diagram

of the gesture search engine. In paper III [56], the whole process is explained

in detail.

4.2.4 Quality of Hand Gesture Database

In general, for the database we have to consider two main issues: how large

the database should be? how to build such a database? The human hand is a

complex articulated structure consisting of many connected links and joints.

Including 6 DOF for orientation and position, there are 27 DOF for the human

hand in total [72]. To render all possible combinations of joints and poses

huge numbers of hand images will be generated. It is impossible to store all

images in a mobile device. Fortunately, there is a strong correlation between

joint angles. The state space of the joints has substantially lower dimensions.

In [73], Wu et al. show that the state space for the joints can be approximated

with 7 DOF. Thus, 7 is a rather good estimation of the embedded dimension

52


Figure 4.5: Interactive 3D vision overview.

of hand postures. If we quantize each DOF and represent it with 3 bits, thus

we will have a total combination of 87 ≈ 2 millions states. Thus, we have a

rough estimation of the size of the database of hand gesture images, around

at least 2 millions.

The second issue is about how to build such a database. One solution is

to use a 3D hand model to render all possible hand postures with computer

graphics technology, and convert the generated gesture images into binary

shape images through edge and boundary detection. The major problem with

this approach is that the extracted edges are not natural, which directly affects

the search of the best matched hand shape. In this thesis, bare-hand is used

against a uniform background to perform all sorts of gestures. Hand gestures

are recorded and used for converting into binary hand shape images. Motion

sensors or video cameras are attached to the hand for measuring the exact

position/orientation of the gestures. Thus, the ground truth hand motion

parameters are tagged to the gesture images.

53

Chapter 4

4.3 Interactive 3D Visualization

The main idea behind interactive 3D visualization is to enable users interact

with the content of display based on their motion in the free space. In fact,

this technology helps them perceive the content in a realistic manner by con-

trolling the angle and viewpoint in real-time, and turn the normal screen to

an interactive digital window.

For accurate 3D motion tracking, active technology requires mounting the vi-

sion sensor on the user’s body. Since the aim is to manipulate the content

based on the user’s view point, the sensor will be mounted on the user’s head.

Therefore, the video sequence can be captured in real-time. In order to esti-

mate the head motion parameters from the visual input, the extracted image

frames from the video sequence will be processed in motion analysis step.

The proposed motion analysis technology is based on the analysis of the 3D

head motion between consecutive frames captured by the camera. For each

two consecutive image frames, a robust feature detector can be employed to

extract and track the important feature points from the environment. In most

cases, due to the robustness and scale-invariance properties, SIFT feature de-

tector is used. Afterwards, the relation between the two sets of corresponding

feature points can be represented by a transformation matrix. For instance,

planar homography, or for more accurate representation, Fundamental and

Essential matrices will be calculated. The transformation matrix contains the

information about the motion between the two image planes. In the next

step, the decomposition process should be applied on the transformation ma-

trix to retrieve the 3D motion parameters. This process will be performed on

each two consecutive frames and the relative 3D position/orientation will be

estimated. The motion analysis block provides six outputs for the rendering

block, three representing the orientation parameters and three representing

the position parameters in x, y, and z coordinate system.

Note that SIFT feature detection at each single frame requires rather heavy

processing which is not a problem in stationary systems. For faster processing

especially on mobile platform, faster detectors such as SURF, or FAST fea-

54


Figure 4.6: Real-time interaction with the graphical content using interactive3D vision system.

tures can be used. Another approach is to perform the feature detection in

the first frame and track the detected features by common tracking methods

such as Optical Flow in consecutive frames. The process of feature detection

can be repeated when the number of features reduces to a certain value. The

rendering block generates and updates the scene based on the provided motion

information at each moment. The rendered scene is based on the pre-defined

graphical model or augmented reality environment. The output result will be

displayed on a screen while user may have the chance to interact and manip-

ulate the content in real-time. Fig. 4.5 demonstrates the system overview of

the interactive 3D vision. Fig. 4.6 shows how user controls the viewpoint and

position in the rendered scene by moving in the 3D space. For capturing and

measuring the 3D position and orientation of the user, an ordinary webcam is

mounted on the user’s head. The graphical content will be updated according

to the translation and rotation of the head at each moment. The perception

effect is similar to looking at a real scene from a window. In fact, view of a

55

Chapter 4

real scene will be adjusted based on the angle and position of the viewer. In

paper IV [61], interactive 3D visualization is discussed in detail.

4.4 Methods for 3D Visualization

As mentioned before, in order to enhance the quality of experience in multi-

media applications, the aim is to visualize the output in 3D format. Here, the

following scenarios might be considered.

ä First, 3D visualization of a graphical model with a known geometry [39, 54],

ä Second, 3D visualization of single images using the image itself [59],

ä Third, 3D visualization of monocular images by analysis of the multiple

views, in 2D digital photo collections [58].

A common way to visualize the content in 3D format is to produce stereo

views. As fully discussed in [54, 57, 58, 59], stereoscopic systems transmit

stereo views of a scene which have been represented by two viewpoints with

a slight horizontal translation. Basically, for rendering the graphical mod-

els in different applications, geometry of the scene is known. Therefore, it is

rather simple to render the second view which satisfies the required geometry

for stereoscopic views. The task of stereoscopic visualization becomes more

challenging when the content is not recorded with stereo cameras or any prior

knowledge about the geometry or structure of the 3D scene is not provided.

Considering single views or randomly captured views of a scene, an efficient

way to generate stereo views should be found. 3D visualization from single

and multiple 2D views are briefly described in the following sections. In paper

V and VI [58, 59], the whole process is explained in detail.

4.4.1 Depth Recovery and 3D Visualization from a Single View

Making stereo views from a single monocular image is one of the most chal-

lenging tasks in computer vision. The first step in making 3D from single

56


images is to recover the depth map. This process is performed by applying su-

pervised learning algorithms on a set of images and the corresponding ground

truth depth maps. Statistical image modeling and estimation techniques such

as Markov Random Fields (MRF) are used for training the system [74]. After

the training process, the depth map for a query image will be recovered. Once

the depth map is estimated, the required information for generating stereo

views will be calculated as suggested in [59].

4.4.2 3D Visualization from Multiple 2D Views

Although 2D digital photo galleries and collections do not contain any 3D

information, but by performing computer vision techniques, interesting 3D

images and videos can be generated. Basically, in many photo collections

there are a lot of hidden connections between images. These connections might

be represented by a transformation matrix. In fact, any two, three or more

unstructured photos of a scene, might capture the overlapping areas. This

means that by finding the geometrical transformation between the overlapping

images, the 3D information of the real scene can be inferred. In paper VI, the

process of generating stereo views for 3D visualization by matching the feature

points and finding the homography transformation between the overlapping

frames is discussed in detail [58].

4.5 3D Channel Coding

The final step to visualize the content in 3D is to encode the stereo channels.

The coding techniques vary according to the display device technology. For

instance, in passive 3D systems with polarized glasses, the stereoscopic output

should be transmitted in channels with different polarities [75], while in ac-

tive shutter glasses, stereo frames are transmitted with twice the original rate

(60 × 2 = 120 frames/sec) [75]. In the implementations ordinary 2D displays

are considered for rendering the 3D output. A common group of stereoscopic

techniques that do not require 3D displays are color anaglyphs [76, 77]. In

57

Chapter 4

Figure 4.7: Contributions in 3D visualization.

anaglyph methods, the stereo frames are encoded into two different colors

for left and right eyes. The color-coded stereo frames should be merged and

displayed as a single layer on the display. Depending on the coding method

an appropriate low-cost glasses are used to decode the displayed output. An

appropriate eye-glasses features two different colors for left and right lenses,

each for filtering the corresponding layer from the output image. In the im-

plementations, two enhanced techniques for generating more realistic outputs

are performed. These two techniques are known as Optimized Anaglyph and

Color-code 3D [47, 76].

58

Chapter 5

Experimental Results

5.1 Experiments on Gesture Detection, Trackingand 3D Motion Analysis

Basically, the implemented system for gesture detection and tracking based

on low-level patterns includes the gesture input from user, vision sensor, and

algorithms for 3D gesture analysis.

The target gesture for implementation of the gesture detector using low-level

patterns is the Grab gesture. Selection of the grab gesture is due to the

conducted studies on the human intuitive hand gestures for daily tasks such

as pick, place, object manipulation, etc. [57]. In the experiments, the grab

gesture is not considered as a rigid object. In fact, the implemented system is

designed in a way that tolerates the deformation and rotations of the gesture

up to a certain limit that the global shape is preserved in the captured frames.

5.1.1 Camera and Experiment Condition

Experiments on gesture detection using rotational symmetry patterns are gen-

erally conducted in the lab environment with normal lighting condition and

different backgrounds. For all the experiments, a single RGB webcam is used.

The camera is used in both static and semi-dynamic (holding by one hand)

setups to simulate both stationary and mobile configurations. The distance

59

Chapter 5

Figure 5.1: Sample variations of Grab gesture.

between the camera and user’s gesture is normally between 15 to 40 cm. For

testing the robustness of the system, various backgrounds with different colors

and patterns are used. In addition, number of users with different skin color

and size are considered in the tests.

5.1.2 Algorithm

In the experiments of this thesis rotational symmetry patterns are used with

two different approaches for detecting the grab gesture. First approach is

based on the first order symmetry patterns. This group represents the cur-

vature patterns with different orientation. Observations reveal that fingertips

are responsive to this group of symmetry patterns. On the other hand, curva-

ture patterns are rather general and noisy points from the background might

show a similar response to the first order symmetry detector. Therefore, in

order to differentiate between the noisy points and fingertips, more features

should be integrated in the algorithm. First criterion is the magnitude of the

responses. Normally, responses of the fingertips are much stronger than noisy

points. Another feature is phase. Since the intuitive hand gesture will not

fully rotate in any angle, by setting a threshold on the phase we can limit

the responses to natural observations. The third point is the skin color. An-

other threshold on the color of the responses can help to remove more noises.

Finally, by including all the conditions, best responses that represent the fin-

gertips will be detected. Although this approach requires further processing

for detecting the fingertips, but it provides more flexibility for detecting the

60


Figure 5.2: 3D model manipulation using second order rotational symmetrypatterns. The graphical model follows the exact motion of the user’s handgesture in 3D space.

deformed gesture patterns. By detecting the fingertips and measuring the dis-

tance between them, it will be possible to model various hand gestures.

Another developed method for detecting the grab gesture is based on the

second order symmetry patterns. Second order group represents the circular

patterns. With some constraints on the phase of the detected patterns the

circular form of grab gesture can be detected properly. The constraints on the

phase can be set based on the restriction of the wrist joints in rotation within

a limited angle. Therefore, the grab gesture can be detected by searching the

circular patterns with phase variation between +/- 45 degrees (see Fig. 5.1).

In order to improve the robustness of the system, after the first detections, a

region of interest is defined around the localized gesture to secure the correct

detection in consecutive frames and automatically remove the noisy points.

Since the second order rotational symmetries represent more complex pat-

terns, the noisy points are significantly less than the previous case, but on the

other hand, flexibility of the user’s gesture is lower than fingertip detection.

The center of the detected patterns using second order symmetries returns the

point around the center of the grab gesture.

In order to retrieve the 3D motion of the localized gesture between the image

61

Chapter 5

frames, SIFT feature detection and tracking is performed. For faster process-

ing, the feature points on the first frame are detected and tracked in consecu-

tive frames to retrieve the 3D motion parameters. In the PC implementation,

SIFT feature matching between all the frames are tested indeed. For the for-

mer case when the number of features reduces to less than 35 points, feature

detection will be restarted to guarantee the robust motion analysis.

5.1.3 Programming Environment and Results

The first version of the gesture analysis system, based on the rotational symme-

tries, has been developed in Matlab. After approving the preliminary results,

in the next step, the program was implemented in C/C++ environment. This

step significantly improved the efficiency of the system in processing the video

sequence in real-time.

Since rotational symmetries are low-level patterns, from computation perspec-

tive, the detection process is extremely fast. This is the major advantage that

improves the quality of interaction in real-time applications. Even with the

further processing for retrieving the 3D motion parameters, we can achieve

the required performance for efficient interaction. As reflected in [57, 66], the

measured detection accuracy shows the effectiveness of the algorithm. The

implemented system based on the first order rotational symmetry detector re-

turns the fingertip positions. In order to localize the grab gesture, the middle

position between the detected thumb and index finger is considered as the

output.

The implemented system based on the second order symmetry detector re-

turns the center of the circular pattern. Thus, for grab gesture, position of

the response is always a point close to the gesture center.

In both mentioned cases, the detected gesture points are used for manipulation

of the graphical objects. The tested scenarios are based on the 3D tracking

and rotation with six DOF motion analysis (see Fig. 5.2).

62


Figure 5.3: Capturing the motion parameters for tagging the pose informationto the database images.

5.2 Experiments on Gesture Search Framework

Experiments on gesture search framework can be divided into two steps. First:

offline step that includes the process of constructing the database entries, tag-

ging the motion parameters, and forming the vocabulary table of the gestures.

Second: the online gesture search process for query input. This step includes

the scoring process and neighborhood analysis for finding the best match for

query input.

5.2.1 Constructing the Database

The main strategy behind constructing the database is to record and store

all possible hand gestures including the deformations, scaling, and translation

variations. Moreover, the stored gesture frames should contain the 3D mo-

tion information for instant retrieval after the matching step. For this reason,

active vision system is used to immediately retrieve and tag the 3D motion pa-

63

Chapter 5

Figure 5.4: Active motion analysis for tagging the orientation information tothe database images. Vision sensor is attached to the back side of the handfor measuring the 3D motion parameters. The retrieved motion parametersare applied to the 3D model to validate the accuracy.

rameters (six parameters including the 3D position/orientation information)

to each image frame during the process of recording the database images.

The whole database is recorded in the lab environment with stable lighting

condition and plain green background. In order to easily obtain a clear image

of the gesture and eliminate the rest of the image, extra green paper for cov-

ering the arm and hand-mounted camera is used. Active camera is mounted

at the back side of the hand to let the second camera capture the video se-

quence while the user is performing different gestures. Therefore, the active

camera captures the frames from the environment for online 3D motion analy-

sis. The second camera captures the gesture sequence simultaneously. Finally,

the retrieved 3D orientation, based on the hand motion, will be tagged to the

64


synchronized frame from the second camera and this process will be continued

to complete the construction of the database (see Fig. 5.3 and Fig. 5.4).

Process of generating the database images and retrieving the orientation pa-

rameters are conducted in C++ environment and is performed in real-time.

Another reason to cover the arm with background color and provide a clear im-

age of the hand is to calculate the 3D position of the gesture in each database

image. During this step, first the database images are converted to edge im-

ages. Afterwards, average position of the edges based on the image coordinate

system is calculated for each frame. Moreover, the bounding box around each

gesture is defined. In fact, size of the bounding box reflects the scaling factor

or depth of the gesture with respect to the camera position. Finally these

three parameters representing the 3D gesture position will be retrieved.

At this step the database of the hand gestures including the gesture images,

converted gesture edge images, and the corresponding text files containing the

six motion parameters is constructed.

5.2.2 Forming the Vocabulary Table

The implemented algorithm for finding the best match for query frames are

based on the low-level edge orientation features. Thus, the vocabulary table

contains the indices of the relevant database images at different locations.

The vocabulary table is defined by number of database images as row size,

and mxnxnθ as column size where m and n represent the width and height of

each image and nθ is the number of angle intervals. For most of the conducted

tests with image size of 320x240, eight angle intervals, and 6000 images in the

database, 6000 rows and 614400 columns represent the vocabulary table. Each

block in the vocabulary table stores the indices of the database images that

have edge at that position and with the similar orientation. The conducted

experiments reveal that with the database size around 6000 the maximum

number of indices at each block will not exceed 100. The whole process of

forming the vocabulary table is performed in Matlab. The final constructed

table is stored in text format for online retrieval step.

65

Chapter 5

5.2.3 Gesture Search Engine and Neighborhood Analysis

The online search system is implemented in C++ environment for efficient

interaction in real-time. First, the vocabulary table will be loaded in the

memory for fast retrieval. Afterwards, each frame from the real-time video

input will be sent to the gesture search engine. After the direct and reverse

scoring steps, the top four matches will be sent to the neighborhood analysis

step and the best match will be selected.

Different methods for analysis and mapping of the gesture images from high

dimensional space to 3D space are introduced in paper III [56]. The main

idea behind that is to analyze the distance between the gesture patterns and

construct a meaningful pattern for neighborhood search. Since gestural in-

teraction represents a smooth motion in 3D space, neighborhood analysis for

selecting or predicting the closest match from the database for query inputs

is quite important. In the implementations, Laplacian method for mapping

the gesture vectors from high dimensional space to 3D space is selected. Se-

lection of the Laplacian among other methods such as PCA, and LLE is due

to the visible pattern from the 3D representation of the image vectors. As

demonstrated in Fig. 5.5, each branch in the graph indicates the clear change

in positioning of the gesture patterns within the database images. Basically,

the dense center mostly represents the gestures around the center point of the

image and each branch shows the direction towards the corners of the im-

age frames. In the process of selecting the best match from the top matches,

neighborhood analysis is used to return the closest gesture match based on the

previous selected match in a video sequence. This step smoothes the motion

of the retrieved sequence.

5.2.4 Gesture Search Results

During the experiments various databases have been provided for testing the

performance of the system. The earliest database contained about 1500 images

of the grab gesture. Later, another database with 3000 images including the

grab gesture and other types of hand gestures was captured. Afterwards, the

66


Figure 5.5: Left: Gesture images mapped to the three dimensional space byLaplacian method. Right: Gesture and non-gesture images mapped to the 3Dspace by PCA.

database was extended to more than 6000 images. At this step the non-gesture

images were also included to analyze the performance on a larger database

with noisy entries. At each step, system test have been conducted on the

resized images, as well. In general, both 320x240 and 160x120 images show

quite promising results in the tests. Most of the tests are based on the 320x240

images but if the database size increases to more than 10000 entries, the image

size might be set to 160x120 to improve the efficiency of the retrieval.

Among the mentioned steps in online retrieval system, the reverse scoring

consumes most of the processing time. This is the major reason that the

reverse scoring is conducted for the top ten matches retrieved by the direct

scoring step. For instance, with 6000 images in the database, and image

size of 320x240, the retrieval system by the direct scoring step can process

25 frames/second while by applying the reverse scoring it will be reduced

to 15 frames/second. Thus, the reverse scoring shows the stronger effect on

increasing the processing time than the database size. Fig. 5.6 shows sample

gesture inputs and the corresponding best matches from the database of hand

gestures.

67

Chapter 5

Figure 5.6: Output of the gesture search engine for number of sample querygestures.

5.3 Technical Comparison between the Prior Artand the Proposed Solutions

Basically, majority of the hand gesture recognition and tracking systems em-

ploy vision-based approaches to handle the technical challenges. RGB and

depth cameras or combination of them (i.e. Kinect) are the widely used hard-

ware for capturing the body gestures. Since contributions of this thesis target

the current and future mobile devices, the introduced methods are based on

using ordinary RGB cameras such as webcams and mobile cameras. Although

various algorithms have been introduced in the computer vision and pattern

recognition discussions, the proposed solutions of this thesis can be compared

to the common approaches for hand detection, gesture recognition, gesture

tracking and 3D motion analysis. 2D and 3D features, 3D models, skeletal,

appearance, color, and depth information are among the most known proper-

ties that have been used for gesture detection and tracking. In fact, majority

of the prior art can be grouped into the mentioned categories or combination

of them.

As discussed before, the proposed gesture analysis system based on rotational

symmetry patterns can be considered as a combination of the mentioned com-

68


puter vision approaches. On the other hand, the introduced gesture analysis

system based on large-scale search is not a classical computer vision approach.

However, technical contributions of this thesis can be compared with the prior

art from different aspects. Table 5.1 provides a comprehensive comparison

between the prior art and the proposed solutions. Method 1 and 2 represent

the rotational symmetries and gesture search method, respectively. Ratings

are estimated based on the reviews and surveys on the current vision-based

technologies [78].

5.4 3D Rendering and Graphical Interface

3D rendering is the process of generating a graphical view based on the three-

dimensional models. 3D models might contain various properties such as ge-

ometry and texture. Finally, the 3D rendering process depicts the 3D scene as

a picture, taken from a particular perspective angle and it might be changed

based on the desired viewpoint in a continuous sequence. Various features

such as lighting, shadow, atmosphere, refraction of light, or motion blur on

moving objects can enhance the realistic perception in 3D rendering.

By the development of the modern computers, 3D rendering has become a

major step in many applications such as video games, simulators, movies,

augmented reality, virtual reality, etc.

In order to convey the realistic 3D experience to users, two possible approaches

or combination of them might be considered. First, the generated graphical

view might be understood by various noticeable features such as perspective,

shading, texture-mapping, reflection, depth of field, transparency, translu-

cency, refraction, etc. Second, the generated scene might be rendered using

stereoscopic techniques to convey the illusion of depth and the final result can

be visualized on 3D displays. Nowadays, due to the popularity of the 3D dis-

plays both techniques are combined to enhance the quality of user experience.

Since the interaction between users and digital devices happens through the

interface level, the effect of provided technical solutions might be visualized

in a graphical interface. Basically, two scenarios are considered for graphical

69

Chapter 5

Method

Properties

shape-based

color-based

depth-based

3D model-based

Method1Rot.sym.

Method2Gesturesearch

Efficiency of de-tection

3 5 5 4 5 5

Accuracy of de-tection

4 3 5 4 4 5

Tracking qual-ity

4 3 5 4 4 5

Gesture recog-nition

4 3 4 4 4 5

Robustness toenvironmentalconditions

3 1 2 3 3 5

3D motion 3 2 4 4 4 4

Large-scale ges-ture

2 3 4 4 3 5

Cluttered back-ground

3 2 5 4 4 5

Occlusion 2 1 1 3 1 3

Scale-invariance

2 3 4 3 3 5

Rotation-invariance

2 3 4 3 3 5

Deformation-invariance

2 3 3 3 3 5

Mobile plat-form

3 3 0 2 4 5

Multi-gesture 4 3 5 4 3 5

Table 5.1: Properties of the different methods in gesture analysis are comparedwith the proposed solutions of this thesis. Method 1 and 2 represent thediscussed methods based on Rotational Symmetries and Gesture search engine,respectively. Quality of the different properties is scaled between zero and five.0: not applicable. 1: very weak. 2: weak. 3: average. 4: strong. 5: verystrong.

70


interface design in this thesis. First, manipulation of graphical objects using

hand gestures, and second, manipulation of the graphical scene in interactive

3D vision. For both cases the graphics are rendered in OpenGL environment.

Perspective projection, lighting, color, reflection and other features are used

to provide a realistic 3D experience. Moreover, in order to convey the illusion

of depth, the rendered output is provided in color-code stereoscopic 3D. By

this technology, users might be able to experience the illusion of depth on any

2D screen using a simple color-code glasses.

In most designed environments for 3D gestural interaction, manipulation of

the graphical objects in an augmented environment is considered. Normally,

the rendered objects are shown on the live camera view while user can pick,

rotate, move, zoom in/out, or even reshape the objects in real-time.

For interactive 3D vision the main goal is to place the users in a virtual reality

environment, enabling them move and perceive the rendered scene in an in-

teractive manner. Thus, the recommended setup for this scenario is a rather

large or possibly a wall-sized screen.

5.5 Research Scenarios

Conceptual and technical contributions of this thesis have been tested and

used for implementation in different research scenarios. The major research

scenarios can be summarized in the following items.

5.5.1 Implementation of the 3D Gestural Interaction on Mo-bile Platform

Since one of the main target areas for applying the proposed technologies is the

future mobile devices, implementation on mobile platforms are the essential

part of this work. Android platform is selected for mobile implementation of

the gesture-based interaction. The core of the system for detecting, tracking

and analysis of the gestures is developed in native C/C++, in OpenCV envi-

ronment. The graphical part is mainly handled by OpenGL (Open Graphics

Library [79]). In some earlier versions, Min3D (3D library for Android using

71

Chapter 5

Figure 5.7: Graphical interface in mobile application. Implementation of theproposed systems in photo browsing and 3D manipulation.

Java and OpenGL ES) has been used for rendering different graphical objects

(see Fig. 5.7).

5.5.2 Implementation of the Interactive 3D Vision on a Wall-sized Display

The proposed interactive 3D vision is tested in three different setups. The first

test is performed on normal computer display with both 2D and stereoscopic

3D rendering. In the Second test, output is displayed on the wall, using a video

projector. Third test is performed on the 4K wall-sized display in KTH VIC

lab (visualization studio). In all three cases the graphical scene is rendered in

OpenGL environment. In the stereoscopic case, passive 3D glasses are used

for depth perception (see Fig. 5.8).

Here an important point to mention is that the interactive 3D vision on per-

sonal devices might be set in both active and passive configurations. As dis-

cussed before, in active configuration where the vision sensor is mounted on

the user’s head, the resolution of the 3D motion analysis is significantly higher

72


than passive configuration. The accuracy level is highly dependent on the

relative distance between the moving subject and the vision sensor. Thus,

if we remove the body-mounted camera and simply use the device’s camera

for motion tracking, decreasing the distance between user and device might

compensate the accuracy to a proper level. This case usually happens where

users interact with their devices in closer range such as operating on laptops,

smartphones and tablets. In these cases, due to the simplicity and comfort of

the users, passive configuration is more practical. Although it will not provide

the same level of accuracy as active vision but it is generally acceptable for

natural interaction (see Fig. 5.9).

In the conducted experiments on MacBook Pro, the device’s camera is used for

tracking the head motion. In order to improve the quality of tracking, in the

first step, face detection is applied to immediately separate the moving part

from the rest of the image. Afterwards, the discussed technology for tracking

and estimating the 3D head motion is used to provide the required data for

3D interaction with the content.

In larger interaction spaces, such as visualization on wall-sized displays, for

accurate and high-resolution interaction, active motion estimation is unavoid-

able. Quality of motion tracking with passive installation is quite weak for

large distances between sensor and moving subject.

5.5.3 3D Rendering and Visualization of 2D Content

Unlike the 3D graphics where the geometry of the scene and objects are known,

3D visualization of the 2D content such as images and videos are quite a

challenging task. Single view and multiple view analysis for retrieving the 3D

information from 2D content are introduced in the contributions of paper V

and VI [58, 59]. The main idea is to provide a supportive technology to enhance

the user experience in interactive applications. This technology enables users

to see the content in 3D while they operate in the application or manipulate

the content. 3D visualization is based on the stereoscopic techniques using

passive glasses. Due to the simplicity of the tests and applicability to any

73

Chapter 5

Figure 5.8: Active 3D vision tests in different setups.

type of display, in all the conducted tests, anaglyph glasses are used [58, 59].

3D channel coding is performed based on the selected glasses.

5.6 Potential Applications

Contributions of this thesis in 3D motion analysis and visualization can be used

in a wide range of multimedia applications on mobile devices and stationary

systems. Virtual reality, augmented reality, medical imaging, motion based

interactive systems, 3D games, 3D displays, motion-based localization and

positioning systems, visual search and many other applications might take

advantage of the proposed methods. Here, several implemented and potential

applications based on the contributions of this thesis are briefly explained.

74


Figure 5.9: Passive configuration shows the similar effect as active in close-range interaction.

5.6.1 3D Photo Browsing

Interactive photo browser enables users to manipulate their photo collections

in 3D space. Unlike the 2D interaction, where only one user could operate

on the device (due to the limited area for interaction), in 3D interaction, two

or multi users might share both the interaction and visualization spaces for

collaborative tasks. Users might sit together to share their photo collections

and manipulate them in 3D space while they have their own devices. They

can use only one device and share the interaction and visualization spaces.

They might share the virtual space if they are present at different locations,

etc.

5.6.2 Virtual/Augmented Reality

In [39, 54], gestural interaction techniques are applied to render graphical

objects in augmented environments. The analysis of the hand gesture motion,

behind the mobile phone’s camera, is used to manipulate graphical models.

75

Chapter 5

The six DOF motion control with high level of accuracy enables users to

experience a realistic interaction with their mobile devices. The proposed

gestural interaction for manipulation of graphical models is also implemented

on Android platform. Efficiency and performance of the system is tested and

validated on different devices such as Samsung and HTC smartphones. The

visual outputs are rendered in both 2D and 3D formats.

5.6.3 Interactive 3D Display

Human motion tracking might be used to interact with the display device. In

the implemented interactive systems introduced in [57, 61], user controls the

content of the display by using head or gesture motion. The retrieved motion

parameters (rotations and translations in three axes), between the consecutive

frames captured by the device’s camera, applies the motion control to the

application.

5.6.4 Medical Applications

The proposed technologies might be widely used in medical applications. 3D

motion tracking and analysis of the patients, help physicians for diagnosing

and treatment of the physical disorders in various types of diseases. Fur-

thermore, 3D imaging and visualization of the body organs and interactive

3D manipulation on display devices help experts to analyze and diagnose the

physical problems and select the required treatment.

5.6.5 3D Games

One of the most exciting areas that can benefit from the efficient 3D motion

analysis is 3D gaming. Bare-hand, marker-less gesture analysis by using ordi-

nary 2D cameras provides a great chance for experiencing a realistic interaction

with the graphical environment in 3D games. Head and gesture detection and

tracking, using the techniques discussed in the previous chapters, provide an

effective way for playing in 3D environments.

76


5.6.6 3D Modeling and Reconstruction

Many digital photo/video capturing devices, in addition to a vision sensor,

present other types of embedded sensors such as GPS and orientation sensors.

Therefore, extra information such as position and orientation will be tagged to

the captured photos. This geo-tagging, have been found to be useful in many

applications such as 3D digital photo albums, photo-tagged maps and visual

navigation. In most cases, the geo-tagged meta-data are corrupted by noise

or missing due to the unavailability of the GPS signal or magnetic sensors.

In paper VII [60], we discuss how the 3D motion analysis can help to form a

signal model and significantly correct this noisy data.

5.6.7 Wearable AR Displays

Contributions of this thesis perfectly fit the area of mobile augmented reality.

Thus, AR glasses such as Google glass that integrate the information through

the augmented environments require intuitive interaction technology. Due to

the fact that in wearable AR glasses, touchscreen will be removed or placed in

a smaller scale, convenient 3D gestural interaction can definitely enhance the

interaction experience.

5.7 Usability Analysis in Object Manipulation:Touchscreen Interaction vs. 3D Gestural Inter-action

In order to evaluate the user experience in 3D gestural interaction, a com-

parative user study is conducted. In this study, manipulation of graphical

objects in 3D space, using bare-hand gestures, is considered. Learnability,

user experience and interaction quality is evaluated and compared with the

same task in 2D touchscreen interaction. Four students from the course, Eval-

uation Methods in HCI (DH2408), assisted this study by selecting this case as

their course project. In order to provide a comparative scenario for evaluating

the 3D gestural interaction, two sets of designed interfaces and tasks for 2D

77

Chapter 5

touchscreen and 3D gesture-based interaction, required usability tests, and

questionnaires for user interview were provided for the students. In this task,

they were supposed to invite uses, test the learnability and usability of both

systems, and collect and report the required information based on the given

instructions. Here, the whole process is explained in detail.

Touchscreen interaction: Two smartphones are considered for this case.

Smartphones are positioned side-by-side on a table. Smartphone 1 plays a

pre-recorded video of the rendered graphical model. On the smartphone 2,

the same graphical model is rendered where user can manipulate that through

the touchscreen and control the position, zooming and viewpoint in x, y, and z

coordinates. During the task, user should follow and mimic the exact motion

of the graphical model on smartphone 1 through the real-time manipulation

of the model on smartphone 2 using touchscreen interaction. A webcam is

mounted on top of both smartphones. This camera records the touchscreen

interaction for further studies.

3D Gestural interaction: In this case, user can control and manipulate

the same graphical model in 3D space using bare-hand interaction. Kinect

depth sensor is used to detect and measure the user’s hand motion in 3D

space. Similar to the previous case, the same pre-recorded motion tasks are

displayed on the computer screen. User should follow and mimic the motion

of the graphical model in free space through the real-time 3D interaction. A

camera records the whole task for further studies.

Task: Both 2D and 3D interaction tasks are divided into different parts.

In each part the graphical model will move with a specific motion sequence to

reach a certain position/orientation. Afterwards, user should follow the same

motion to reach the similar position/orientation. These pre-recorded tasks are

divided into 10 parts. First two videos are used for learnability step where

new users learn how to work with both 2D and 3D tasks. In the main part, 8

78


Figure 5.10: User test in 2D touchscreen and 3D gestural interaction.

videos (2 easy, 4 normal, and 2 hard) are considered (see Fig. 5.10).

5.7.1 User Test

For this study ten users were selected: one pilot user, seven in the primary

target group (experienced in using touchscreen) and two in the secondary tar-

get group (very little or no experience in using touchscreen). According to

Nielsen [80, 81], five people find 85% of the problems. Therefore, ten users

(mostly between the age 20 and 30) were enough to provide proper results.

As mentioned before, the goals are set to test the learnability for the 3D gestu-

ral interaction system as well as comparing the user experience of 3D gestural

interaction to the touchscreen interface. For the comparative analysis, effi-

ciency, effectiveness and user satisfaction are considered as the main criteria.

In order to increase the reliability of the tests, a subjective data based on the

experience of the participants during and after the test sessions were gathered.

This was done through filling the scale-based forms, and answering predefined

questions.

On the other hand, user performance is observed by Seeing as Doing during

the comparative tests and the manual data are collected. Eventually, these

two steps could provide access to both quantitative and qualitative data for

final evaluation.

79

Chapter 5

Figure 5.11: Average score of the 2D vs. 3D user performance.

5.7.2 Usability Results

Since the tests are based on following the movements in a video, for the quan-

titative measurement of the user performance, the following method is consid-

ered. This method is motivated by the MUSiC user performance method [82].

Therefore, all observers in the student group watched the recordings of each

and every user test, and scored them from 1-7 according to how well the user

performed based on the instructions in the video (1 = no coherence at all and

7 = no difference between video and performance). After all four students

scored the user tests, the average score were calculated. Fig. 5.11 demon-

strates the measured comparative performance between 2D and 3D scenarios.

Since task 1 and 2 are considered for learnability step, they are not included

in the chart. By analysis of the collected data, some important points can be

distinguished. Firstly, it is clear that the touchscreen interface works fine as

long as the task is limited to spinning/turning around x and y axes without

any translation or zooming. As soon as movements in the 3D or zooming come

to the action, the 3D gestural interface clearly shows its strength. Scoring on

tasks 3, 4, 6, 8 and 9 indicates this observation (task 3 was primarily limited

to turning the object). This fact is also proven by the comments of the users

80


in the interview, where five users mentioned the fact that turning objects were

easier than turning and moving on touchscreen whereas combination of rota-

tions, translations and zooming is simpler with the 3D gestural interaction. It

may be too early to tell without future studies with larger user groups, but

according to the collected data in the chart, it seems that the 3D gestural

interaction is in overall the preferred system, due to the fact that users all

scored higher on that system (except one case) than the touchscreen interface.

During the test sessions two users had little to no experience in using touch-

screens (two male teachers in their sixties). These users formed the secondary

group in the usability evaluations. Although none of them owned a smartphone

or regularly used touchscreens, one of them had little experience in using a

Nintendo Wii, which might be the reason he scored substantially higher in the

3D interaction part, whereas the second user only scored a few points higher

than the touchscreen interface. It would undoubtedly seem that 3D gestural

interaction is the preferred system for people who learn both systems at the

same time. In fact, both of them scored higher on the 3D gestural interaction,

and during the interview both said that they strongly prefer the 3D interface

over touchscreen. However, they both mentioned that large hands and fingers

might cause problems on touchscreens.

Findings through the interviews reveal that users truly believe that 3D gestu-

ral interaction will be a standard interaction tool for the future applications.

Although, there are some differences in opinions about how wide it will be

used for future interactive scenarios. During the interviews, our users also

had to answer a few statements and respond to how they agreed with the

statement on a scale of 1-5 (1 = I do not agree at all, and 5 = I fully agree).

The results of this interview is reflected in Fig. 5.12. In the learnability step,

the main idea was to let users watch the videos and intuitively start following

the recorded motions instead of giving specific instructions. Based on the re-

sponses gathered from these questions it is clear that users find the 3D system

easy to learn and think that most people will learn to use a similar interface

quite easily. They mainly believe that with a bit more time in front of the 3D

81

Chapter 5

Figure 5.12: Qualitative results of 3D gestural interaction.

interface they would master it.

3D gestural interaction has a quicker learning curve than touchscreen interface

and obviously, better at performing more complex movements. However, in-

terfaces using 3D gestural interaction must be specifically designed for gesture-

based inputs, in the same way as applications for touchscreens are developed

differently from those on a desktop computer where mouse and keyboard are

available. This is already performed in games designed for Kinect and similar

products.

82

Chapter 6

Concluding Remarks andFuture Direction

6.1 Contributions

Today’s multimedia technology is highly inspired by two strong trends: tech-

nologies towards intuitive interaction and technologies towards augmented vi-

sualization. The former trend provides natural interaction technology for ef-

fective communication between users and smart devices. The latter trend,

which is considered as the direction to the fifth screen or augmented reality

visualization, combines the interactive experience on the personalized screen

with augmented information through the Internet. Technical contributions of

this thesis can support the development of both trends. In fact, 3D interaction

through intuitive gestures is unavoidable part of the future AR applications.

Therefore, defining new frameworks for effective interaction in augmented en-

vironments will improve the quality of user experience in future mobile ap-

plications. Basically, contributions of this thesis might be divided into three

main categories: Conceptual models for future human mobile device interac-

tion; Technical contributions towards 3D interaction design and interactive

visualization; Implementation of the proposed concepts and methods for dif-

ferent application scenarios.

83

Chapter 6

6.1.1 Conceptual Models for Future Human Mobile DeviceInteraction

This thesis proposes new concepts and frameworks for future human mobile

device interaction. The main features of the proposed ideas can be summa-

rized as follows.

ä Current and future trends in multimedia technology, especially on mobile

multimedia, future demands, challenges, limitations and directions are dis-

cussed in detail.

ä Evolution of the interaction and visualization facilities on mobile devices

and future trends are investigated.

ä Concept of extending the interaction and visualization spaces to 3D on mo-

bile devices is introduced and its advantages are discussed.

ä Concept of 3D gestural interaction on mobile devices is introduced and its

significant impacts are discussed.

ä Concept of collaborative tasks on mobile devices using bare- hand interac-

tion in 3D, and sharing the interaction and visualization spaces are discussed.

ä Potential application scenarios based on 3D gestural interaction are intro-

duced.

ä Concept of user-manipulated content and interactive 3D vision are intro-

duced and discussed.

6.1.2 Technical Contributions for 3D Gestural Interaction and3D Interactive Visualization

Technical contributions are the main focus of this thesis. New methods and

frameworks for 3D gesture analysis and interactive visualization have been

introduced. Specifically, technical contributions are mainly focused on two

major problems: first, interaction with mobile devices based on motion anal-

ysis in 3D space [38, 39, 54, 57], and second, 3D visualization on ordinary

2D displays [58, 59, 60, 61]. The introduced interactive systems are based

on detection, tracking, and analysis of the users’ 3D motion from the visual

input. This visual input might be received from the mobile device’s camera,

84

Concluding Remarks and Future Direction

body-mounted camera, webcam, and in general, from any type of vision sen-

sor. Technical contributions can be listed as below.

ä Concept of 3D gesture recognition and tracking, new methods and algo-

rithms based on low-level operators are discussed.

ä Novel methods and algorithms regarding the gesture recognition and track-

ing based on large-scale search framework is introduced and considered.

ä Proposed methods for 3D gesture analysis are compared and evaluated.

ä Technical solutions regarding the motion-based interactive 3D display are

introduced and compared in different configurations.

ä Different configurations in interaction between users and multimedia con-

tent in various scenarios and platforms are discussed.

ä New methods regarding 3D visualization of monocular images, photo col-

lections, and videos are investigated and discussed.

6.1.3 Implementations

Implemented scenarios based on the conceptual and technical contributions

can be summarized in the following items.

ä New methods for gesture analysis based on low-level patterns are imple-

mented.

ä New framework for 3D gesture analysis based on large-scale retrieval and

search methods are implemented.

ä 3D gesture detection and tracking are implemented in different platforms

(Windows, Mac OS X, Android).

ä Interactive 3D vision is implemented and tested in different scenarios from

personal smart devices to wall-sized display.

ä 3D visualization of monocular images, photo collections and videos are

implemented and tested.

85

Chapter 6

6.2 Concluding Remarks and Future Direction

Although today’s media industry is highly inspired by 3D technology, but

realistic interaction and visualization are still at their early stage of develop-

ment. Realistic visualization has attracted a lot of attention during the recent

decade. Introduction of 3D display technology in TVs, projectors and even

on mobile devices is the indication of the fast-growing 3D market. Strong

efforts towards changing the stereoscopic 3D to glasses-free 3D displays are

other indications of the general trend for intuitive and realistic visualization.

However, the current technology of 3D displays is quite different from real

human observation of the 3D world and significant improvements are required

to fulfill the objective of realistic visualization. Contributions of this thesis

in 3D visualization, especially the introduced concept and technology for in-

teractive 3D display, support the realistic and intuitive visualization. In fact,

the main idea behind interactive visualization is to enable users observe the

content, control the angle and viewpoint, in a similar manner to the real world

observation.

Introduction of 3D interaction facilities such as Microsoft Kinect has signifi-

cantly changed the way people interact with the digital content especially in

the entertainment area. Due to the fact that real 3D interaction requires ex-

tremely high accuracy in 3D motion estimation and tracking of the body joints,

there are still many unsolved issues and challenges to handle the difficulties.

However, strong indications reveal that future human mobile interaction will

be highly affected by intuitive 3D interaction. Contributions of this thesis

aimed to tackle the fundamental issues and propose novel ideas towards solv-

ing them.

In this thesis, 3D gestural interaction is deeply investigated as an effective

tool for future human mobile device interaction. Computer vision, pattern

recognition, and machine learning methods are widely used in this area. Ob-

servations and experimental results of this thesis indicate that although these

methods might be extremely useful to solve different challenges in 3D ges-

ture recognition, 3D motion analysis, etc., but for the generalized problem

86


Figure 6.1: Different approaches for solving the technical challenges in mediatechnology. The current trend shows the gradual move from low-level featuresand high-level algorithms towards meta-data retrieval from large databases.

formulation they are not adequate. Therefore, new methods for 3D gesture

analysis through the large-scale retrieval system have been introduced. Due

to the possibility of storing and processing of extremely large databases and

the corresponding metadata, future methodology for solving the discussed

problems will be mostly centered around the metadata retrieval and search

methods instead of processing the low-level data. Thus, preparation of rich

and comprehensive databases can formulate the classical problems in a totally

new way. For instance, challenges of gesture recognition and tracking can be

slightly shifted from signal processing level to large-scale search and match-

ing frameworks. Although, image-based retrieval and template matching are

quite known concepts in media technology, but large-scale search framework

for gesture analysis is rather a new concept and needs further development.

This thesis has introduced and investigated this framework for high accu-

racy 3D motion retrieval and gesture tracking. Experimental results indicate

that the search framework is extremely powerful especially when recognition,

tracking and 3D motion retrieval are required all together, in large-scale and

87

Chapter 6

real-time. On the other hand, if we target specific patterns and models for

recognition and tracking, computer vision methods can handle the complexity

of the problems.

6.2.1 Technical Challenges

During this research various methodologies, algorithms and different approaches

towards solving the current and future challenges in media technology have

been considered. Some of the important technical challenges and findings

that have been tackled during this research work might be highlighted in the

following items.

6.2.1.1 Active vs. Passive Motion Capture

A common discussion in human motion analysis is the position where mo-

tion capture sensor should be mounted. In order to enhance the accuracy of

the tracking and convenience of the users, various configurations have been

introduced in different application scenarios. As discussed before, for more

intuitive and natural interaction design, marker-less bare-hand solutions are

preferred to wearable sensors such as motion capture gloves or body-mounted

devices.

Although the current motion analysis setups can be divided into passive and

active systems, in mobile devices or augmented reality glasses motion analysis

might be performed by both passive and active configurations. In fact, mo-

bile sensor can be used in static or dynamic modes. This possibility provides

a great chance to take the technical advantages of both configurations. For

instance, camera of the AR glass presents the advantages of active motion

analysis for a moving head, while it can be used as a passive sensor for hand

gesture tracking. This thesis has demonstrated the practical scenarios where

each configuration can show its advantages. For instance, the proposed in-

teractive 3D visualization employs the active vision for manipulation of the

content from larger distances (wall-sized display and projection) while the

same system is introduced with passive configuration for close-range interac-

88


tion with mobile devices or laptops. As discussed before, in order to design

a realistic experience, for hand gesture interaction, passive configuration is

preferred. Thus, intuitive interaction should happen by using bare hands in

free space. However, the proposed technical solutions provide flexible designs

for different application scenarios.

6.2.1.2 Gesture Detection and Tracking without Intelligence

Majority of the available computer vision and pattern recognition methods hire

complex algorithms for gesture detection, recognition, and tracking. These

types of solutions usually include heavy computation and large training sets.

Obviously, for mobile systems with hardware and power limitations majority

of the common solutions are not applicable. The idea of employing low-level

operators for detecting and tracking hand gestures is for ensuring the efficient

detection without intelligence. Although, implementation of the effective ges-

ture analysis system without using high-level detection algorithms is quite

challenging but for efficiency reasons this important goal should be achieved.

Employing rotational symmetries for detecting and tracking bare-hand ges-

tures are based on this idea.

6.2.1.3 Adaptability of the Contributions to Future Hardware Evo-lution

Obviously, with the current rate of technology development, new types of sen-

sors will be introduced and embedded to the smart devices. Although the

proposed solutions of this thesis are mainly designed and tested based on the

current technology but in fact, they can perfectly fit the future environments.

Development of new sensors and extra hardware-related features can addition-

ally support the contributions and enhance the quality of the achieved results.

For instance, release of the Kinect sensor provided more flexibility to the pro-

posed concepts, designs and technologies due to its capability to provide the

additional depth information. Clearly, integration of the RGB images and

depth information can substantially improve the quality of detection, track-

89

Chapter 6

ing, noise removal, and etc.

Another example is the developing wearable AR glasses. Although presenting

the information through the AR glasses is not a new concept in media tech-

nology, but technical development of the recent years has made this concept

possible for implementation. Combination of the lightweight wearable display

and different types of sensors is the ideal scenario for gesture-based interaction

technology. Technical contributions of this thesis perfectly fit this area.

6.2.1.4 Contributions of other Research Areas to Computer Vision

Rapid development of other research areas might have strong contributions

to computer vision field. Solving the gesture analysis problems through the

search methods is formed based on this idea. Since search algorithms are exten-

sively used for text and document retrieval, modeling the gesture recognition

and tracking problems by common search methods such as indexing could ef-

fectively improve the research results. These findings reveal that breakthrough

technologies from other research areas might be successfully adapted to similar

concepts with totally different application scenarios. Basically, retrieving the

best gesture entry from a huge database of images is a similar concept to find-

ing the most related document to a searched text phrase. Thus, integration of

the classical computer vision and pattern recognition methods with enabling

technologies from other research fields can provide extremely powerful tools

for solving the technical challenges.

6.2.2 Further Development

There are quite a large number of application scenarios and configurations

that might benefit from the proposed technologies of this thesis. Due to the

fact that this research has been conducted during a limited period of time, it

was not possible to deeply investigate all the aspects of the proposed methods

such as user study and design features. Evaluations and experiments are

mainly performed on the technical aspects of the contributions. However, for

the implemented systems based on the proposed methods, user experience and

90


design aspects are considered and studied in most cases. Here, some interesting

directions for further research and development might be mentioned.

6.2.2.1 Concept of Collaborative 3D Interaction

Development of the 3D interaction in collaborative scenarios is an interesting

line for further research. From technical perspective, collaborative 3D inter-

action can be implemented based on the proposed solutions of this thesis. As

discussed in the previous chapters, sharing the interaction/visualization spaces

among several users provides a great chance for numerous application scenar-

ios. Exchanging the digital information such as documents, photos, audio and

video tracks between different users is an example of collaborative sharing

based on 3D gestural interaction. In fact, users might grab, move and pass

the multimedia content in a shared 3D space using physical hand gesture.

6.2.2.2 Concept of Interaction in the Space using Body Gestures

Interaction between human and future smart devices can be extended to the

whole space. Specifically, by introducing the AR glasses, hand-held devices

will be removed and the whole space in front of the user can be dedicated

to the interaction. Contributions of this thesis have been focused on hand

gesture technology for interaction in front of the smart device and 3D head

motion estimation for interactive display. Since the interaction space can be

extended to a larger space, whole body motion for action recognition and other

body parts such as feet might by employed to design interactive application

scenarios.

6.2.2.3 Extension of the Gesture Search Framework to ExtremelyLarge Scale

The proposed search framework for gesture recognition and tracking has been

implemented and tested with different databases. The largest database has

been made by 10000 gesture entries. Although this number seems to contain

91

Chapter 6

quite large gesture poses for handling the gesture analysis problem, but ac-

cording to the estimations, for extremely high resolution tracking, the database

should be extended. One important line of research for further development is

to generalize the retrieval system to an extremely huge database. Confidently,

real-time matching process for extended database will be quite a challenging

problem.

6.2.3 Future of Mobile Interaction and Visualization

It is quite difficult to predict the evolution of smart devices within the next

ten years. From hardware point of view, the current trend shows that displays

might be presented with 3D technology, in transparent, flexible, or wearable

formats. Future devices will be definitely equipped with numerous sensors,

larger storage, and faster processors. High-speed mobile network connections

might totally change the design of the future smartphones. Storage and pro-

cessing can be shifted to the infrastructure and smart devices might act as a

set of sensors and screen for visualization.

Huge network of connected devices will provide a chance to share the digital

content in a virtual space. Collaborative interaction might be an important

part of the future mobile multimedia.

Concept of personalized environments and screens can totally change the fu-

ture of visualization. In fact, by introducing the AR glasses, any space can be

dedicated to the personalized environment. The whole space and augmented

information can be designed based on the user demands.

Future of interaction technology on mobile devices might be highly affected by

multimodal inputs. 3D technology for intuitive interaction will be definitely

an essential part of that. Using the bare-hand interaction in 3D space can

perform various tasks. Other input modalities such as voice, motion or orien-

tation will be complementary.

Here an important point to mention is that the natural interaction requires

the sense of touch. This is probably the essential feature that the free-space

interaction might need. Ultra-sound 3D rendering is an enabling technology

92


that might be useful for implementation of the sense of touch in free space.

Due to the primary stage of research [83], availability of this technology for

mobile devices in the near future is quite questionable at this moment. This

technology might offer the rendering of virtual objects in free space.

93

Chapter 7

Summary of the SelectedArticles

This thesis reflects the results of the conducted research by Shahrouz Yousefi

during his PhD study. The first part of this thesis (introduction part), is

formed based on the result of more than 15 published papers in the interna-

tional conferences and journals. The publication list is included at the end

of this chapter. In the second part of this thesis 7 papers are included. In all se-

lected papers, Shahrouz Yousefi (author of this thesis) is the first/corresponding

author. Basically, major contributions of these seven papers including con-

cepts, theories, experiments, implementations and writing are from Shahrouz

Yousefi. Prof. Haibo Li has supervised Shahrouz Yousefi as the main super-

visor during this PhD study. Third author has assisted Shahrouz Yousefi in

some experiments or has participated in the discussions.

Chapter 8 introduces the gesture detection, tracking and 3D motion analy-

sis based on the first order and second order rotational symmetry patterns.

Rotational symmetries have been used for gesture localization and fingertip

detection. Feature detection, feature tracking, and 3D motion retrieval have

been performed. The computed motion parameters have been used to control

and manipulate the virtual objects on the screen. Various application scenar-

95

Chapter 7

ios that might benefit from the proposed technology have been introduced.

Content of this chapter has been published as a journal article in Pattern

Recognition Letters (PRL).

Content of chapter 9 is reprinted from the published paper at the ACM Inter-

national Conference on Multimedia (ACMMM12). This work has been pre-

sented at the conference in both oral and poster sessions. It has been selected

as one of the top eight papers for Doctoral Symposium track. The content of

this work has been evaluated by an opponent and committee members at the

conference.

This paper reflects the substantial part of this PhD thesis in brief. It intro-

duces the concept of 3D gestural interaction, potential applications, enabling

media technologies that support this concept, and the implemented photo

browsing system.

Chapter 10 introduces the concept of gesture analysis based on the large-

scale gesture retrieval and search engine. The introduced technology is based

on the provided database of annotated gesture images with the corresponding

3D pose information, and a search engine for similarity analysis between the

query gesture and the database entries. The output provides the best match

from the database and the annotated motion parameters will be used in real-

time interaction.

This paper is accepted for publication in the 9th International Conference on

Computer Vision Theory and Applications (VISAPP2014). This work has

successfully passed the novelty analysis step through the KTH Innovation.

Due to the patent application restrictions, the full version of this work with

technical details has not been submitted for publication in conferences or jour-

nals. The extended version of this work is filed as a U.S. patent application.

Chapter 11 introduces the interactive 3D visualization, the proposed tech-

nology for interaction between users and content of the display in real-time.

96

Summary of the Selected Articles

This technology enables users to control and manipulate the content of the

screen based on their position/orientation in 3D space. A head-mounted vi-

sion sensor is employed to measure and report the 3D motion parameters. The

real-time motion parameters will be sent to the rendering block for visualiza-

tion on the screen. This paper is submitted to the International Conference

on Image Processing (ICIP2014).

Content of chapter 12 is reprinted from the published paper at the Interna-

tional Conference on Signal Processing and multimedia applications (SIGMAP

2011). This paper introduces the technology for 3D visualization of monocular

images based on the patch-level depth retrieval. Stereoscopic techniques have

been used for 3D visualization on a normal 2D display.

Content of chapter 13 is the reprinted version of the published paper at the

IEEE International Conference on Wireless Communications and Signal Pro-

cessing (WCSP2011). This paper discusses the technology for converting the

2D monocular photo and video collections to 3D and visualizing them on 2D

displays using stereoscopic technology.

Chapter 14 introduces a vision-based technique for robust correction of 3D

geo-metadata in photo collections. The proposed technology efficiently im-

proves the accuracy of the position/orientation information in photo collec-

tions. Consequently, this approach enhances the 3D visualization, navigation,

and exploration of large data sets. Content of the chapter 14 is reprinted

from the published paper at the IEEE International Conference on Wireless

Communications and Signal Processing (WCSP2011).

7.1 List of Publications

Content of this thesis is based on the contributions of the following articles

but not including all of them:

97

Chapter 7

Journal articles:

p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Experiencing Real

3D Gestural Interaction with Mobile Devices, published in Pattern Recog-

nition Letters (PRLetters), 2013.

p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Gesture Tracking for

Real 3D Interaction Behind Mobile Devices, published in the International

Journal of Pattern Recognition and Artificial Intelligence (IJPRAI),

2013.

p Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, Direct Head Pose Es-

timation Using Kinect-type Sensors, published in Electronics Letters., 2014.

Licentiate thesis:

p Shahrouz Yousefi, Enabling Media Technologies for Mobile Photo Brows-

ing, Licentiate Thesis, Digital Media Lab (DML), Department of Applied

Physics and Electronics, Umea University, SE-901 87, Umea, Sweden, ISSN:

1652-6295:16, ISBN: 978-91-7459-426-3, 2012.

Conference papers:

p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Bare-hand Gesture

Recognition and Tracking through the Large-scale Image Retrieval, accepted

for publication in the 9th International Conference on Computer Vi-

sion Theory and Applications (VISAPP), January, 2014.

p Shahrouz Yousefi, 3D Photo Browsing for Future Mobile Devices, In

Proceeding of the 20th ACM International Conference on Multime-

dia (ACMMM12), October 29-November 2, Nara, Japan, 2012.

98


p Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, Real 3D Interaction

Behind Mobile Phones for Augmented Environments, In Proceeding of the

IEEE International Conference on Multimedia and Expo (ICME),

Barcelona, Spain, July 2011.

p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, 3D Gestural In-

teraction for Stereoscopic Visualization on Mobile Devices, In Proceeding of

the 14th International Conference on Computer Analysis of Images

and Patterns (CAIP), Seville, Spain, CAIP (2), Vol. 6855 Springer (2011)

, p. 555-562, 29-31 August 2011.

p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, 3D Visualization

of Single Images using Patch Level Depth, In Proceedings of the Interna-

tional Conference on Signal Processing and Multimedia Applica-

tions (SIGMAP), Seville, Spain, 18-21 July, 2011.

p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Stereoscopic Vi-

sualization of Monocular Images in Photo Collections, In Proceeding of the

IEEE International Conference on Wireless Communications and

Signal Processing (WCSP), Nanjing, China, p. 1 - 5, 9-11 Nov. 2011.

p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Robust Correction

of 3D Geo-Metadata in Photo Collections by Forming a Photo Grid, In Pro-

ceeding of the IEEE International Conference on Wireless Communi-

cations and Signal Processing (WCSP), Nanjing, China, p. 1 - 5, 9-11

Nov. 2011.

p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Tracking Fingers

in 3D Space for Mobile Interaction, In Proceeding of the 20th International

Conference on Pattern Recognition (ICPR), Istanbul, Turkey, 2010.

99

Chapter 7

Under-review articles:

p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, 3D Hand Gesture

Recognition and Tracking through the Large-scale Gesture Search Engine, Sub-

mitted to the IEEE Transactions on Pattern Analysis and Machine

Intelligence (TPAMI), 2014.

p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Interactive 3D Visu-

alization on a 4K Wall-Sized Display, Submitted to the IEEE International

Conference on Image Processing (ICIP 2014)., Paris, France, 2014.

p Farid Abedan Kondori, Shahrouz Yousefi, Li Liu, Haibo Li, Direct Hand

Pose Estimation for Immersive Gestural Interaction, Submitted to Pattern

Recognition Letters (PRLetters)., 2014.

p Farid Abedan Kondori, Shahrouz Yousefi, Ahmad Ostovar, Li Liu, Haibo

Li, A Direct Method for 3D Hand Pose Recovery, Submitted to the 22nd In-

ternational Conference on Pattern Recognition (ICPR 2014)., Stock-

holm, Sweden, 2014.

Other related publications:

p Farid Abedan Kondori, Shahrouz Yousefi, Li Liu, Haibo Li, Head Oper-

ated Electric Wheelchair, accepted for publication in the IEEE Southwest

Symposium on Image Analysis and Interpretation (SSIAI 2014), San

Diego, USA, 2014.

p Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, Samuel Sonning,

Sabina Sonning, 3D Head Pose Estimation Using the Kinect, In Proceeding

of the 2011 IEEE International Conference on Wireless Communica-

100


tions and Signal Processing (WCSP), Nanjing, China, Nov. 2011.

p Farid Abedan Kondori, Shahrouz Yousefi, Smart Baggage In Aviation,

In Proceeding of the 2011 IEEE International Conference on Internet

of Things (iThings-11), Dalian, China, 2011.

p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, 3D Visualization

of Monocular Images in Photo Collections, In Proceeding of the Swedish

Symposium on Image Analysis (SSBA), Linkoping, Sweden, 2011.

p Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, Gesture Tracking for

3D Interaction in Augmented Environments, In Proceeding of the Swedish

Symposium on Image Analysis (SSBA), Linkoping, Sweden, 2011.

p Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, , In Proceeding

of the Swedish Symposium on Image Analysis (SSBA), Gothenburg,

Sweden, 2013.

Patents:

p Shahrouz Yousefi, Haibo Li, Farid Abedan Kondori, Real-time 3D Gesture

Recognition and Tracking System for Mobile Devices, U.S. Patent Applica-

tion, filed January 2014. Patent Pending.

101

Documents

3D Gesture Recognition and Tracking for Next Generation of ...699011/FULLTEXT01.pdf · 3D Gesture Recognition and Tracking for Next Generation of Smart Devices Theories, Concepts,