Real-time Hand Gesture Recognition Using Range Cameras

Embed Size (px)

Citation preview

  • 8/3/2019 Real-time Hand Gesture Recognition Using Range Cameras

    1/7

    Real time Hand Gesture Recognition using a Range CameraZhi Li, Ray Jarvis

    Monash University

    Wellington Road Clayton, Victoria AUSTRALIA

    [email protected], [email protected]

    Abstract

    This paper proposes a real time hand gesturerecognition system. The approach uses a range

    camera to capture the depth data. After some pre-processing procedures, the depth data is used tosegment the hand and then locate the hand in 3Dspace. The hand shape is classified into knowncategories using a chamfer matching method tomeasure the similarities between the candidatehand image and the hand templates in the database.The 3D hand trajectory is recognized by a FiniteState Machine (FSM) method. Each gestureconsists of several states. The 3D hand positiondetermines the state transition of each gesturerecognizer. Experiments show that the systemperforms reliably for recognizing both static handshapes and spatial-temporal trajectories in realtime.

    1 Introduction

    Fast and robust analysis of hand gestures has receivedincreasingly more attention in the last two decades. Gesturerecognition from a single view is important in the Human-Robot Interaction (HRI) scenario. Both static hand shapeand dynamic hand trajectories are expressive in our dailylife. These kinds of hand gestures are naturally preferred inHRI applications.

    Various methods have been proposed to segmentand track human hands. Many systems [Imagawa et al.,1998] [Hong et. al., 2000] have achieved promisingperformance; however, these methods only operate understrong restrictions of the environment, because they rely onsimple clothing, static background, and constant lightingconditions.

    We aim to achieve a real time hand gesturerecognition system in a natural environment. It means we donot need homogenous clothing and should be able toperform reliably with cluttered and even non-staticbackgrounds.

    We investigated the usefulness of a 3D rangecamera for our hand gesture recognition system. Thishardware exhibits significant advantages over traditionalcameras in the aspect of unambiguously capturing the depthdata at a high frame rate, which makes the segmentation andtracking the hand in 3D space easy. Hand shapes arerecognized by the Chamfer Matching method [Borgefors,

    1988], and 3D trajectories are recognized using a FiniteState Machine (FSM) method. Gestures are recognized in

    real time. This gesture recognition method has wideapplications including human robot interaction, intelligentrooms, virtual reality and game control.

    The remainder of this paper is organized as

    follows: after a brief review of the related work in section 2,we investigate the property of the range camera in section 3.Hand segmentation and its 3D position are obtained insection 4, which is the input data for the hand shape andtrajectory recognition algorithms in section 5. In section 6we present the experiments and the result. Finally,discussion, conclusion and future work are in section 7.

    2 Related work

    Hand detection and segmentation is an essential componentfor gesture recognition. Many approaches [Mo et. al., 2005][Kjeldsen and Kender, 1996] use color information, sincethe skin color is a salient feature different from thebackground in most cases. However, these methods are notreliable under unstable illumination conditions, where,obviously, it is more challenging to extract complete handshapes. Some methods use special colored gloves or amagnetic sensing device (data gloves) [Sturman and Zeltzer,1994] to simplify the task, but they hinder the naturalness ofdaily use. The intentions of the user should be recognizedeffortlessly and non-invasively.

    Most of the hand trajectory recognition systemsdeal with 2D image data. [Starner and Pentland, 1995]tracked the hand by colored glove and natural skin tone forAmerican Sign Language recognition. Some methods usestereo vision to achieve the hand tracking in 3D space.[Nickel et al., 2004] [Stieflhagen et al., 2004] used stereocameras to locate the hands position in 3D in order to find

    the pointing direction. [Abe et al., 2000] proposed a 3Ddrawing system using a top-view camera and a side-viewcamera. Stereo vision is a popular choice for depth sensing.However, it is highly dependent on the textures of the objectto find the correspondence between images, and becomeserroneous for texture-insufficient surfaces.

    A time-of-flight range camera has become popularin the recent years. Although the technology is still in itsearly days, resulting in low resolution, noisy data etc, it hasalready been applied in a number of applications such asgame controlling [Wang et al., 2008], upper body gesturerecognition [Holte et al., 2008] [Grest et al., 2007], robotnavigation [Prusak et al., 2007], and mobile human-robotteaming [Loper et al., 2009].

    Australasian Conference on Robotics and Automation (ACRA), December 2-4, 2009, Sydney, Australia

  • 8/3/2019 Real-time Hand Gesture Recognition Using Range Cameras

    2/7

    3 Range Camera

    A 3-D range camera is employed as our apparatus. Itdelivers the depth data of objects in its view at every pixel ata high frame rate. The distance of 3D points is determined

    by a Time-of-Flight (TOF) approach using modulatedinfrared light. The phase shift between the reference andreflected signal is determined by a sampling and correlatingmethod for each pixel, and then the distance is calculated bythe phase shift. See [Linder and Kolb, 2007] for moredetails about the operational principle of the range cameratechnology. Figure 1 shows the range camera we adopted,which is developed by PMD Technologies GmbH.

    Figure 1 PMD Range Camera

    As illustrated in [Kahlmann and Ingensand, 2005],with d as the measured distance, the coordinate of a 3Dpoint (see figure 2) can be calculated by equation (1).

    Figure 2 Geometrical illustration for coordinate evaluation[Kahlmann and Ingensand, 2005]

    0

    (1)P(X,Y,Z) is the coordinate of the 3D point, s(x,y,0) iscalculated from image coordinate to Euclidean coordinate, dis the measured distance, andfis the focal length.

    Assuming the camera is a simple pinhole model, inwhich the optical center is at the image center. We can alsouse equation (2) to retrieve the 3D coordinates, if we knowthe Field-of-View (FOV) of the camera, and ignore thedifference between the distance d and Z coordinate in thedepth direction, which is reasonably correct since thedistance in our application is always over 1m and the FOVof the range camera is 28 degrees.

    ( )tan / 2 (2 ) /

    tan( / 2) ( 2 )/

    FOV Z x W W X

    Y FOV Z H y H

    Z d

    =

    p(2)

    In equation (2), we assume the Z coordinate equal themeasured distance d, W is the width of the image, H is theheight of the image. W,H,x andy are image coordinates inpixel units. In this way, we avoid the burden of calibratingthe range camera. Actually, it is troublesome to calibrate therange camera, because of low-resolution and an unevenlyilluminated image with a large amount of noise.

    Compared to stereo vision systems, which are thetraditional visual methods to capture the depth data, thisdevice has the following advantages:1. Illumination-invariant and color-invariant. Because it

    actively emits modulated infrared light, it is not

    influenced much by the ambient illumination condition

    and color of the objects in the environment.

    2. Texture-independent. Because stereo vision methodsare highly dependent on the texture on the objects to

    find the correspondence between multiple images, the

    range camera significantly outperforms the stereo

    methods in texture-insufficient regions.

    3. Depth resolution is approximately 5~15mm in thedistance range of 0.5 to 3 meters, variable withexposure time and reflectiveness of the surface. A

    comparison of the PMD range camera and stereo-vision

    for the task of surface reconstruction has been

    investigated by [Beder et al., 2007]. Their experiments

    show that the PMD range camera system outperforms

    the stereo system in terms of achievable accuracy for

    distance measurements.

    4. Both depth and grey scale data is captured at high framerate (15 fps).

    Although having the above advantages, this rangecamera has some drawbacks, such as a narrow Field-of-View(28 degrees), low resolution (160x120), and the distance datacontains a large amount of noise, especially near the edges ofthe objects, which is often referred to a jump boundaryeffect.

    3.1 Depth data processing

    There is a significant amount of noise in both depth andintensity data from the sensor. Pre-processing of the depthdata is essential. The standard deviation of the range depthdata is found to be reciprocal to the signal amplitude [M.Frank, 2009]. We remove the bad-pixels first, whoseamplitude of reflection is below a specified threshold. Lowamplitude indicates that their depth data is inaccuratebecause of low signal/noise ratio. The depth values at these

    removed points are assigned using a linear interpolationmethod. Then we apply a Median filter. Speckle noise can byreduced effectively by the Median Filter. However, the mostannoying noise which is known as jump boundary effectcan not be removed in this way. The points near the edges ofthe objects tend to merge into the background. Note thedistribution of the points on the edge of the hand in figure 3.One reason for this error is the limited resolution of thesensor chip [Breuer et al., 2007].

    Australasian Conference on Robotics and Automation (ACRA), December 2-4, 2009, Sydney, Australia

  • 8/3/2019 Real-time Hand Gesture Recognition Using Range Cameras

    3/7

    Figure 3the Range Cam

    Oveerrors in theaverage inteintegration ti125~150.

    Fig.depth data u

    grey image fmiddle imagright image ilthe right oneinformation o

    R=(d/Maxd

    G=(Maxdd

    B=0

    d is the deptpoints in the

    Fig

    3.2 Error

    Several studiparameters aand Ingensanconsider statiperformance

    The

    range camer

    necessary fo

    2005]. If the

    slot, the dist

    serious at the

    could be a pi

    background

    blur effect is

    person move

    of the movin

    the depth dat

    the right im

    grey scale imera

    exposure anddepth data.

    nsity valuee to maintai

    4 shows theing the abov

    om the rangis rendered

    lustrates theis less nois

    f objects: the

    )255

    )/Maxd255

    h value, acene or a spe

    ure 4 The effec

    in regions

    s have beend depth valud, 2005] [Rc scenes. Verf the range c1-tap dista

    requires th

    a single di

    distance at a

    nce calculati

    edges of an o

    xel on an ob

    or the last t

    illustrated in

    a flat board

    objects is s

    a when the o

    ge (c). Not

    age (left) and

    underexposuhus, we fre

    of the iman an average

    ffect of the pe methods. T

    camera, shosing the origrocessing ef. The colorsRGB color v

    d is the maified value.

    t of depth data

    ith high ve

    proposed to cof the rang

    ulke, 2006].y few researcamera withnce acquisiti

    at the four c

    tance calcul

    pixel senses

    n is falsifie

    bject, becaus

    ect for the fi

    o taps. Con

    figure 5. In

    from left to r

    own in the

    ject is alrea

    that this m

    3D points (rig

    re will also ruently calcules and adjgrey value b

    re-processinghe left image

    wing the sceinal depth daect. We can srepresent thlue is assigne

    (3)

    imum depth

    processing

    locity

    alibrate the icamera [Ka

    However, thehes investigaoving objectson method

    onsecutive t

    tion [Oggier

    change in th

    . This error i

    a distance

    rst two taps

    cretely, this

    this experim

    ight. The dep

    iddle image

    y static is sh

    otion blur e

    t) from

    sult inate thest theetween

    on theis the

    e. Theta. Theee thatdepth

    d as:

    of the

    trinsiclmanny onlyted the.of the

    ps are

    et al.,

    is time

    s most

    easure

    nd the

    motion

    nt, the

    th data

    b) and

    own in

    fect is

    di

    te

    b

    pr

    rere

    co

    th

    p

    4

    R

    p

    se

    d

    M

    ca

    ca

    in

    lovprbreteli

    sispptecoco

    cooascabth

    hitoacW1

    fferent from

    nds to be la

    erging into

    ard, the dept

    obably becau

    flected signaliable depth

    nsider the de

    e depth value

    rts of the obj

    (a)igure 5 Comp

    Hand Se

    Extractio

    liable hand

    rts of body i

    gmentation,

    composition

    atching [Bor

    ndidate han

    tegories. Th

    fluences the r

    Skincate and segriable illumiojected to othightness and. However,

    iminate the eTracki

    mple. Howevace. Stereorpose, but t

    xture on thmputation rmputation ti

    In color informatily depth datasume that ameras view.hand gestur

    e body.Now

    stogram metthe camera,count. As she put all thecm, and sel

    the jump bo

    rger because

    background.

    h data on th

    se of the ins

    l. Therefore,ata of a mo

    pth value at t

    at the centroi

    ct.

    (b)rison of the de

    stati

    gmentatio

    n

    egmentation

    crucial for f

    many techn

    [Jacobs, 1

    efors, 1988],

    d shape, cl

    e fineness

    ecognition re

    olor informment hand rnation condier color spacchromatic chthis kind offect of the illng the haner, it is diffvision techniey often suf

    e hands. Fquires highe.trast to then, hand seg

    . In the humasingle persoWhen the p

    s, the hand is

    e can simplod. Althougwe have to twn in figuredepth data i

    lect the bin

    ndary effect

    the pixels

    However, o

    edge tends t

    tability of th

    in order toving object, i

    he centroid, s

    id varies less

    pth data on a

    c object

    and Traj

    from the bac

    rther analysi

    iques, such

    95] and C

    can be appli

    assifying it

    of segmenta

    sult correctne

    ation is tradgion. To re

    tions, the Rs, which sep

    annels, suchmethods domination dis

    in 2D imicult to trackques are widfer from therthermore,resolution i

    traditionalentation can

    n-robot interan is the neaerson is indiusually at a

    segment ththe hand is

    ke the noise6, we use a

    nto N binswhich indic

    . The depth

    n the borde

    this movin

    o be smaller.

    amplitude o

    find a relatt is reasonab

    ince in most

    compared to

    (c)oving object a

    ctory

    ground and

    s. After succe

    as Haar-wa

    hamfer Dis

    d to recogniz

    into pre-de

    tion signific

    ss.

    itionally useuce the effeB color israte the colo

    as HSV, CIEs not compl

    turbance.ges is relatthe hand i

    ely used forlack of sufficcurate dispmages and

    ethods basebe achievedction scenarirest object iating instrucistance in fro

    hand by athe nearest oin depth dataistogram meith an intervates the sm

    alue

    r are

    flat

    It is

    f the

    ivelyle to

    ases

    other

    d a

    other

    ssful

    velet

    tance

    e the

    fined

    antly

    d toct ofoften

    intoLab

    etely

    ively3Dthis

    cientaritymore

    d onsing, wethe

    tionsnt of

    epthbjectinto

    thod.al ofallest

    Australasian Conference on Robotics and Automation (ACRA), December 2-4, 2009, Sydney, Australia

  • 8/3/2019 Real-time Hand Gesture Recognition Using Range Cameras

    4/7

    distance andthat the bindistance valuthat it consist

    Aftecomposed bybetween [dh-and d1 andin image cooThe depth vcheck the depall fall in thedepth data inthe small wiIn this way, t

    Morapplied to elnormalized tofrom the bina

    Figure 6 An

    5 Hand g

    5.1 Hand s

    For hand

    established.

    patterns are rFigure 7 sh

    currently int

    used for gest

    O

    also containsin a red cir, but contain

    s of noise.r the hand dithe points wd1, dh+d2]d2 are specifrdinate (x,y)lue of the hth value in aforeground,the window.dow first une 3D hand tr

    phological opliminate the

    a uniform siry image.

    example of ha

    d

    sture Rec

    ape analysi

    shape anal

    large set o

    corded withw the sampl

    rested in. B

    ring, but onl

    ne

    sufficient nule in figures only 13 poi

    tance is founhose depth v, where dh iied thresholdis the centernd is furthersmall windothen use theOtherwise, a

    til they all falajectory canerations i.e.noise. Thenze and the ed

    d gesture and

    epth data

    gnition

    s

    lysis, a dat

    images cont

    known labelsles of hand

    th left and ri

    left hand sh

    Two

    ber of point6 has the snts, which in

    d, the hand salues are in

    the hand dis. The hand pof the handrefined as fof the hand,verage value

    djust the posill in the foree obtained.rode and dilthe hand sh

    ges are easily

    the histogram

    base needs

    aining variou

    in the traininshapes that

    ight hands co

    pes are show

    Three

    s. Notemallestdicates

    ape isrange

    stance,ositionregion.llows:if theyof the

    tion ofround.

    te, areape isfound

    f the

    to be

    s hand

    g stage.e are

    uld be

    n here.

    d

    1

    te

    st

    di

    an

    pi

    E

    ch

    tr

    b

    n

    is

    t

    i

    ca

    a

    Four

    Turn R

    To m

    tabase, the c

    88] is emploDistan

    mplate imag

    age. DT is a

    stance. In a b

    d non-edge

    xel at positio

    uation 4 [Bo

    ,

    ,

    minki j

    i

    i

    v

    v

    v

    =

    The itanges. Figu

    nsformation.

    ckground (l

    rmalized (mi

    calculated (ri

    Figur

    The mo edge ima

    age I(i,j)

    lculated as E

    ,

    1(

    i j

    sn

    =

    [Borg

    verages for

    Fiv

    ight (b) Tu

    Stop (a)Figure 7 h

    atch a candi

    amfer distan

    ed.ce transform

    s are calcul

    reasonable a

    inary edge i

    pixels are se

    n (i,j) at itera

    rgefors, 19881

    1, 1

    1 1

    , 1 ,

    1 1

    1, 1,

    ( 4,

    3, ,

    3,

    k

    i j

    k

    j i j

    k

    j i j

    v

    v

    v

    +

    +

    +

    +

    erations conre 8 show

    The hand

    eft image).

    ddle image),

    ght image).

    e 8 example of

    atching proces. The mat

    nd a dista

    uation 5.

    ( , ) (I i j DT

    fors, 1984]

    the matchin

    Tur

    rn Left (a) T

    Stop (band shape patte

    date hand i

    e matching

    ation (DT) o

    ted in advan

    pproximation

    age, edge pi

    t to infinity.

    tion step kis

    ].1

    1, 1

    1

    , 1

    1

    3,

    3,

    4)

    k k

    i j i

    k k

    i j i

    v

    v v

    + +

    +

    +

    +

    +

    tinue until ths an exam

    is first se

    The edges

    then the dista

    distance transf

    dure measurching score

    ce transfor

    2, ))j

    compared

    measure:

    n Right (a)

    rn Left(b)

    rns

    age against

    ethod [Borg

    f the hand s

    ce before th

    of the Eucli

    els are set to

    The value o

    then comput

    1

    , 1

    1

    1, 1

    4,

    4,

    j

    j

    +

    +

    +

    ere are nople of dis

    gmented ou

    are found

    ce transform

    ormation

    s the similariof a binary

    ation DT(i,

    four diff

    edian, arith

    our

    fors,

    hape

    test

    dean

    zero

    f the

    d as

    (4)

    aluetance

    t of

    and

    ation

    ty ofedge

    ) is

    (5)

    erent

    etic,

    Australasian Conference on Robotics and Automation (ACRA), December 2-4, 2009, Sydney, Australia

  • 8/3/2019 Real-time Hand Gesture Recognition Using Range Cameras

    5/7

    root mean s

    found to giv

    the matching

    result in zero.

    In o

    images, e.g.B (the dista

    already obtai

    normalized a

    matching sc

    distance tran

    DTA is deriv

    template IB,

    equation 5. T

    similarity me

    the template

    candidate and

    and the small

    5.2 3D han

    A hand trajec

    intentions ar

    static hand sh

    The

    method in s

    obvious paus

    take it as th

    hand 3D posi

    way, the ges

    fixed position

    TheMachine (FS

    sequence of s

    A st

    u is the cente

    in thex, y, z

    cells represe

    recognizer ha

    3D position

    while in thejtWhe

    recognizer de

    enters the n

    gesture is rec[Hong et al

    different nu

    simple move

    complicatedThis

    transition wdifferent frodata before a

    uare (r.m.s)

    fewer false

    is, the higher

    rder to mea

    candidate ice transfor

    ned in advan

    nd extracted t

    re is calcul

    formation of

    ed as well a

    then another

    hese two sco

    asurement be

    B. The simil

    all the templ

    est score indi

    trajectory

    tory is also e

    naturally ex

    apes.

    hand trajecto

    ection 4. Si

    between a

    starting poi

    ition is relati

    turer does no

    every time.

    trajectoriesM) method.

    tate transition

    te S is define

    r of the state

    irections. Th

    nting differe

    s its own wa

    (x,y,z) may

    h state of ges

    n a new h

    termines whe

    xt state bas

    ognized if a., 2000]. Ea

    ber of total

    ments to be

    estures by monline recogen a new

    the approacrecognition p

    and maxim

    minima than

    the similarity

    ure the simi

    ageA and a tation of th

    ce), first, the

    o a binary e

    ted by equa

    the normali

    s the binary

    matching sc

    es are added

    tween the can

    arity measur

    ates in the da

    ates the reco

    recognition

    xpressive in

    pressed by m

    ry in 3D spac

    nce the han

    ew gesture a

    nt of a gest

    e to the start

    t need to sta

    re recognizeEach gesture

    s in the spati

    d as a simple

    space,dis t

    e 3D space is

    nt states. N

    y of splitting

    e in the ith

    ture two.

    nd position

    ther to stay a

    d on the sp

    recognizer rch FSM re

    states, thus

    represented

    re states.nition methoand positionhes that requocedure begi

    um. The r.m

    others. The

    is. A perfect

    ilarity betwe

    emplate handtemplate,

    candidate i

    ge imageIA,

    tion 5; seco

    ed candidate

    edge image

    re is calcula

    together as t

    didate image

    ments betw

    tabase are cal

    gnized gestur

    n interaction

    ovement rath

    e is captured

    normally

    nd previous o

    re. The subs

    ing position.

    rt from a pa

    d by a Finitis defined t

    l-temporal sp

    vector ,

    e distance th

    divided into

    te that each

    the spatial sp

    state of gestu

    arrives, each

    t the current

    atial paramet

    aches its finognizer ma

    this method

    y fewer stat

    determinesis provided

    ire completens.

    .s was

    maller

    fit will

    n two

    imageTB, is

    age is

    then a

    d, the

    image

    of the

    ted by

    e final

    A and

    en the

    ulated

    .

    . Some

    er than

    by the

    as an

    ne, we

    equent

    In this

    rticular

    Stateo be a

    ace.

    where

    eshold

    several

    FSM

    ace. A

    re one,

    FSM

    tate or

    ers. A

    l statehave

    allows

    es and

    a state. It isesture

    6

    T

    m

    6.

    Wshcainit

    arapchcanAlpthohrehromth

    o

    te

    m

    re

    co

    11

    o

    st

    re

    g

    st

    th

    se

    th

    6.

    Fi

    th

    d

    Experim

    e hand gestu

    eters from th

    1 Hand sha

    e tested theapes shown iptured and stanother day.

    will be testedWe n

    ound horizopearances oamfer distanses, the angrrowly separlthough it isrformance be slight rotatitput. Therefond shape tcognition mrizontal rotatation of -1ultiple fingere various app

    igure 9 Examp

    Durin

    e gesture to

    sting purpos

    anually dele

    cognized ev

    nfusion matri

    in the table

    erall recogni

    For th

    ability count

    peats several

    sture. The o

    ability counteLike

    e training set

    t covers most

    en the error r

    2 Hand tra

    ve types of h

    e FSM meth

    raw a triangl

    nt and res

    res are perfo

    range camer

    pe recogniti

    ability of thein figure 7.ored in advaTo further tewith differenticed that t

    ntal and vf the handce matchingular separatiated) may alsdifficult tocause of selon in space sre, in the traimplates forre robust. Etions of -10o

    , 10, andare involved

    earances of th

    les of differen

    the period

    another, the s

    , we extrac

    te the hand

    en by hu

    ix is shown in

    are shown in

    tion error rate

    e real time,

    r. Only wh

    times it will

    line recognit

    r.ther training

    is vital to re

    of the gestu

    te is relativel

    ectory reco

    and trajectori

    od in sectio

    , pick up,

    ult

    rmed at a dis

    a in an indoo

    on

    system to rhe hand shace, and the t

    st the robustnt people in thhe rotationrtical axes

    different, whresult. Furtn (the finge

    o influence tachieve view-occlusion frhould not affning stage, w

    each gestuach gesture i, -5o, 0o, 5oseveral angul. Figure 9 gie same gestu

    e appearances

    hen the hand

    hape is not r

    ted images

    images w

    an. The o

    table 1. The

    figure 7 in th

    is under 3%.

    online reco

    n the same

    be confirme

    ion rate is 98

    -test method

    ognition res

    res which app

    y low.

    gnition

    ies are traine

    5.2, includi

    put down

    tance of abou

    environment

    cognize thee templatesst was perfo

    ess of the menear future.

    f the handmay make

    ich wouldermore, inrs are sharple matching s

    point indepem a singlect the recogne collected 1re to makes conductedand 10o, ve

    lar separatioes a rough ide.

    of the same ge

    is changing

    cognizable, s

    from videos,

    ich can no

    fline recogn

    gestures fro

    e same order

    nition, we

    recognition

    d as a recog

    %, because o

    , the selecti

    lts. If the tra

    ear in the tes

    and tested

    ing wave h

    nd come he

    t 1.5

    .

    handwerermedthod,

    boththe

    ffectsomey orcore.dentiew,ition0~18

    thewith

    rticals if

    ea of

    ture

    from

    o for

    and

    t be

    ition

    1 to

    . The

    se a

    esult

    ized

    f the

    n of

    ining

    t set,

    sing

    nd,

    e.

    Australasian Conference on Robotics and Automation (ACRA), December 2-4, 2009, Sydney, Australia

  • 8/3/2019 Real-time Hand Gesture Recognition Using Range Cameras

    6/7

    Sev

    video are sh

    from left to ri

    the states tra

    hand position

    all its states,

    1 2

    1 82 0

    2 76

    3 3

    4

    5

    6

    7

    8

    9

    10

    11

    Table

    gestures

    Wave handDraw triangle

    Come here

    Pick up

    Put downTable

    Figure 11 s

    person move

    come hereare shown in

    its states and

    ral key fra

    wn in figure

    ight and then

    nsition of e

    in figure 10.

    o it is recogn

    3 4 5

    1 0 2

    3 2

    64 3

    57 3

    59

    3 1 1

    1 hand shape r

    Figure 10 ex

    1 2 3

    1 1 2

    1

    1

    1states transiti

    ows some f

    his hand for

    gesture. Thetable 3. Only

    it is recogniz

    Figure 11 ex

    es of the

    10. The per

    from right to

    ch gesture a

    Only wave

    ized.

    6 7 8

    2 0 0

    45

    51

    47

    1

    2

    ecognition con

    mple of wavin

    3 4 5

    n of each gest

    rames of th

    wards and ba

    states transitithe third gest

    d as come h

    mple of come

    ovement fr

    son waved hi

    left. Table 2

    ccording to t

    hand went t

    9 10 11

    0 0 0

    59

    68

    70

    fusion matrix

    hand

    States

    finished

    St

    le

    5 0

    2 2

    1 4

    1 2

    1 2re in figure 10

    video in w

    kwards to ex

    ions of eachure went thro

    ere.

    here

    m the

    s hand

    shows

    he 3D

    rough

    total

    87

    81

    70

    60

    59

    50

    51

    47

    60

    70

    70

    ates

    ft

    hich a

    press a

    estureugh all

    T

    8

    U

    m

    It

    au

    p

    m

    7

    A

    d

    se

    to

    th

    an

    ac

    dthdsiw

    se

    coinus

    ra

    b

    th

    co

    ar

    fr

    gestures

    Wave hand

    Draw triang

    Come herePick up

    Put downTable 3 s

    e overall rec

    %. Details fo

    Gestures

    names

    Wave handDraw triangl

    Come here

    Pick upPut down

    Tabl

    sing the met

    ovements ca

    can detect

    tomatically,

    rticular posit

    ay affect the

    Discussio

    real time

    scribed. Tak

    gmentation a

    the changes

    e chamfer m

    d the FSM

    hieve high reThe m

    ta in sectione same deptta is not abl

    mply. In thisich are calc

    gmented regiAn onjunction wiformation aneful for face

    First, i

    nge camera

    ckground w

    reshold is r

    rresponding

    e removed. L

    m the web c

    1

    le 1

    1 21

    1tates transition

    gnition rate

    r each gestur

    Performe

    times

    10

    10

    10

    1010

    e 4 hand traje

    od in sectio

    have differe

    the start

    and a gestur

    ion. Howev

    esult.

    n and Con

    and gesture

    ing advanta

    nd 3D tracki

    in the envir

    tching meas

    method for r

    cognition ratain drawback4 is that, wh

    range, thee to furthersituation, th

    ulated by fi

    n will also brdinary weth the range

    higher resoletection and

    mage coordi

    is aligned.

    hose depth

    moved in t

    egions of the

    ast, the hand

    amera using a

    S

    fi

    4 5

    of each gesture

    f the hand tr

    are shown i

    d Recogni

    times

    9

    8

    10

    710

    tory recognitio

    n 5.2, simpl

    nt numbers o

    nd end poi

    e is not req

    r, the scale

    clusion

    recognition

    e of a ran

    ng becomes

    nment. Expe

    rement for h

    ecognizing t

    s.of the metho

    en the handsegmentationdistinguish h

    hand positiding the cen

    e erroneous.camera ccamera to prlution, whichrecognition.ates of the w

    Second, th

    values are

    he range ca

    background

    is tracked by

    Particle Filte

    tates

    ished

    Stat

    left

    1 4

    1 3

    5 01 2

    1 2in figure 11

    jectories is a

    table 4.

    zed Recogn

    Rate

    90%

    80%

    100%

    70%100%

    n result

    and compli

    f state transit

    nt of a ge

    ired to start

    of the move

    system has

    ge camera,

    asy and inv

    riments show

    nd shape an

    e hand traje

    d using onlynd forearm ausing only

    and from forns in the i

    tral points o

    an be usevide usefulis also poten

    eb camera an

    objects in

    over a spec

    mera. Third,

    in the web ca

    color inform

    r method.

    s

    out

    ition

    ated

    ions.

    sture

    at a

    ment

    been

    hand

    riant

    that

    lysis

    ctory

    epthre inepth

    earmagesf the

    incolortially

    d the

    the

    ified

    the

    mera

    ation

    Australasian Conference on Robotics and Automation (ACRA), December 2-4, 2009, Sydney, Australia

  • 8/3/2019 Real-time Hand Gesture Recognition Using Range Cameras

    7/7

    We then find the corresponding hand position in

    the range camera in order to find its depth value. The hand

    is segmented from a cube whose center is the 3D hand

    position.

    This extra web camera helps to improve the

    accuracy of the 3D hand position, especially in thesituations mentioned above. However, it also imposes extra

    burden of alignment between the web camera and the range

    camera. Furthermore, when the person is wearing a short

    sleeve clothes, the color based method may locate the hand

    position incorrectly.

    In the future, the recognition of more types of hand

    shapes will be included. For testing robustness, the

    templates will be created by one person, while hand images

    from several persons will be used in the test stage. The hand

    trajectory recognition method should be improved so that it

    is invariant to the scale of the movements. Gestures

    performed in an outdoor environment will also be tested to

    evaluate the capability of the range camera and therobustness of the proposed methods.

    References

    [Beder et al., 2007] C. Beder, B. Bartczak and R. Koch. Acomparison of PMD- camera and stereo-vision for thetask of surface reconstruction using Patchlets. ComputerVision and Pattern Recognition, pp 1-8, 2007.

    [Borgefors, 1988] Gunilla Borgefors. Hierarchical ChamferMatching: a Parametric Edge Matching Algorithm. IEEEtransactions on systems, man, and cybernetics. Pp 849-865. 1988.

    [Breuer et al., 2007] Pia Breuer, Christian Eckes, and StefanMuller. Hand Gesture Recognition with a novel IR Time-of-Flight Range Camera A pilot study. LNCS pp 247-260, 2007

    [Frank et al., 2009] M. Frank, M. Plause, H.Rapp, U. Kothe,B. Janhne, and F.A. Hamprehct. Theoretical andexperimental error analysis of continuous-wave time-of-flight range cameras. Opt. Eng. 48, 013602, 2009.

    [Grest et al., 2007] Daniel Grest, Volker Krger andReinhard Koch. Single View Motion Tracking by Depthand Silhouette Information. Lecture Notes in ComputerScience. pp.719-729, 2007

    [Hong et al., 2000] Pengyu Hong, Matthew Turk andThomas S. Huang. Gesture Modeling and Recognitionusing Finite State Machines. IEEE conference on Face

    and Gesture Recognition, 2000.[Holte et al., 2008] M.B. Holte, T.B. Moeslund and P. Fihl.Fusion of Range and Intensity Information for ViewInvariant Gesture Recognition. IEEE Computer SocietyConference on Computer Vision and Pattern RecognitionWorkshops, 2008.

    [Imagawa et al., 1998] Kazuyuki Imagawa, Shan Lu, SeijiIgi. Color-Based Hands Tracking System for SignLanguage Recognition. Proceedings of the 3rd.International Conference on Face & Gesture Recognition.pp 462-468, 1998

    [Jacobs, 1995] C. E. Jacobs, A. Finkelstein, and D. H.Salesin. Fast Multiresolution Image Querying.

    International conference on computer graphics andinteractive techniques, pp 277286, 1995.

    [Kahlmann and Ingensand, 2005] T. Kahlmann, H.Ingensand, Calibration and improvements of the high-resolution range-imaging camera SwissRanger,Proceedings of the SPIE, Vol. 5665, pp. 144-155,February, 2005.

    [Kjeldsen and Kender, 1996] Rick Kjeldsen and JohnKender .Toward the use of gesture in traditional userinterfaces. International Conference on Automatic Faceand Gesture recognition. 1996, pp 1257-1261.

    [Linder and Kolb, 2007] M. Lindner and A. Kolb.Calibration of the Intensity-Related Distance Error of thePMD TOF-Camera. Proceedings of Intelligent Robotsand Computer Vision XXV: Algorithms, Techniques, andActive Vision, September, 2007.

    [Loper et. al.2009] Matthew M. Loper, Nathan P. Koening,Sonia H. Chernova, Chris V. Jones, Odest C. Jenkins.Mobile human-robot teaming with environmentaltolerance. Proceedings of the 4th ACM/IEEE

    international conference on Human robot interaction.Pages 157-164 . 2009[Nickel et al., 2004] Kai Nickel, Edgar Scemann, and

    Rainer Stiefelhagen. 3D-Tracking of Head and Hands forPointing Gesture Recognition in a Human-RobotInteraction Scenario." Proceedings of the Sixth IEEEInternational Conference on Automatic Face and GestureRecognition, 2004.

    [Oggier et al., 2005] Thierry Oggier, Mike Stamm, MatthiasSchweizer, Jorn Pedersen. User manual SwissRanger 2rev. b. Version 1.02, 2005.

    [Reulke, 2006] R. Reulke. Combination of distance datawith high resolution images, IEVM06, 2006

    [Starner and Pentland, 1995] Thad Starner, and AlexPentland. Real-time American Sign Language recognition

    from video using hidden markov models. IEEETransactions on Pattern Analysis and MachineIntelligence, vol 20, pp. 1371-1375, 1995.

    [Stiefelhagen et al., 2004] R. Stiefelhagen, C. Fugen, P.Gieselmann, H. Holzapfel, K. Nickel and A. Waibel.Natural human-robot interaction using speech, head poseand gestures. Proceedings of 2004 IEEE/RSJinternational conference on Intelligent Robots andSystems, 2004

    [Sturman and Zeltzer, 1994] D.J. Sturman, and D. Zeltzer,A Survey of Glove-Based Input, IEEE ComputerGraphics and Applications, Vol. 14, pp-30-39, 1994.

    [Prusak et al., 2007] A. Prusak, O. Melnychuk, H. Roth, I.Schiller, R. Koch. Pose estimation and map building witha Time-Of-Flight-camera for robot navigation. Dynamic

    3D imaging workshop, Heidelberg, Germany, 2007[Wang et al., 2008] Using human body gestures as input for

    gaming via depth analysis.[Zhenyao Mo, J. P. Lewis, Ulrich Neumann, 2005]

    SmartCanvas: A Gesture-Driven Intelligent DrawingDesk. IUI 05: 10th International Conference onIntelligent User Interfaces, ACM, 2005

    Australasian Conference on Robotics and Automation (ACRA), December 2-4, 2009, Sydney, Australia