4
Interpretation of Group Behaviour in Visually Mediated Interaction Jamie Sherrah and Shaogang Gong Department of Computer Science, Queen Mary and Westfield College, London E1 4NS, UK [jamie|sgg]@dcs.qmw.ac.uk A. Jonathan Howell and Hilary Buxton School of Cognitive and Computing Sciences, University of Sussex, Brighton BN1 9QH UK [jonh|hilaryb]@cogs.susx.ac.uk Abstract While full computer understanding of dynamic visual scenes containing several people may be currently unattain- able, we propose a computationally efficient approach to determine areas of interest in such scenes. We present meth- ods for modelling and interpretation of multi-person human behaviour in real time to control video cameras for visually mediated interaction. 1 Introduction Machine understanding of human motion and behaviour is currently a key research area in computer vision, and has many real-world applications. Visually Mediated Interac- tion (VMI) requires intelligent interpretation of a dynamic visual scene in order to determine areas of interest for fast and effective communication to a remote observer. Recent work at the MIT Media Lab has shown some ini- tial progress in the modelling and interpretation of human body activity [7, 9]. Computationally simple view-based approaches to action recognition have also been proposed [1, 2]. However, these systems do not attempt intentional tracking and modelling to control active camera views for VMI. Previous work on vision-based camera control has been based on off-line execution of pre-written scripts of a set of defined camera actions [8]. Here we propose to model and exploit head pose and a set of “interaction-relevant” gestures for reactive on-line visual control. These will be interpreted as user intentions for live control of an active camera with adaptive view direction and attentional focus. In particular, pointing with head pose as evidence for di- rection and waving for attention are important for delibera- tive camera control. The reactive camera movements should provide the necessary visual context for applications such as group video-conferencing and automated studio direction. 2 Modelling Human Behaviour for VMI For our purposes, human behaviour can be considered to be any temporal sequence of body movements or configu- rations, such as a change in head pose, walking or waving. When attempting to model human behaviour, one must se- lect the set of behaviours to be modelled for the application at hand. Further, the level of complexity of the modelling should be concomitant with its purpose. For the implemen- tation of real-time systems, it is of paramount importance that only the minimum amount of information is computed to adequately model human subjects for the task at hand. In this section, some salient behaviours are defined for visually mediated interaction tasks. Implicit Behaviour Our system needs to identify regions of interest in a visual scene for communication to a remote subject. Examining the case in which the scene contains human subjects in- volved in a video conference, the subject(s) currently in- volved in communication will usually constitute the appro- priate focus of attention. Therefore visual cues that indicate a switch in the chief communicator, or turn-taking, are most important. Gaze is quite a significant cue for determining the focus of communication, and is approximated by head pose. Gaze and other uses of body language that indicate turn-taking are generally performed unconsciously by the subject. We define implicit behaviour as a body movement sequence that is performed subconsciously by the subject. We adopt head pose as our primary source of implicit behaviour in VMI tasks. Head pose at each time instant is represented by a pair of angles, yaw (azimuth) and tilt (elevation) . Our previous work shows that yaw and tilt can be computed robustly in real-time from 2D images of limited resolution [6].

Interpretation of Group Behaviour in Visually Mediated ...sgg/papers/icpr00-group-behaviour.pdf · group video-conferencingand automated studio direction. 2 Modelling Human Behaviour

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Inter pretation of Group Behaviour in Visually Mediated Interaction

JamieSherrah and ShaogangGongDepartmentof ComputerScience,QueenMary andWestfieldCollege,London E14NS,UK

[jamie|sgg]@dcs.qmw.ac.uk

A. JonathanHowell and Hilary BuxtonSchoolof CognitiveandComputingSciences,Universityof Sussex, Brighton BN1 9QH UK

[jonh|hilaryb]@cogs.susx.ac.uk

Abstract

While full computerunderstandingof dynamicvisualscenescontainingseveral peoplemaybecurrentlyunattain-able, we proposea computationallyefficient approach todetermineareasof interestin such scenes.Wepresentmeth-odsfor modellingandinterpretationof multi-personhumanbehaviourin real timeto control videocamerasfor visuallymediatedinteraction.

1 Intr oduction

Machineunderstandingof humanmotionandbehaviouris currentlya key researchareain computervision,andhasmany real-world applications. Visually MediatedInterac-tion (VMI) requiresintelligent interpretationof a dynamicvisualscenein orderto determineareasof interestfor fastandeffectivecommunicationto a remoteobserver.

Recentwork at theMIT MediaLabhasshown someini-tial progressin the modellingandinterpretationof humanbody activity [7, 9]. Computationallysimple view-basedapproachesto action recognitionhave alsobeenproposed[1, 2]. However, thesesystemsdo not attemptintentionaltrackingandmodellingto control active cameraviews forVMI. Previous work on vision-basedcameracontrol hasbeenbasedon off-line executionof pre-writtenscriptsof asetof definedcameraactions[8]. Hereweproposeto modeland exploit headposeand a set of “interaction-relevant”gesturesfor reactive on-line visual control. Thesewill beinterpretedas user intentionsfor live control of an activecamerawith adaptive view directionandattentionalfocus.In particular, pointing with headposeas evidencefor di-rectionandwaving for attentionareimportantfor delibera-tivecameracontrol.Thereactivecameramovementsshouldprovidethenecessaryvisualcontext for applicationssuchasgroupvideo-conferencingandautomatedstudiodirection.

2 Modelling Human Behaviour for VMI

For ourpurposes,humanbehaviourcanbeconsideredtobe any temporalsequenceof body movementsor configu-rations,suchasa changein headpose,walking or waving.Whenattemptingto modelhumanbehaviour, onemustse-lect thesetof behavioursto bemodelledfor theapplicationat hand. Further, the level of complexity of the modellingshouldbeconcomitantwith its purpose.For theimplemen-tation of real-timesystems,it is of paramountimportancethatonly theminimumamountof informationis computedto adequatelymodelhumansubjectsfor thetaskathand.Inthissection,somesalientbehavioursaredefinedfor visuallymediatedinteractiontasks.

Implicit Behaviour

Our systemneedsto identify regionsof interestin a visualscenefor communicationto a remotesubject. Examiningthe casein which the scenecontainshumansubjectsin-volved in a video conference,the subject(s)currently in-volvedin communicationwill usuallyconstitutetheappro-priatefocusof attention.Thereforevisualcuesthatindicateaswitchin thechiefcommunicator, or turn-taking, aremostimportant. Gazeis quite a significantcuefor determiningthe focusof communication,andis approximatedby headpose. Gazeandotherusesof body languagethat indicateturn-takingare generallyperformedunconsciouslyby thesubject.We defineimplicit behaviourasa bodymovementsequencethatis performedsubconsciouslyby thesubject.

We adoptheadposeas our primary sourceof implicitbehaviour in VMI tasks.Headposeat eachtime instantisrepresentedby a pair of angles,yaw (azimuth)

�and tilt

(elevation) � . Our previous work shows that yaw andtiltcanbe computedrobustly in real-timefrom 2D imagesoflimited resolution[6].

Explicit BehaviourHeadposeinformation is insufficient to determinea sub-ject’s focusof attentionfrom a single2D view, dueto lossof 3D information.Thereforeit is necessaryto havetheusercommunicateexplicitly with our VMI systemthrougha setof pre-definedbehaviourswith vaguesemanticsattachedtothem.Letusdefineexplicit behaviourasasequenceof bodymovementsthatareperformedconsciouslyby a subjectinorderto highlight regionsof interestin the scene.We useasetof pointingandwaving gesturesasexplicit behavioursfor control of the currentfocusof attention.We have pre-viously shown that thesegesturescanbe reliably detectedandclassifiedin real-time[5, 4]. Specifically, a modelm �is maintainedfor eachof � gesturesunderconsideration,������� � � � , andat time � a likelihood ��� x ����� � m ��� is gen-eratedfor eachmodel that the given gesturehasjust beencompleted. These� likelihoodvaluesare thresholdedtodetecta gesture,or arein themselvesconsideredasmodeloutputsfor explicit behaviour.

Human BehaviourGiven that both implicit andexplicit behaviours aremea-suredfrom humansubjectsin a scene,thesesourcesof in-formationcanbe combinedto form a temporalmodel forhumanbehaviour. Let usdefineb ����� , thebehaviourvectorof a subjectat time � to be the concatenationof measuredimplicit andexplicit behaviours. For our purposes,thebe-haviour vectoris theconcatenationof gesturemodellikeli-hoodsandheadposeangles:

b ����� ��� ��� x ������� m ��� ��� �� ��� x ������� m ��� � ����� ������� �"! (1)

3 Inter pretation of Group Activities

Although the individual interpretationof behaviours ispossibleby attachingpre-definedsemanticsin the form ofcameracontrol commands,thecaseof multiple subjectsisnot so simpledueto the combinatorialexplosionof possi-bilities. Thesepossibilitiesnot only include variationsinwhich behaviours occur simultaneously, but also in theirtiming andduration. Clearly the rangeof possibilitiesandambiguitiespresenta problemfor any singlevisualcue,nomatterhow accuratelyit canbecomputed.It is only by fus-ing differentvisualcuesthatwe maysuccessfullyinterpretthescene.

3.1 High-Level Group Behaviour Inter pretation

Now we describea methodologyfor machineunder-standingof groupbehaviours.Giventhecomplexitiesof theunconstrainedmulti-personenvironmentdescribedabove,we examinea more constrainedsituation. We assumeafixed number � of peoplewho remainin the sceneat all

times. Let us definethe group vectorto be the concatena-tion of the � behaviour vectorsof thesepeopleat time � :

g ����� ���b �#����� ! b $%����� ! ��� � b �&�����'!(�)! (2)

Thegroupvectoris anoverall descriptionof thesceneat agiventimeinstant.Let usdefineagroupbehaviourasatem-poral sequenceof group vectors,

�g ��� � � g ��� $ � � ��� g ��*+� � .

Given a group behaviour, we introducea high-level in-terpretationmodelto determinethe currentareaof focus.Sincethe region of interestis almostalwaysa personandwe track the headof eachindividual, the outputneedonlygiveanindicationof whichof the � peoplearecurrentlyat-tendedto. Thereforewe definetheoutputof thehigh-levelsystemto bethecamera positionvector:

c ����� � � , � -, $ �� ����., � � (3)

where, � is abooleanvalue(0 or 1) indicatingwhetherper-

son�

is currentlyattendedto. An interpretationcanthenbeplaceduponc ����� to control themovablecamera.Examplesaregivenin Table1. Sucha schemewould requireat leasttwo cameras,oneto framethewholescenefor trackingofall individuals,andtheotherfor takingclose-upshots.Herewe usea “virtual camera”by croppingfocal regionsfromtheglobalimage.Thehigh-level interpretationmodelmusttransforma recenthistory of group vectorsinto a cameravectorfor thecurrentscene.However, without thefeedbackto retainthepreviousfocusof attention,thesystemwill lackthecontext to correctlyinterpretbehaviour. For instance,ifasubjectwavesto gainfocusof attention,thecameravectormustremainon thesubjectuntil anothersubjectattractsat-tention.Without feedback,thesubjectwould losethefocusof attentionassoonasthegesturehasended.Thereforethegeneralform of thehigh-level interpretationsystem/0��� is:

c ����� � /0� g ����� c ���21 � ��� � /0� s�����3� (4)

wheres����� is thescenevectorat time � , definedasthecon-catenationof thecurrentgroupvectorandpreviouscameravector, s����� �4�

g ����� c ����1 � �'� .Given this model, the high-level interpretationsystem

must perform the translationfrom behaviours to focus ofattentionbasedon a fusion of externalsemanticdefinitionandstatisticsof behavioursandtheir timings. The seman-tics maycomefrom a setof rules,but an exhaustive spec-ification of thesystemwould be infeasibledueto themul-tiplicity of possibleco-occurringbehavioursandtheir tim-ings. We take a supervisedlearningapproach:the systemis trainedon a setof examplegroupbehaviours, with theaim of generalisingto new groupbehaviours. To learnthetransformationfrom scenevectorto camerapositionvector,we usedaTime-DelayRBF Network [3], trainedonhalf ofoursequencedatabaseandtestedon theotherhalf.

Weconstrainthecomplexity of thetaskby restrictingthegroupbehavioursto certainfixedscenarios.Theexacttim-ing of the eventswill vary betweendifferent instancesofthesamescenario,but thefocusof attentionshouldswitchto the sameregionsat approximatelythe sametimes. De-scriptionsof examplescenariosinvolving threesubjectsaregiven in Table2. Several examplesof eachscenariowerecollected,andtrainingexampleswerelabelledby handwithacamerapositionvectorfor eachscenevector. A high-levelsystemconsistingof a recurrentRBF network wastrainedon theseexamplesandthentestedon a differentsetof testinstancesof thesamescenarios.

Figures1–4show examplesof thesystemoutputfor twoexamplescenarios:wave-look andpoint. Figures1 and3show temporally-orderedframeswith boxes framing thehead,faceandhandsbeing tracked. In eachframe, headposeis shown above the headwith an intuitive dial box.Figures2 and4 show the headposeangles(top) andges-turelikelihoods(middle)for personsA, B andC (from leftto right). Onecanseethe correspondenceof peaksin thegesturelikelihoodswith gestureeventsin thescenario.

c ����� interpretation

[0,0,0] framewholescene[1,0,0] focuson subjectA[0,1,1] focuson subjectsB andC usinga split-

screeneffect

Table 1. Example of possib le interpretationsof camera position vector s for three people .

scenario description

wave-look C wavesandspeaks,A wavesandspeaks,B wavesandspeaks.Eachtimesomeoneisspeakingtheothertwo subjectslook athim

point C wavesandspeaks,A andB look at C, Cpointsto A, C andB look at A, A looksatcameraandspeaks

Table 2. The example scenarios described intemporal order of their behaviour s. All sub-jects are looking at the camera (forwar d) un-less stated otherwise .

The bottomsectionsof Figures2 and4 show the train-ing signal,or targetcameravectors,tracedabovetheactualoutputcameravectorsobtainedduringtestswith thetrainedRBF network. It canbe seenthat the network follows thegeneralinterpretationof groupbehaviour, thoughtheexactpointsof transitionfrom onefocusof attentionto anotherdonotalwayscoincide.Thesetransitionsarehighly subjectiveanddifficult to determine,evento thehumaneye.

4 Conclusion

Somekey issuesin visual interpretationof group be-haviours in singleviews have beenexplored,anda frame-work hasbeenpresentedfor trackingpeopleandrecognis-ing theircorrelatedgroupbehavioursin VMI contexts. Pre-definedgesturesandtheheadposeof severalindividualsinthe scenecan be simultaneouslyrecognisedfor scenein-terpretation. In the presenceof multiple people,ambigu-ities ariseanda high-level interpretationof the combinedbehavioursof theindividualsbecomesessential.

We have shown examplesof how multi-personactivityscenarioscanbe learnedfrom trainingexamplesandinter-polatedto obtain the sameinterpretationfor different in-stancesof the samescenario. However for the approachto scaleup to moregeneralapplication,it mustbe abletocopewith a whole rangeof scenarios,and extrapolatetonovel situationsin thesameway asa person.A significantissueraisedin this paperandfor future work is the feasi-bility of learningcorrelatedtemporalstructuresanddefaultbehavioursfrom sparsedata.

Sincethis systemrelieson several independentcompo-nents,theoverall probabilityof failureof at leastonecom-ponentis alwaysquitehigh. Thereforethehigh-level inter-pretationsystemmustbeableto copewith missingor noisyinputs.Thesystemoutputsmaybefedbackto thelow-levelsub-systemsto guidetheir processing.

References

[1] A. F. Bobick. Movement,activity, andaction: The role ofknowledgein theperceptionof motion. Proc.RoyalSocietyLondon,SeriesB, 352:1257–1265,1997.

[2] R. Cutler andM. Turk. View-basedinterpretationof real-time optical flow for gesturerecognition. In Proc. FG’98,pp.416–421,Nara,Japan,1998.

[3] A. J.Howell andH. Buxton.Recognisingsimplebehavioursusingtime-delayRBFnetworks.Neural ProcessingLetters,5:97–104,1997.

[4] A. J. Howell and H. Buxton. Learninggesturesfor visu-ally mediatedinteraction. In Proc. BMVC, pp. 508–517,Southampton,UK, 1998.BMVA Press.

[5] S.J.McKennaandS.Gong.Gesturerecognitionfor visuallymediatedinteractionusingprobabilisticeventtrajectories.InProc.BMVC, pp.498–507,Southampton,UK, 1998.

[6] E. Ong,S. McKennaandS. Gong. Trackingheadposeforinferring intention. In EuropeanWorkshoponPerceptionofHumanAction, Freiburg, June1998.

[7] A. Pentland.Smartrooms.ScientificAmerican, 274(4):68–76,1996.

[8] C. Pinhanezand A. F. Bobick. Approximateworld mod-els: Incorporatingqualitative andlinguistic informationintovisionsystems.In AAAI’96, Portland,Oregon,1996.

[9] C. R. WrenandA. P. Pentland.Dynamicmodelsof humanmotion. In Proc.FG’98, pp.22–27,Nara,Japan,1998.

Figure 1. Frames from wave-look sequence . Individuals are labelled A, B and C from left to right.

0 2 4 6 8 10 12 14 16 18 2040

60

80

100

120

time (seconds)

angl

e (d

egre

es)

yaw tilt

0 2 4 6 8 10 12 14 16 18 200

0.05

0.1

0.15

0.2

time (seconds)

likel

ihoo

d

p(point)p(wave)

0 2 4 6 8 10 12 14 16 18 2060

80

100

120

140

160

180

time (seconds)

angl

e (d

egre

es)

yaw tilt

0 2 4 6 8 10 12 14 16 18 200

0.01

0.02

0.03

0.04

time (seconds)

likel

ihoo

dp(point)p(wave)

0 2 4 6 8 10 12 14 16 18 2040

60

80

100

120

140

160

time (seconds)

angl

e (d

egre

es)

yaw tilt

0 2 4 6 8 10 12 14 16 18 200

0.01

0.02

0.03

0.04

time (seconds)

likel

ihoo

d

p(wave) p(point)

140 4 6 8 10 12 16 18 20

Output

TargetCBA

A

CB

2

Figure 2. Results for wave-look scenario. Plots sho w pose angles (top) for persons A, B and C fromleft to right, gesture likelihoods (mid dle) and target/output camera position vector s (bottom).

Figure 3. Frames from point sequence . Individuals are labelled A, B and C from left to right.

0 2 4 6 8 10 1240

60

80

100

120

time (seconds)

angl

e (d

egre

es)

yaw tilt

0 2 4 6 8 10 120

0.05

0.1

0.15

0.2

time (seconds)

likel

ihoo

d

p(point)p(wave)

0 2 4 6 8 10 1260

80

100

120

140

160

180

time (seconds)

angl

e (d

egre

es)

yaw tilt

0 2 4 6 8 10 120

0.05

0.1

0.15

0.2

time (seconds)

likel

ihoo

d

p(point)p(wave)

0 2 4 6 8 10 1220

40

60

80

100

120

140

time (seconds)

angl

e (d

egre

es)

yaw tilt

0 2 4 6 8 10 120

0.05

0.1

0.15

0.2

time (seconds)

likel

ihoo

d

p(point)p(wave)

ABC

ABC

0 2 4 6 8 10 12

Target

Output

Figure 4. Results for point scenario. Plots sho w pose angles (top) for persons A, B and C from left toright, gesture likelihoods (mid dle) and target/output camera position vector s (bottom).