Interpretation of Group Behaviour in Visually Mediated ...sgg/papers/icpr00-group-behaviour.pdf ·...

Preview:

Citation preview

Inter pretation of Group Behaviour in Visually Mediated Interaction

JamieSherrah and ShaogangGongDepartmentof ComputerScience,QueenMary andWestfieldCollege,London E14NS,UK

[jamie|sgg]@dcs.qmw.ac.uk

A. JonathanHowell and Hilary BuxtonSchoolof CognitiveandComputingSciences,Universityof Sussex, Brighton BN1 9QH UK

[jonh|hilaryb]@cogs.susx.ac.uk

Abstract

While full computerunderstandingof dynamicvisualscenescontainingseveral peoplemaybecurrentlyunattain-able, we proposea computationallyefficient approach todetermineareasof interestin such scenes.Wepresentmeth-odsfor modellingandinterpretationof multi-personhumanbehaviourin real timeto control videocamerasfor visuallymediatedinteraction.

1 Intr oduction

Machineunderstandingof humanmotionandbehaviouris currentlya key researchareain computervision,andhasmany real-world applications. Visually MediatedInterac-tion (VMI) requiresintelligent interpretationof a dynamicvisualscenein orderto determineareasof interestfor fastandeffectivecommunicationto a remoteobserver.

Recentwork at theMIT MediaLabhasshown someini-tial progressin the modellingandinterpretationof humanbody activity [7, 9]. Computationallysimple view-basedapproachesto action recognitionhave alsobeenproposed[1, 2]. However, thesesystemsdo not attemptintentionaltrackingandmodellingto control active cameraviews forVMI. Previous work on vision-basedcameracontrol hasbeenbasedon off-line executionof pre-writtenscriptsof asetof definedcameraactions[8]. Hereweproposeto modeland exploit headposeand a set of “interaction-relevant”gesturesfor reactive on-line visual control. Thesewill beinterpretedas user intentionsfor live control of an activecamerawith adaptive view directionandattentionalfocus.In particular, pointing with headposeas evidencefor di-rectionandwaving for attentionareimportantfor delibera-tivecameracontrol.Thereactivecameramovementsshouldprovidethenecessaryvisualcontext for applicationssuchasgroupvideo-conferencingandautomatedstudiodirection.

2 Modelling Human Behaviour for VMI

For ourpurposes,humanbehaviourcanbeconsideredtobe any temporalsequenceof body movementsor configu-rations,suchasa changein headpose,walking or waving.Whenattemptingto modelhumanbehaviour, onemustse-lect thesetof behavioursto bemodelledfor theapplicationat hand. Further, the level of complexity of the modellingshouldbeconcomitantwith its purpose.For theimplemen-tation of real-timesystems,it is of paramountimportancethatonly theminimumamountof informationis computedto adequatelymodelhumansubjectsfor thetaskathand.Inthissection,somesalientbehavioursaredefinedfor visuallymediatedinteractiontasks.

Implicit Behaviour

Our systemneedsto identify regionsof interestin a visualscenefor communicationto a remotesubject. Examiningthe casein which the scenecontainshumansubjectsin-volved in a video conference,the subject(s)currently in-volvedin communicationwill usuallyconstitutetheappro-priatefocusof attention.Thereforevisualcuesthatindicateaswitchin thechiefcommunicator, or turn-taking, aremostimportant. Gazeis quite a significantcuefor determiningthe focusof communication,andis approximatedby headpose. Gazeandotherusesof body languagethat indicateturn-takingare generallyperformedunconsciouslyby thesubject.We defineimplicit behaviourasa bodymovementsequencethatis performedsubconsciouslyby thesubject.

We adoptheadposeas our primary sourceof implicitbehaviour in VMI tasks.Headposeat eachtime instantisrepresentedby a pair of angles,yaw (azimuth)

�and tilt

(elevation) � . Our previous work shows that yaw andtiltcanbe computedrobustly in real-timefrom 2D imagesoflimited resolution[6].

Explicit BehaviourHeadposeinformation is insufficient to determinea sub-ject’s focusof attentionfrom a single2D view, dueto lossof 3D information.Thereforeit is necessaryto havetheusercommunicateexplicitly with our VMI systemthrougha setof pre-definedbehaviourswith vaguesemanticsattachedtothem.Letusdefineexplicit behaviourasasequenceof bodymovementsthatareperformedconsciouslyby a subjectinorderto highlight regionsof interestin the scene.We useasetof pointingandwaving gesturesasexplicit behavioursfor control of the currentfocusof attention.We have pre-viously shown that thesegesturescanbe reliably detectedandclassifiedin real-time[5, 4]. Specifically, a modelm �is maintainedfor eachof � gesturesunderconsideration,������� � � � , andat time � a likelihood ��� x ����� � m ��� is gen-eratedfor eachmodel that the given gesturehasjust beencompleted. These� likelihoodvaluesare thresholdedtodetecta gesture,or arein themselvesconsideredasmodeloutputsfor explicit behaviour.

Human BehaviourGiven that both implicit andexplicit behaviours aremea-suredfrom humansubjectsin a scene,thesesourcesof in-formationcanbe combinedto form a temporalmodel forhumanbehaviour. Let usdefineb ����� , thebehaviourvectorof a subjectat time � to be the concatenationof measuredimplicit andexplicit behaviours. For our purposes,thebe-haviour vectoris theconcatenationof gesturemodellikeli-hoodsandheadposeangles:

b ����� ��� ��� x ������� m ��� ��� �� ��� x ������� m ��� � ����� ������� �"! (1)

3 Inter pretation of Group Activities

Although the individual interpretationof behaviours ispossibleby attachingpre-definedsemanticsin the form ofcameracontrol commands,thecaseof multiple subjectsisnot so simpledueto the combinatorialexplosionof possi-bilities. Thesepossibilitiesnot only include variationsinwhich behaviours occur simultaneously, but also in theirtiming andduration. Clearly the rangeof possibilitiesandambiguitiespresenta problemfor any singlevisualcue,nomatterhow accuratelyit canbecomputed.It is only by fus-ing differentvisualcuesthatwe maysuccessfullyinterpretthescene.

3.1 High-Level Group Behaviour Inter pretation

Now we describea methodologyfor machineunder-standingof groupbehaviours.Giventhecomplexitiesof theunconstrainedmulti-personenvironmentdescribedabove,we examinea more constrainedsituation. We assumeafixed number � of peoplewho remainin the sceneat all

times. Let us definethe group vectorto be the concatena-tion of the � behaviour vectorsof thesepeopleat time � :

g ����� ���b �#����� ! b $%����� ! ��� � b �&�����'!(�)! (2)

Thegroupvectoris anoverall descriptionof thesceneat agiventimeinstant.Let usdefineagroupbehaviourasatem-poral sequenceof group vectors,

�g ��� � � g ��� $ � � ��� g ��*+� � .

Given a group behaviour, we introducea high-level in-terpretationmodelto determinethe currentareaof focus.Sincethe region of interestis almostalwaysa personandwe track the headof eachindividual, the outputneedonlygiveanindicationof whichof the � peoplearecurrentlyat-tendedto. Thereforewe definetheoutputof thehigh-levelsystemto bethecamera positionvector:

c ����� � � , � -, $ �� ����., � � (3)

where, � is abooleanvalue(0 or 1) indicatingwhetherper-

son�

is currentlyattendedto. An interpretationcanthenbeplaceduponc ����� to control themovablecamera.Examplesaregivenin Table1. Sucha schemewould requireat leasttwo cameras,oneto framethewholescenefor trackingofall individuals,andtheotherfor takingclose-upshots.Herewe usea “virtual camera”by croppingfocal regionsfromtheglobalimage.Thehigh-level interpretationmodelmusttransforma recenthistory of group vectorsinto a cameravectorfor thecurrentscene.However, without thefeedbackto retainthepreviousfocusof attention,thesystemwill lackthecontext to correctlyinterpretbehaviour. For instance,ifasubjectwavesto gainfocusof attention,thecameravectormustremainon thesubjectuntil anothersubjectattractsat-tention.Without feedback,thesubjectwould losethefocusof attentionassoonasthegesturehasended.Thereforethegeneralform of thehigh-level interpretationsystem/0��� is:

c ����� � /0� g ����� c ���21 � ��� � /0� s�����3� (4)

wheres����� is thescenevectorat time � , definedasthecon-catenationof thecurrentgroupvectorandpreviouscameravector, s����� �4�

g ����� c ����1 � �'� .Given this model, the high-level interpretationsystem

must perform the translationfrom behaviours to focus ofattentionbasedon a fusion of externalsemanticdefinitionandstatisticsof behavioursandtheir timings. The seman-tics maycomefrom a setof rules,but an exhaustive spec-ification of thesystemwould be infeasibledueto themul-tiplicity of possibleco-occurringbehavioursandtheir tim-ings. We take a supervisedlearningapproach:the systemis trainedon a setof examplegroupbehaviours, with theaim of generalisingto new groupbehaviours. To learnthetransformationfrom scenevectorto camerapositionvector,we usedaTime-DelayRBF Network [3], trainedonhalf ofoursequencedatabaseandtestedon theotherhalf.

Weconstrainthecomplexity of thetaskby restrictingthegroupbehavioursto certainfixedscenarios.Theexacttim-ing of the eventswill vary betweendifferent instancesofthesamescenario,but thefocusof attentionshouldswitchto the sameregionsat approximatelythe sametimes. De-scriptionsof examplescenariosinvolving threesubjectsaregiven in Table2. Several examplesof eachscenariowerecollected,andtrainingexampleswerelabelledby handwithacamerapositionvectorfor eachscenevector. A high-levelsystemconsistingof a recurrentRBF network wastrainedon theseexamplesandthentestedon a differentsetof testinstancesof thesamescenarios.

Figures1–4show examplesof thesystemoutputfor twoexamplescenarios:wave-look andpoint. Figures1 and3show temporally-orderedframeswith boxes framing thehead,faceandhandsbeing tracked. In eachframe, headposeis shown above the headwith an intuitive dial box.Figures2 and4 show the headposeangles(top) andges-turelikelihoods(middle)for personsA, B andC (from leftto right). Onecanseethe correspondenceof peaksin thegesturelikelihoodswith gestureeventsin thescenario.

c ����� interpretation

[0,0,0] framewholescene[1,0,0] focuson subjectA[0,1,1] focuson subjectsB andC usinga split-

screeneffect

Table 1. Example of possib le interpretationsof camera position vector s for three people .

scenario description

wave-look C wavesandspeaks,A wavesandspeaks,B wavesandspeaks.Eachtimesomeoneisspeakingtheothertwo subjectslook athim

point C wavesandspeaks,A andB look at C, Cpointsto A, C andB look at A, A looksatcameraandspeaks

Table 2. The example scenarios described intemporal order of their behaviour s. All sub-jects are looking at the camera (forwar d) un-less stated otherwise .

The bottomsectionsof Figures2 and4 show the train-ing signal,or targetcameravectors,tracedabovetheactualoutputcameravectorsobtainedduringtestswith thetrainedRBF network. It canbe seenthat the network follows thegeneralinterpretationof groupbehaviour, thoughtheexactpointsof transitionfrom onefocusof attentionto anotherdonotalwayscoincide.Thesetransitionsarehighly subjectiveanddifficult to determine,evento thehumaneye.

4 Conclusion

Somekey issuesin visual interpretationof group be-haviours in singleviews have beenexplored,anda frame-work hasbeenpresentedfor trackingpeopleandrecognis-ing theircorrelatedgroupbehavioursin VMI contexts. Pre-definedgesturesandtheheadposeof severalindividualsinthe scenecan be simultaneouslyrecognisedfor scenein-terpretation. In the presenceof multiple people,ambigu-ities ariseanda high-level interpretationof the combinedbehavioursof theindividualsbecomesessential.

We have shown examplesof how multi-personactivityscenarioscanbe learnedfrom trainingexamplesandinter-polatedto obtain the sameinterpretationfor different in-stancesof the samescenario. However for the approachto scaleup to moregeneralapplication,it mustbe abletocopewith a whole rangeof scenarios,and extrapolatetonovel situationsin thesameway asa person.A significantissueraisedin this paperandfor future work is the feasi-bility of learningcorrelatedtemporalstructuresanddefaultbehavioursfrom sparsedata.

Sincethis systemrelieson several independentcompo-nents,theoverall probabilityof failureof at leastonecom-ponentis alwaysquitehigh. Thereforethehigh-level inter-pretationsystemmustbeableto copewith missingor noisyinputs.Thesystemoutputsmaybefedbackto thelow-levelsub-systemsto guidetheir processing.

References

[1] A. F. Bobick. Movement,activity, andaction: The role ofknowledgein theperceptionof motion. Proc.RoyalSocietyLondon,SeriesB, 352:1257–1265,1997.

[2] R. Cutler andM. Turk. View-basedinterpretationof real-time optical flow for gesturerecognition. In Proc. FG’98,pp.416–421,Nara,Japan,1998.

[3] A. J.Howell andH. Buxton.Recognisingsimplebehavioursusingtime-delayRBFnetworks.Neural ProcessingLetters,5:97–104,1997.

[4] A. J. Howell and H. Buxton. Learninggesturesfor visu-ally mediatedinteraction. In Proc. BMVC, pp. 508–517,Southampton,UK, 1998.BMVA Press.

[5] S.J.McKennaandS.Gong.Gesturerecognitionfor visuallymediatedinteractionusingprobabilisticeventtrajectories.InProc.BMVC, pp.498–507,Southampton,UK, 1998.

[6] E. Ong,S. McKennaandS. Gong. Trackingheadposeforinferring intention. In EuropeanWorkshoponPerceptionofHumanAction, Freiburg, June1998.

[7] A. Pentland.Smartrooms.ScientificAmerican, 274(4):68–76,1996.

[8] C. Pinhanezand A. F. Bobick. Approximateworld mod-els: Incorporatingqualitative andlinguistic informationintovisionsystems.In AAAI’96, Portland,Oregon,1996.

[9] C. R. WrenandA. P. Pentland.Dynamicmodelsof humanmotion. In Proc.FG’98, pp.22–27,Nara,Japan,1998.

Figure 1. Frames from wave-look sequence . Individuals are labelled A, B and C from left to right.

0 2 4 6 8 10 12 14 16 18 2040

60

80

100

120

time (seconds)

angl

e (d

egre

es)

yaw tilt

0 2 4 6 8 10 12 14 16 18 200

0.05

0.1

0.15

0.2

time (seconds)

likel

ihoo

d

p(point)p(wave)

0 2 4 6 8 10 12 14 16 18 2060

80

100

120

140

160

180

time (seconds)

angl

e (d

egre

es)

yaw tilt

0 2 4 6 8 10 12 14 16 18 200

0.01

0.02

0.03

0.04

time (seconds)

likel

ihoo

dp(point)p(wave)

0 2 4 6 8 10 12 14 16 18 2040

60

80

100

120

140

160

time (seconds)

angl

e (d

egre

es)

yaw tilt

0 2 4 6 8 10 12 14 16 18 200

0.01

0.02

0.03

0.04

time (seconds)

likel

ihoo

d

p(wave) p(point)

140 4 6 8 10 12 16 18 20

Output

TargetCBA

A

CB

2

Figure 2. Results for wave-look scenario. Plots sho w pose angles (top) for persons A, B and C fromleft to right, gesture likelihoods (mid dle) and target/output camera position vector s (bottom).

Figure 3. Frames from point sequence . Individuals are labelled A, B and C from left to right.

0 2 4 6 8 10 1240

60

80

100

120

time (seconds)

angl

e (d

egre

es)

yaw tilt

0 2 4 6 8 10 120

0.05

0.1

0.15

0.2

time (seconds)

likel

ihoo

d

p(point)p(wave)

0 2 4 6 8 10 1260

80

100

120

140

160

180

time (seconds)

angl

e (d

egre

es)

yaw tilt

0 2 4 6 8 10 120

0.05

0.1

0.15

0.2

time (seconds)

likel

ihoo

d

p(point)p(wave)

0 2 4 6 8 10 1220

40

60

80

100

120

140

time (seconds)

angl

e (d

egre

es)

yaw tilt

0 2 4 6 8 10 120

0.05

0.1

0.15

0.2

time (seconds)

likel

ihoo

d

p(point)p(wave)

ABC

ABC

0 2 4 6 8 10 12

Target

Output

Figure 4. Results for point scenario. Plots sho w pose angles (top) for persons A, B and C from left toright, gesture likelihoods (mid dle) and target/output camera position vector s (bottom).

Recommended