Interpretation of Group Behaviour in Visually Mediated ...sgg/papers/icpr00-group-behaviour.pdf ·...

Inter pretation of Group Behaviour in Visually Mediated Interaction

JamieSherrah and ShaogangGongDepartmentof ComputerScience,QueenMary andWestfieldCollege,London E14NS,UK

[jamie|sgg]@dcs.qmw.ac.uk

A. JonathanHowell and Hilary BuxtonSchoolof CognitiveandComputingSciences,Universityof Sussex, Brighton BN1 9QH UK

[jonh|hilaryb]@cogs.susx.ac.uk

Abstract

While full computerunderstandingof dynamicvisualscenescontainingseveral peoplemaybecurrentlyunattain-able, we proposea computationallyefficient approach todetermineareasof interestin such scenes.Wepresentmeth-odsfor modellingandinterpretationof multi-personhumanbehaviourin real timeto control videocamerasfor visuallymediatedinteraction.

1 Intr oduction

Machineunderstandingof humanmotionandbehaviouris currentlya key researchareain computervision,andhasmany real-world applications. Visually MediatedInterac-tion (VMI) requiresintelligent interpretationof a dynamicvisualscenein orderto determineareasof interestfor fastandeffectivecommunicationto a remoteobserver.

Recentwork at theMIT MediaLabhasshown someini-tial progressin the modellingandinterpretationof humanbody activity [7, 9]. Computationallysimple view-basedapproachesto action recognitionhave alsobeenproposed[1, 2]. However, thesesystemsdo not attemptintentionaltrackingandmodellingto control active cameraviews forVMI. Previous work on vision-basedcameracontrol hasbeenbasedon off-line executionof pre-writtenscriptsof asetof definedcameraactions[8]. Hereweproposeto modeland exploit headposeand a set of “interaction-relevant”gesturesfor reactive on-line visual control. Thesewill beinterpretedas user intentionsfor live control of an activecamerawith adaptive view directionandattentionalfocus.In particular, pointing with headposeas evidencefor di-rectionandwaving for attentionareimportantfor delibera-tivecameracontrol.Thereactivecameramovementsshouldprovidethenecessaryvisualcontext for applicationssuchasgroupvideo-conferencingandautomatedstudiodirection.

2 Modelling Human Behaviour for VMI

For ourpurposes,humanbehaviourcanbeconsideredtobe any temporalsequenceof body movementsor configu-rations,suchasa changein headpose,walking or waving.Whenattemptingto modelhumanbehaviour, onemustse-lect thesetof behavioursto bemodelledfor theapplicationat hand. Further, the level of complexity of the modellingshouldbeconcomitantwith its purpose.For theimplemen-tation of real-timesystems,it is of paramountimportancethatonly theminimumamountof informationis computedto adequatelymodelhumansubjectsfor thetaskathand.Inthissection,somesalientbehavioursaredefinedfor visuallymediatedinteractiontasks.

Implicit Behaviour

Our systemneedsto identify regionsof interestin a visualscenefor communicationto a remotesubject. Examiningthe casein which the scenecontainshumansubjectsin-volved in a video conference,the subject(s)currently in-volvedin communicationwill usuallyconstitutetheappro-priatefocusof attention.Thereforevisualcuesthatindicateaswitchin thechiefcommunicator, or turn-taking, aremostimportant. Gazeis quite a significantcuefor determiningthe focusof communication,andis approximatedby headpose. Gazeandotherusesof body languagethat indicateturn-takingare generallyperformedunconsciouslyby thesubject.We defineimplicit behaviourasa bodymovementsequencethatis performedsubconsciouslyby thesubject.

We adoptheadposeas our primary sourceof implicitbehaviour in VMI tasks.Headposeat eachtime instantisrepresentedby a pair of angles,yaw (azimuth)

�and tilt

(elevation) � . Our previous work shows that yaw andtiltcanbe computedrobustly in real-timefrom 2D imagesoflimited resolution[6].

Explicit BehaviourHeadposeinformation is insufficient to determinea sub-ject’s focusof attentionfrom a single2D view, dueto lossof 3D information.Thereforeit is necessaryto havetheusercommunicateexplicitly with our VMI systemthrougha setof pre-definedbehaviourswith vaguesemanticsattachedtothem.Letusdefineexplicit behaviourasasequenceof bodymovementsthatareperformedconsciouslyby a subjectinorderto highlight regionsof interestin the scene.We useasetof pointingandwaving gesturesasexplicit behavioursfor control of the currentfocusof attention.We have pre-viously shown that thesegesturescanbe reliably detectedandclassifiedin real-time[5, 4]. Specifically, a modelm �is maintainedfor eachof � gesturesunderconsideration,�� , andat time � a likelihood �� x �� m �� is gen-eratedfor eachmodel that the given gesturehasjust beencompleted. These� likelihoodvaluesare thresholdedtodetecta gesture,or arein themselvesconsideredasmodeloutputsfor explicit behaviour.

Human BehaviourGiven that both implicit andexplicit behaviours aremea-suredfrom humansubjectsin a scene,thesesourcesof in-formationcanbe combinedto form a temporalmodel forhumanbehaviour. Let usdefineb �� , thebehaviourvectorof a subjectat time � to be the concatenationof measuredimplicit andexplicit behaviours. For our purposes,thebe-haviour vectoris theconcatenationof gesturemodellikeli-hoodsandheadposeangles:

b �� x �� m �� x �� m �� "! (1)

3 Inter pretation of Group Activities

Although the individual interpretationof behaviours ispossibleby attachingpre-definedsemanticsin the form ofcameracontrol commands,thecaseof multiple subjectsisnot so simpledueto the combinatorialexplosionof possi-bilities. Thesepossibilitiesnot only include variationsinwhich behaviours occur simultaneously, but also in theirtiming andduration. Clearly the rangeof possibilitiesandambiguitiespresenta problemfor any singlevisualcue,nomatterhow accuratelyit canbecomputed.It is only by fus-ing differentvisualcuesthatwe maysuccessfullyinterpretthescene.

3.1 High-Level Group Behaviour Inter pretation

Now we describea methodologyfor machineunder-standingof groupbehaviours.Giventhecomplexitiesof theunconstrainedmulti-personenvironmentdescribedabove,we examinea more constrainedsituation. We assumeafixed number � of peoplewho remainin the sceneat all

times. Let us definethe group vectorto be the concatena-tion of the � behaviour vectorsof thesepeopleat time � :

g �� b �#�� ! b $%�� ! �� b �&��'!(�)! (2)

Thegroupvectoris anoverall descriptionof thesceneat agiventimeinstant.Let usdefineagroupbehaviourasatem-poral sequenceof group vectors,

�g �� g �� $ � � �� g ��*+� � .

Given a group behaviour, we introducea high-level in-terpretationmodelto determinethe currentareaof focus.Sincethe region of interestis almostalwaysa personandwe track the headof eachindividual, the outputneedonlygiveanindicationof whichof the � peoplearecurrentlyat-tendedto. Thereforewe definetheoutputof thehigh-levelsystemto bethecamera positionvector:

c �� , � -, $ �� ., � � (3)

where, � is abooleanvalue(0 or 1) indicatingwhetherper-

son�

is currentlyattendedto. An interpretationcanthenbeplaceduponc �� to control themovablecamera.Examplesaregivenin Table1. Sucha schemewould requireat leasttwo cameras,oneto framethewholescenefor trackingofall individuals,andtheotherfor takingclose-upshots.Herewe usea “virtual camera”by croppingfocal regionsfromtheglobalimage.Thehigh-level interpretationmodelmusttransforma recenthistory of group vectorsinto a cameravectorfor thecurrentscene.However, without thefeedbackto retainthepreviousfocusof attention,thesystemwill lackthecontext to correctlyinterpretbehaviour. For instance,ifasubjectwavesto gainfocusof attention,thecameravectormustremainon thesubjectuntil anothersubjectattractsat-tention.Without feedback,thesubjectwould losethefocusof attentionassoonasthegesturehasended.Thereforethegeneralform of thehigh-level interpretationsystem/0�� is:

c �� /0� g �� c ��21 � �� /0� s��3� (4)

wheres�� is thescenevectorat time � , definedasthecon-catenationof thecurrentgroupvectorandpreviouscameravector, s�� 4�

g �� c ��1 � �'� .Given this model, the high-level interpretationsystem

must perform the translationfrom behaviours to focus ofattentionbasedon a fusion of externalsemanticdefinitionandstatisticsof behavioursandtheir timings. The seman-tics maycomefrom a setof rules,but an exhaustive spec-ification of thesystemwould be infeasibledueto themul-tiplicity of possibleco-occurringbehavioursandtheir tim-ings. We take a supervisedlearningapproach:the systemis trainedon a setof examplegroupbehaviours, with theaim of generalisingto new groupbehaviours. To learnthetransformationfrom scenevectorto camerapositionvector,we usedaTime-DelayRBF Network [3], trainedonhalf ofoursequencedatabaseandtestedon theotherhalf.

Weconstrainthecomplexity of thetaskby restrictingthegroupbehavioursto certainfixedscenarios.Theexacttim-ing of the eventswill vary betweendifferent instancesofthesamescenario,but thefocusof attentionshouldswitchto the sameregionsat approximatelythe sametimes. De-scriptionsof examplescenariosinvolving threesubjectsaregiven in Table2. Several examplesof eachscenariowerecollected,andtrainingexampleswerelabelledby handwithacamerapositionvectorfor eachscenevector. A high-levelsystemconsistingof a recurrentRBF network wastrainedon theseexamplesandthentestedon a differentsetof testinstancesof thesamescenarios.

Figures1–4show examplesof thesystemoutputfor twoexamplescenarios:wave-look andpoint. Figures1 and3show temporally-orderedframeswith boxes framing thehead,faceandhandsbeing tracked. In eachframe, headposeis shown above the headwith an intuitive dial box.Figures2 and4 show the headposeangles(top) andges-turelikelihoods(middle)for personsA, B andC (from leftto right). Onecanseethe correspondenceof peaksin thegesturelikelihoodswith gestureeventsin thescenario.

c �� interpretation

[0,0,0] framewholescene[1,0,0] focuson subjectA[0,1,1] focuson subjectsB andC usinga split-

screeneffect

Table 1. Example of possib le interpretationsof camera position vector s for three people .

scenario description

wave-look C wavesandspeaks,A wavesandspeaks,B wavesandspeaks.Eachtimesomeoneisspeakingtheothertwo subjectslook athim

point C wavesandspeaks,A andB look at C, Cpointsto A, C andB look at A, A looksatcameraandspeaks

Table 2. The example scenarios described intemporal order of their behaviour s. All sub-jects are looking at the camera (forwar d) un-less stated otherwise .

The bottomsectionsof Figures2 and4 show the train-ing signal,or targetcameravectors,tracedabovetheactualoutputcameravectorsobtainedduringtestswith thetrainedRBF network. It canbe seenthat the network follows thegeneralinterpretationof groupbehaviour, thoughtheexactpointsof transitionfrom onefocusof attentionto anotherdonotalwayscoincide.Thesetransitionsarehighly subjectiveanddifficult to determine,evento thehumaneye.

4 Conclusion

Somekey issuesin visual interpretationof group be-haviours in singleviews have beenexplored,anda frame-work hasbeenpresentedfor trackingpeopleandrecognis-ing theircorrelatedgroupbehavioursin VMI contexts. Pre-definedgesturesandtheheadposeof severalindividualsinthe scenecan be simultaneouslyrecognisedfor scenein-terpretation. In the presenceof multiple people,ambigu-ities ariseanda high-level interpretationof the combinedbehavioursof theindividualsbecomesessential.

We have shown examplesof how multi-personactivityscenarioscanbe learnedfrom trainingexamplesandinter-polatedto obtain the sameinterpretationfor different in-stancesof the samescenario. However for the approachto scaleup to moregeneralapplication,it mustbe abletocopewith a whole rangeof scenarios,and extrapolatetonovel situationsin thesameway asa person.A significantissueraisedin this paperandfor future work is the feasi-bility of learningcorrelatedtemporalstructuresanddefaultbehavioursfrom sparsedata.

Sincethis systemrelieson several independentcompo-nents,theoverall probabilityof failureof at leastonecom-ponentis alwaysquitehigh. Thereforethehigh-level inter-pretationsystemmustbeableto copewith missingor noisyinputs.Thesystemoutputsmaybefedbackto thelow-levelsub-systemsto guidetheir processing.

References

[1] A. F. Bobick. Movement,activity, andaction: The role ofknowledgein theperceptionof motion. Proc.RoyalSocietyLondon,SeriesB, 352:1257–1265,1997.

[2] R. Cutler andM. Turk. View-basedinterpretationof real-time optical flow for gesturerecognition. In Proc. FG’98,pp.416–421,Nara,Japan,1998.

[3] A. J.Howell andH. Buxton.Recognisingsimplebehavioursusingtime-delayRBFnetworks.Neural ProcessingLetters,5:97–104,1997.

[4] A. J. Howell and H. Buxton. Learninggesturesfor visu-ally mediatedinteraction. In Proc. BMVC, pp. 508–517,Southampton,UK, 1998.BMVA Press.

[5] S.J.McKennaandS.Gong.Gesturerecognitionfor visuallymediatedinteractionusingprobabilisticeventtrajectories.InProc.BMVC, pp.498–507,Southampton,UK, 1998.

[6] E. Ong,S. McKennaandS. Gong. Trackingheadposeforinferring intention. In EuropeanWorkshoponPerceptionofHumanAction, Freiburg, June1998.

[7] A. Pentland.Smartrooms.ScientificAmerican, 274(4):68–76,1996.

[8] C. Pinhanezand A. F. Bobick. Approximateworld mod-els: Incorporatingqualitative andlinguistic informationintovisionsystems.In AAAI’96, Portland,Oregon,1996.

[9] C. R. WrenandA. P. Pentland.Dynamicmodelsof humanmotion. In Proc.FG’98, pp.22–27,Nara,Japan,1998.

Figure 1. Frames from wave-look sequence . Individuals are labelled A, B and C from left to right.

0 2 4 6 8 10 12 14 16 18 2040

time (seconds)

yaw tilt

0 2 4 6 8 10 12 14 16 18 200

time (seconds)

p(point)p(wave)

0 2 4 6 8 10 12 14 16 18 2060

time (seconds)

yaw tilt

0 2 4 6 8 10 12 14 16 18 200

time (seconds)

dp(point)p(wave)

0 2 4 6 8 10 12 14 16 18 2040

time (seconds)

yaw tilt

0 2 4 6 8 10 12 14 16 18 200

time (seconds)

p(wave) p(point)

140 4 6 8 10 12 16 18 20

Output

TargetCBA

Figure 2. Results for wave-look scenario. Plots sho w pose angles (top) for persons A, B and C fromleft to right, gesture likelihoods (mid dle) and target/output camera position vector s (bottom).

Figure 3. Frames from point sequence . Individuals are labelled A, B and C from left to right.

0 2 4 6 8 10 1240

time (seconds)

yaw tilt

0 2 4 6 8 10 120

time (seconds)

p(point)p(wave)

0 2 4 6 8 10 1260

time (seconds)

yaw tilt

0 2 4 6 8 10 120

time (seconds)

p(point)p(wave)

0 2 4 6 8 10 1220

time (seconds)

yaw tilt

0 2 4 6 8 10 120

time (seconds)

p(point)p(wave)

0 2 4 6 8 10 12

Target

Output

Figure 4. Results for point scenario. Plots sho w pose angles (top) for persons A, B and C from left toright, gesture likelihoods (mid dle) and target/output camera position vector s (bottom).

Interpretation of Group Behaviour in Visually Mediated ...sgg/papers/icpr00-group-behaviour.pdf ·...

Documents

Organisational Behaviour.pdf

Changing Audience Behaviour: Festival Goers and Throwaway …e-space.mmu.ac.uk/617471/8/Changing Audience Behaviour.pdf · concept of sustainable marketing which combines green marketing

Can we study biological behaviour of bilirubin photoproducts? we study biological behaviour.pdf · Can we study biological behaviour of bilirubin photoproducts? Van íkováJ, Čepa

MANAGING DIFFICULT BEHAVIOUR - Lindsay Wright - Managing Difficult Behaviour.pdf · MANAGING DIFFICULT BEHAVIOUR ... BEHAVIOUR ICEBERG Motives Beliefs Attitudes Sulking Shouting

The Influence of Consumer Behaviour on the Greenhouse Gas ...tt21c.org/wp-content/uploads/2018/09/2018_Consumer-behaviour.pdf · scarcity of domestic showering associated with consumer

Wishingrad et al. 2015 Behaviour.pdf

Group behaviour

4. ADMINISTRATIVE BEHAVIOUR - Jeywin Publicationsjeywin.com/wp-content/...Administration-4-Administrative-Behaviour.pdf · 4. ADMINISTRATIVE BEHAVIOUR Decision-making is a common

elephant behaviour.pdf

OB Cover 2008 - mim.ac.ms Organisational Behaviour.pdf · Limitations in a Changing World 42 Systems Theory 48 Contingency Theories 54 Contemporary Theories 55 2 Organisational Behaviour

Organizational behaviour and Function of group Behaviour

Group Behaviour (OB)

Group Behaviour Model & Group Decesion Making

Learning Behaviour - UK Government Web Archivewebarchive.nationalarchives.gov.uk/.../DCSF-Learning-Behaviour.pdf · 8208-DCSF-Learning Behaviour COVER.indd 1 16/4/09 16:51:03. 1

buyig behaviour.pdf

Group Behaviour 2

Consumer Behaviour Group

Foundations of Group Behaviour

a model of Industrial Buyer Behaviour.pdf

Group behaviour ppt