V ideo A nalysis C ontent E xtraction MLMI, May 2, 2006 A Multimodal Analysis of Floor Control in Meetings Lei Chen, Mary Harper, Amy Franklin, R. Travis

MLMI, May 2, 2006

Video Analysis

Content Extraction

A Multimodal Analysis of Floor Control in Meetings

Lei Chen, Mary Harper, Amy Franklin, R. Travis Rose, Irene Kimbara, Zhongqiang Huang, Francis Quek

Video Analysis

Content Extraction

MLMI, May 2, 2006

Multimodal Study: Floor Control

An underlying mechanism is employed to control the floor distribution among participants; understanding floor control in dialogs and meetings is helpful for discerning their structure.

In a meeting for which the current floor holder is known, it is interesting to predict: Whether floor control will change Who the next floor holder will be

We investigate multimodal cues for floor control in two VACE meetings

Video Analysis

Content Extraction

MLMI, May 2, 2006

Importance of Floor Control

The floor holder represents a primary thread in summarizing meetings, so identification of the primary channels (audio and visual) is important Camera focus Special purpose signal processing

More natural human-like conversational agents Using human conversational principles related to the distribution

of floor control Automatic meeting analysis

Floor control information can contribute to revealing the topic flow and interaction patterns that emerge during meetings.

Video Analysis

Content Extraction

MLMI, May 2, 2006

Prior Work Conversation Analysis Research: Sacks et al. (1974) posited

that a conversation is built on turn constructional units (TCUs), complete units with respect to intonation contours, syntax, and semantics. A transition relevance place (TRP) raises the likelihood that another speaker can take over the floor and start speaking. Many cues are used by participants to predict the end of TCUs (Duncan,1972; Argyle and Cook, 1974).

Dialog-based Research: prosody (Caspers, 2000; Wichmann and Caspers, 2001), gaze (Novick et al., 1996); dialog acts by Shriberg et al. (2004)

Multiparty Meetings: Padilha and Carletta (2002, 2003), Novick (2005), Vertegaal et al. (2001),

Meeting Collections: ISL audio corpus, ICSI audio corpus, the NIST audio-visual corpus, and MM4 audio-visual corpus

Video Analysis

Content Extraction

MLMI, May 2, 2006

VACE2 Meeting Room Data

Video Analysis

Content Extraction

MLMI, May 2, 2006

Data and Annotations Two VACE meetings

Jan. 07: foreign weapon testing (41.6 minutes, 5 participants, 9,871 words)

March 18: scholarship selection (44.4 minutes, 5 participants, 7,547 words)

Multimodal annotation

Video Analysis

Content Extraction

MLMI, May 2, 2006

January 7th Excerpt

Video Analysis

Content Extraction

MLMI, May 2, 2006

March 18th Excerpt

Video Analysis

Content Extraction

MLMI, May 2, 2006

Word and SU Annotations

Words (Purdue and U Chicago): Segment into speech and non-speech chunks (IHM) Transcribe speech chunks using LDC Quick Transcription (QTR)

guidelines Obtain time alignments given pronunciations for all words using ASR

SUs (Purdue): Use Ears MDE annotation specification V6.2

SU segmentation Type: statement, question, backchannel, incomplete

Used a hidden event LM to automatically provide automatic SU hypotheses that were hand corrected

Anvil interface allowed us to view time aligned transcripts, while consulting audio and video cues for annotating the sentences in the meetings.

Video Analysis

Content Extraction

MLMI, May 2, 2006

Gesture and Gaze Annotations Gesture and gaze coding was done on MacVissta

under Mac OS X. The display and annotation tool supports the simultaneous display of multiple MPEG-4 videos (representing different camera angles) and enables the annotator to select an appropriate view from any of the videos to produce more accurate gaze/gesture coding.

10 cameras were used to record the meeting participants from different viewing angles, supporting the annotation of each participant’s gaze direction and gestures.

Annotators had access to time aligned word transcriptions and all of the videos when producing gaze and gesture annotations.

Video Analysis

Content Extraction

MLMI, May 2, 2006

Gaze Annotations

Gaze was annotated by researchers in the McNeill Lab at U. Chicago

Gaze target plus start and end times were marked: Based on markup of major saccades (intervals

between fixations) ~3 frames of video (insufficient for micro saccades)

Segmentation of space into areas and objects, which we collapsed into: each participant, paper, table, whiteboard, neutral space, and other

Video Analysis

Content Extraction

MLMI, May 2, 2006

Gesture Annotations Gesture was annotated by researchers in the McNeill Lab at U.

Chicago Gesture Annotations that were annotated and used in our

investigations: Emblematic gestures: e.g., “thumb up” means “good” in some

cultures. Four gesticulation types were annotated and used in our

investigations: Metaphoric: e.g., gestures containing smooth, continuous motions

(such as sweeping, arcing, or dragging) for continuous change; Iconic: e.g., “and he bends it way back” while making an iconic

gesture of appearing to grip something and pull it back. Deictic: gestures are used to point to entities during a

communication. Beat: simple rhythmical hand motions.

Note that fidget and instrumental movements are excluded.

Video Analysis

Content Extraction

MLMI, May 2, 2006

Video Analysis

Content Extraction

MLMI, May 2, 2006

Floor Annotations Six types of floor annotations (Purdue):

Control:Control: Who has control of the floor and which participants comprise the floor

Sidebar:Sidebar: Used to represent sub-floors that have split off from the main thread of the meeting. Again we want to record who has control and which participants are involved.

Backchannel:Backchannel: An SU type involving utterances like ``yeah'' that is spoken when another controls the floor.

Challenge:Challenge: An attempt to grab the floor. Cooperative:Cooperative: An utterance inserted into the middle of the floor

controller's utterance (like a backchannel but with propositional content) Other:Other: Other vocalizations, e.g., self talk, that do not contribute to any

current floor thread.

Anvil interface allowed us to view time aligned transcripts and SU annotations, while consulting audio and video cues for annotating the floor events in the meetings.

Video Analysis

Content Extraction

MLMI, May 2, 2006

Cooperative Example

Video Analysis

Content Extraction

MLMI, May 2, 2006

Challenge Example

Video Analysis

Content Extraction

MLMI, May 2, 2006

Questions Audio

How frequently do verbal backchannels occur in meetings? Are discourse markers (e.g., right, so, well) used more frequently in the

beginning, middle, or end of a control event? Gaze

When a holder finishes his/her turn, does he/she gaze at the next floor holder more often than at other potential targets?

When a holder takes control of the floor, does he/she gaze at the previous floor holder more often than at other potential targets?

Do we observe the frequent mutual gaze breaks between two adjacent floor holders during floor change?

Gesture How frequently does the previous floor holder make floor yielding

gestures such as pointing to the next floor holder? How frequently does the next floor holder make floor grabbing gestures

to gain control of the floor?

Video Analysis

Content Extraction

MLMI, May 2, 2006

Measurement Study Goals:

To gain insight into mechanisms governing floor control in meetings

To identify useful multimodal cues for an automatic floor control identification system

Measurements: Basic meeting statistics Speech events

Verbal backchannels Discourse markers (DM)

Gaze events Gaze distribution at floor transitions Meeting manager’s gaze

Gesture events

Video Analysis

Content Extraction

MLMI, May 2, 2006

299.58

12.8414.9

465.54

5.3164.88

763.31

29.8217.02

523.16

11.832.39

296.31

11.7839.16

0

100

200

300

400

500

600

700

800

900

Cu

mu

lati

ve

Du

rati

on

(se

c)

C D E F G

Participant

Jan 7 Meeting: Control Event Duration by Participant

Control Challenge Backchannel Sidebar-Control Cooperative

648.73

28.02

359.75

21.78

465.03

18.41

481.7

3.47

422.49

36.76

0

100

200

300

400

500

600

700

800

900C

um

ula

tiv

e D

ura

tio

n (

sec)

C D E F G

Participant

March 18 Meeting: Control Event Duration by Participant

Control Challenge Backchannel Sidebar-Control Cooperative

Video Analysis

Content Extraction

MLMI, May 2, 2006

37

26

63

37

31

87

11

20

11

44

29

116

43

55

17

197

15

15

40290

0

50

100

150

200

250

300

Co

un

t

Control Challenge Backchannel Sidebar Cooperative

Control Event

Jan 7 Meeting: Control Event Count and Participant

C D E F G

62

54

49

57

53

4112142 74

65

72

11

111

00000 27002

0

50

100

150

200

250

300

350

Co

un

t


Control Event

March 18 Meeting: Control Event Count and Participant

C D E F G

Video Analysis

Content Extraction

MLMI, May 2, 2006

Floor Transition Types

Change:Change: there is a clear floor transition between two adjacent floor holders with some gap between adjacent floors.

Overlap:Overlap: there is a clear floor transition between two adjacent floor holders, but the next holder begins talking before the previous holder stops speaking.

Stop:Stop: the previous floor holder clearly gives up the floor, and there is no intended next holder so the floor is open to all participants.

Self-select:Self-select: without being explicitly yielded the floor by the previous holder, a participant takes control of the floor.

Video Analysis

Content Extraction

MLMI, May 2, 2006

Distribution of Floor Transition Types

117

50 70 73115

45

18 17

0

20

40

60

80

100

120

Co

un

t

Change Overlap Stop Self-Select

Jan 7

March 18

Transition Type

Distribution of Floor Transition Types

Video Analysis

Content Extraction

MLMI, May 2, 2006

Verbal Backchannels and Nods

551

102 86

330

509

597

127 109

281 304

0

100

200

300

400

500

600

Co

un

t

Statement Question Incomplete Backchannel Nods

Jan 7

March 18

Event Type

SU Type Distribution and Nod Frequency

Video Analysis

Content Extraction

MLMI, May 2, 2006

Discourse Markers

Event (plus Portion) Meeting # DMs Total Duration DM/sec.Jan 7 22 26.79 0.82March 18 12 18.65 0.64Jan 7 20 52.41 0.38March 18 42 111.04 0.38Jan 7 58 70 0.83March 18 73 82.5 0.88Jan 7 13 70 0.19March 18 13 82.5 0.16Jan 7 304 2155.5 0.14March 18 184 2092.67 0.09

Control Middle

Challenge

Short Control (< 2 sec)

Control Beginning (first 0.5 sec)

Control Ending (last 0.5 sec)

Video Analysis

Content Extraction

MLMI, May 2, 2006

Gaze Patterns: Current to Next Holder

Video Analysis

Content Extraction

MLMI, May 2, 2006

0

20

40

60

80

100

120

Co

un

t

Change Overlap Stop Change Overlap Stop

Jan 7 March 18

Meeting and Transition Type

Eye Gaze of Current Floor Holder at Floor Transition

Next Holder Manager Others Noone

Video Analysis

Content Extraction

MLMI, May 2, 2006

0

20

40

60

80

100

120

Co

un

t

Change Overlap Self-Select Change Overlap Self-Select

Jan 7 March 18


Eye Gaze of Next Floor Holder at Floor Transition

Prior Holder Manager Others Noone

Video Analysis

Content Extraction

MLMI, May 2, 2006

Mutual Gaze Break

Video Analysis

Content Extraction

MLMI, May 2, 2006

Meeting Manager Role The ostensible meeting manager for each meeting is participant E;

however, participant E in March 18 meeting does not appear to embrace that role.

In the Jan07 meeting, there were 53 cases that E is not either the previous or next floor holder in floor exchange (only Change and Overlap). In these 53 cases, E gazes at the next floor holder 21 times. If we rule out such cases where other participants look at the next floor holder, E

still gazes to the next floor holder 11 times (20.75%), suggesting that the gaze of the meeting manager plays a role in predicting the next floor holder.

In Mar18 meeting, there are 100 cases that E is not a floor holder. In these100 cases, E gazes to the next floor holder only 6 times. In fact, E tends to gaze largely at his papers or the whiteboard.

Video Analysis

Content Extraction

MLMI, May 2, 2006

37

26

63

37

31

87

11

20

11

44

29

116

43

55

17

197

15

15

40290

0

50

100

150

200

250

300

Co

un

t


Control Event

Jan 7 Meeting: Control Event Count and Participant

C D E F G

62

54

49

57

53

4112142 74

65

72

11

111

00000 27002

0

50

100

150

200

250

300

350

Co

un

t


Control Event

March 18 Meeting: Control Event Count and Participant

C D E F G

Video Analysis

Content Extraction

MLMI, May 2, 2006

Gestures for Yielding and Grabbing the Floor

Video Analysis

Content Extraction

MLMI, May 2, 2006

0

5

10

15

20

25

30

35

40

Co

un

t



Jan 7 March 18


Distribution of Floor Yielding and Grabbing Gestures Given Meeting and Transition Type

Yield Grab

Video Analysis

Content Extraction

MLMI, May 2, 2006

Conclusions

Presented a floor control annotation specification and conducted an analysis of two VACE meetings

Identified some multimodal cues that will be helpful for predicting floor control events DMs occur frequently at the beginning of a floor The previous holder often gazes at the next floor holder and vice

versa during floor transitions The mutual gaze break patterns previously observed in dialogs are

also found in the Jan07 meeting. An active meeting manager plays a role in floor transitions Gestures, especially floor capturing gestures, play a role in floor

transitions

Video Analysis

Content Extraction

MLMI, May 2, 2006

Acknowledgements

Discussions with David McNeill and Susan Duncan at U Chicago, Liz Shriberg at ICSI/SRI, and Felicia Roberts at Purdue University

This work was supported by: ARDA VACE II DARPA EARS and Gale

Documents

V ideo A nalysis C ontent E xtraction MLMI, May 2, 2006 A Multimodal Analysis of Floor Control in Meetings Lei Chen, Mary Harper, Amy Franklin, R. Travis