Upload
ronny72
View
228
Download
0
Tags:
Embed Size (px)
Citation preview
Recording Meetings with the CMU Meeting Recorder Architecture
Satanjeev Banerjee, et al.
School of Computer Science
Carnegie Mellon University
Carnegie Mellon University 2
Goals
End goal: Build conversational agents That “understand” meetings
E.g.: Identify action items Make contributions to meetings
E.g.: Confirm details of action items Part of Project CALO: Cognitive Agent that
Learns and Organizes First goal: Create corpus of human meetings
Capture data that we expect agents to use E.g.: Speech, video, whiteboard markings, etc.
Carnegie Mellon University 3
Desirable Properties of the Recorder Need to record meetings anywhere
Emphasis on instrumenting user, not room Assume low network bandwidth Should still be able to record in the extreme
situation where there is no network access! Should be easy to add new data streams
“Easy” = low time to incorporate new stream Should be able to support major OS-es
Carnegie Mellon University 4
The Recorder Architecture
Information stream is discretized into events Either a sequence of events, e.g. utterances Or one long event, e.g. video data
Each event is given start/end time stamps Coincide for instantaneous events, e.g. keystroke
Events are stored on local disks Laptops, shuttle PCs, etc.
Events are (slowly) uploaded to a central server when there is network access
Carnegie Mellon University 5
Event Identification and Logging Each recorded event has the following
identifying information associated with it: Start and stop time stamps Name of the meeting and the user Modality (speech, video, hand-writing, etc.)
After recording an event, its identification information is sent to a logging server Server creates a list of all the events in a meeting Good for book-keeping (but not essential)
Carnegie Mellon University 6
Time server
Participant 1
Participant 3
Participant 2
Architecture of Meeting Recorder
{DATA_BLOCKsession: OTTERuser: arudnicky
datatype: SPEECHfile: \\spot\data\u1.rawStart: 20030917::18:27.600End: 20030917::18:35.357}
Browse Meeting
P1
P1 P2 P3
P2
P3P1
P1
[master]
Carnegie Mellon University 7
Synchronizing the Time Stamps All event time stamps must be synchronized We use the Simplified Network Time Protocol
Query a central NTP server for the time Use the reply and the round-trip time to estimate
time difference between local machine and server Use this to create server-time time stamps
Rough experiments reveal 10ms variance Caveat: Experiments done on high speed network What if there is *no* network access?
Carnegie Mellon University 8
Aggregating the Data
Upon network access availability, data is transferred from all sites to a central location Current recording sites: CMU and Stanford
Implemented a cross-platform version of the MS Background Intelligent Transfer Service Uploads files in a transparent background process Throttles bandwidth use as user’s activity goes up Pauses if network connection is lost Resumes once network access is restored
Carnegie Mellon University 9
Data Collection Process (proposed)
Transcription, Annotation
Learning
Analysis
CALO
Independent cross-site collection
Independent cross-site collection
Background data
transmission
Background data
transmission researchresearch
integrationintegration
preparationpreparation
MEETINGDATABASE
Carnegie Mellon University 10
Capturing Close-Talking Speech Implemented Meeting Recorder Cross
Platform (MRCP) to record speech and notes Speech recorded using head-mounted mics 11.025 kHz sampling rate used for portability End pointing done using CMU Sphinx 3 ASR
Each end-pointed utterance is an event Utterance is recorded to local disk (wav format) Time stamps are generated using Simple NTP Utterance’s identifying information is sent to
logging server, utterance is queued for upload
Carnegie Mellon University 11
Capturing Typed Notes
Users type notes in client’s note-taking area “Snapshots” of notes are taken at each
carriage return Each snapshot is an event Each snapshot is saved to disk, time-stamped,
logged, and queued for upload [Demonstration of MRCP]
Carnegie Mellon University 12
More Details about MRCP
Implemented using cross platform libraries: wxWidgets for GUI, file access, networking PortAudio for audio libraries
Currently compiles on Windows, Macintosh OS-X and Linux operating systems
Windows version distributed to other Project CALO sites
Macintosh and Linux versions in beta-testing WinCE version in development
Carnegie Mellon University 13
Capturing Whiteboard Pen Strokes
We use Mimio to capture whiteboard pen strokes “Strokes” consist of all the x-y coordinates
between pen-down and pen-up Each stroke is an event. It is recorded, time-
stamped, logged, queued for upload.
Carnegie Mellon University 14
Capturing Power Point Slides Information We use MS’s PowerPoint API to capture slide
change timing information, and slide contents Events = slide changes Event data = content of the new slide
Content is in the form of all the text, and all the “shapes” on the slide
Events are instantaneous Start and stop time stamps coincide
Events are processed as before
Carnegie Mellon University 15
Capturing Panoramic Video
We capture panoramic video using a 4-camera CAMEO device Developed by the Physical
Awareness group at CMU Video recording done in
MPEG-4 format One long event is
produced and uploaded
Carnegie Mellon University 16
Current Status of Data Collection Recorded meetings vary widely in size…
From 2 to 10 person meetings …in meeting type
Scheduling meetings, presentations, brain storms …in content
Speech group meetings, dialog group meetings, physical awareness group meetings
Currently have a total of more than 11,000 utterances (including cross talk)
Carnegie Mellon University 17
Using the Data: Some Initial Research Question: Can we detect the state of a meeting, and
the roles of participants from simple speech data? Introduced a taxonomy of meeting states and
participant roles
Meeting State Participant Roles
Presentation Presenter, Observer
Briefing Information producer/consumer
Discussion Participator, Observer
Carnegie Mellon University 18
Detection Methods and Initial Results Used Anvil to hand annotate 45 minutes of
meeting video with states and roles Trained decision tree classifier from 30
minutes of data Input features:
# speakers, lengths of utterances, pauses and interruptions within a short history of the meeting
Initial results: About 50% detection accuracy on separate 15 minutes of test data
Questions?
Thanks to DARPA grant NBCH-D-02-0010