Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop July 13, 2005 Royal College of Physicians Edinburgh, UK

Welcome to the Rich Transcription 2005 Spring

Meeting Recognition Evaluation Workshop

July 13, 2005

Royal College of Physicians Edinburgh, UK

Today’sAgenda

Administrative Points• Participants:

– Pick up the hard copy proceedings on the front desk

• Presenters:– The agenda will be strictly followed

• Time slots include Q&A time.

– Presenters should either• Load their presentations on the computer at the front, or• Test their laptops during the breaks prior to making their presentation

• We’d like to thank:– MLMI-05 organizing committee for hosting this workshop – Caroline Hastings for the workshop’s administration– All the volunteers: evaluation participants, data providers,

transcribers, annotators, paper authors, presenters and other contributors

The Rich Transcription 2005 Spring Meeting

Recognition Evaluation

Jonathan Fiscus, Nicolas Radde, John Garofolo, Audrey Le, Jerome Ajot, Christophe Laprun

July 13, 2005

Rich Transcription 2004 Spring Meeting Recognition Workshop

at MLMI 2005

http://www.nist.gov/speech/tests/rt/rt2005/spring/

Overview

• Rich Transcription Evaluation Series

• Research opportunities in the Meeting Domain

• RT-05S Evaluation– Audio input conditions– Corpora– Evaluation tasks and results

• Conclusion/Future

The Rich Transcription Task

ComponentRecognition Technologies

Multiple Applications

RICH TRANSCRIPTIONSpeech-To-Text + METADATA

Smart Meeting RoomsTranslationExtractionRetrieval

Summarization

ReadableTranscriptsHuman-to-Human

Speech

Rich Transcription Evaluation Series

• Goal:– Develop recognition technologies that produce transcripts

which are understandable by humans and useful for downstream processes.

• Domains:– Broadcast News (BN)– Conversational Telephone Speech (CTS)– Meeting Room speech

• Parameterized “Black Box” evaluations– Evaluations control input conditions to investigate

weaknesses/strengths– Sub-test scoring provides finer-grained diagnostics

Research Opportunities in the Meeting Domain

• Provide fertile environment to advance state-of-the-art in technologies for understanding human interaction

• Many potential applications– Meeting archives, interactive meeting rooms, remote collaborative systems

• Important Human Language Technology challenges not posed by other domains– Varied forums and vocabularies– Highly interactive and overlapping spontaneous speech – Far field speech effects

• Ambient noise• Reverberation• Participant movement

– Varied room configurations• Many microphone conditions • Many camera views

– Multimedia information integration• Person, face, and head detection/tracking

RT-05S Evaluation Tasks

• Focus on core speech technologies– Speech-to-Text Transcription– Diarization “Who Spoke When”– Diarization “Speech Activity Detection”– Diarization “Source Localization”

Five System Input Conditions• Distant microphone conditions

– Multiple Distant Microphones (MDM)• Three or more centrally located table mics

– Multiple Source Localization Arrays (MSLA)• Inverted “T” topology, 4-channel digital microphone array

– Multiple Mark III digital microphone Arrays (MM3A)• Linear topology, 64-channel digital microphone array

• Contrastive microphone conditions– Single Distant Microphone (SDM)

• Center-most MDM microphone • Gauge performance benefit using multiple table mics

– Individual Head Microphones (IHM)• Performance on clean speech • Similar to Conversational Telephone Speech

– One speaker per channel, conversational speech

Training/Development Corpora• Corpora provided at no cost to participants

– ICSI Meeting Corpus– ISL Meeting Corpus– NIST Meeting Pilot Corpus– Rich Transcription 2004 Spring (RT-04S) Development &

Evaluation Data– Topic Detection and Tracking Phase 4 (TDT4) corpus – Fisher English conversational telephone speech corpus– CHIL development test set – AMI development test set and training set

• Thanks to ELDA and LDC for making this possible

RT-05S Evaluation Test Corpora:Conference Room Test Set

• Goal-oriented small conference room meetings– Group meetings and decision-making exercises– Meetings involved 4-10 participants

• 120 minutes – Ten excerpts, each twelve minutes in duration– Five sites donated two meetings each:

• Augmented Multiparty Interaction (AMI) Program, International Computer Science Institute (ICSI), NIST, and Virginia Tech (VT)

– No VT data was available for system development– Similar test set construction used for RT-04S evaluation

• Microphones:– Participants wore head microphones– Microphones were placed on the table among participants– AMI meetings included an 8-channel circular microphone array on the

table– NIST meetings include 3 Mark III digital microphone arrays

RT-05S Evaluation Test Corpora: Lecture Room Test Set

• Technical lectures in small meeting rooms– Educational events where a single lecturer is briefing an audience on

a particular topic – Meetings excerpts involve one lecturer and up to five participating

audience members

• 150 minutes – 29 excerpts from 16 lectures– Two types of excerpts selected by CMU

• Lecturer excerpts – 89 minutes, 17 excerpts • Question & Answer (Q&A) excerpts – 61 minutes, 12 excerpts

• All data collected at Karlsruhe University• Sensors:

– Lecturer and at most two other participants wore head microphones– Microphones were placed on the table among participants– A source localization array mounted on each of the room’s four walls– Mark III mounted on the wall opposite the lecturer

RT-05S Evaluation ParticipantsSite ID Site Name Evaluation Task

STT SPKR SAD SLOCAMI Augmented Multiparty Interaction Program X

ICSI/SRI International Computer Science Institute and SRI International

X X

ITC-irst Center for Scientific and Technological Research

X

KU Karlsruhe University X

ELISA Consortium

Laboratoire Informatique d'Avignon (LIA), Communication Langagière et Interaction Personne-Système (CLIPS), and LIUM

X X

MQU Macquarie University X

Purdue Purdue University X

TNO The Netherlands Organisation for Applied Scientific Research

X X

TUT Tampere University of Technology X

Diarization “Who Spoke When” (SPKR) Task

• Task definition– Identify the number of participants in each

meeting and create a list of speech time intervals for each such participant

• Several input conditions:– Primary: MDM– Contrast: SDM, MSLA

• Four participating sites: ICSI/SRI, ELISA, MQU, TNO

SPKR System Evaluation Method• Primary Metric

– Diarization Error Rate (DER) – the ratio of incorrectly detected speaker time to total speaker time

• System output speaker segment sets are mapped to reference speaker segment sets so as to minimize the total error

• Errors consist of:– Speaker assignment errors (i.e., detected speech but not assigned to

the right speaker) – False alarm detections– Missed detections

• Systems were scored using the mdeval tool– Forgiveness collar of +/- 250ms around reference segment

boundaries

• DER on non-overlapping speech is the primary metric

RT-05S SPKR ResultsPrimary Systems, Non-Overlapping Speech

• Conference room SDM DER less than MDM– Sign test indicates differences are not significant

• Primary ICSI/SRI Lecture Room system attributed the entire duration of each test excerpt to be from a single speaker. – ICSI/SRI contrastive system had a lower DER

0

10

20

30

40

50

60

MDM SDM MDM MSLA SDM

Conference Room Lecture Room

DE

R

ICSI/SRI

ELISA

MQU

TNO

Lecture Room Results:Broken Down by Excerpt Type

• Lecturer excerpt DERs are lower than Q&A excerpt DERs

0

5

10

15

20

25

30

35

Lec

ture

rE

xcer

pts

Q&

AE

xcer

pts

All

Dat

a

DE

R (

%)

ICSI/SRI ContrastiveELISAICSI/SRI Primary

Historical Best System SPKR Performance on Conference Data

• 20% relative reduction for MDM

• 43% relative reduction for SDM

23.3

27.5

18.6

15.3

0

5

10

15

20

25

30

MDM SDM

DE

R RT-04S

RT-05S

Diarization “Speech Activity Detection” (SAD) Task

• Task definition– create a list of speech time intervals where at least one

person is talking

• Dry run evaluation for RT-05S– Proposed by CHIL

• Several input conditions:– Primary: MDM– Contrast: SDM, MSLA, IHM

• Systems designed for the IHM condition must detect speech and also reject cross talk speech and breath noises, therefore IHM systems are not directly comparable to MDM or SDM systems

• Three participating sites: ELISA, Purdue, TNO

SAD System Evaluation Method

• Primary metric – Diarization Error Rate (DER)

• Same formula and software as used for the SPKR task• Reduced to a two-class problem: speech vs. non-

speech• No speaker assignment errors, just false alarms and

missed detections– Forgiveness collar of +/- 250ms around reference

segment boundaries

RT-05S SAD ResultsPrimary Systems

• DERs for conference and lecture room MDM data are similar• Purdue didn’t compensate for breath noise and crosstalk

7.42% 6.59%

26.97%

5.04%

0%

5%

10%

15%

20%

25%

30%

MDM IHM MDM


DE

R

ELISA

Purdue

TNO

Speech-To-Text (STT) Task

• Task definition– Systems output a single stream of time-tagged

word tokens

• Several input conditions:– Primary: MDM– Contrast: SDM, MSLA, IHM

• Two participating sites: AMI and ICSI/SRI

STT System Evaluation Method• Primary metric

– Word Error Rate (WER) - ratio of inserted, deleted, and substituted words to the total number of words in the reference

• System and reference words are normalized to a common form• System words are mapped to reference words using a word-

mediated dynamic programming string alignment program

• Systems were scored using the NIST Scoring Toolkit (SCTK) version 2.1– A Spring 2005 update to the SCTK alignment tool can now

score most of the overlapping speech in the distant microphone test material

• Can now handle up to 5 simultaneous speakers– 98% of Conference Room test can be scored– 100% of Lecture Room test set can be scored

– Greatly improved over Spring 2004 prototype

RT-05S STT ResultsPrimary Systems (Incl. overlaps)

• First evaluation for the AMI team• IHM error rates for conference and lecture room data are comparable• ICSI/SRI lecture room MSLA WER lower than MDM/SDM WER

0%

10%

20%

30%

40%

50%

60%

MDM SDM IHM MDM SDM IHM MSLA


WE

R

AMIICSI/SRI

Microphone conditions

Conference Room

Lecture Room

Historical STT Performance in the Meeting Domain

• Performance for ICSI/SRI has dramatically improved for all conditions

0102030405060

1 Spkr. <= 5Spkr.

1 Spkr. <= 5Spkr.

All

MDM SDM IHM

WE

R

RT-04S

RT-05S

Diarization “Source Localization” (SLOC) Task

• Task definition– Systems track the three-dimensional position of the lecturer (using

audio input only)– Constrained to lecturer subset of the Lecture Room test set– Evaluation protocol and metrics defined in the CHIL “Speaker

Localization and Tracking – Evaluation Criteria” document

• Dry run pilot evaluation for RT-05S– Proposed by CHIL– CHIL provided the scoring software and annotated the evaluation data

• One evaluation condition– Multiple source localization arrays

• Required calibration of source localization microphone positions and video cameras

• Three participating sites: ITC-irst, KU, TNO

SLOC System Evaluation Method

• Primary Metric:– Root Mean Squared Error (RMSE) – a measure

of the average Euclidean distance between the reference speaker position and the system-determined speaker position

• Measured in millimeters at 667 ms intervals

• IRST SLOC scoring software

►Maurizio Omologo will give further details this afternoon

R-05S SLOC ResultsPrimary Systems

• Issues:– What accuracy and

resolution is needed for successful beamforming?

– What will performance be for multiple speakers?

0

100

200

300

400

500

600

700

800

900

ITC-irst KU TUT

RM

SE

(fi

ne+

gro

ss)

mm

Summary

• Nine sites participated in the RT-05S evaluation– Up from six in RT-04S

• Four evaluation tasks were supported across two meeting sub-domains: – Two experimental tasks: SAD and SLOC

successfully completed– Dramatically lower STT and SPKR error rates for

RT-05S

Issues for RT-06 Meeting Eval• Domain

– Sub domains

• Tasks– Require at least three sites per task– Agreed-upon primary condition for each task

• Data contributions– Source data and annotations

• Participation intent • Participation commitment• Decision making process

– Only sites with intent to participate will have input to the task definition

Documents

Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop July 13, 2005 Royal College of Physicians Edinburgh, UK