62
Phonetic and Lexical Dissection of the Hub-5 Speech Recognition Evaluation Steven Greenberg, Shawn Chang and Joy Hollenback International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 Speech Transcription Workshop, College Park, MD - May 18, 2000

Phonetic and Lexical Dissection of the Hub-5 Speech Recognition Evaluation Steven Greenberg, Shawn Chang and Joy Hollenback International Computer Science

Embed Size (px)

Citation preview

Phonetic and Lexical Dissection of the

Hub-5 Speech Recognition Evaluation

Steven Greenberg, Shawn Chang and Joy HollenbackInternational Computer Science Institute1947 Center Street, Berkeley, CA 94704

Speech Transcription Workshop, College Park, MD - May 18, 2000

OR…

An Introduction to Error AnalysisThe Study of Uncertainties in Physical Measurements

Speech Transcription Workshop, College Park, MD - May 18, 2000

Acknowledgements and Thanks• EVALUATION DESIGN

– George Doddington– Jack Godfrey

• ANALYSIS – George Doddington – Hollis Fitch– Joe Kupin– Rosaria Silipo

• SC-LITE SUPPORT– Jon Fiscus

• DATA SUBMISSION– AT&T – BBN– DRAGON SYSTEMS– CAMBRIDGE UNIVERSITY– JOHNS HOPKINS UNIVERSITY– MISSISSIPPI STATE UNIVERSITY– SRI– UNIVERSITY OF WASHINGTON

• SWITCHBOARD RECOGNITION SYSTEMS FROM EIGHT SEPARATE SITES ARE EVALUATED WITH RESPECT TO PHONE- AND WORD-LEVEL CLASSIFICATION USING BOTH FREE AND FORCED-ALIGNMENT-BASED RECOGNITION ON NON-COMPETITIVE DIAGNOSTIC MATERIAL

• THIS EVALUATION REQUIRED THE CONVERSION OF THE ORIGINAL SUBMISSION MATERIAL TO A COMMON REFERENCE FORMAT– The common format is required for scoring using SC-Lite and to perform

certain types of statistical analyses– Key to the conversion is the use of TIME-MEDIATED parsing which provides

the capability of assigning different outputs to the same reference unit (be it word, phone or other)

• THE ASR SYSTEM RESULTS HAVE BEEN SCORED USING A 1-HOUR SUBSET OF THE SWITCHBOARD TRANSCRIPTION CORPUS – This corpus has been hand-labeled at the phone, syllable, word and prosodic

stress levels and hand-segmented at the syllabic and word levels. 25% of the material was hand-segmented at the phone level and the remainder quasi-automatically segmented into phonetic segments (and verified)

– The phonetic segments of each site have been mapped to a common reference set, enabling a detailed analysis of the phone and word errors for each site that would otherwise be difficult to perform

Take Home Messages - 1

• FORCED-ALIGNMENT RECOGNITION RESULTS FROM SIX SITES HAVE ALSO BEEN ANALYZED – Possessing both the free and forced-alignment-based recognition results

provide the possibility of performing analyses focused on the relation of the acoustic models to those pertaining to pronunciation and statistical co-occurrence patterns of lexical units

• THE RECOGNITION MATERIAL HAS BEEN TAGGED AT THE PHONE AND WORD LEVELS WITH ~ 40 SEPARATE LINGUISTIC PARAMETERS– This information pertains to the acoustic, phonetic, lexical, utterance and

speaker characteristics of the material and are formatted into “BIG LISTS”

• ALL OF THE RECOGNITION MATERIAL AND CONVERTED FILES, AS WELL AS THE SUMMARY (“BIG”) LISTS ARE AVAILABLE ON THE WORLD WIDE WEB:

http://www.ices.berkeley.edu/real/phoneval

Take Home Messages - 2

• IT MAY BE POSSIBLE TO TEASE OUT FACTORS ASSOCIATED WITH PHONETIC CLASSIFICATION FROM THOSE ASSOCIATED WITH PRONUNCIATION AND LANGUAGE MODELS– Via comparison of the free and forced recognition scores for each site

• PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS– Decision-tree analyses of the big lists support this hypothesis– Additional analyses (some performed by George Doddington) are also

consistent with this conclusion

• DETAILED ANALYSIS OF THE BIG LISTS IS LIKELY TO YIELD ADDITIONAL INSIGHTS OVER THE NEXT SEVERAL MONTHS

– This material is available on the PHONEVAL web site for members of the ASR community to analyze

Take Home Messages - 3

• SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS – Nearly one hour’s material that had previously been phonetically

transcribed (by highly trained phonetics students from UC-Berkeley)

– All of this material was hand-segmented at either the phonetic-segment or syllabic level by the transcribers

– The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained on 72-minutes of hand-segmented Switchboard material

• THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS USED IN THE CURRENT PROJECT IS AVAILABLE ON THE PHONEVAL WEB SITE:

http://www.icsi.berkeley.edu/real/phoneval

• THE ORIGINAL FOUR HOURS OF TRANSCRIPTION MATERIAL IS AVAILABLE AT:

http://www.icsi.berkeley.edu/real/stp

Evaluation Materials

Evaluation Utterance Statistics

Statistic N-Words N-Syls N-Phns Duration MRate SylRate logEng

Minimum 2.00 5.00 11.00 1.63 2.15 2.40 9.63

25% 12.25 16.00 33.00 3.28 3.51 4.35 12.45

50% 16.00 20.00 44.00 4.08 3.86 4.87 13.33

Mean 18.53 23.25 50.49 4.76 3.87 4.88 13.31

75% 23.00 29.00 63.00 5.90 4.22 5.38 14.15

Maximum 64.00 81.00 186.00 17.43 5.81 7.81 17.01

Std. Dev. 9.10 11.35 25.73 2.15 0.55 0.80 1.20

• THE 674 UTTERANCE FILES VARY A LOT IN TERMS OF SUCH PARAMETERS AS DURATION, NUMBER OF WORDS AND

PHONES, ENERGY AND SPEAKING RATE

Evaluation Material Characteristics

0

50

100

150

200

250

300

V_Easy Easy Medium Hard V_Hard

Subjective Difficulty

By Subjective Difficulty

0

20

40

60

80

100

120

140

160

180

S_Mid N_Mid N_East West South NYC (Other)

Dialect Region

Num

ber

of U

ttera

nces

By Dialect Region

• AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS

• BROAD DISTRIBUTION OF UTTERANCE DURATIONS– 2-4 sec - 40%, 4-8 sec - 50%, 8-17 sec - 10%

• COVERAGE OF ALL (7) DIALECT REGIONS IN SWITCHBOARD

• A WIDE RANGE OF DISCUSSION TOPICS

• VARIABILITY IN DIFFICULTY (VERY EASY TO VERY HARD)

• EIGHT SITES PARTICIPATED IN THE EVALUATION– All eight provided material for the free-recognition phase– Six sites also provided forced-alignment-recognition material

(i.e., phone/word labels and segmentation given the word transcript for each utterance)

• AT&T (transcript-based recognition incomplete, not analyzed )

• Bolt, Beranek and Newman

• Cambridge University

• Dragon (transcript-based recognition incomplete, not analyzed )

• Johns Hopkins University

• Mississippi State University

• SRI

• University of Washington

Evaluation Sites

Parameter Key

START - Begin time (in seconds) of phone

DUR - Duration (in sec) of phone

PHN - Hypothesized phone ID

WORD - Hypothesized Word ID

Format is for all 674 files in the evaluation set

Forced-alignment files conform to a similar

structure. Two sites provided confidence scores associated

with either word or phone recognition

(Example courtesy of MSU)

Initial Recognition File - ExampleUTT-ID CH Start DUR PHN WORD

2001_0016 B 0 0.1 sil !SENT_START

2001_0016 B 0.1 0.06 l

2001_0016 B 0.16 0.05 ay

2001_0016 B 0.21 0.07 k LIKE

2001_0016 B 0.28 0.04 ih

2001_0016 B 0.32 0.05 n IN

2001_0016 B 0.37 0.21 ao

2001_0016 B 0.58 0.08 g

2001_0016 B 0.66 0.08 ax

2001_0016 B 0.74 0.03 s

2001_0016 B 0.77 0.04 t

2001_0016 B 0.81 0.01 sp AUGUST

2001_0016 B 0.82 0.03 w

2001_0016 B 0.85 0.03 eh

2001_0016 B 0.88 0.04 n WHEN

2001_0016 B 0.92 0.05 eh

2001_0016 B 0.97 0.03 v

2001_0016 B 1 0.03 r

2001_0016 B 1.03 0.05 iy

2001_0016 B 1.08 0.06 b

2001_0016 B 1.14 0.05 aa

2001_0016 B 1.19 0.03 d

2001_0016 B 1.22 0.03 iy EVERYBODY

Generation of Evaluation Data - 1

• EACH SUBMISSION SITE USED A (QUASI) CUSTOM PHONE SET– The phone sets are available on the PHONEVAL web site

• THE SITES’ PHONE SETS WERE MAPPED TO A “REFERENCE” PHONE SET – The reference phone set is based on the ICSI Switchboard

Transcription materials (STP), but is adapted to match the less granular symbol sets used by the submission sites

– The set of mapping conventions relating the STP (and reference) sets are also available on the PHONEVAL web site

• THE REFERENCE PHONE SET WAS ALSO MAPPED TO THE SITE PHONE SETS– This reverse mapping was done in order to insure that variants of a

phone were given due “credit” in the scoring procedure– For example - [em] (syllabic nasal) is mapped to [ix] + [m], the vowel

[ix] maps in certain instances to both [ih] and [ax], depending on the specifics of the phone set

Phone Mapping Procedure

Generation of Evaluation Data - 2

• EACH SITE’S SUBMISSION WAS PROCESSED THROUGH SC-LITE TO OBTAIN A WORD-ERROR SCORE AND ERROR ANALYSIS (IN TERMS OF ERROR TYPE)

CTM File Format for Word Scoring

SOURCE UTID SIDE START DUR WORD ERTYP

REFERENCE 2001-B-0016 B 0 0.11 ? NHYPOTHESIS 2001-B-0016 B *** *** *** N

R 2001-B-0016 B 0.11 0.18 LIKE CH 2001-B-0016 B 0.1 0.18 LIKE C

R 2001-B-0016 B 0.29 0.08 IN CH 2001-B-0016 B 0.28 0.09 IN C

R 2001-B-0016 B 0.37 0.48 AUGUST CH 2001-B-0016 B 0.37 0.45 AUGUST C

R 2001-B-0016 B 0.85 0.07 WHEN CH 2001-B-0016 B 0.82 0.1 WHEN C

R 2001-B-0016 B 0.92 0.44 EVERYBODY_IS SH 2001-B-0016 B 0.92 0.33 EVERYBODY S

R 2001-B-0016 B *** *** *** IH 2001-B-0016 B 1.25 0.1 IS I

R 2001-B-0016 B 1.36 0.15 ON CH 2001-B-0016 B 1.35 0.15 ON C

… … … … … … …

ERROR KEY

C = CORRECTI = INSERTION N = NULL ERRORS = SUBSTITUTION

Generation of Evaluation Data - 3

• UTTERANCE LEVEL – Utterance ID

– Number of Words in Utterance

– Utterance Duration

– Utterance Energy (Abnormally Low or High Amplitude)

– Utterance Difficulty (Very Easy, Easy, Medium, Hard, Very Hard)

– Speaking Rate - Syllables per Second

– Speaking Rate - Acoustic Measure (MRATE)

Speech Parameters Analyzed - 1

• LEXICAL LEVEL – Lexical Identity – Word Error Type - Substitution, Deletion, Insertion, Null– Word Error Type Context (Preceding/Following)– Unigram Frequency (in Switchboard Corpus)– Number of Syllables in Word (Canonical)– Number of Phones in Word (Canonical)– Number of Phones Incorrect at Word Level (and Error Types)– Phonetic Feature Distance Between Hypothesized/Reference Word– Position of the Word in the Utterance – Lexical Compound Status (Part of a Compound or Not)– Word Duration– Word Energy – Prosodic Prominence (Maximum and Average Stress)– Prosodic Context -Maximum/Average Stress (Preceding/Following)– Temporal Alignment Between Reference and Hypothesized Word

Speech Parameters Analyzed - 2

• PHONE LEVEL – Phone ID (Reference and Hypothesized)

– Phone Duration (Reference and Hypothesized)

– Phone Position within the Word

– Phone Frequency (Switchboard Transcription Corpus)

– Phone Error Type (Substitution, Deletion, Insertion, Null)

– Phone Error Context (Preceding/Following Phone)

– Temporal Alignment Between Reference and Hypothesized Phone

– Phonetic Feature Distance Between Reference/Hypothesized Phone

– Phonetic Feature Analysis Between Reference/Hypothesized Phone+ Manner of Articulation+ Place of Articulation+ Voicing+ Lip Rounding

Speech Parameters Analyzed - 3

• SPEAKER CHARACTERISTICS – Dialect Region

– Gender

– Recognition Difficulty (Very Easy, Easy, Medium, Hard, Very Hard)

– Speaking Rate - Syllables per Second and Acoustic (MRATE)

Speech Parameters Analyzed - 4

• ALL 674 FILES IN DIAGNOSTIC EVALUATION WERE LABELED

• LABELERS WERE TWO UC-BERKELEY LINGUISTICS STUDENTS

• ALL SYLLABLES WERE MARKED WITH RESPECT TO:– Primary Stress

– Unstressed (no explicit label)

– Intermediate Stress

• INTERLABELER AGREEMENT IS HIGH– 95% Agreement with Respect to Stress (78% for Primary Stress)– 85% Agreement for Unstressed Syllables

• THE PROSODIC TRANSCRIPTION MATERIAL IS AVAILABLE AT: http://www.icsi.berkeley.edu/~steveng/prosody

Prosodic Material

Word-Centric “Big List”

ERR REFWORD HYPWORD UTID WORDPOS WORDFREQ WRDENG MRATE SYLRATE ETC.

N ? *** 2001-B-0016 0 -6.02 0.92 5.05 6.56 …C LIKE LIKE 2001-B-0016 0.06 -2.1522 1.04 5.05 6.56 …C IN IN 2001-B-0016 0.11 -1.9295 0.97 5.05 6.56 …C AUGUST AUGUST 2001-B-0016 0.17 -4.6678 1.1 5.05 6.56 …C WHEN WHEN 2001-B-0016 0.22 -2.5432 0.97 5.05 6.56 …C EVERYBODY'S EVERYBODY'S 2001-B-0016 0.28 -4.3253 1.02 5.05 6.56 …C ON ON 2001-B-0016 0.33 -2.3138 0.97 5.05 6.56 …C VACATION VACATION 2001-B-0016 0.39 -3.9967 0.95 5.05 6.56 …C OR OR 2001-B-0016 0.44 -2.3202 0.84 5.05 6.56 …C SOMETHING SOMETHING 2001-B-0016 0.5 -2.7438 0.81 5.05 6.56 …C WE WE 2001-B-0016 0.56 -2.1082 0.88 5.05 6.56 …C CAN CAN 2001-B-0016 0.61 -2.611 0.75 5.05 6.56 …C DRESS DRESS 2001-B-0016 0.67 -4.0399 0.9 5.05 6.56 …C A A 2001-B-0016 0.72 -1.6723 0.85 5.05 6.56 …C LITTLE LITTLE 2001-B-0016 0.78 -2.7814 0.91 5.05 6.56 …C MORE MORE 2001-B-0016 0.83 -2.7027 0.85 5.05 6.56 …C CASUAL CASUAL 2001-B-0016 0.89 -4.6678 0.94 5.05 6.56 …I *** !SILENCE 2001-B-0016 0.94 -6.02 0.6 5.05 6.56 …N H# *** 2005-B-0077 0 -6.02 0.6 4.44 7 …N ? *** 2005-B-0077 0.06 -6.02 0.92 4.44 7 …C YEAH YEAH 2005-B-0077 0.12 -1.9361 0.99 4.44 7 …C JUST JUST 2005-B-0077 0.18 -2.1809 0.94 4.44 7 …C BECAUSE BECAUSE 2005-B-0077 0.24 -2.4782 1.09 4.44 7 …… … … … … … … … … …

• THE “BIG LISTS” CONTAIN SUMMARY INFORMATION ON 55-65 SEPARATE PARAMETERS ASSOCIATED WITH PHONES, SYLLABLES, WORD, UTTERANCES AND SPEAKERS SYNCHRONIZED TO EITHER THE WORD (THIS SLIDE) OR THE PHONE (NEXT SLIDE)

Phone-Centric “Big List”

PhERR WDERR REFPHN HYPPHN REFWORD HYPWORD PHNPOS RDUR HDUR RVOI HVOI AFDIST MSTRESS

S N ? H# ? *** 0 0.11 0.08 2 2 0 0C C L L LIKE LIKE 0 0.05 0.08 0 0 0 0.5C C AY AY LIKE LIKE 0.333 0.07 0.05 0 0 0 0.5C C K K LIKE LIKE 0.667 0.06 0.07 1 1 0 0.5C C IH IH IN IN 0 0.04 0.04 0 0 0 0C C N N IN IN 0.5 0.04 0.08 0 0 0 0C C AO AO AUGUST AUGUST 0 0.24 0.18 0 0 0 1C C G G AUGUST AUGUST 0.167 0.05 0.09 0 0 0 1S C IX AH AUGUST AUGUST 0.333 0.08 0.05 0 0 2 1I C *** S AUGUST AUGUST 0.5 0 0.04 2 1 5 1S C S T AUGUST AUGUST 0.667 0.07 0.06 1 1 1 1C C W W AUGUST AUGUST 0.833 0.04 0.03 0 0 0 1C C W W WHEN WHEN 0 0.04 0.03 0 0 0 0S C EH AX WHEN WHEN 0.333 0.04 0.04 0 0 1 0C C N N WHEN WHEN 0.667 0.04 0.03 0 0 0 0C C EH EH EVERYBODY'S EVERYBODY'S 0 0.04 0.05 0 0 0 0.5S C R V EVERYBODY'S EVERYBODY'S 0.111 0.04 0.03 0 0 2 0.5C C R R EVERYBODY'S EVERYBODY'S 0.222 0.03 0.03 0 0 0 0.5C C IY IY EVERYBODY'S EVERYBODY'S 0.333 0.04 0.05 0 0 0 0.5C C B B EVERYBODY'S EVERYBODY'S 0.444 0.07 0.05 0 0 0 0.5S C AA AH EVERYBODY'S EVERYBODY'S 0.556 0.06 0.05 0 0 1 0.5C C D D EVERYBODY'S EVERYBODY'S 0.667 0.03 0.07 0 0 0 0.5C C IY IY EVERYBODY'S EVERYBODY'S 0.778 0.04 0.03 0 0 0 0.5C C Z Z EVERYBODY'S EVERYBODY'S 0.889 0.07 0.08 1 0 1 0.5C C AA AA ON ON 0 0.09 0.07 0 0 0 0C C N N ON ON 0.5 0.07 0.07 0 0 0 0C C V V VACATION VACATION 0 0.04 0.03 0 0 0 1C C EY EY VACATION VACATION 0.143 0.09 0.11 0 0 0 1C C K K VACATION VACATION 0.286 0.1 0.1 1 1 0 1C C EY EY VACATION VACATION 0.429 0.13 0.12 0 0 0 1C C SH SH VACATION VACATION 0.571 0.08 0.1 1 1 0 1S C IH AX VACATION VACATION 0.714 0.04 0.04 1 0 3 1C C N N VACATION VACATION 0.857 0.06 0.04 0 0 0 1C C ER ER OR OR 0 0.06 0.07 0 0 0 0C C S S SOMETHING SOMETHING 0 0.12 0.11 1 1 0 0.5C C AH AH SOMETHING SOMETHING 0.143 0.06 0.04 0 0 0 0.5C C M M SOMETHING SOMETHING 0.286 0.04 0.07 0 0 0 0.5C C TH TH SOMETHING SOMETHING 0.429 0.04 0.04 1 1 0 0.5C C IH IH SOMETHING SOMETHING 0.571 0.05 0.05 0 0 0 0.5D C NG *** SOMETHING SOMETHING 0.714 0.01 0 0 2 5 0.5S C K NG SOMETHING SOMETHING 0.857 0.04 0.06 1 0 3 0.5

• THE PHONE “BIG LISTS” CONTAIN INFORMATION PERTAINING TO THE PHONETIC-FEATURE DISTANCE BETWEEN THE HYPOTHESIZED AND REFERENCE (STP) PHONE SEQUENCES, AS WELL AS MANY OTHER PARAMETERS

The Phonetic Evaluation Web SiteRECOGNITION FILES•Converted Submissions

ATT, CU, BBN , DRAG, JHU, MSU, SRI, WASH

•Word Level Recognition ErrorsATT, CU, BBN , DRAG, JHU, MSU, SRI, WASH

•Phone Error (Free Recognition)ATT, CU, BBN, DRAG, JHU, MSU, SRI, WASH •Word Recognition Phone Mapping

ATT, CU, BBN, DRAG, JHU, MSU, SRI, WASH

BIG LISTS•Word-Centric

ATT, CU, BBN, DRAG, JHU, MSU, SRI, WASH

•Phone-CentricATT, CU, BBN, DRAG, JHU, MSU, SRI, WASH

•Phonetic Confusion MatricesATT, CU, BBN, DRAG, JHU, MSU, SRI, WASH

FORCED ALIGNMENT FILES•Forced Alignment Files

CU, BBN , JHU, MSU, SRI, WASH

•Word-Level Alignment ErrorsCU, BBN , JHU, MSU, SRI, WASH

•Phone Error (Forced Alignment)CU, BBN, JHU, MSU, SRI, WASH •Alignment Word-Phone Mapping

CU, BBN , JHU, MSU, SRI, WASH

BIG LISTS•Word-Centric

CU, BBN, JHU, MSU, SRI, WASH

•Phone-CentricCU, BBN, JHU, MSU, SRI, WASH

•Phonetic Confusion MatricesCU, BBN, JHU, MSU, SRI, WASH

•Description of the STP Phone Set•STP Transcription Material

Phone-Word Reference

Syllable-Word Reference

•Phone Mapping for Each SiteATT, CU, BBN , DRAG, JHU, MSU, SRI, WASH

STP-to-Reference Map

STP Phone-to-Articulatory-Feature Map

http://www.icsi.berkeley.edu/real/phoneval

Word Recognition ErrorE

rro

r R

ate

Error Type

0

0.1

0.2

0.3

0.4

0.5

0.6

TOTAL SUB DEL INS

ATT

BBN

CU

DRAGON

JHU

MSU

SRI

WASH

Site

• WORD ERROR RATES VARY BETWEEN 27% AND 43%–Substitutions are the major source of word errors

Phone Recognition ErrorE

rro

r R

ate

Error Type

0

0.1

0.2

0.3

0.4

0.5

0.6

TOTAL SUB DEL INS

ATT

BBN

CU

DRAGON

JHU

MSU

SRI

WASH

Site

• PHONE ERROR RATES VARY BETWEEN 39% AND 55%–Substitutions are the major source of phone errors

Word Error - Forced Alignment (a)• WORD ERROR RATES VARY FROM 8% TO 19% (SRI = 71%)

–Insertions as well as substitutions are a major source of errors

Err

or

Ra

te

SRI uses many lexical compounds Site

Error Type

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

TOTAL SUB DEL INS

BBN

CU

JHU

MSU

SRI

WASH

AT&T, Dragondid not provideforced alignments

Word Error - Forced Alignment (b)E

rro

r R

ate

Error Type

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

TOTAL SUB DEL INS

BBN

CU

JHU

MSU

WASH

Site

• WORD ERROR RATES VARY FROM 8% TO 19% –Insertions as well as substitutions are a major source of errors

AT&T, Dragondid not provideforced alignments

Minus SRI

Phone Error - Forced Alignment

0

0.1

0.2

0.3

0.4

0.5

TOTAL SUB DEL INS

BBN

CU

JHU

MSU

SRI

WASH

Err

or

Ra

te

Error Type

AT&T, Dragondid not provideforced alignments

Site

• WORD ERROR RATES VARY BETWEEN 35% AND 49%–Insertions as well as substitutions are a major source of errors

Are Word and Phone Errors Related?• COMPARISON OF THE WORD AND PHONE ERROR RATES ACROSS

SITES SUGGESTS THAT WORD ERROR IS HIGHLY DEPENDENT ON THE PHONE ERROR RATE

–The correlation between the two parameters is 0.78

0

0.1

0.2

0.3

0.4

0.5

0.6

JHUATTCUSRIDRAGMSUUWBBN

Phone Error

Word Error

Phone ErrorWord Error

Submission Site

Error Rate

Pronunciation Models?

Language Models?

r = 0.78

The differential error rate is

probably related to the use of

either pronunciation or

language models (or both)

Language Models?

Phone/Word Error - Forced Alignment

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

JHUMSUCUWASHSRIBBN

Phone Error

Word Error

Phone Error

Word Error

Submission Site

Error Rate

Lexical Compounds

~ 20% “Null” Errors

• THE DIFFERENTIAL ERROR RATE BETWEEN WORDS AND PHONES PROBABLY REFLECTS PRONUNCIATION MODELS

–SRI’S performance is affected by lexical_compounds in the lexicon

Phone/Word Error - Forced Alignment

0.0

0.1

0.2

0.3

0.4

0.5

JHUMSUCUWASHBBN

Phone Error

Word Error

Phone Error

Word Error

Submission Site

Error Rate

~ 20% “Null” Errors

• THE DIFFERENTIAL ERROR RATE BETWEEN WORDS AND PHONES PROBABLY ALSO REFLECTS THE LARGE NUMBER OF “NULL” ERRORS (NO PENALTY FOR BEING WRONG) IN SEVERAL OF THE SITE SUBMISSIONS

Minus SRI

0.8873 0.1127AVGAFDIST < 4.835: 0.9069 0.09306| PHNCOR < 1.5: 0.7811 0.2189| | PHNINS < 0.5: 0.6763 0.3237| | | PHNSUB < 1.5: 0.5519 0.4481| | | | POSTWORDERR in S,D, : 0.3223 0.6777| | | | | PHNCOR < 0.5: 0.1296 0.8704| | | | | PHNCOR >= 0.5: 0.4485 0.5515| | | | | | WORDENGY < 0.995: 0.3448

0.6552| | | | | | WORDENGY >= 0.995: 0.6939

0.3061| | | | POSTWORDERR in C,I,N, : 0.7204 0.2796| | | | | AVGAFDIST < 3.165: 0.7994 0.20061 2 3 4 5 6 CANO-PHNCNT < 2.5: 0.8462 0.1538| | | | | | CANO-PHNCNT >= 2.5: 0.3214

0.6786| | | | | AVGAFDIST >= 3.165: 0.2931 0.7069| | | | | | REFDUR < 0.125: 0.08108 0.9189| | | | | | REFDUR >= 0.125: 0.6667 0.3333| | | PHNSUB >= 1.5: 0.8394 0.1606| | PHNINS >= 0.5: 0.9283 0.07169| PHNCOR >= 1.5: 0.9842 0.01578AVGAFDIST >= 4.835: 0.1016 0.8984| PHNINS < 0.5: 0.05785 0.9421| PHNINS >= 0.5: 0.8571 0.1429

Decision Tree Analysis Example

+4652+591AVGAFDIST < 4.835: +4639+476| PHNCOR < 1.5: +1520+426| | PHNINS < 0.5: +769+368| | | PHNSUB < 1.5: +356+289| | | | POSTWORDERR in S,D, : +88+185| | | | | PHNCOR < 0.5: +14+94| | | | | PHNCOR >= 0.5: +74+91| | | | | | WORDENGY < 0.995: +40+76| | | | | | WORDENGY >= 0.995: +34+15| | | | POSTWORDERR in C,I,N, :

+268+104| | | | | AVGAFDIST < 3.165: +251+631 2 3 4 5 6 CANO-PHNCNT < 2.5: +242+44| | | | | | CANO-PHNCNT >= 2.5: +9+19| | | | | AVGAFDIST >= 3.165: +17+41| | | | | | REFDUR < 0.125: +3+34| | | | | | REFDUR >= 0.125: +14+7| | | PHNSUB >= 1.5: +413+79| | PHNINS >= 0.5: +751+58| PHNCOR >= 1.5: +3119+50AVGAFDIST >= 4.835: +13+115| PHNINS < 0.5: +7+114| PHNINS >= 0.5: +6+1

• DELETIONS VS. EVERTHING ELSE - WORD LEVEL (MSU)

PROPORTIONS INSTANCES

N = 5243

Level

Decision Tree Analysis Example

+4652+591AVGAFDIST < 4.835: +4639+476| PHNCOR < 1.5: +1520+426| | PHNINS < 0.5: +769+368| | | PHNSUB < 1.5: +356+289| | | | POSTWORDERR in S,D, : +88+185| | | | | PHNCOR < 0.5: +14+94| | | | | PHNCOR >= 0.5: +74+91| | | | | | WORDENGY < 0.995: +40+76| | | | | | WORDENGY >= 0.995: +34+15| | | | POSTWORDERR in C,I,N, :

+268+104| | | | | AVGAFDIST < 3.165: +251+631 2 3 4 5 6 CANO-PHNCNT < 2.5: +242+44| | | | | | CANO-PHNCNT >= 2.5: +9+19| | | | | AVGAFDIST >= 3.165: +17+41| | | | | | REFDUR < 0.125: +3+34| | | | | | REFDUR >= 0.125: +14+7| | | PHNSUB >= 1.5: +413+79| | PHNINS >= 0.5: +751+58| PHNCOR >= 1.5: +3119+50AVGAFDIST >= 4.835: +13+115| PHNINS < 0.5: +7+114| PHNINS >= 0.5: +6+1

• DELETIONS VS. EVERTHING ELSE - WORD LEVEL (MSU)

INSTANCES

N = 5243

Level

Number of Nodes at a Given Level

Number of Instances at a Given LevelFEATURE TOTAL LEVEL 1 LEVEL 2 LEVEL 3 LEVEL 4 LEVEL 5 LEVEL 6 LEVEL 7

AVGAFDIST 5615 5243 0 0 0 0 372 0

PHNCOR 5388 0 5115 0 0 0 273 0

PHNINS 2074 0 128 1946 0 0 0 0

PHNSUB 1137 0 0 0 1137 0 0 0

POSTWORDERR 645 0 0 0 0 645 0 0

CANO-PHNCNT 314 0 0 0 0 0 0 314

WORDENGY 165 0 0 0 0 0 0 165

REFDUR 58 0 0 0 0 0 0 58

FEATURE TOTAL LEVEL 1 LEVEL 2 LEVEL 3 LEVEL 4 LEVEL 5 LEVEL 6 LEVEL 7

AVGAFDIST 4 2 0 0 0 0 2 0

PHNCOR 4 0 2 0 0 0 2 0

PHNINS 4 0 2 2 0 0 0 0

CANO-PHNCNT 2 0 0 0 0 0 0 2

POSTWORDERR 2 0 0 0 0 2 0 0

WORDENGY 2 0 0 0 0 0 0 2

PHNSUB 2 0 0 0 2 0 0 0

REFDUR 2 0 0 0 0 0 0 2

• PHONE-BASED PARAMETERS DOMINATE THE TREES ….

• WORD SUBSTITUTIONS VERSUS EVERYTHING ELSEATT - phnsub, wordfreq, avgAFdist, beginoff, endoff, postworderrBBN - postworderr, preworderr, avgAFdist, phnsub, wordfreq, hypdurCU - preworderr, phnsub, wordfreqDragon - phnsub, preworderr, postworderrMSU - phnsub, avgAFdist, hypdur, postworderr, beginoffJHU - phnsub, wordfreq, cano-sylcntSRI - postworderr, phnsub, avgAFdist, wordfreq, hypdurWASH - phnsub, wordfreq, avgAFdist, postworderr, avgphnfreq

• WORD DELETIONS VERSUS EVERYTHING ELSEATT - phncor, avgAFdist, postworderrBBN - avgAFdist, refdur, wordengy, preworderrCU - avgAFdist, phnins, phncorDragon - phncor, preworderr, phnsub MSU - avgAFdist, phncor, phnins, phnsubJHU - avgAFdist, preworderr, refdurSRI - phncor, phnsub, phnins, wordfreq, hypdurWASH - avgAFdist, refdur, preworderr

Decision Tree Analysis of Errors - 1

• DURATION IS IMPORTANT IN DISTINGUISHING AMONG ERROR TYPES IN THE TREES ….

• WORD SUBSTITUTIONS VERSUS DELETIONSATT - refdur, phnsub, wordengy, postworderr, phncorBBN - phnsub, phncor, phninsCU - hypdur, phnsub, avgAFdist, phncor, Dragon - refdur, phnsub, avgAFdist, phninsMSU - refdur, phnsub, avgAFdist, phncor, phninsJHU - refdur, phnsub, phncor, phnins, postworderrSRI - refdur, wordengy, phnsub, wordfreq, phnins, phncorWASH - refdur, phnsub, phnins, phncor

• WORD SUBSTITUTIONS VERSUS INSERTIONSATT - hypdur, avgAFdist, preworderrBBN - hypdur, phnsub, avgphnfreq, refdur, preworderrCU - hypdur, avgphnfreq, postworderr Dragon - hypdurJHU - hypdur, phnsubMSU - avgphnfreq, hypdur, preworderr, phnsubSRI - hypdur, phnsubWASH - hypdur, phnsub, avgphnfreq, phndel, preworderr, phncor

Decision Tree Analysis of Errors - 2

Different Measures of Speaking Rate

Statistic N-Words N-Syls N-Phns Duration MRate SylRate logEng

Minimum 2.00 5.00 11.00 1.63 2.15 2.40 9.63

25% 12.25 16.00 33.00 3.28 3.51 4.35 12.45

50% 16.00 20.00 44.00 4.08 3.86 4.87 13.33

Mean 18.53 23.25 50.49 4.76 3.87 4.88 13.31

75% 23.00 29.00 63.00 5.90 4.22 5.38 14.15

Maximum 64.00 81.00 186.00 17.43 5.81 7.81 17.01

Std. Dev. 9.10 11.35 25.73 2.15 0.55 0.80 1.20

• MRATE IS AN ACOUSTIC MEASURE BASED ON THE MODULATION PROPERTIES OF THE SIGNAL’S ENERGY ACROSS THE SPECTRUM

• SYLLABLES/SEC IS A LINGUISTIC MEASURE OF SPEAKING RATE

• CORRELATION BETWEEN THE TWO MEASURES (R) = 0.56

• MRATE GENERALLY UNDERESTIMATES THE SYLLABLE RATE– Non-speech, filled pauses, etc. are computed in MRATE but not in

the syllable rate

• LEXICAL PROPERTIES – Lexical Identity– Unigram Frequency– Number of Syllables in Word– Number of Phones in Word– Word Duration– Speaking Rate– Prosodic Prominence– Energy Level– Lexical Compounds– Non-Words– Word Position in Utterance

• SYLLABLE PROPERTIES– Syllable Structure– Syllable Duration– Syllable Energy– Prosodic Prominence– Prosodic Context

Summary of Corpus Statistical Analyses• PHONE PROPERTIES

– Phonetic Identity– Phone Frequency– Position within the Word– Position within the Syllable– Phone Duration– Speaking Rate– Phonetic Context– Contiguous Phones Correct– Contiguous Phones Wrong– Phone Segmentation– Articulatory Features– Articulatory Feature Distance– Phone Confusion Matrices

• OTHER PROPERTIES– Speaker (Dialect, Gender)– Utterance Difficulty– Utterance Energy– Utterance Duration

• HOW ACCURATE IS THE PHONETIC SEGMENTATION PROVIDED BY FORCED-ALIGNMENT-BASED RECOGNITION?– The average disparity between the phone duration of the reference

(STP) corpus and the duration of the forced alignment phones is substantial (ca. 40% of the mean duration of a phone in the corpus)

• AUTOMATIC ALIGNERS ARE NOT RELIABLE PHONE SEGMENTERS

Precision of Forced Alignment Segmentation

MEAN DISPARITY

SITE in millisec rel to mean

BBN 31 0.39

CAMBRIDGE 30 0.38

JOHNS HOPKINS 37 0.47

MISSISSIPPI 31 0.39

SRI 32 0.40

WASHINGTON 31 0.39

MEAN of 6 SITES 32 0.40

Mean phone duration in corpus = 79.3 ms

There is virtually no skew in disparity between beginning and ending portions of the phones (i.e., no bias in segmentation)

• RELATION OF THE NUMBER OF PHONES IN A WORD TO WORD ERROR– Done by George Doddington of NIST using both the free and forced-

alignment recognition results (from the “Big Lists”)– Reveals an interesting relationship between the number of phones

correctly (or incorrectly) classified and the probability of a word being correctly (or incorrectly) labeled

– Also shows the extent to which decoders are tolerant of phone classification errors

– George’s analysis is consistent with the D-Tree analyses suggesting that phone classification is the controlling variable for word error

– George will discuss this material directly following this presentation

• ANALYSIS OF PHONETIC CONFUSIONS IN THE CORPUS MATERIAL– Performed by Joe Kupin and Hollis Fitch of the Institute for Defense

Analysis– The output of their scripts are available on the PHONEVAL web site– Hollis will discuss some of their results directly after George’s

presentation

Analyses Performed By Others

• THE DIAGNOSTIC MATERIAL MAY NOT BE TRULY REPRESENTATIVE OF THE SWITCHBOARD RECOGNITION TASK

– The competitive evaluation is based on entire conversations, whereas the current diagnostic material contains only relatively small amounts of material from any single speaker

– This strategy was intended to provide a broad coverage of different speaker qualities (gender, dialect, age, voice quality, topic, etc.), but …

– Was also designed to foil recognition based largely on speaker adaptation algorithms

• THE TIME-MEDIATED SCORING TECHNIQUE IS NOT “PERFECT” AND MAY HAVE INTRODUCED CERTAIN ERRORS NOT PRESENT IN

THE COMPETITIVE EVALUATION

• LEXICAL COMPOUNDS ARE PROBLEMATIC FOR THE CURRENT ANALYSIS

– Certain sites (e.g., SRI) were unduly penalized in the forced alignment component of the analysis due to the high proportion of compounds

• THE STP TRANSCRIPTION (REFERENCE) MATERIAL IS NOT “PERFECT” AND THEREFORE THE ANALYSES COULD UNDERESTIMATE A

SITE’S PERFORMANCE ON BOTH FREE AND FORCED-ALIGNMENT-BASED RECOGNITION

Caveats

• SWITCHBOARD RECOGNITION SYSTEMS FROM EIGHT SEPARATE SITES ARE EVALUATED WITH RESPECT TO PHONE- AND WORD-LEVEL CLASSIFICATION USING BOTH FREE AND FORCED-ALIGNMENT-BASED RECOGNITION ON NON-COMPETITIVE DIAGNOSTIC MATERIAL

• IT MAY BE POSSIBLE TO TEASE OUT FACTORS ASSOCIATED WITH PHONETIC CLASSIFICATION FROM THOSE ASSOCIATED WITH PRONUNCIATION AND LANGUAGE MODELS– Via comparison of the free and forced recognition scores for each site

• PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS– Decision-tree analyses of the big lists support this hypothesis– Additional analyses (some performed by George Doddington) are also

consistent with this conclusion

• DETAILED ANALYSIS OF THE BIG LISTS IS LIKELY TO YIELD ADDITIONAL INSIGHTS OVER THE NEXT SEVERAL MONTHS

– This material is available on the PHONEVAL web site for members of the ASR community to analyze

Summary and Conclusions

• STRUCTURED QUERY LANGUAGE (SQL) DATABASE VERSION (9/2000)– Will provide quick and ready access to the entire set of recognition

and forced-alignment material– Will enable accurate selection of specific subsets of the material for

detailed, intensive analysis and graphing without much scripting– Will accelerate analysis of the evaluation material, which will be …

• POSTED ON THE PHONEVAL WEB SITE FOR WIDE DISSEMINATION

• DEVELOPMENT OF A HIGH-FIDELITY AUTOMATIC PHONETIC TRANSCRIPTION SYSTEM TO LABEL AND SEGMENT (IN PROGRESS)– This automatic system will enable accurate labeling and segmentation

of the remainder of the Switchboard corpus, thus enabling …

• PHONETIC AND LEXICAL DISSECTION OF THE COMPETITIVE EVALUATION SUBMISSIONS IN THE SPRING OF 2001– Hopefully providing further insight into ways in which ASR systems

can be improved

Into the Future …

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

CVC CV VC V CVCC VCC CVCVC CCVC CVCV CVCCVC VCVC

St. Dev

Error Rate

Syllable Structure

Error Rate

0

0.1

0.2

0.3

0.4

0.5

0.6

Error Rate

CVC

CV

VC

V

CVCC

VCC

CVCVC

CCVC

CVCV

CVCCVC

VCVC

Mean ATT BBN CU DRAG JHU MSU SRI UW

Syllable Structure

Site

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

CVC CV VC V CVCC VCC CVCVC CCVC CVCV CVCCVC VCVC

Syllable Type

Syllable Structure

CV22%

VC14%

V10%

CVCC8%

VCC6%

CVCVC4%

CCVC2%

CVCV2%

CVCCVC2%

VCVC1%

CVC29%

0

0.1

0.2

0.3

0.4

0.5

0.6

Error Rate0.1

0.3

0.5

0.7

0.9

MEAN ATT BBN CU DRAG JHU MSU SRI UW

Average Stress

SITE

00.10.20.30.40.50.6

Error Rate

0

1

MEAN ATT BBN CU DRAG JHU MSU SRI UW

Max Stress

Site

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0 0.5 1

Maximum Stress

Deletions

Substitions

0

0.05

0.1

0.15

0.2

0.25

Deletion Errors

0 0.5 1

ATTBBN

CUDRAG

JHUMSU

SRIUW

Maximum Stress in Word

Site

0.000.050.100.150.200.250.30

Substitution Errors

0.0

ATT BBN CU DRAG JHU MSU SRI UW

Maximum Stress

Site

0.00

0.05

0.10

0.15

0.20

0.25

Deletion Error Rate

ATT BBN CU DRAG JHU MSU SRI UW

Average

Stress

Site

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Word Error

2

2.5

3

3.5

4

4.5

5

5.5

MEAN ATT BBN CU DRAG JHU MSU SRI UW

MRATE

Site

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Word Error Rate

2

3

4

5

6

7

MEAN ATT BBN CU DRAG JHU MSU SRI UW

Syllable Rate

Site

0.0

1.0

2.0

3.0

4.0

5.0

Phone Errors per Word

Total Substitutions Deletions Insertions

1

2

3

4

5

Word Error Type

Number of Phones per Word

0.000.050.100.150.200.250.300.350.400.45

Feature ErrorTotal

Sub

Del

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Error Type

Position Within Syllable

0.00

0.05

0.10

0.15

0.20

0.25

Phone Error Rate

Total Sub Del

0.1

0.3

0.5

0.7

0.9

Phone Error Type

Position Within Syllable

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Total Feature ErrorsOnset

Mid

Final

Phone Position in Word

Site

Incorrect Words

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Feature Distance Error

Onset

Final

Mean ATT BBN CU DRAG JHU MSU SRI UW

Phone

Position in

Word

Site

Correct Words