Phonetic and Lexical Dissection of the Hub-5 Speech Recognition Evaluation Steven Greenberg, Shawn Chang and Joy Hollenback International Computer Science

  • View
    212

  • Download
    0

Embed Size (px)

Text of Phonetic and Lexical Dissection of the Hub-5 Speech Recognition Evaluation Steven Greenberg, Shawn...

  • Slide 1
  • Phonetic and Lexical Dissection of the Hub-5 Speech Recognition Evaluation Steven Greenberg, Shawn Chang and Joy Hollenback International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 Speech Transcription Workshop, College Park, MD - May 18, 2000
  • Slide 2
  • OR
  • Slide 3
  • An Introduction to Error Analysis The Study of Uncertainties in Physical Measurements Speech Transcription Workshop, College Park, MD - May 18, 2000
  • Slide 4
  • Acknowledgements and Thanks EVALUATION DESIGN George Doddington Jack Godfrey ANALYSIS George Doddington Hollis Fitch Joe Kupin Rosaria Silipo SC-LITE SUPPORT Jon Fiscus DATA SUBMISSION AT&T BBN DRAGON SYSTEMS CAMBRIDGE UNIVERSITY JOHNS HOPKINS UNIVERSITY MISSISSIPPI STATE UNIVERSITY SRI UNIVERSITY OF WASHINGTON
  • Slide 5
  • SWITCHBOARD RECOGNITION SYSTEMS FROM EIGHT SEPARATE SITES ARE EVALUATED WITH RESPECT TO PHONE- AND WORD-LEVEL CLASSIFICATION USING BOTH FREE AND FORCED-ALIGNMENT-BASED RECOGNITION ON NON-COMPETITIVE DIAGNOSTIC MATERIAL THIS EVALUATION REQUIRED THE CONVERSION OF THE ORIGINAL SUBMISSION MATERIAL TO A COMMON REFERENCE FORMAT The common format is required for scoring using SC-Lite and to perform certain types of statistical analyses Key to the conversion is the use of TIME-MEDIATED parsing which provides the capability of assigning different outputs to the same reference unit (be it word, phone or other) THE ASR SYSTEM RESULTS HAVE BEEN SCORED USING A 1-HOUR SUBSET OF THE SWITCHBOARD TRANSCRIPTION CORPUS This corpus has been hand-labeled at the phone, syllable, word and prosodic stress levels and hand-segmented at the syllabic and word levels. 25% of the material was hand-segmented at the phone level and the remainder quasi- automatically segmented into phonetic segments (and verified) The phonetic segments of each site have been mapped to a common reference set, enabling a detailed analysis of the phone and word errors for each site that would otherwise be difficult to perform Take Home Messages - 1
  • Slide 6
  • FORCED-ALIGNMENT RECOGNITION RESULTS FROM SIX SITES HAVE ALSO BEEN ANALYZED Possessing both the free and forced-alignment-based recognition results provide the possibility of performing analyses focused on the relation of the acoustic models to those pertaining to pronunciation and statistical co- occurrence patterns of lexical units THE RECOGNITION MATERIAL HAS BEEN TAGGED AT THE PHONE AND WORD LEVELS WITH ~ 40 SEPARATE LINGUISTIC PARAMETERS This information pertains to the acoustic, phonetic, lexical, utterance and speaker characteristics of the material and are formatted into BIG LISTS ALL OF THE RECOGNITION MATERIAL AND CONVERTED FILES, AS WELL AS THE SUMMARY (BIG) LISTS ARE AVAILABLE ON THE WORLD WIDE WEB: http://www.ices.berkeley.edu/real/phoneval Take Home Messages - 2
  • Slide 7
  • IT MAY BE POSSIBLE TO TEASE OUT FACTORS ASSOCIATED WITH PHONETIC CLASSIFICATION FROM THOSE ASSOCIATED WITH PRONUNCIATION AND LANGUAGE MODELS Via comparison of the free and forced recognition scores for each site PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS Decision-tree analyses of the big lists support this hypothesis Additional analyses (some performed by George Doddington) are also consistent with this conclusion DETAILED ANALYSIS OF THE BIG LISTS IS LIKELY TO YIELD ADDITIONAL INSIGHTS OVER THE NEXT SEVERAL MONTHS This material is available on the PHONEVAL web site for members of the ASR community to analyze Take Home Messages - 3
  • Slide 8
  • SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS Nearly one hours material that had previously been phonetically transcribed (by highly trained phonetics students from UC-Berkeley) All of this material was hand-segmented at either the phonetic- segment or syllabic level by the transcribers The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained on 72-minutes of hand-segmented Switchboard material THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS USED IN THE CURRENT PROJECT IS AVAILABLE ON THE PHONEVAL WEB SITE: http://www.icsi.berkeley.edu/real/phoneval THE ORIGINAL FOUR HOURS OF TRANSCRIPTION MATERIAL IS AVAILABLE AT: http://www.icsi.berkeley.edu/real/stp Evaluation Materials
  • Slide 9
  • Evaluation Utterance Statistics THE 674 UTTERANCE FILES VARY A LOT IN TERMS OF SUCH PARAMETERS AS DURATION, NUMBER OF WORDS AND PHONES, ENERGY AND SPEAKING RATE
  • Slide 10
  • Evaluation Material Characteristics Subjective Difficulty By Subjective Difficulty Dialect Region Number of Utterances By Dialect Region AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS BROAD DISTRIBUTION OF UTTERANCE DURATIONS 2-4 sec - 40%, 4-8 sec - 50%, 8-17 sec - 10% COVERAGE OF ALL (7) DIALECT REGIONS IN SWITCHBOARD A WIDE RANGE OF DISCUSSION TOPICS VARIABILITY IN DIFFICULTY (VERY EASY TO VERY HARD)
  • Slide 11
  • EIGHT SITES PARTICIPATED IN THE EVALUATION All eight provided material for the free-recognition phase Six sites also provided forced-alignment-recognition material (i.e., phone/word labels and segmentation given the word transcript for each utterance) AT&T (transcript-based recognition incomplete, not analyzed ) Bolt, Beranek and Newman Cambridge University Dragon (transcript-based recognition incomplete, not analyzed ) Johns Hopkins University Mississippi State University SRI University of Washington Evaluation Sites
  • Slide 12
  • Parameter Key START - Begin time (in seconds) of phone DUR - Duration (in sec) of phone PHN - Hypothesized phone ID WORD - Hypothesized Word ID Format is for all 674 files in the evaluation set Forced-alignment files conform to a similar structure. Two sites provided confidence scores associated with either word or phone recognition (Example courtesy of MSU) Initial Recognition File - Example
  • Slide 13
  • Generation of Evaluation Data - 1
  • Slide 14
  • EACH SUBMISSION SITE USED A (QUASI) CUSTOM PHONE SET The phone sets are available on the PHONEVAL web site THE SITES PHONE SETS WERE MAPPED TO A REFERENCE PHONE SET The reference phone set is based on the ICSI Switchboard Transcription materials (STP), but is adapted to match the less granular symbol sets used by the submission sites The set of mapping conventions relating the STP (and reference) sets are also available on the PHONEVAL web site THE REFERENCE PHONE SET WAS ALSO MAPPED TO THE SITE PHONE SETS This reverse mapping was done in order to insure that variants of a phone were given due credit in the scoring procedure For example - [em] (syllabic nasal) is mapped to [ix] + [m], the vowel [ix] maps in certain instances to both [ih] and [ax], depending on the specifics of the phone set Phone Mapping Procedure
  • Slide 15
  • Generation of Evaluation Data - 2
  • Slide 16
  • EACH SITES SUBMISSION WAS PROCESSED THROUGH SC-LITE TO OBTAIN A WORD-ERROR SCORE AND ERROR ANALYSIS (IN TERMS OF ERROR TYPE) CTM File Format for Word Scoring ERROR KEY C = CORRECT I = INSERTION N = NULL ERROR S = SUBSTITUTION
  • Slide 17
  • Generation of Evaluation Data - 3
  • Slide 18
  • UTTERANCE LEVEL Utterance ID Number of Words in Utterance Utterance Duration Utterance Energy (Abnormally Low or High Amplitude) Utterance Difficulty (Very Easy, Easy, Medium, Hard, Very Hard) Speaking Rate - Syllables per Second Speaking Rate - Acoustic Measure (MRATE) Speech Parameters Analyzed - 1
  • Slide 19
  • LEXICAL LEVEL Lexical Identity Word Error Type - Substitution, Deletion, Insertion, Null Word Error Type Context (Preceding/Following) Unigram Frequency (in Switchboard Corpus) Number of Syllables in Word (Canonical) Number of Phones in Word (Canonical) Number of Phones Incorrect at Word Level (and Error Types) Phonetic Feature Distance Between Hypothesized/Reference Word Position of the Word in the Utterance Lexical Compound Status (Part of a Compound or Not) Word Duration Word Energy Prosodic Prominence (Maximum and Average Stress) Prosodic Context -Maximum/Average Stress (Preceding/Following) Temporal Alignment Between Reference and Hypothesized Word Speech Parameters Analyzed - 2
  • Slide 20
  • PHONE LEVEL Phone ID (Reference and Hypothesized) Phone Duration (Reference and Hypothesized) Phone Position within the Word Phone Frequency (Switchboard Transcription Corpus) Phone Error Type (Substitution, Deletion, Insertion, Null) Phone Error Context (Preceding/Following Phone) Temporal Alignment Between Reference and Hypothesized Phone Phonetic Feature Distance Between Reference/Hypothesized Phone Phonetic Feature Analysis Between Reference/Hypothesized Phone + Manner of Articulation + Place of Articulation + Voicing + Lip Rounding Speech Parameters Analyzed - 3
  • Slide 21
  • SPEAKER CHARACTERISTICS Dialect Region Gender Recognition Difficulty (Very Easy, Easy, Medium, Hard, Very Hard) Speaking Rate - Syllables per Second and Acoustic (MRATE) Speech Parameters Analyzed - 4
  • Slide 22
  • ALL 674 FILES IN DIAGNOSTIC EVALUATION WERE LABELED LABELERS WERE TWO UC-BERKELEY LINGUISTICS STUDENTS ALL SYLLABLES WERE MARKED WITH RESPECT TO: Primary Stress Unstressed (no explicit label) Intermediate Stress INTERLABELER AGREEMENT IS HIGH 95% Agreement with Respect to Stress (78% for Primary Stress) 85% Agreement for Unstressed Syllables THE PROSODIC TRANSCRIPTI