30
+ CS 136 Speech Recognition September 26, 2017 Professor Meteer Thanks to Dan Jurafsky for many of these slides Phonetics and Language Resources for Speech Recognition

Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

Embed Size (px)

Citation preview

Page 1: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+

CS 136 Speech Recognition September 26, 2017 Professor Meteer

Thanks to Dan Jurafsky for many of these slides

Phonetics and Language Resources for Speech Recognition

Page 2: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Just a bit more on Kaldi

n Next you will be using the Kaldi Resource Management (rm) recipe n  As you go through the steps, we’ll be talking about the algorithms in

class n  I will be putting up questions in the Latte quiz format

n The “recipes” are in the “egs” directory

n Note that for those running on department machines: n  The Kaldi folder on montera is read-only, so you need to copy /opt/

kaldi/egs to your home folder and run the scripts from there. n  There are 12 different directories you need to add to your

LD_LIBRARY_PATH. Ken will send out a list.

Thanks to Dan Jurafsky for these slides

Page 3: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Phonetics

n Phonemes and the ARPAbet n  An alphabet for transcribing American English phonetic

sounds.

n Articulatory Phonetics n  How speech sounds are made by articulators (moving

organs) in mouth.

n Language resources and WFSTs

1/5/07

Page 4: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ From speech to phonemes n Phonemes are the minimal set of sounds to distinguish

meaning n  Pat – bat, tab – dab, n  Fat – chat – that n  Pack – pick – puck -- pike

n Uses the alphabet, but not isomorphic to spelling (especially in English)

n Standard used in speech recognition is the “ARPABET” n  46 total (17 vowels, 29 consonants) + 13 “extras n  In practice there are many variations, but all are close n  http://www.stanford.edu/class/cs224s/arpabet.html n  NOTE: These are for English only—each language has its own set of

phonemes

Thanks to Dan Jurafsky for these slides

Page 5: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ ARPAbet Vowels

1/5/07

b_d ARPA b_d ARPA 1 bead iy 9 bode ow 2 bid ih 10 booed uw 3 bayed ey 11 bud ah 4 bed eh 12 bird er 5 bad ae 13 bide ay 6 bod(y) aa 14 bowed aw 7 bawd ao 15 Boyd oy 8 Budd(hist) uh

Sounds from Ladefoged

Note: Many speakers pronounce Buddhist with the vowel uw as in booed, So for them [uh] is instead the vowel in “put” or “book”

Page 6: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+Speech Spectogram for “I’d like to order”

Page 7: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ The Speech Chain (Denes and Pinson)

1/5/07

SPEAKER HEARER

Articulatory Phonetics

Page 8: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ George Miller figure

n Articulation and Resonance n Shape of vocal tract

n Phonation n Airstream sets vocal folds in

motion. Vibration of vocal folds produces sounds.

n Respiration: n We (normally) speak while

breathing out. Respiration provides airflow. “Pulmonic egressive airstream”

Recognizing speech Separating the filter from the source

Page 9: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Phonation: Larynx and Vocal Folds

n The Larynx (voice box) n  A structure made of cartilage and muscle n  Located above the trachea (windpipe) and below the pharynx

(throat) n  Contains the vocal folds n  (adjective for larynx: laryngeal)

n Vocal Folds (older term: vocal cords) n  Two bands of muscle and tissue in the larynx n  Can be set in motion to produce sound (voicing)

1/5/07 Text from slides by Sharon Rose UCSD LING 111 handout

Page 10: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Voicing:

•  Air comes up from lungs

•  Forces its way through vocal folds, pushing open (2,3,4)

•  This causes air pressure in glottis to fall, since: •  when gas runs through constricted

passage, its velocity increases (Venturi tube effect)

•  this increase in velocity results in a drop in pressure (Bernoulli principle)

•  Because of drop in pressure, vocal cords snap together again (6-10)

•  Single cycle: ~1/100 of a second.

1/5/07 Figure & text from John Coleman’s web site

Page 11: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Voicelessness

n When vocal cords are open, air passes through unobstructed

n Voiceless sounds: p/t/k/s/f/sh/th/ch

n  If the air moves very quickly, the turbulence causes a different kind of phonation: whisper

1/5/07

Page 12: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Articulators and resonance

1/5/07 From Mark Liberman’s Web Site, from Language Files (7th ed)

Page 13: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Consonants and Vowels n Consonants: phonetically, sounds with audible noise

produced by a constriction

n Vowels: phonetically, sounds with no audible noise produced by a constriction

1/5/07 Text adapted from John Coleman

Page 14: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Place of articulation n  Coronal (tip of the tongue)

n  Dental: th/dh n  Alveolar: t/d/s/z/l n  Post: sh/zh/y

n  Dorsal (back of the tongue) n  Velar: k/g/ng

n  Lips n  Bilabial: p//b/m n  Labiodental: f/v

n  Glottis: n  Glotal stop, as in Cockney

“bottle”

1/5/07

labial

dental alveolar post-alveolar/palatal

velar uvular

pharyngeal

laryngeal/glottal

Figure thanks to Jennifer Venditti

Page 15: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Manner of Articulation n Stop: complete closure of articulators, so no air

escapes through mouth n Oral stop: palate is raised, no air escapes

through nose. Air pressure builds up behind closure, explodes when released n  p, t, k, b, d, g

n Nasal stop: oral closure, but palate is lowered, air escapes through nose. n  m, n, ng

n Fricative n Close approximation of two articulators,

resulting in turbulent airflow between them n  f, v, s, z, th, dh

n Affricate

n Approximant

n Tap or flap 1/5/07

Oral

Nasal

Page 16: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Articulatory parameters for English consonants (in ARPAbet)

1/5/07

PLACE OF ARTICULATION bilabial labio-

dental inter-dental

alveolar palatal velar glottal

stop p b t d k g q

fric. f v th dh s z sh zh h

affric. ch jh

nasal m n ng

approx w l/r y

flap dx MA

NN

ER O

F A

RTIC

ULA

TIO

N

VOICING: voiceless voiced

Table from Jennifer Venditt!i

Page 17: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Vowels

1/5/07

IY AA UW

Fig. from Eric Keller

Peaks are the Formants

Page 18: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Vowels n Characterized by “formants”: Bands of energy

n Each vowel has 2 characteristic pitches n  lower is 1st formant n  higher is 2nd formant

1/5/07

.

Page 19: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ [iy] vs. [uw]

1/5/07 Figure from Jennifer Venditti, from a lecture given by Rochelle Newman

Page 20: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ American English Vowel Space

1/5/07

FRONT BACK

HIGH

LOW

ow

aw

oy

iy

ih

eh

ae aa

ao

uw

uh

ah ax

ix ux

Figure from Jennifer Venditti

Page 21: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ More phonetic structure

n Syllables n Composed of vowels and consonants. Not well

defined. Something like a “vowel nucleus with some of its surrounding consonants”.

1/5/07

Page 22: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ More phonetic structure n Stress

n  Some syllables have more energy than others n  Stressed syllables versus unstressed syllables n  (an) ‘INsult vs. (to) in’SULT n  (an) ‘OBject vs. (to) ob’JECT

n Simple model: every multi-syllabic word has one syllable with: n  “primary stress”

n  We can represent by using the number “1” on the vowel (and an implicit unmarking on the other vowels)

n  “table”: t ey1 b ax l n  “machine: m ax sh iy1 n

n  Also possible: “secondary stress”, marked with a “2” n  ih-2 n f axr m ey-1 sh ax n

n  Third category: reduced: schwa: n  ax

1/5/07

Page 23: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Multi syllable words

1/5/07

Page 24: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ She came back and started again

1. SH- lots of high-freq energy

3. closure for K in came

4. burst of aspiration for K

5. EY vowel;faint 1100 Hz formant is nasalization

6. bilabial nasal

8. ae; note upward transitions after bilabial stop at beginning

9.  note F2 and F3 coming together for “K”

10.  D is lost between N and S

© MM Consulting 2015 From Ladefoged “A Course in Phonetics” 2/10/15

24

SH–IY-K-EY-M-B-AE-K-AX-N-D-S-T-AA-R-T-DX-IX-D-AX-G-EH-N

Page 25: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Resource Management data

n DARPA Resource Management Continuous Speech Database (RM1) is a two-CDROM set: n  rm1_audio1 corresponds to the merged

n  NIST Corpus 2-1.1 and 2-2.1 (Speaker-Dependent Training Data) n  NIST Corpus 2-3.1 (Speaker-Independent Training Data)

n  rm1_audio2 corresponds to two separate sets: n  NIST Corpus 2-4.2 (Development Test and Evaluation Test Data and Scoring

Software) n  NIST Corpus 2-5.1 (Isolated- and Spelled-Word Data)

n DARPA Extended Resource Management Continuous Speech Speaker-Dependent Corpus (RM2) is a one-CDROM set: n  rm2_audio corresponds to the merged

n  NIST Corpus 3-1.2 and 3-2.2 (Training, Extended Training, n  Development Test and Evaluation Test Data)

Thanks to Dan Jurafsky for these slides

Page 26: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Kaldi RM Data Prep

n Takes the dictionary, text (snor format), bigram grammar, speaker info … n  Resources supplied by research groups n  You get what you get …

n “rm_data_prep” rewrites the information in the Kaldi format (as in the Y/N tutorial)

Thanks to Dan Jurafsky for these slides

Page 27: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Dictionary formats n  SRI format

Aberdeen q ae+1 bcl b axr dcl d iy+1 naboard ax bcl b ao+1 r dclabove ax bcl b ah+1 vadd q ae+1 dcl

n  ARPAbet format ABERDEEN ae b er d iy n ABOARD ax b ao r dd ABOVE ax b ah v ADD ae dd

Perl script $line =~ s/\+1//g;

for ($i = 1; $i < @LineArray; $i ++) { if (@LineArray[$i] eq 'bcl') { if (@LineArray[$i+1] ne 'b') { printf "b ”;} } elsif (@LineArray[$i] eq 'dcl’) { if (@LineArray[$i+1] ne 'd’) {printf "dd ”;} }…

else {printf "@LineArray[$i] ”;}

}

Thanks to Dan Jurafsky for these slides

Page 28: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Context dependency

n Monophones n  dx eh el en er ey f g hh ih iy jh

n Position dependent phones n  aa_B aa_E aa_I aa_S ae_B ae_E ae_I ae_S ah_B ah_E

ah_I ah_S ao_B ao_E ao_I ao_S aw_B aw_E

n Triphones

n Xxxphones (quin-, sep-, …)

Thanks to Dan Jurafsky for these slides

Page 29: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ From dictionary to graph

n Create an FST based on the dictionary

Thanks to Dan Jurafsky for these slides

fstprint --isymbols=data/lang/phones.txt --osymbols=data/lang/words.txt data/lang/L.fst | head -12 0 1 <eps> <eps> 0.693147182 0 1 sil <eps> 0.693147182 1 2 sil_S !SIL 0.693147182 1 1 sil_S !SIL 0.693147182 1 1 ax_S A 0.693147182 1 2 ax_S A 0.693147182 1 3 ey_B A42128 1 15 ey_B AAW 1 21 ae_B ABERDEEN 1 26 ax_B ABOARD 1 30 ax_B ABOVE 1 33 ae_B ADD

Page 30: Phonetics and Language Resources for Speech …cs136a/CS136a_Slides/CS136a_Lect5_Language...n How speech sounds are made by articulators (moving ... Larynx and Vocal Folds n The

+ Disambiguation symbols

Thanks to Dan Jurafsky for these slides

Do

1 1170 d_B Do

1170 1171 uw_E <eps>

1171 1 #1 <eps>

1171 3 #1 <eps>

n Marks the end of phoneme sequences that are ambiguous

n Required so the WFST is determinizable

n From L_disambig.fst:

Due 1 1226 d_B DUE 1226 1227 uw_E <eps>

1227 1 #2 <eps>

1227 3 #2 <eps>