16
1 HCS 7367 Speech Perception Dr. Peter Assmann Fall 2014 Phonetics Anatomical, acoustic, neural properties Production and perception Focuses on continuous/physical properties Basic unit: phone (=speech sound) Narrow transcription: [ p h ] Phonology Abstract representations mediate between the speech stream and its interpretation as meaningful words Discrete entities (phonemes, syllables, ...) Basic unit: phoneme (smallest contrastive unit of a language) Broad transcription: / p / Phonology Phonemes: minimum meaning- differentiating units in a language Phonemic inventory: list of the phonemes of a given language American English has ~12 vowel phonemes, 3 diphthongs, ~24 consonants Phonology Minimal pairs test: criterion for deciding whether two sounds belong to the same phonemic category. If two words or phrases in a given language differ only in one phonological segment (e.g. English “pin” and “bin”) but are perceived as distinct words, then that sound contrast contains two distinct phonemes (/p/ and /b/). Phonology Allophones: variants (pronunciations) of a given phoneme In English, the “p” sound in “pot” is aspirated (pronounced [ p h ɔt ]) while the “p” in “spot” is unaspirated (pronounced [ spɔt ]). [p h ] and [p] are allophones of the phoneme /p/

Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

1

HCS 7367Speech Perception

Dr. Peter AssmannFall 2014

Phonetics

Anatomical, acoustic, neural properties

Production and perception

Focuses on continuous/physical properties

Basic unit: phone (=speech sound)

Narrow transcription: [ ph ]

Phonology

Abstract representations mediate between the speech stream and its interpretation as meaningful words

Discrete entities (phonemes, syllables, ...)

Basic unit: phoneme (smallest contrastive unit of a language)

Broad transcription: / p /

Phonology

Phonemes: minimum meaning-differentiating units in a language

Phonemic inventory: list of the phonemes of a given language

American English has ~12 vowel phonemes, 3 diphthongs, ~24 consonants

Phonology

Minimal pairs test: criterion for deciding whether two sounds belong to the same phonemic category. If two words or phrases in a given language

differ only in one phonological segment (e.g. English “pin” and “bin”) but are perceived as distinct words, then that sound contrast contains two distinct phonemes (/p/ and /b/).

Phonology

Allophones: variants (pronunciations) of a given phoneme In English, the “p” sound in “pot” is aspirated

(pronounced [ phɔt ]) while the “p” in “spot” is unaspirated (pronounced [ spɔt ]).

[ph] and [p] are allophones of the phoneme /p/

Page 2: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

2

Phonology

Jakobson, Fant and Halle (1961). Preliminaries to Speech Analysis. Distinctive feature theory: universal set of (binary)

features, combining articulation and perception Chomsky and Halle (1968). The Sound Pattern of

English Phonological representations as sequences of segments

(phonemes) composed of distinctive features. Two levels: underlying abstract representation and

surface phonetic representation

Consonant Perception

What is a consonant?

Consonants are sounds produced with a major constriction somewhere in the vocal tract.

English Consonants 

Ladefoged, P. (1993). A Course in Phonetics. Harcourt Brace: Fort Worth. 3rd Ed.

Man

ner of articulation

Place of articulation  BilabialLabio‐dental

Dental Alveolar Palatal Velar Glottal

Voicing  – + – + – + – + – + – + –

Stop p b t d k g

Nasal m n ŋFricative f v θ ð s z ∫ ʒ h

Affricate t∫ ʤApproximant (w) r j w

Lateral l

Consonant production

American English has 24 consonant phonemes

The standard phonetic classification system uses three main features to classify the consonants:

Place of articulation

Manner of articulation

Voicing

Consonant classification

Place of articulation

bilabial, labiodental, dental, alveolar, palatal, 

velar, glottal

Manner of articulation

stops, nasal, fricatives, affricates, approximant, 

lateral

Voicing

voiced, voiceless

Stop consonants in English

/ba/ /da/ /ga/

/pa/ /ta/ /ka/

bilabial alveolar velar

Place of articulation

Vo

icin

g voic

edvo

icel

ess

Manner of articulation = stop consonants

Page 3: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

3

Consonant production

Speech production theory

Source‐filter concept still applies

Relatively narrow constriction or closure

Secondary sound source in the upper vocal tract

Source‐filter theory for consonants

Consonants: sounds produced with a major 

constriction somewhere in the vocal tract.

Source‐filter theory still applies, but there are 

complications:

1. secondary sound source within the vocal tract

2. secondary source is aperiodic, turbulent noise (frication) rather 

than quasi‐periodic like the glottal source

3. secondary source may combine with the glottal source to produce 

mixed excitation (voiced fricatives)

Acoustic tube perturbation and vowel formantsSource: Stevens (1999, p. 286)

Filter properties of stop consonants

2.0 2.5 3.01.0

1.5

2.0

F3 frequency (kHz)

F2

fre

qu

en

cy (

kHz)

After Stevens (1997)

alveolar

labial

velar

glottislips

Area functions: cylindrical tube approximations of the vocal tract

Source‐filter theory for consonants

Obstruents (fricatives, affricates, stops)

Sonorants (nasals, liquids, glides)

In obstruents, the constriction can divide the resonating vocal tract into two separate cavities.

coupled vs. uncoupled

interactions with the source

Page 4: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

4

Schematic model for fricative /s/

After Kent, Dembowski & Lass (1996)

Glottis

Back Cavity Front Cavity

Obstruction(maxillary incisors)

Lips

Anteriorconstriction

Jet

Eddies

Time (ms)

Voiceless unaspirated stop consonant

[pɑ]

Place of articulation

What are the acoustic cues for place of articulation (i.e., what acoustic properties do listeners use to distinguish /p/, /t/, /k/ (and /b/, /d/, /g/) from each other)?

Place of articulation

What are the acoustic cues for place of articulation (i.e., what acoustic properties do listeners use to distinguish /p/, /t/, /k/ (and /b/, /d/, /g/) from each other)?

1. Frequency of burst 

2. Formant transitions (F2 and F3) from the burst to the vowel

Acoustic cues for place of articulation in stops

Burst cues

Release burst, frication noise, aspiration noise

Spectral shape

Static vs. dynamic

Formant transition cues

F2, F3 onset vs. target (vowel) frequency 

Formant transitions

During the closure for a stop consonant the vocal tract is completely closed, and no sound is emitted. At the moment of release the resonances change rapidly. These changes are called formant transitions.

Page 5: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

5

Formant transitions

The first formant (F1) always increases in frequency following the release of a stop closure. F2 and F3 increases or decreases, depending on the place of constriction and the flanking vowels.

Burst cues

The burst is a brief acoustic transient that occurs at the moment of consonant release. 

Burst intensity is higher for voiceless stops than for voiced stops

Burst frequency varies as a function of place

burst voicing onset

Burst cues

Is consonant place information available when the burst is removed?

Stop consonants(from Kent and Read, 1992)

Syllable-initialprevocalic stop

Closure(stop gap)

Release(noise burst)

Transition(formant transitions)

Aspirated

Unaspirated

/ pɑ /

ata

Time (ms)

Fre

qu

ency

(kH

z)

0 100 200 300 400 5000

1

2

3

4

Alveolar stop

Closure(stop gap)

Release(noise burst)

Formant transitions

F1

F2

F3

/ ada /

• high F2• high F3

aka

Time (ms)

Fre

qu

ency

(kH

z)

0 100 200 300 400 5000

1

2

3

4

Velar stop

Closure(stop gap)

Release(noise burst)

Formant transitions

F1

F2

F3

/aga/

• high F2• low F3

Page 6: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

6

apa

Time (ms)

Fre

qu

ency

(kH

z)

0 100 200 300 400 5000

1

2

3

4

Bilabial stop

Closure(stop gap)

Release(noise burst)

Formant transitions

F1

F2

F3

/aba/

• low F2• low F3

Stop consonants(from Kent and Read, 1992)

Syllable-finalpostvocalic stop

Transition(formant transitions)

Closure(stop gap)

Release (noise burst)or

No release (no burst)

/ /

Alveolar stop

Time (ms)

Fre

qu

ency

(kH

z)

0 100 200 300 4000

1

2

3

4

/ad/

• high F2• high F3

F2

F3

ap

Time (ms)

Fre

qu

ency

(kH

z)

0 50 100 150 200 2500

1

2

3

4

F2

F3

/ab/

• low F2• low F3

Bilabial stop

Burst cues:  /ba/ and /pa/

Time (ms)

Fre

quen

cy (

kHz)

0 50 100 150 200 250 300 350 4000

1

2

3

4

/ba/

Time (ms)0 50 100 150 200 250 300 350 400

0

1

2

3

4

/pa/

• Bilabials: Burst energy is concentrated at low frequencies

Burst cues:  /da/ and /ta/

• Alveolars: Burst energy is concentrated at high frequencies

Time (ms)

Fre

quen

cy (

kHz)

0 50 100 150 200 250 3000

1

2

3

4

/da/

Time (ms)0 50 100 150 200 250 300 350

0

1

2

3

4

/ta/

Page 7: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

7

Burst cues: /ga/ and /ka/

• Velars: Burst energy is concentrated in the mid range

Time (ms)

Fre

quen

cy (

kHz)

0 50 100 150 200 250 300 3500

1

2

3

4

/ka/

Time (ms)

Fre

quen

cy (

kHz)

0 50 100 150 200 250 3000

1

2

3

4

/ga/

Place of articulation

What are the acoustic cues for place of articulation (i.e., what acoustic properties do listeners use to distinguish /p/, /t/, /k/ (and /b/, /d/, /g/) from each other)?

Place of articulation

What are the acoustic cues for place of articulation (i.e., what acoustic properties do listeners use to distinguish /p/, /t/, /k/ (and /b/, /d/, /g/) from each other)?

1. Frequency of burst 

2. Formant transitions (F2 and F3) from the burst to the vowel

Acoustic cues for place of articulation in stops

Burst cues

Release burst, frication noise, aspiration noise

Spectral shape

Static vs. dynamic

Formant transition cues

F2, F3 onset vs. target (vowel) frequency 

Formant transitions

During the closure for a stop consonant the vocal tract is completely closed, and no sound is emitted. At the moment of release the resonances change rapidly. These changes are called formant transitions.

Formant transitions

The first formant (F1) always increases in frequency following the release of a stop closure. F2 and F3 increases or decreases, depending on the place of constriction and the flanking vowels.

Page 8: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

8

Burst cues

The burst is a brief acoustic transient that occurs at the moment of consonant release. 

Burst intensity is higher for voiceless stops than for voiced stops

Burst frequency varies as a function of place

burst voicing onset

Burst cues

Is consonant place information available when the burst is removed?

Invariance problem for stop consonants

A

NOISEBURSTS VOWELS

i

1K

2K

3K

4K

e ucB C

Cooper, Delattre, Liberman, Borst & Gerstman (1952) JASA 24, 597-606

burst + transitionless vowels0.5

1.0

1.5

2.0

2.5

3.0

3.5

i e a c o u

Majority responses

Bur

st f

req

uenc

y (k

Hz)

t

p k p

p

Liberman, Delattre, and Cooper (1952)

t

k pp

p

Invariance problem for stop consonants

Invariance problem for stop consonants

Transition + vowel without burst

Invariance problem for stop consonants

Page 9: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

9

Categorical Perceptionof stop consonantsEqual acoustic changes  unequal auditory percepts

place of articulation of stops: /b/ vs /d/ vs /g/

Liberman, Harris, Hoffman, and Griffith (1957)Journal of Experimental Psychology 54, 358-368

b d g

Categorical PerceptionPhoneme boundary (50%)

Identification(labeling) function

Discrimination function

Categorical Perception

1. Identification function shows steep slope (cross‐over) between categories

2. Good between‐category discrimination (near perfect) when the members of the pair belong to different categories

3. Poor within‐category discrimination (near chance) when they are perceived to belong to the same category

time

"bag"

Where are the segments?

F1

F2

F3

/bæg/

Segmentation problem for stops

Same noise ‐ different consonant

pea ka

1400 Hz

F 2

F 1

Different transition ‐ same consonant

dee da

1400 Hz

Page 10: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

10

Motor Theory

Invariance problem

Segmentation problem

Categorical perception

Mapping from continuous to discrete

Encodedness of speech

Units of perception (segments, syllables..)

Linguistic units, acoustics and articulation

K.N. Stevens

Quantal theory Certain sounds are favored in the

languages of the world because their acoustic properties can be produced with a wide range of articulations.

Articulatoryvariations

Acousticconsequences

K.N. Stevens

Distinctive feature theory syllables /bi/, /du/, … segments /b/, /d/, … features [+voiced], [-rounded], …

K.N. Stevens JASA 2002

Acoustic Landmarks An early stage in the processing of speech

identifies acoustic landmarks in the signal such as syllable nuclei and acoustic discontinuities corresponding to consonantal closures and releases.

Landmarks point to regions of the signal where a more detailed phonetic analysis is carried out.

K.N. Stevens JASA 2002

Lexical access from features The acoustic signal in the vicinity of these

landmarks is processed by a set of modules, each of which identifies a phonetic feature.

K.N. Stevens JASA 2002

Lexical access from features From these landmarks and features, lexical

(word) hypotheses are evaluated.

This is done using analysis-by-synthesis, in which a word sequence is hypothesized, a possible pattern of features from this sequence is internally synthesized, and the synthesized pattern is tested for a match against an acoustically derived pattern.

Page 11: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

11

Burst spectrum

Blumstein and Stevens (1979, 1980) invariance hypothesis (static spectral shape model)

The shape of the spectrum – sampled at the time of burst release – provides invariant cues specifying the place of articulation for the English stop consonants.

Static spectral shape model

Step 1: Find consonant burst onset

Temporal window

Step 2: Position a 25.6 ms half-Hamming window

0 0.5 1 1.5 2 2.5 3 3.5-50

-40

-30

-20

-10

0

10

Frequency (kHz)

Am

plit

ud

e i

n d

B

Acoustic landmarks

Step 3: Compute spectrum of windowed segment

LPC spectrum

FFT spectrum

Static spectral shape model

0 0.5 1 1.5 2 2.5 3 3.5-50

-40

-30

-20

-10

0

10

Frequency (kHz)

Am

plitu

de i

n dB

0 0.5 1 1.5 2 2.5 3 3.5-50

-40

-30

-20

-10

0

10

Frequency (kHz)

Am

plitu

de i

n dB

0 0.5 1 1.5 2 2.5 3 3.5-50

-40

-30

-20

-10

0

10

Frequency (kHz)

Am

plitu

de i

n dB

Diffuse falling Diffuse rising Compact

/ ba / / da / / ga /

Burst spectra of syllable-initial stop consonants

Static spectral shape model

Predictions: 

1) spectral shape for a given place of articulation is the same for different talkers and in different phonetic contexts. 

2) acoustic modifications that distort other aspects of the syllable but preserve the spectral shape near the burst do not affect consonant identity.

Page 12: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

12

Dynamic properties

Kewley‐Port (1983) hypothesized that the 

identification of stop consonants is based on 

time‐varying changes in the spectrum from the 

onset of the burst into the transition region.

Locus Equations

Delattre et al. (1954) described the locus as the frequency location of F2 extrapolated back in time to the consonant release.

locus

Onset of voiced formant transition

Onset of vowel target

F2 of /i/

F2 of /u/

Problems with the locus concept

No physical energy exists at the time/frequency position of the locus; actual extrapolation of the formant transition may lead to a change in consonant identity. 

locus

Onset of voiced formant transition

Onset of vowel target

Locus equations

Sussman et al. (1991, 1993) proposed that locus equations provide invariantrelational cues for the perception of place of articulation in stop consonants.

Locus equations describe the relationship between the frequency of F2 at burst onset and F2 in the vowel.

ba

Time (ms)

Fre

que

ncy

(kH

z)

0 100 200 300 4000

1

2

3

4

Locus equationsda

Time (ms)

Fre

que

ncy

(kH

z)

0 50 100 150 200 250 3000

1

2

3

4

Locus equations

Page 13: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

13

ga

Time (ms)

Fre

que

ncy

(kH

z)

0 50 100 150 200 250 3000

1

2

3

4

Locus equations Locus equations

Sussman (1991)

Smits, ten Bosch, and Collier (1996)

Gross cues 

spectral features that are distributed across frequency or time

overall spectral shape or its relative change over time

Detailed cues

features that are narrowly localized in frequency or time

center frequency of prominent spectral peaks and their dynamic transitions.

Cues for voicing

• VOT – voice onset timeShort lag (0-20 ms) = voiced sounds /b,d,g/

Long lag (80-100 ms) = unvoiced sounds /p,t,k/

bi

Time (ms)

Fre

qu

ency

(kH

z)

0 50 100 150 200 250 3000

1

2

3

4

short lagpi

Time (ms)

Fre

qu

ency

(kH

z)

0 100 200 300 4000

1

2

3

4

long lag

Page 14: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

14

di

Time (ms)

Fre

qu

ency

(kH

z)

0 50 100 150 200 250 3000

1

2

3

4

short lagti

Time (ms)

Fre

qu

ency

(kH

z)

0 50 100 150 200 250 300 3500

1

2

3

4

long lag

gi

Time (ms)

Fre

qu

ency

(kH

z)

0 50 100 150 200 250 300 3500

1

2

3

4

short lagki

Time (ms)

Fre

qu

ency

(kH

z)

0 100 200 300 4000

1

2

3

4

long lag

aga

Time (ms)

Fre

qu

ency

(kH

z)

0 100 200 300 400 5000

1

2

3

4

aka

Time (ms)

Fre

qu

ency

(kH

z)

0 100 200 300 400 5000

1

2

3

4

Page 15: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

15

Voicing• Lisker and Abramson (1970)

Studied voice onset time (VOT) in several languages, some with 2 voicing categories (e.g., English, Spanish, Cantonese) and others with 3 or 4 voicing categories (e.g. Thai, Hindi, Korean)

• Prevoiced

• Voiced

• Voiceless aspirated

• Voiceless unaspirated

VOT in Dutch

/bɛn/ /pɛn/

Prevoiced and unaspirated stops

VOT in an English-Dutch bilingual child

E. Simon (2010).Child L2 development: A longitudinal case study on Voice Onset Times in word-initial stops.J. Child Lang. 37 159–173.

Voicing

• Identification functions for English VOT continuum.

• Frequency histograms of measured VOT in English stops.

Voicing

• Identification functions for Thai VOT continuum.

• Frequency histograms of measured VOT in Thai stops.

Identification and discrimination of English VOT continuum

-150 -100 -50 0 50 100 1500

20

40

60

80

100

Voice onset time (ms)

Per

cent

/b

a/ r

espo

nses

Identification function

Discriminationfunction

Hypothetical data

Page 16: Phonetics HCS 7367 Speech Perceptionassmann/hcs6367/lec3.pdfThe standard phonetic classification system uses three main features to classify the consonants: Place of articulation Manner

16

Categorical Perception

• Relationship between VOT and identification functions is nonlinear, with steep transitions and a clearly defined boundary.

• Discrimination functions show a peak near the phoneme boundary; near chance levels elsewhere.

/ a d a /

Time (ms)

Fre

que

ncy

(kH

z)

1. VOICE ONSET TIME

3. BURST AMPLITUDE

2. F1 CUTBACK

6. PREVOICING5. VOWEL

DURATION

4. FUNDAMENTALFREQUENCY

0 100 200 300 400 5000

1

2

3

4

Voicing cues in stop consonants

Voicing cues in stop consonants/ a t a /

Time (ms)

Fre

que

ncy

(kH

z)

1. VOICE ONSET TIME

3. BURST AMPLITUDE

2. F1 CUTBACK

6. PREVOICING5. VOWEL

DURATION

4. FUNDAMENTALFREQUENCY

0 100 200 300 400 5000

1

2

3

4