24

Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques [email protected] Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Embed Size (px)

Citation preview

Page 1: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio
Page 2: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Phonetic features in ASR:a linguistic solution to acoustic variation?

Jacques Koreman [email protected]

Bistra Andreeva andreeva @coli.uni-sb.de

Attilio Erriquez [email protected]

William J. Barry [email protected]

Institute of Phonetics, University of the Saarland

Germany

LabPhon 7, Nijmegen, The NetherlandsThursday 29 June - Saturday 1 July 2000

Page 3: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Overview of this talk

• Modelling acoustic variation in ASR

• A phonetic representation of the signal

• Comparison with “standard” ASR

• Outlook

Page 4: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Modelling variation in ASR

Variation which leads to the crossing of a phonemic boundary (epenthesis, deletion and assimilation) is best modelled in the lexicon by adding pronunciation variants. Context-dependent variation can also be modelled in the lexicon by using allophones to define the lexical entries (cf. Pols, ICPhS’99).

Solution 1: Pronunciation variants are only added to the lexicon for the

most frequent (function) words.Solution 2: Pronunciation variants modelled by underspecified lexical

entries in combination with ternary logic (Lahiri & Reetz).

Distance of lexical entry to realisation

Confusability with other

lexical entries

=

in the lexicon

Page 5: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Modelling variation in ASR

Variation which does not lead to the crossing of a phonemic (or allophonic) boundary is best handled by the acoustic models:

Multiple mixtures per state can handle variation in the signal.

in the acoustic models

phone

hidden Markov

modelling

MFCC’s + energy+ delta parameters

lexicon

language model

S E

1-p3

1 p1 p3p2

1-p21-p1

Page 6: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Modelling variation in ASR

Eurom0 TIMIT

Data: read texts for English, German, read sentences for 8 dialects Italian, Dutch = 31 min. of of Am.E. = training: 202 min.; speech data (train = test) test: 74 min. of speech data

Param.: 12 MFCC’s + energy + delta’s id.15-ms Hamming windowpre-emphasis: 0.975-ms step size

Speakers: 2 male + 2 female per language 630 speakers

Models: 3-state phone HMMs id. (5 states for diphthongs)

Task: HMM identification of HMM recognition ofpre-segmented phones pre-segmented phones(no lexicon or language model) (no lexicon or language model)

in the acoustic models

Page 7: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Modelling variation in ASR

Results Eurom0 Results TIMIT

1 mixt.: phone identification: 15.6% phone recognition:48.9%

phoneme identification: 15.6% phoneme recognition: 53.2%

8 mixt.: phone identification: 63.7% phone recognition:59.6%

phoneme identification: 63.7% phoneme recognition: 63.8%

in the acoustic models

• Better results for 8 mixtures than for 1, because allophonic variation is handled better.

• Difference phone vs. phoneme identification in Eurom0 experiment small, because of almost phonemic labelling (except /r/, /t/).

• Difference between 1 and 8 mixtures not so great in TIMIT exp.:

a) less variation for different dialects than for different languages

b) many allophones modelled in separate phone models

Page 8: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

A phonetic representation of the signal

in the acoustic models

When variation in the signal is modelled by 8-mixture phone HMMs, a possible disadvantage of the approach is that the phone models waste modelling power (by estimating too many mixtures) if there is little variation in the acoustic realisations of a phone.

Alternatively, a Kohonen map can be used to reduce the variation in the signal (similar to vector quantisation in a HMM system) before performing HMM. In this way, variation in the signal of phones will be modelled only if necessary. This is the first theoretical advantage of the Kohonen network.

Mentionlocation of different [l]-ophones

in the phonotopic maphere

necessary. This is the first theoretical advantage of the Kohonen network.The Kohonen network has another (so far equally theoretical) advantage: it can additionally map the acoustic parameters onto phonetic features. These represent the phonologically distinctive properties of the phones. As such, they bridge the gap between the acoustic and the phonemic representations of the signal.

Page 9: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

A phonetic representation of the signal

in the acoustic models

(For the sordid details re. the Kohonen network, please see article on web.)

phone

hidden Markov

modelling

50x50Kohonen

map

phonetic features

MFCC’s + energy+ delta parameters

S E

1-p3

1 p1 p3p2

1-p21-p1

lexicon

language model

phonetic features

Page 10: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

A phonetic representation of the signal

in the acoustic models

The acoustic parameters (MFCC’s, energy + delta’s) are mapped onto several different phonetic feature sets:

Phone HMMs using a single mixture were then trained and used for identification/recognition, as in the previous experiments.

Different feature sets have different implication for the similarity between phones, esp. between consonants and vowels.(cf. Koreman, Andreeva & Strik, Proc. ICPhS, 1999)

• IPA

• SPE

• SPEus = underspecified SPE

• ArtFeat = Articulatory features (cf. Deng & Sun, JASA 1994)

Page 11: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Comparison with “standard” ASR (1)

Results Eurom0 Results TIMIT

IPA: phone identification: 42.3% phone recognition:27.6%

(1 mixt.) phoneme identification: 42.6% phoneme recognition: 31.9%

SPE: phone identification: 35.6% phone recognition:30.9%

(1 mixt.) phoneme identification: 36.2% phoneme recognition: 35.4%

SPEus: phone identification: 46.0% phone recognition:32.4%

(1 mixt.) phoneme identification: 46.1% phoneme recognition: 37.1%

ArtFeat: phone identification: — phone recognition:27.8%

(1 mixt.) phoneme identification: — phoneme recognition: 32.1%

AcPar: phone identification: 63.7% phone recognition: 59.6%

(8 mixt.) phoneme identification: 63.7% phoneme recognition: 63.8%

results

Page 12: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Comparison with “standard” ASR (1)

• Underspecified SPE features lead to better results than any other feature set, because the lack of redundancy leads to greater distinctiveness of the phones. Statistical methods like PCA or LDA can also reduce the redundancy in

the signal. They perform a global decorrelation across the data, which

leads to optimal phone(me) recognition. This optimum is only obtained at the cost of less frequent phones,

which have a minor effect on the overall correlations between input

parameters. Underspecification is in this repect a more interesting way of

decorrelating data as it preserves the distinctiveness between all

phone(me)s – which should be a long-term goal of ASR.

• Results for phonetic features of any type are lower than if HMM uses acoustic parameters to model phones.

discussion

Page 13: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Comparison with “standard” ASR (1)

The best results for phonetic features are considerably lower than for acoustic parameters. Two possible explanations are:

• The Kohonen network does not model the variation in the signal appropriately.

• Phonetic features do not perform well in the ASR system we use. (We shall return to this in the outlook at the end of this talk.)

discussion

Page 14: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Comparison with “standard” ASR (1)

Why should the Kohonen network not model variation in the signal well?

The Kohonen network organises the acoustic data phonotopically. Different (allophonic) realisations of a phoneme may therefore be modelled in different parts of the phonotopic map.

When the neurons in the Kohonen network are calibrated with phonetic features, a weighted phonetic feature vector is computed across all the frames (from different phones) that activated each neuron. Since a neuron is often activated by different phonemes located near to each other in the phonotopic map, the actual phonetic feature vector can be different for different realisations of a phoneme.

If this happens, HMMs using only 1 mixture are inappropriate to model the phonetic features.

Page 15: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Comparison with “standard” ASR (2)

In order to better reflect the possible variation in phonetic features, HMMs using 8 mixtures instead of 1 were used to model the phonetic features. Results for this second set of experiments using phonetic features are shown on the next slide.

Page 16: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Comparison with “standard” ASR

Results Eurom0 Results TIMIT

IPA: phone identification: 54.1% phone recognition:32.8%

(8 mixt.) phoneme identification: 54.2% phoneme recognition: 37.3%

SPE: phone identification: 54.0% phone recognition:33.8%

(8 mixt.) phoneme identification: 54.1% phoneme recognition: 37.8%

SPEus: phone identification: 58.3% phone recognition:35.4%

(8 mixt.) phoneme identification: 58.4% phoneme recognition: 39.9%

ArtFeat: phone identification: — phone recognition:33.9%

(8 mixt.) phoneme identification: — phoneme recognition: 37.8%

AcPar: phone identification: 63.7% phone recognition: 59.6%

(8 mixt.) phoneme identification: 63.7% phoneme recognition: 63.8%

results

Page 17: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Comparison with “standard” ASR (2)

• When phonetic features are modelled by 8 mixtures instead of 1, we find an increase of 11.6 – 18.4 percent points for the Eurom0 phone identification exp. 2.3 – 6.1 percent points for the TIMIT phone recognition exp.Although the Kohonen network captures a large part of the variation, there still is some non-random variation left in the phonetic features.

• As in the previous experiment, underspecified SPE features lead to better results than any other feature set.

• Phone identification with the Eurom0 data is only 5.4 percent points below that when acoustic parameters are used.Phone recognition with the TIMIT data is still 23.84 percent points below that when acoustic parameters are used.This linguistic approach to modelling variation deserves further work. We shall explain why in the next two slides.

discussion

Page 18: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Outlook

• If we want to evaluate the potential of the method, the Kohonen network must be optimised for the TIMIT recognition task: size of the Kohonen network training parameters ( and ) equal number of frames per phone in self-organisation of the map

to better represent less frequent phones equal number of frames per phone in calibration of the map

to better represent less frequent phones

• Also, the input to the Kohonen network may not be optimal. MFCC’s, energy and their corresponding delta parameters may not be best suited to estimate phonetic features from. From a purely technical point of view, the preprocessing in the Kohonen network can at best preserve all the information, but is likely to discard some.Acoustic cues derived from the signal may be better suited for our aims.

short-term

Page 19: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

word recognition

Outlook

• Even if phone(me) recognition on the basis of phonetic features does not attain the level that was reached on the basis of acoustic parameters, it may have other advantages. Underspecified phonetic features can provide a linguistically optimal

decorrelation of phonetic features. Underspecification can be exploited when we contact the lexicon with

a phonetic features representation (cf. Lahiri & Reetz).

short-term

Page 20: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Outlook

• Phonetic knowledge has been a driving force for innovations in ASR, e.g. the use of generalised triphones or multiple-mixture HMMs.

• The application of phonetic knowledge may not lead to immediate results within state-of-the-art stochastic approaches and may require more basic changes in the ASR system architecture.

• The attempt to use “traditional” phonetic knowledge in ASR, which has often been acquired from controlled studies using (carefully) read speech, at the same time provides us with a means of scrutinizing phonetic theory, esp. in terms of the generalisability to continuous and spontaneous speech.

long-term

Page 21: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio
Page 22: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Phonetic features in ASR:a linguistic solution to acoustic variation?

Data available under “Publications” on my homepage:

http://www.coli.uni-sb.de/~koreman

If you have any questions or suggestions, mail me:

[email protected]

LabPhon 7, Nijmegen, The NetherlandsThursday 29 June - Saturday 1 July 2000

Page 23: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Outlook

• Blind modelling systems (direct modelling of the signal by very powerful statistical techniques) are very effective, but it seems a ceiling has been reached.

• Systems in which the causes of linguistic variability are modelled in separate models (cf. Pols, ICPhS’99) would be ideal, since the knowledge that is obtained from them can be used for further linguistic processing. In practice, however we do not have enough data to train models for many different

linguistic conditions, we do not know all the causes of linguistic, non-random variation in

the signal. Such systems have both eyes open for variation and allow an in-depth linguistic analysis.

long-term

Page 24: Phonetic features in ASR: a linguistic solution to acoustic variation? Jacques Koremankoreman@coli.uni-sb.de Bistra Andreevaandreeva @coli.uni-sb.de Attilio

Outlook

• Systems “with one eye blindfolded” (i.e. systems in which the architecture is defined to capture linguistic effects without modelling them explicitly) can be a first approach to such a system.We can model linguistic properties of the signal, even if we do not know exactly what effects they have in continuous, spontaneous speech. If the incorporation of lingistic knowledge is helpful, we can attempt to devise a new system in which it is used explicitly.

long-term