Language Identification

Language Identification

Oldřich Plchot, Pavel Matějka Speech@FIT, Brno University of Technology, Czech Republic

[email protected]

IKRBrno2012

Language Identification IKR, Brno, 2012

2

Outline

• Why do we need LID?• Evaluations• Acoustic LID• Phonotactic LID• Fusion• Conclusion


3

Why do we need language identification?

1) Route phone calls to human operators.

Emergency (112,155,911) Call centers

Fireguard (150)Police (158)


4


2) Pre-select suitable recognition system.

Translate SPA

KWS CHN

Speech2Text ENG

Language Identification

Translate CZETranslate VIE

Connect


5


3) Security applications to narrow search space.


6

Two main approaches to LID

• Acoustic – Gaussian Mixture Model

• Phonotactic – Phoneme Recognition followed by Language Model


7

Acoustic approach

• Gaussian Mixture Model

- good for short speech segments and dialect recognition- relies on the sounds


8

-11.20.4

-4.7-13.0

2.34.5…

Spectral features - MFCC

20ms 10ms

Short-timeFFT

Mel - Filter Bank

Log () Discrete Cosine Transform

-12.8-0.3-5.7

-22.48.96.8…


9

Shifted delta cepstra

• Shifted Delta Cepstra represent an information about the speech evolution around the current frame ( ± 0.1sec)

• Size of Final feature vector is: 7 MFCC + 7 × 7 SDC = 56


10

Acoustic systems – GMM based

• Maximum likelihood (generative)• Objective function to maximize is the likelihood of

training data given the transcription

• Maximum Mutual Information (discriminative)• Objective function to maximize is the posterior

probability of all training utterances being correctly recognized

• Advantages of using discriminative training:• Lower error rates• Less parameters

• Disadvantages of discriminative training• Overtraining• Sometimes computationaly expensive

• Channel Compensation – from previous presentation


11

Highly overlapped distributions


12

Results on LRE 2007 (14 languages)

System / Equal Error Rate [%] 30sec 10sec 3sec

GMM2048 8.03 12.89 21.77

GMM2048-eigchan 2.76 7.38 17.14

GMM2048–chcf 2.94 7.40 17.93

GMM2048-MMI-chcf ( ~3 MMI iterations) 2.41 7.02 16.90

The best acoustic system combines:• Many Gaussians• Eigen-channel compensation of features• MMI

System / Equal Error Rate [%] 30sec 10sec 3sec

GMM2048 ML 8.03 12.89 21.77

GMM 256 ML ~16

GMM256 MMI (~15 MMI iterations) 4.15 8.61 18.43

GMM256-MMI-chcf (~3 MMI iterations) 3.73 9.81 20.98


13

Phonotactic approach

• Phoneme Recognition followed by Language Model (PRLM)

- good for longer speech segments- robust against dialects in one language - eliminates speech characteristics of speaker's native language


14

Phone recognizer

• 3 neural networks to produce the phone posterior probability

• 310 ms long time trajectory around the actual frame

• Investigation of different phone recognizers for LID => better phone recognizer ≈ better LID system


15

Phone recognition output

One best phone string


16

Phonotactic modeling - example

u n d 25

a n d 3

t h e 0

. . . .

u n d 1

a n d 32

t h e 13

. . . .

u n d 5

a n d 0

t h e 1

. . . .

German English Test

• N-gram language models – discounting, backoff • Support Vector Machines – vectors with counts• PCA + LDA• Neural Networks


17

Phone recognition output

One best phone string

Phone lattice0,6

0,30,1


18

Results on LRE 2007 (14 languages)

Conclusion:• Build as good phone recognizer as you can• Gather as much data for each language as you can• Different approaches to modeling counts seem to not have

big influence on results

System / Equal Error Rate [%] 30 sec 10 sec 3 sec

HU_LM string (4-gram) 6.35 13.86 27.12

HU_LM 5.54 11.75 23.54

HU_SVM-3gram-counts 5.41 13.26 26.92


19

Fusion - LRE 2007 (14 languages)

System / Equal Error Rate [%] 30 sec 10 sec 3 sec

Acoustic - GMM2048-MMI-chcf ( ~3 MMI iterations)

2.41 7.02 16.90

Phonotactic - EN_TREE 3.54 10.68 22.66

Phonotactic - HU_TREE_A3E7M5S3G3_LFA 4.52 10.35 23.66

Fusion – The best 3 systems 1.28 4.63 13.53

Note:• Fusion weights have to be trained on separate set of files

which are as close as possible to target data


20

Thanks for your attentionand

I hope you enjoyed it ;)

Documents

Language Identification