Upload
argyle
View
48
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Language Identification. Oldřich Plchot, Pavel Ma t ějka Speech@FIT, Brno University of Technology, Czech Republic [email protected]. IKR Brno 2012. Outline. Why do we need LID? Evaluations Acoustic LID Phonotactic LID Fusion Conclusion. Wh y do we need language identification?. - PowerPoint PPT Presentation
Citation preview
Language Identification
Oldřich Plchot, Pavel Matějka Speech@FIT, Brno University of Technology, Czech Republic
IKRBrno2012
Language Identification IKR, Brno, 2012
2
Outline
• Why do we need LID?• Evaluations• Acoustic LID• Phonotactic LID• Fusion• Conclusion
Language Identification IKR, Brno, 2012
3
Why do we need language identification?
1) Route phone calls to human operators.
Emergency (112,155,911) Call centers
Fireguard (150)Police (158)
Language Identification IKR, Brno, 2012
4
Why do we need language identification?
2) Pre-select suitable recognition system.
Translate SPA
KWS CHN
Speech2Text ENG
Language Identification
Translate CZETranslate VIE
Connect
Language Identification IKR, Brno, 2012
5
Why do we need language identification?
3) Security applications to narrow search space.
Language Identification IKR, Brno, 2012
6
Two main approaches to LID
• Acoustic – Gaussian Mixture Model
• Phonotactic – Phoneme Recognition followed by Language Model
Language Identification IKR, Brno, 2012
7
Acoustic approach
• Gaussian Mixture Model
- good for short speech segments and dialect recognition- relies on the sounds
Language Identification IKR, Brno, 2012
8
-11.20.4
-4.7-13.0
2.34.5…
Spectral features - MFCC
20ms 10ms
Short-timeFFT
Mel - Filter Bank
Log () Discrete Cosine Transform
-12.8-0.3-5.7
-22.48.96.8…
Language Identification IKR, Brno, 2012
9
Shifted delta cepstra
• Shifted Delta Cepstra represent an information about the speech evolution around the current frame ( ± 0.1sec)
• Size of Final feature vector is: 7 MFCC + 7 × 7 SDC = 56
Language Identification IKR, Brno, 2012
10
Acoustic systems – GMM based
• Maximum likelihood (generative)• Objective function to maximize is the likelihood of
training data given the transcription
• Maximum Mutual Information (discriminative)• Objective function to maximize is the posterior
probability of all training utterances being correctly recognized
• Advantages of using discriminative training:• Lower error rates• Less parameters
• Disadvantages of discriminative training• Overtraining• Sometimes computationaly expensive
• Channel Compensation – from previous presentation
Language Identification IKR, Brno, 2012
11
Highly overlapped distributions
Language Identification IKR, Brno, 2012
12
Results on LRE 2007 (14 languages)
System / Equal Error Rate [%] 30sec 10sec 3sec
GMM2048 8.03 12.89 21.77
GMM2048-eigchan 2.76 7.38 17.14
GMM2048–chcf 2.94 7.40 17.93
GMM2048-MMI-chcf ( ~3 MMI iterations) 2.41 7.02 16.90
The best acoustic system combines:• Many Gaussians• Eigen-channel compensation of features• MMI
System / Equal Error Rate [%] 30sec 10sec 3sec
GMM2048 ML 8.03 12.89 21.77
GMM 256 ML ~16
GMM256 MMI (~15 MMI iterations) 4.15 8.61 18.43
GMM256-MMI-chcf (~3 MMI iterations) 3.73 9.81 20.98
Language Identification IKR, Brno, 2012
13
Phonotactic approach
• Phoneme Recognition followed by Language Model (PRLM)
- good for longer speech segments- robust against dialects in one language - eliminates speech characteristics of speaker's native language
Language Identification IKR, Brno, 2012
14
Phone recognizer
• 3 neural networks to produce the phone posterior probability
• 310 ms long time trajectory around the actual frame
• Investigation of different phone recognizers for LID => better phone recognizer ≈ better LID system
Language Identification IKR, Brno, 2012
15
Phone recognition output
One best phone string
Language Identification IKR, Brno, 2012
16
Phonotactic modeling - example
u n d 25
a n d 3
t h e 0
. . . .
u n d 1
a n d 32
t h e 13
. . . .
u n d 5
a n d 0
t h e 1
. . . .
German English Test
• N-gram language models – discounting, backoff • Support Vector Machines – vectors with counts• PCA + LDA• Neural Networks
Language Identification IKR, Brno, 2012
17
Phone recognition output
One best phone string
Phone lattice0,6
0,30,1
Language Identification IKR, Brno, 2012
18
Results on LRE 2007 (14 languages)
Conclusion:• Build as good phone recognizer as you can• Gather as much data for each language as you can• Different approaches to modeling counts seem to not have
big influence on results
System / Equal Error Rate [%] 30 sec 10 sec 3 sec
HU_LM string (4-gram) 6.35 13.86 27.12
HU_LM 5.54 11.75 23.54
HU_SVM-3gram-counts 5.41 13.26 26.92
Language Identification IKR, Brno, 2012
19
Fusion - LRE 2007 (14 languages)
System / Equal Error Rate [%] 30 sec 10 sec 3 sec
Acoustic - GMM2048-MMI-chcf ( ~3 MMI iterations)
2.41 7.02 16.90
Phonotactic - EN_TREE 3.54 10.68 22.66
Phonotactic - HU_TREE_A3E7M5S3G3_LFA 4.52 10.35 23.66
Fusion – The best 3 systems 1.28 4.63 13.53
Note:• Fusion weights have to be trained on separate set of files
which are as close as possible to target data
Language Identification IKR, Brno, 2012
20
Thanks for your attentionand
I hope you enjoyed it ;)