Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
HYDRA A Hybrid CPU/GPU-based Speech Recognition Engine for Real-Time LVCSR
Jungsuk Kim, Jike Chong, Ian Lane
A Hybrid GPU+CPU Speech Recognition Engine
• For intuitive Voice and Interactive Multimodal systems robust and responsive speech recognition is crucial • Robust
• Acoustic robustness Large Acoustic Models • Linguistic robustness Large Vocabulary (1M+ words) Large Language Models (>20GB)
• Responsive • Low latency Faster than real-time search
• Current state-of-the-art speech recognition systems are optimized for either robustness or responsiveness • Robustness: 5-10 x real-time >95% accuracy • Responsiveness: real-time 85% accuracy
How can we decode with large models in real-time? Use hybrid GPU/CPU architectures Perform “On-The-Fly Partial Hypothesis Rescoring”
GPU
GPU/CPU Hybrid LVCSR Engine architecture
Recognition Network (WFST)
Acoustic
Model Pronunciation
Model
Language
Model
(uni-gram)
Viterbi Search
Observation
Probability
Calculation
WFST
Search
CPU
Language
Model
( n-gram )
Backtrack
Feature
Extraction
Voice
Input
Output
On-The-Fly Partial Hypothesis Rescoring
Control and data flow for the proposed approach
Prepare Active Hypotheses Set • Gather active speech recognition hypotheses (word and phone
sequences) from previous frame.
Compute Observation Probabilities • Compute likelihood of phonetic models (Gaussian Mixture Model) for
current input feature.
On-The-Fly Partial Hypothesis Rescoring • On the CPU, rescore likelihoods of partial hypotheses using a higher
order N-gram language model stored in main memory. • Partial Hypothesis rescoring and the observation probability
computation can be performed concurrently.
WFST Search • Frame synchronous Viterbi search is performed on the GPU using
WFST network composed using unigram language model. • Maintaining N-best paths during decoding to ensure good hypotheses
are not pruned early.
Experimental Evaluation
Initialize data
structures
CPU Manycore GPU
Backtrack
Output Results
Phase 0
Phase 1
Compute
Observation
Probabilities
Phase 2
WFST Search
Save
Backtrack Log W
R
W
R
R W
Data Control Data Control
R
R
W
R
W W
R
W
R
Collect
Backtrack Info
Prepare ActiveSet
Iteration Control
On-The-Fly
Rescoring
(LM Lookup) R
• Accuracy improves when maintaining more number of N-
best hypothesis.
• Accuracy improvement converges with large N.
• 20x speed-up compared to standard WFST
decoding on CPU at word accuracy of 93.80%
• 95.40% maximum accuracy is achieved.
85
86
87
88
89
90
91
92
93
94
95
0.04 0.40
Wo
rd A
ccu
rac
y [
%]
Real time Factor
Standard (bigram) Proposed (bigram, N=1)
Proposed (bigram, N=3) Proposed (trigram, N=3)
1.582 0.078
93.8
20X • Acoustic Model
• SI-284 Data Set • 3000 tied state • 16 mixture Gaussians • 39th MFCCs features
• Language Model • Wall Street Journal 5k • 1-gram: 5k entries • 2-gram: 1.6M entries • 3-gram: 2.7M entries
• Evaluation Set • Nov. 92 ARPA WSJ test set • 330 sentences
• NVIDIA GTX 680 • Keplar architecture • 1536 CUDA cores
Relationship between word accuracy and Real Time Factor (RTF)
Decoding Process
Relationship between word accuracy and beamwidth for maintaining different number of N-best hypothesis
Experimental Evaluation
Relationship between vocab. Size and Real Time Factor (RTF)
0.100.13 0.14
0.15 0.17 0.17
-
0.1
0.2
0.3
0.4
0.5
-
200.0
400.0
600.0
800.0
1,000.0
1,200.0
100k200k300k400k500k600k700k800k900k 1M
RealTim
eFactor
Size(MB)
Vocab.size
SIZE(MB)
RTF
15.94
9.61 9.15
15.9
9.35 8.77
16.26
9.38 8.82
0
5
10
15
20
25
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
WordErrorRate(%)
RealTimeFactor(RTF)
UGM/3GM UGM/4GM UGM/5GM
Relationship between vocab. Size and Real Time Factor (RTF)
• Acoustic Model • All WSJ corpus • 10,000 tied state • 32 mixture Gaussians • 39th MFCCs features
• Language Model • 1M vocab. • 3-gram: 497.6M entries • 4-gram: 767.8M entries • 5-gram: 977.1M entries
• Evaluation Set • WSJ test set • 543 sentences
N-Best On-The-Fly Partial Hypothesis Rescoring
1
2
3 4
i[1]: ε / w[1]
i[2]:ε / w[2]
i[3]: o[3] / w[3]
1
2
3 4
i[1]: ε / w[1]
i[2]:ε / w[2]
i[3]: o[3] / w[3]
α[g1]+w[1]
α[g2]
α[g1]
α[g1’]=α[g1]+w[1,3]
α[g1]
α[g3]
α[g2]
α[g4]
α[g1]+w[1]
α[g2]+w[2]
α[g’2]=α[g2]+w[2,3]+c[g2]
α[g’1]=α[g1]+w[1,3]+c[g1]
Standard One-Best Search • Choose only best hypothesis when multiple arcs meet in
the same destination state.
Proposed N-Best Search with Rescoring • Rescore the partial hypothesis using likelihood difference
between larger N-gram and unigram (c[.]) when hypothesis outputs word symbol.
• Maintaining N-best paths effectively allows multiple word hypotheses to be kept until rescoring can be applied
• 1M vocab. network can be decoded on a modern GPU.
• Network size does not significantly affect decoding
speed.
• 2.74x faster than realtime when the WER is
9.35%.
• 91.23% maximum accuracy is achieved.
Why “N-Best”? Early pruning: Best hypothesis g2 is pruned before the rescoring.
Best Path
α[.]: State likelihood
g: Partial hypothesis
c[.]: Language model
likelihood difference
w[3] > w[2] > w[1]
α[g1] ≈ α[g2] > α[g3] ≈ α[g4]
c[g1] >> c[g2]
i[.]: Input symbol
o[.]: Output symbol
w[.]: Arc weight
(a) Standard One-Best Search (b) Proposed N-Best Search
Load Balancing Between GPU and CPU using OpenMP
GPU and CPU Parallel Execution • Language model look up has no data dependency between
Acoustic likelihood computation. • CPU function and GPU kernel can be conducted in parallel • Language model runtime can be hided behind GPU run
time.
GPU and CPU load Balancing using OpenMP • Language model look up is longer than Acoustic likelihood
computation time with small acoustic model • Language model lookup for each hypothesis is
independent. • Language model lookup phase is parallelized using
OpenMP on the CPU to achieve better load balance. Ratio of the processing time per phase
0.07 0.07 0.07 0.07
0.09 0.09
0.94
0.43
0.45
0.16
0.05
0.05
0.05
0.05
0.01
0.010.01
0.01
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
KENLM GRAPHLM GRAPHLM(+PARALLELEXCUTION)
GRAPHLM(+OPENMP)
MISC(ADAPTIVEBEAMCONTROL,BACKTRACK)
GRAPHTRAVERSAL
LANGUAGEMODELLOOKUP
LOGLIKELIHOODCOMPUTATION
DATAGATHERING