2
HYDRA A Hybrid CPU/GPU-based Speech Recognition Engine for Real-Time LVCSR Jungsuk Kim, Jike Chong, Ian Lane A Hybrid GPU+CPU Speech Recognition Engine For intuitive Voice and Interactive Multimodal systems robust and responsive speech recognition is crucial Robust Acoustic robustness Large Acoustic Models Linguistic robustness Large Vocabulary (1M+ words) Large Language Models (>20GB) Responsive Low latency Faster than real-time search Current state-of-the-art speech recognition systems are optimized for either robustness or responsiveness Robustness: 5-10 x real-time >95% accuracy Responsiveness: real-time 85% accuracy How can we decode with large models in real-time? Use hybrid GPU/CPU architectures Perform “On-The-Fly Partial Hypothesis RescoringGPU GPU/CPU Hybrid LVCSR Engine architecture Recognition Network (WFST) Acoustic Model Pronunciation Model Language Model (uni-gram) Viterbi Search Observation Probability Calculation WFST Search CPU Language Model ( n-gram ) Backtrack Feature Extraction Voice Input Output On-The-Fly Partial Hypothesis Rescoring Control and data flow for the proposed approach Prepare Active Hypotheses Set Gather active speech recognition hypotheses (word and phone sequences) from previous frame. Compute Observation Probabilities Compute likelihood of phonetic models (Gaussian Mixture Model) for current input feature. On-The-Fly Partial Hypothesis Rescoring On the CPU, rescore likelihoods of partial hypotheses using a higher order N-gram language model stored in main memory. Partial Hypothesis rescoring and the observation probability computation can be performed concurrently. WFST Search Frame synchronous Viterbi search is performed on the GPU using WFST network composed using unigram language model. Maintaining N-best paths during decoding to ensure good hypotheses are not pruned early. Experimental Evaluation Initialize data structures CPU Manycore GPU Backtrack Output Results Phase 0 Phase 1 Compute Observation Probabilities Phase 2 WFST Search Save Backtrack Log W R W R R W Data Control Data Control R R W R W W R W R Collect Backtrack Info Prepare ActiveSet Iteration Control On-The-Fly Rescoring (LM Lookup) R Accuracy improves when maintaining more number of N- best hypothesis. Accuracy improvement converges with large N. 20x speed-up compared to standard WFST decoding on CPU at word accuracy of 93.80% 95.40% maximum accuracy is achieved. 85 86 87 88 89 90 91 92 93 94 95 0.04 0.40 Word Accuracy [%] Real time Factor Standard (bigram) Proposed (bigram, N=1) Proposed (bigram, N=3) Proposed (trigram, N=3) 1.582 0.078 93.8 20X Acoustic Model SI-284 Data Set 3000 tied state 16 mixture Gaussians 39 th MFCCs features Language Model Wall Street Journal 5k 1-gram: 5k entries 2-gram: 1.6M entries 3-gram: 2.7M entries Evaluation Set Nov. 92 ARPA WSJ test set 330 sentences NVIDIA GTX 680 Keplar architecture 1536 CUDA cores Relationship between word accuracy and Real Time Factor (RTF) Decoding Process Relationship between word accuracy and beamwidth for maintaining different number of N-best hypothesis

HYDRA A Hybrid CPU/GPU-based Speech Recognition Engine …ianlane/publications/SLT_JungsukKim.pdfgraphlm(+ openmp) misc (adaptive beam control, backtrack) graph traversal language

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HYDRA A Hybrid CPU/GPU-based Speech Recognition Engine …ianlane/publications/SLT_JungsukKim.pdfgraphlm(+ openmp) misc (adaptive beam control, backtrack) graph traversal language

HYDRA A Hybrid CPU/GPU-based Speech Recognition Engine for Real-Time LVCSR

Jungsuk Kim, Jike Chong, Ian Lane

A Hybrid GPU+CPU Speech Recognition Engine

• For intuitive Voice and Interactive Multimodal systems robust and responsive speech recognition is crucial • Robust

• Acoustic robustness Large Acoustic Models • Linguistic robustness Large Vocabulary (1M+ words) Large Language Models (>20GB)

• Responsive • Low latency Faster than real-time search

• Current state-of-the-art speech recognition systems are optimized for either robustness or responsiveness • Robustness: 5-10 x real-time >95% accuracy • Responsiveness: real-time 85% accuracy

How can we decode with large models in real-time? Use hybrid GPU/CPU architectures Perform “On-The-Fly Partial Hypothesis Rescoring”

GPU

GPU/CPU Hybrid LVCSR Engine architecture

Recognition Network (WFST)

Acoustic

Model Pronunciation

Model

Language

Model

(uni-gram)

Viterbi Search

Observation

Probability

Calculation

WFST

Search

CPU

Language

Model

( n-gram )

Backtrack

Feature

Extraction

Voice

Input

Output

On-The-Fly Partial Hypothesis Rescoring

Control and data flow for the proposed approach

Prepare Active Hypotheses Set • Gather active speech recognition hypotheses (word and phone

sequences) from previous frame.

Compute Observation Probabilities • Compute likelihood of phonetic models (Gaussian Mixture Model) for

current input feature.

On-The-Fly Partial Hypothesis Rescoring • On the CPU, rescore likelihoods of partial hypotheses using a higher

order N-gram language model stored in main memory. • Partial Hypothesis rescoring and the observation probability

computation can be performed concurrently.

WFST Search • Frame synchronous Viterbi search is performed on the GPU using

WFST network composed using unigram language model. • Maintaining N-best paths during decoding to ensure good hypotheses

are not pruned early.

Experimental Evaluation

Initialize data

structures

CPU Manycore GPU

Backtrack

Output Results

Phase 0

Phase 1

Compute

Observation

Probabilities

Phase 2

WFST Search

Save

Backtrack Log W

R

W

R

R W

Data Control Data Control

R

R

W

R

W W

R

W

R

Collect

Backtrack Info

Prepare ActiveSet

Iteration Control

On-The-Fly

Rescoring

(LM Lookup) R

• Accuracy improves when maintaining more number of N-

best hypothesis.

• Accuracy improvement converges with large N.

• 20x speed-up compared to standard WFST

decoding on CPU at word accuracy of 93.80%

• 95.40% maximum accuracy is achieved.

85

86

87

88

89

90

91

92

93

94

95

0.04 0.40

Wo

rd A

ccu

rac

y [

%]

Real time Factor

Standard (bigram) Proposed (bigram, N=1)

Proposed (bigram, N=3) Proposed (trigram, N=3)

1.582 0.078

93.8

20X • Acoustic Model

• SI-284 Data Set • 3000 tied state • 16 mixture Gaussians • 39th MFCCs features

• Language Model • Wall Street Journal 5k • 1-gram: 5k entries • 2-gram: 1.6M entries • 3-gram: 2.7M entries

• Evaluation Set • Nov. 92 ARPA WSJ test set • 330 sentences

• NVIDIA GTX 680 • Keplar architecture • 1536 CUDA cores

Relationship between word accuracy and Real Time Factor (RTF)

Decoding Process

Relationship between word accuracy and beamwidth for maintaining different number of N-best hypothesis

Page 2: HYDRA A Hybrid CPU/GPU-based Speech Recognition Engine …ianlane/publications/SLT_JungsukKim.pdfgraphlm(+ openmp) misc (adaptive beam control, backtrack) graph traversal language

Experimental Evaluation

Relationship between vocab. Size and Real Time Factor (RTF)

0.100.13 0.14

0.15 0.17 0.17

-

0.1

0.2

0.3

0.4

0.5

-

200.0

400.0

600.0

800.0

1,000.0

1,200.0

100k200k300k400k500k600k700k800k900k 1M

RealTim

eFactor

Size(MB)

Vocab.size

SIZE(MB)

RTF

15.94

9.61 9.15

15.9

9.35 8.77

16.26

9.38 8.82

0

5

10

15

20

25

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

WordErrorRate(%)

RealTimeFactor(RTF)

UGM/3GM UGM/4GM UGM/5GM

Relationship between vocab. Size and Real Time Factor (RTF)

• Acoustic Model • All WSJ corpus • 10,000 tied state • 32 mixture Gaussians • 39th MFCCs features

• Language Model • 1M vocab. • 3-gram: 497.6M entries • 4-gram: 767.8M entries • 5-gram: 977.1M entries

• Evaluation Set • WSJ test set • 543 sentences

N-Best On-The-Fly Partial Hypothesis Rescoring

1

2

3 4

i[1]: ε / w[1]

i[2]:ε / w[2]

i[3]: o[3] / w[3]

1

2

3 4

i[1]: ε / w[1]

i[2]:ε / w[2]

i[3]: o[3] / w[3]

α[g1]+w[1]

α[g2]

α[g1]

α[g1’]=α[g1]+w[1,3]

α[g1]

α[g3]

α[g2]

α[g4]

α[g1]+w[1]

α[g2]+w[2]

α[g’2]=α[g2]+w[2,3]+c[g2]

α[g’1]=α[g1]+w[1,3]+c[g1]

Standard One-Best Search • Choose only best hypothesis when multiple arcs meet in

the same destination state.

Proposed N-Best Search with Rescoring • Rescore the partial hypothesis using likelihood difference

between larger N-gram and unigram (c[.]) when hypothesis outputs word symbol.

• Maintaining N-best paths effectively allows multiple word hypotheses to be kept until rescoring can be applied

• 1M vocab. network can be decoded on a modern GPU.

• Network size does not significantly affect decoding

speed.

• 2.74x faster than realtime when the WER is

9.35%.

• 91.23% maximum accuracy is achieved.

Why “N-Best”? Early pruning: Best hypothesis g2 is pruned before the rescoring.

Best Path

α[.]: State likelihood

g: Partial hypothesis

c[.]: Language model

likelihood difference

w[3] > w[2] > w[1]

α[g1] ≈ α[g2] > α[g3] ≈ α[g4]

c[g1] >> c[g2]

i[.]: Input symbol

o[.]: Output symbol

w[.]: Arc weight

(a) Standard One-Best Search (b) Proposed N-Best Search

Load Balancing Between GPU and CPU using OpenMP

GPU and CPU Parallel Execution • Language model look up has no data dependency between

Acoustic likelihood computation. • CPU function and GPU kernel can be conducted in parallel • Language model runtime can be hided behind GPU run

time.

GPU and CPU load Balancing using OpenMP • Language model look up is longer than Acoustic likelihood

computation time with small acoustic model • Language model lookup for each hypothesis is

independent. • Language model lookup phase is parallelized using

OpenMP on the CPU to achieve better load balance. Ratio of the processing time per phase

0.07 0.07 0.07 0.07

0.09 0.09

0.94

0.43

0.45

0.16

0.05

0.05

0.05

0.05

0.01

0.010.01

0.01

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

KENLM GRAPHLM GRAPHLM(+PARALLELEXCUTION)

GRAPHLM(+OPENMP)

MISC(ADAPTIVEBEAMCONTROL,BACKTRACK)

GRAPHTRAVERSAL

LANGUAGEMODELLOOKUP

LOGLIKELIHOODCOMPUTATION

DATAGATHERING