31
HYDRA : A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR Jungsuk Kim Jike Chong Ian Lane Electrical and Computer Engineering Carnegie Mellon University Silicon Valley March 21, 2013 @GTC2013 1

A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

HYDRA: ���A Hybrid CPU/GPU Speech Recognition

Engine for Real-Time LVCSR

Jungsuk  Kim        Jike  Chong        Ian  Lane    Electrical and Computer Engineering

Carnegie Mellon University Silicon Valley

March 21, 2013 @GTC2013 1

Page 2: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Overview  

2

•  Introduc5on    

•  Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFST)  

•  GPU-­‐Accelerated  Speech  Recogni5on  •  On-­‐The-­‐Fly  Hypothesis  Rescoring  •  Evalua5on  Results  •  Demonstra5on  Video  

Page 3: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Introduc5on  

3

•  Voice interfaces a core technology for User Interaction •  Mobile devices, Smart TVs, In-Vehicle Systems, …

•  For a captivating User Experience, Voice UI must be: •  Robust

•  Acoustic robustness Large Acoustic Models •  Linguistics robustness Large Vocabulary Recognition

•  Responsive •  Low latency Faster than real-time search

•  Adaptive •  User and Task adaptation

A New Speech Recognition Architecture Required

Page 4: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Introduc5on  

4

&

Speech recognition contains many highly

parallel tasks

GPU processors optimized for parallel

computing

HYDRA an ASR engine designed specifically

for GPUs + =

Page 5: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

r! eh! k! ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

r! eh! k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

A search space in speech recognition is represented as a WFST graph that includes acoustic, phonetic and linguistic constraints

Page 6: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

r! eh! k! ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

r! eh! k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Search is performed in 3 phases

Page 7: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

r! eh! k! ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

r! eh! k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Phase 0: Prepare Active Set Gather active states (speech recognition hypotheses) from previous frame.

Page 8: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

r! eh! k! ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

r! eh! k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Phase 1: Compute Observation Probability Compute likelihood of phonetic models (Gaussian Mixture Model) for current frame

Page 9: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

r! eh! k! ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

r! eh! k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Phase 2: WFST Search Frame synchronous Viterbi search is performed on WFST network

Page 10: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

eh! k! ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

eh! k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Phase 0: Prepare Active Set

r!

r!

Page 11: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

eh! k! ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

eh! k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Phase 1: Compute Observation Probability

r!

r!

Page 12: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

eh! k! ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

eh! k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Phase 2: WFST Search

r!

r!

Page 13: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

k! ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Phase 0: Prepare Active Set

r! eh!

r! eh!

Page 14: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

k! ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Phase 1: Compute Observation Probability

r! eh!

r! eh!

Page 15: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

k! ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

k! b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Phase 2: WFST Search

r! eh!

r! eh!

Page 16: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

b! iy! ch!

Beach Nice

n! ay! s!a!

a

Phase 0: Prepare Active Set

r! eh! k!

r! eh! k!

Page 17: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

b! iy! ch!

Beach Nice

n! ay! s!a!

a

Phase 1: Compute Observation Probability

r! eh! k!

r! eh! k!

Page 18: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

ax! g! n! ay! z! s! p! iy! ch!

Speech Recognize

b! iy! ch!

Beach Wreck Nice

n! ay! s!a!

a

Backtrack from Viterbi Search is Speech Recognition Hypothesis

r! eh! k!

r! eh! k!

Page 19: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Speech  Recogni5on  with    Weighted  Finite  State  Transducers  (WFSTs)  

Initialize data structures"

CPU!

Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute

Observation Probabilities!

Phase 2!WFST Search!

Save !Backtrack Log!W

R W!

R!

R W

Data! Control!

R!

RW

R!

WW

RW

R

Prepare ActiveSet!

Iteration Control!

Decoding Process •  Phase  0:  Prepare  Ac5ve  Set  

•  Gather  ac2ve  speech  recogni2on  hypotheses  from  previous  frame.  

•  Phase  1:  Compute  Observa5on  Probability  •  Compute  likelihood  of  phone2c  models  (Gaussian  

Mixture  Model)  for  current  input  features.  

•  Phase  2:  WFST  Search  •  Frame  synchronous  Viterbi  search  is  performed  on  

WFST  network.  

19

Page 20: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Concurrency  Opportuni5es  

20

1

2

3

4

5

6

Thread block 1

Thread block 2

6

2

3

4

•  Phase  0:  Prepare  Ac5ve  Set  •  Communica6on-­‐intensive  phase  

•  Phase  1:  Compute  Observa5on  Probability  •  1000~2000  GMM  clusters  are  ac2vated  per  step.  •  Computa6on-­‐intensive  phase  

•  Phase  2:  WFST  Search  •  10,000s  par2al  hypotheses  tracked  per  step.  •  Handling  irregular  graph  structure  with  data  parallel  

opera2on.  •  Conflict-­‐free  reduc6on  in  graph  traversal  to  resolve  write-­‐

conflict.    •  Communica6on-­‐intensive  phase  

Page 21: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

GPU-­‐Accelerated  Speech  Recogni5on  

21

Initialize data structures"

Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute

Observation Probabilities!

Phase 2!WFST Search!

Save !Backtrack Log!

Prepare ActiveSet!

Iteration Control!

CPU GPU

Paul R. Dixon [ICASSP 2009]

Approach  1:  •  Conduct  compute  intensive  phase  only  on  GPU  •  Incurs  high  data-­‐copying  overheads  between  the  

CPU  and  the  GPU.  •  Less  scalable  as  transfer  become  a  sequen2al  

boRleneck  in  algorithm.  

Faster than CPU but less scalable

Page 22: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

GPU-­‐Accelerated  Speech  Recogni5on  

22

Initialize data structures"

Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute

Observation Probabilities!

Phase 2!WFST Search!

Save !Backtrack Log!

Prepare ActiveSet!

Iteration Control!

CPU GPU

Paul R. Dixon [ICASSP 2009]

Initialize data structures"

Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute

Observation Probabilities!

Phase 2!WFST Search!

Save !Backtrack Log!

Prepare ActiveSet!

Iteration Control!

CPU GPU

Jike Chong [INTERSPEECH 2009]

Collect Backtrack Info"

Page 23: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

GPU-­‐Accelerated  Speech  Recogni5on  

23

Initialize data structures"

Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute

Observation Probabilities!

Phase 2!WFST Search!

Save !Backtrack Log!

Prepare ActiveSet!

Iteration Control!

CPU GPU

Jike Chong [INTERSPEECH 2009]

Collect Backtrack Info"

Approach  2:  •  Both  phases  are  on  GPU.  •  Scalable  algorithm  •  Only  suitable  for  small  models  (vocab  /  linguis2c  

context)  due  to  limited  GPU  memory  (1~6GB)  

Fast and Scalable but Limited

Page 24: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

GPU-­‐Accelerated  Speech  Recogni5on  

24

Initialize data structures"

Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute

Observation Probabilities!

Phase 2!WFST Search!

Save !Backtrack Log!

Prepare ActiveSet!

Iteration Control!

CPU GPU

Paul R. Dixon [ICASSP 2009]

Initialize data structures"

Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute

Observation Probabilities!

Phase 2!WFST Search!

Save !Backtrack Log!

Prepare ActiveSet!

Iteration Control!

CPU GPU

Jike Chong [INTERSPEECH 2009]

Collect Backtrack Info"

Initialize data structures"

Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute

Observation Probabilities!

Phase 2!WFST Search!

Save !Backtrack Log!

Prepare ActiveSet!

Iteration Control!

CPU GPU

Proposed

Collect Backtrack Info"

On-The-Fly Rescoring!

(LM Lookup) !

Page 25: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

GPU-­‐Accelerated  Speech  Recogni5on  

25

Initialize data structures"

Backtrack"

Output Results"

Phase 0!""

Phase 1 !Compute

Observation Probabilities!

Phase 2!WFST Search!

Save !Backtrack Log!

Prepare ActiveSet!

Iteration Control!

CPU GPU

Proposed

Collect Backtrack Info"

On-The-Fly Rescoring!

(LM Lookup) !

Vocab. size 5k 64k 1M

unigram 2 12 92 bigram 114 1,880 N/A* trigram 676 4,644* N/A*

WFST network size (MB) (* Unable to decode)

Proposed  Approach:  •  Both  phases  are  on  GPU  using  unigram  WFST  •  Rescore  hypothesis  “On-­‐The-­‐Fly”  using  larger  

language  model  on  CPU  

Fast, Scalable, Large Models

Page 26: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

On-­‐The-­‐Fly  Hypothesis  Rescoring  

26

ax! g! n! ay! z!

Speech Recognize

Wreck Nice

n! ay! s!a!

a

r! eh! k!

r! eh! k!

<s> s! p! iy! ch!

b! iy! ch!

Beach

P(Recognize|<s>)

At word boundary update graph weight with linguistic context from CPU

Page 27: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

On-­‐The-­‐Fly  Hypothesis  Rescoring  

27

ax! g! n! ay! z!

Speech Recognize

Wreck Nice

n! ay! s!a!

a

r! eh! k!

r! eh! k!

<s> s! p! iy! ch!

b! iy! ch!

Beach

P(Speech|<s>,Recognize)

At word boundary update graph weight with linguistic context from CPU

Page 28: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Evalua5on  Results  (5k  vocab.)  

28

•  20X  speed-­‐up  compared  to  standard  WFST  decoding  on  CPU  at  word  accuracy  93.80%  

•  95.4%  maximum  accuracy  is  achieved  

85

86

87

88

89

90

91

92

93

94

95

0.04 0.40

Wor

d A

ccur

acy

[%]

Real time Factor

STANDARD (2GRAM) PROPOSED (3GRAM, N=3)

93.8%

0.078 1.582 20X

Page 29: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Evalua5on  Results  (1M  vocab.)  

29

•  2.74X  faster  than  real-­‐2me  when  the  WER  is  9.35%  

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

TRIGRAM" 40GRAM" 50GRAM" 40GRAM"

GPU" CPU"

REAL%TIM

E%FAC

TOR%

9.61%

9.35%

9.38%

0.36

Page 30: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Evalua5on  Results  (1M  vocab.)  

30

•  2.74X  faster  than  real-­‐2me  when  the  WER  is  9.35%  

•  10X  faster  than  CPU  decoder.  

0"

0.5"

1"

1.5"

2"

2.5"

3"

3.5"

4"

4.5"

TRIGRAM" 4/GRAM" 5/GRAM" 4/GRAM"

GPU" CPU"

REAL%TIM

E%FAC

TOR%

9.61% 9.35% 9.38%

9.59%

Real-Time 0.36

4.01

Page 31: A Hybrid CPU/GPU Speech Recognition Engine for Real ......HYDRA: ! A Hybrid CPU/GPU Speech Recognition Engine for Real-Time LVCSR JungsukKim!!!!Jike!Chong!!!!Ian!Lane!! Electrical

Carnegie Mellon University

Q&A  

31

Thank you for your attention.