26
All rights reserved. Copyright ©2017- Yoshikazu Miyanaga autonomous Speech Recognition with Noise robust Speech Features Yoshikazu Miyanaga Hokkaido University Laboratory of Information Communication Networks Graduate School of Information Science and Technology Sapporo 060-0814, Hokkaido Japan

autonomous Speech Recognition with Noise robust Speech ... · PDF fileROBI -2014- Producer & Sales ... et. al., “Compensation for the effect of communication channel in auditory-like

Embed Size (px)

Citation preview

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

autonomous Speech Recognition with

Noise robust Speech Features

Yoshikazu Miyanaga

Hokkaido University

Laboratory of Information Communication Networks

Graduate School of Information Science and Technology

Sapporo 060-0814, Hokkaido Japan

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

BASIC AUTONOMOUS SPEECH RECOGNITION SYSTEM

Part 1

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Conditions for Speech Recognition

Short Isolated Speech: words, phrase (<2sec)

Continuous Speech: sentences (>2sec)

Attached Microphone (several cm – 10cm)

Remote Microphone(10cm – 5m)

Silent Room(>20dB SNR)

Living Room(20~10dB SNR)

Noisy Room & Outside

(<10dB SNR)

Long Distance Microphone(>5m)

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Cloud ASR

Continuous Sentence Speech

Attached Mic(<10cm)

Silent Room(>20dB)

Attached Mic(<10cm)

Short Sentence Speech

Living Room(20~10dB)

Short Isolated Speech: (<2sec)

Remote Mic: (<5m)

Living Room(20~10dB)

Array Microphone

Continuous Speech Recognition over Internet

Language Model with small Ontology

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Autonomous ASR

Short Isolated Speech: words, phrase (<2sec)

Remote Mic:(10cm – 5m)

Long Distance Mic:(>5m)

Silent Room(>20dB)

Living Room(20~10dB)

Noisy Room: exhibition(<10dB)

Attached Mic(several cm – 10cm)

Isolated Speech Recognition using own SW/HW

Our Target

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Voice Activity Detection

Automatic Speech Detection

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Autonomous Speech Recognition

Automatic Speech

Recognition

Candidates of Recognition Results(1) Good Morning

(2) See you

(3) How are you ?

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Automatic Speech Selection

Automatic Speech Rejection

Recognition Result: Good Morning

Candidates of Recognition Results(1) Good Morning

(2) See you

(3) How are you ?

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

ROBI -2014-

Producer & Sales Companyby Deagostini Japan, and Raytron Inc, JP

Design & Robot Controller by T.Takahashi, Robo-Garage Ltd

Autonomous ASRby Miyanaga Lab, HU

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

NOISE ROBUST SYSTEMS

Part 2

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Noise !

Clean

/h A ch I N O h E /

SNR = 10dB

SNR = 0dB

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Running Spectrum

Running spectra are obtained by accumulating short-time

spectrum

0 2000 4000 6000 8000 10000 12000

-0.1

0

0.1

0.2

DFT

Frame NumberFrequency

Running spectrum: time trajectory

of frequency

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Modulation Spectrum

Modulation Spectrum

Running Spectrum

Frame NumberFrequency

DFT on each frequencyFrequency Modulation

frequency

CMS, RASTA and RSF focuses on modulation spectra.

Modulation spectrum: spectrum versus time

trajectory of frequency.

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Clean Noisy (white noise at 5 dB SNR)

Speech components are dominant around 4

Hz in modulation spectrum.

Lower modulation frequency components can be assumed as

noise because of little changes in noise components.

Mod-F of Clean and Noisy Speech

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Speech components are dominant around 4

Hz in modulation spectrum.

Modulation Frequency [Hz]

Modulation Spectrum

Noise ComponentsSpeech Components

Unnecessary Part

Fre

qu

ency

(H

z)

Filtering over Running Spectrum

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

RASTA (1991) and RSF (2002)

RSF (Running Spectrum Filtering) enhances perceptual auditory components.

decreases noise components relatively by band-pass filtering in cepstral sequences.

),()(),(~

0

kinCihknCQ

i

Coefficients in FIR Filter

Modulation Frequency

RSF

RASTA(IIR)

N.Wada, Y.Miyanaga, et. al., "A Study about the Extract of Robust Speech Characteristics on Speech Recognition System", IEICE Technical Report, DSP2002-33, pp.19-22, May 2002.

H. Hermansky, et. al., “Compensation for the effect of communication channel in auditory-like analysis ofspeech (RASTA-PLP),” Proceedings of European Conference on Speech Technology, 1991, pp. 1367–1370.

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

DRA

DRA (Dynamic Range Adjustment)

normalizes amplitude of cepstral vectors in time domain (use of maximum value during utterance).

suppresses dynamic range distortions caused by additive noise.

k

knCknC

),(~

),(

|),(~

|max1

knCTk

k

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

RSF/DRA (2002)

10 20 30 40 50 60 70 80 90 100-3

-2

-1

0

1

2

3

RSF/DRA processing

10 20 30 40 50 60 70 80 90 100

-3

-2

-1

0

1

2

Baseline

10 20 30 40 50 60 70 80 90 100-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Clean

Noisy

Comparison in cepstral time-trajectories at 4th order

S.Yoshizawa, Y.Miyanaga et. al., "Hardware Implementation of a Noise Robust Speech Recognition System Using RSF/DRA Technique", IEICE Technical Report, CAS2003-42, VLD2003-52, DSP2003-72, pp.127-132, June 2003.

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

RSA (Running Spectrum Analysis)

Speech components in 0.5 – 7 Hz of the Modulation

Spectrum Domain are directly selected by DFT/FFT

operation.

Modulation Frequency [Hz]

Modulation Spectrum

Noise ComponentsSpeech Components

Unnecessary Part

Fre

qu

ency

(H

z)

Modulation Spectrum Domain Operation

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Conditions of Robust-ASR

ASR for Similar Japanese Pronunciation Phrases under Low SNR ( 10dB, 15dB)

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

ASR Results using RSA

G. M

ufu

ngulw

a, Y.M

iyanaga, e

t. al., "R

obust S

peech

Reco

gnitio

n

for S

imila

r Japanese

Pro

nuncia

tion P

hra

ses U

nder N

oisy

Conditio

ns", IS

SCS2017 IE

EE, to

be p

rese

nte

d, Ia

si, Rom

ania

, Ju

ly 2

017.

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Robust ASR

Improvement (%) on ASR Accuracy on NSP and SP

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

High Speed Eco ASR HW System

Definition of Real-time

180ms for speechprocessing

Selection of LSI Design Technology

90nm, 65nm

S.Yoshizawa, Y.Miyanaga et. al., "A VLSI Implementation of a Word Recognition System for Low-Power Design", IEICE Technical Report, CAS2002-28, VLD2002-42, DSP2002-68, pp.13-18, June 2002.

Design of Green LSIlower clock, sub-threshold,

parallel/pipeline, dynamic architecture

HU Robust ASR v1

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Current HU Robust ASR v4 (2014)

HU-ASR Board

PC Interface with HU-ASR Board

55mm×44 mm

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Robot Implementation

Autonomous Speech Recognition

Speech Synthesis

Quick Response

Control to Consumer Electronics and Machines

welfare

speech therapy

All rights reserved. Copyright ©2017- Yoshikazu Miyanaga

Summary

Autonomous ASR

Integrated Architecture of Speech Detection, Robust Speech Analysis, Speech Recognition, Speech Selection

Higher Speed Processing than DSP and SoftwareSuperior in Energy Saving than DSP Solutions