41
ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

Embed Size (px)

Citation preview

Page 1: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE 2009 / 10 introductory talk

Honza Černocký

Speech@FIT, Brno University of Technology, Czech Republic

ZRE 8.2.2010

Page 2: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 2/46

Agenda

• Where we are and who we are • Needle in a haystack• Simple example - Gender ID • Speaker recognition• Language identification• Keyword spotting • CZ projects

Page 3: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 3/46

Where is Brno ?

Page 4: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 4/46

The place

• Brno University of Technology – 2nd largest technical university in the Czech Republic (~2500 staff, ~18000 students).

• Faculty of Information Technology (FIT) – its youngest faculty (created in January 2002).

• Reconstruction of the campus finished in Nov 2007 – now a beautiful place marrying old cartusian monastery and modern buildings.

Page 5: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 5/46

Department of Computer Graphics and Multimedia

• Video/image processing• Speech processing • Knowledge engineering and natural

language processing • Medical visualization and 3D modeling http://www.fit.vutbr.cz/units/UPGM/

Setup (Desired information, editing

properties)

Video editing algorithm

(rules)

Scenario

Features extraction

Video editor

Input video

streams

Output video

stream

Camera selection

Setup (Desired information, editing

properties)

Video editing algorithm

(rules)

Scenario

Features extraction

Video editor

Input video

streams

Output video

stream

Setup (Desired information, editing

properties)

Video editing algorithm

(rules)

Scenario

Features extraction

Video editor

Input video

streams

Output video

stream

Camera selection

Page 6: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

6/46

Speech@FIT

• University research group established in 1997

• 20 people in 2009 (faculty, researchers, students, support staff).

• Provides also education within Dpt. of Computer Graphics and Multimedia.

• Cooperating with EU and US universities and companies.

• Supported by EC, US and national projects Speech@FIT’s goal: high profile research in speech theory and algorithms

Page 7: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 7/46

Key people

Directors:Dr. Jan “Honza” Černocký - Executive directionProf. Hynek Heřmanský - (Johns Hopkins University, USA) advisor and guruDr. Lukáš Burget – Scientific director

Sub-group leaders:• Petr Schwarz – phonemes, implementation• Pavel “Pája” Matějka – SpeakerID, LanguageID• Pavel Smrž – NLP and semantic Web

Page 8: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 8/46

The steel and soft …

Steel • 3 IBM Blade centers with

42 IBM Blade servers à 2 dual-core CPUs

• Another ~120 computers in class-rooms

• >16 TB of disk space• Professional and friendly

administration

Soft• Common: HTK, Matlab,

QuickNet, SGE• Own SW: STK, BS-CORE,

BS-API

Page 9: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 9/46

• Faculty (faculty members and faculty-wide research funds)

• EU projects (FP[4567])• Past: SpeechDat-E, SpeeCon, M4, AMI,

CareTaker. • Running: AMIDA, MOBIO, weKnowIt.

• US funding – Air Force’s EOARD• Local funding agencies - Grant Agency of

Czech Republic, Ministry of Education, Ministry of Trade and Commerce

• Czech “force” ministries – Defense, Interior• Industrial contracts• Spin-off – Phonexia, Ltd.

Speech@FIT funding

Page 10: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

10/46

Phonexia Ltd.

• Company created in 2006 by 6 Speech@FIT members

• Closely cooperating with the research group

• Key people• Dr. Pavel Matějka, CEO• Dr. Petr Schwarz, CTO• Igor Szöke, CFO• Dr. Lukáš Burget,

research coordinator• Dr. Jan Černocký,

university relations• Tomáš Kašpárek,

hardware architectPhonexia’s goal: bringing mature technologies to the market, especially

in the security/defense sector

Page 11: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 11/46

Agenda

• Where we are and who we are • Needle in a haystack• Simple example - Gender ID • Speaker recognition• Language identification• Keyword spotting • CZ projects

Page 12: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

12/46

Needle in a haystack

• Speech is the most important modality of human-human communication (~80% of information) … criminals and terrorists are also communicating by speech

• Speech is easy to acquire in both civilian and intelligence/defense scenarios.

• More difficult is to find what we are looking for• Typically done by human experts, but always count

on:• Limited personnel• Limited budget• Not enough languages spoken • Insufficient security clearances

Technologies of speech processing are not almighty but can help to narrow the search space.

Page 13: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 13/46

“Speech recognition”

GOAL: Automatically extract information transmitted in speech signal

SpeakerRecognition

GenderRecognition

LanguageRecognition

SpeechRecognition

Speaker Name

Gender

Language

What was said.

John Doe

Male or Female

English/German/??

“Hallo Crete!”

Keyword spotting“Crete” spotted

Speech

Page 14: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

14/46

Focus on evaluations

• „I'm better than the other guys“ – not relevant unless the same data and evaluation metrics for everyone.

• NIST – US Government Agency, http://www.nist.gov/speech • Regular benchmark campaigns – evaluations – of speech

technologies.• All participants have the same data and have the same limited

time to process them and send results to NIST => objective comparison.

• The results and details of systems are discussed at NIST workshops.

• Speech@FIT extensively participating in NIST evaluations:• Transcription 2005, 2006, 2007, 2009 • Language ID 2003, 2005, 2007, 2009• Speaker Verification 1998, 1999, 2006, 2008, • Spoken term detection 2006

• Why are we doing this ? • We believe that evaluations are really advancing the state of the art • We do not want to waste our time on useless work …

Page 15: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 15/46

What we are really doing ?

Following the recipe from any pattern-recognition book:

Page 16: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 16/46

And what is the result ?

Something you’ve probably already seen:

Feature extraction

Evaluation of probabilities or likelihoods

Models

“Decoding”

input decision

Page 17: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 17/46

Agenda

• Where we are and who we are • Needle in a haystack• Simple example - Gender ID • Speaker recognition• Language identification• Keyword spotting • CZ projects

Page 18: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

18/46

The simplest example … GID

Gender Identification • The easiest speech application to

deploy …• … and the most accurate (>96%

on challenging channels)• Limits search space by 50%

Page 19: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 19/46

So how is Gender-ID done ?

Evaluation of GMM

likelihoodsMFCC

input

Gaussian Mixture

models – boys, girls

DecisionMale/female

Page 20: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 20/46

Features – Mel Frequency Cepstral Coefficients

• The signal is not stationary

• And the hearing is not linear

Page 21: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 21/46

The evaluation of likelihoods: GMM

Page 22: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 22/46

The decision – Bayes rule.

GID DEMO

Page 23: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 23/46

Agenda

• Where we are and who we are • Needle in a haystack• Simple example - Gender ID • Speaker recognition• Language identification• Keyword spotting • CZ projects

Page 24: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 24/46

Speaker recognition

• Speaker recognition aims at recognizing "who said it".

• In speaker identification, the task is to assign speech signal to one out of N speakers.

• In speaker verification, the claimed identity is known and the question to be answered is "was the speaker really Mr. XYZ or an impostor?

Front-end processing

Front-end processing

Target modelTarget model

Background model

Background model

scorenormalization

scorenormalization

Adapt

Page 25: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 25/4625

High inter-session variability

High speaker variability

UBM

Target speaker model

Bad session variability

Example: single Gaussian model with 2D features

Page 26: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 26/46

And what to do about it

High inter-session variability

UBM

Target speaker model Test data

For recognition, move both models along the high inter-session variability direction(s) to fit well the test data

High inter-speaker variability

Page 27: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

27/46

Research achievements

Key thing:• Joint Factor Analysis (JFA) decomposes models into channel

and speaker sub-spaces.• Coping with unwanted variability • In the same time, compact representation of speakers allowing

for extremely fast scoring of speech files.

Speaker search DEMO

<- NIST SRE 2006: • BUT• STBU

consortium

NIST SRE 2008 ->• confirming

leading position

Page 28: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 28/46

Agenda

• Where we are and who we are • Needle in a haystack• Simple example - Gender ID • Speaker recognition• Language identification• Keyword spotting • CZ projects

Page 29: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 29/46

The goal of language ID

• Determine the language of a speech segment

LID

Page 30: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 30/46

Two main approaches to LID

• Acoustic – Gaussian Mixture Model

• Phonotactic – Phone Recognition followed by Language Model

Page 31: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 31/46

Acoustics

• Good for short speech segments and dialect recognition• Relies on the sounds• Done by discriminatively trained GMMs with channel

compensation

Page 32: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 32/46

Phonotactic approach

• good for longer speech segments• robust against dialects in one language • eliminates speech characteristics of speaker's native

language• Based on high-quality NN-based phone recognizer

… producing strings

or lattices

Page 33: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 33/46

Phonotactic modeling - example

u n d 25

a n d 3

t h e 0

. . . .

u n d 1

a n d 32

t h e 13

. . . .

u n d 5

a n d 0

t h e 1

. . . .

German English Test

• N-gram language models – discounting, backoff • Binary decision trees – adaptation from UBM• Support Vector Machines – vectors with counts

Page 34: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

34/46

Research achievements

NIST evaluation results:• LRE 2005 – Speech@FIT

the best in 2 out of 3 categories

• LRE 2007 – confirmation of the leading position.

• LRE 2009 – a bit of bad luck but very good post-submission system

ara F 0.0eng F 0.0far F 0.0fre T 99.9ger F 0.0hin F 0.0jap F 0.0kor F 0.0man F 0.0spa F 0.0tam F 0.0vie F 0.0

ara F 0.0eng T 93.3far F 0.0fre F 0.3ger F 4.9hin F 0.0jap F 0.0kor F 0.0man F 1.3spa F 0.0tam F 0.0vie F 0.1

ara F 0.0eng F 15.1far F 0.0fre F 0.0ger T 84.7hin F 0.0jap F 0.0kor F 0.0man F 0.0spa F 0.0tam F 0.0vie F 0.0

ara T 42.9eng F 1.7far F 12.9fre F 0.0ger F 0.0hin F 11.2jap F 0.9kor F 22.2man F 0.0spa F 0.1tam F 7.4vie F 0.1

Key things:• Discriminative modeling• Channel compensation• Gathering training data

from public sources

Web demo:

http://speech.fit.vutbr.cz/lid-demo/

Page 35: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 35/46

Agenda

• Where we are and who we are • Needle in a haystack• Simple example - Gender ID • Speaker recognition• Language identification• Keyword spotting • CZ projects

Page 36: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

36/46

Keyword spotting

• What ? Which recording and when ? Confidence ? • Comparing keyword model output with an anti-

model.

Technical approaches• Acoustic keyword spotting• Searching in an output of Large

Vocabulary Continuous speech recognizer (LVCSR)

• Searching in an output of LVCSR completed with sub-word units.

The choices:• What is the needed tradeoff

between speed and accuracy?

• How to cope with the “devil” of keyword spotting: Out of Vocabulary (OOV) words

Page 37: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 37/46

Acoustic KWS

no problem with OOVs Indexing not possible –

need to go through everything

down to 0.01xRT Does not have the strength

of LM – problem with short words and sub-words.

• Model of a word against a background model.

• No language model

Page 38: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 38/46

Searching in the output of LVCSR

speed of search more precise on frequent

words. limited by LVCSR

vocabulary - OOV LVCSR is more complex

and slower.

• LVCSR, then search • in 1-best or lattice. • Indexing possible

Page 39: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 39/46

Searching in the output of LVCSR + sub-words

Speed of search preserved Precision on frequent words

preserved. Allows to search OOVs

without additional processing of all data.

LVCSR and indexing are more complex.

• LVCSR with words and sub-word units.

• Indexing of both words and sub-word units

Page 40: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

40/46

Research achievements

Key things:• Expertise with acoustic, word and sub-word recognition• Excellent front-ends – LVCSR and phone recognizer. • Speech indexing and search• Normalization of scores.

DEMO – Russian acoustic KWS

NIST STD 2006 – English MV Task 2008 – Czech

Page 41: ZRE 2009 / 10 introductory talk Honza Černocký Speech@FIT, Brno University of Technology, Czech Republic ZRE 8.2.2010

ZRE Honza Cernocky 8.2.2010 41/46

Agenda

• Where we are and who we are • Needle in a haystack• Simple example - Gender ID • Speaker recognition• Language identification• Keyword spotting • CZ projects