Non-linear speech processing: overview of COST-277 current research1 Nonlinear speech processing (NOLISP) Overview of COST-277 current research Marcos

Non-linear speech processing: overview of COST-277 current research

1

Nonlinear speech processing (NOLISP)

Overview of COST-277 current research

Marcos Faúndez-Zanuy

([email protected])

COST-277 Chairman


2

OUTLINE

1. Overview: what means “nonlinear”?

2. Organization of COST-277

3. Report activity june’01 – june’03


3

OUTLINE





4

What means “Non-linear”? (Strict sense)

Superposition principle does not hold:

Given: f(x1)=y1, f(x2) =y2 =>

f(ax1)=ay1, f (x1 +x2) =y1+y2


5

What means “Non-linear”? Strict sense: Really almost “everything” is nonlinear

Acquisition Parameterization Models

Quantizer (linear, A-law, etc.)

Cepstrum HMM, VQ

-8 -6 -4 -2 0 2 4 6 8 -8 -6 -4 -2 0 2 4 6 8

outp

ut

input

Uniform 3 bits quantizer

-4 -3

-2 -1

0 1

2 3

)(log)( 1 nxFFnxcepstrum


6

Non-linearities are always present

Nonlinearities of the systems that generate the signal and/ or noise

Nonlinearities of the signal acquisition system

Nonlinearities of the transmission channel Nonlinearities of the human perception

mechanism.


7

Classical approachWide sense: linear speech processing

Speech signal model consists of a pulse/ noise source and a linear filter where both change their characteristics on a frame-by-frame basis.

This approach neglects structure known to be present in the speech signal.


8

Evidences of nonlinearities

Residue comparison Correlation dimension Higher order statistics Probability density functions


9

Example: Linear vs NL


10

Drawbacks with NOLISP approaches

A lack of a unifying theory of the different nonlinear processing tools (nnets, homomorphic, polynomial, morphological, ordered statistics filters, and so on)

High computational burden Well known analysis tools are not applicable Usually, a closed-form formulation does not exist,

and iterative methods (with local minima problems) must be used.


11

What are we mainly looking for?

The replacement of the linear filter (or parts thereof) with nonlinear operators (models) should enable us to obtain an accurate description of the speech signal with a lower number of parameters. This in turn should lead to better performance of practical speech processing applications.


12

OUTLINE





13

What is COST ?

Intergovernmental Cooperation– Created in 1971– 17 Scientific and Technical Domains

Participation– 33 COST Countries– European Commission– International Organisations – Organizations from Non-COST Countries on Mutual

Benefit Basis COST Actions

– Concerted Actions of Nationally Funded R&D


14

COST TISTCOST TISTTelecommunications,Telecommunications,Information ScienceInformation Scienceand Technologiesand Technologies


15

COST CountriesThe fifteen EU Member States

The EFTA Member States

Iceland

Norway

Switzerland

Central and Eastern countries

Estonia

Latvia

Lithuania

Poland

the Czech republic

Slovakia

Slovenia

Croatia

Romania

Bulgaria

Other countries

Cyprus

Malta

Turkey

Hungary


16

Evolution of COST Actions

0

50

100

150

200

250

80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 00

Total Actions

Starting Actions


17

WHAT IS A COST ACTION?

Concerted Action Pan-European “NON-COMPETITIVE” Research R&D Financed Nationally Flexibility Bottom-up A la carte participation Commission funds only coordination activities


18

COST Senior Officials (CSO)

Responsible for the overall strategy of COST

Decides on the launching of each individual COST Action

Approves participation from non-COST countries institutes

Approves prolongation of COST Actions


19

COST Technical Committee (TC)

Selection of new COST Actions

Monitoring of ongoing COST Actions

Evaluation of completed COST Actions

Dissemination and Valorisation of COST activities

Provide Advice to EC on Budget Planning


20

Management Committee (MC)

Supervises and coordinates the implementation of the Action

Composed of :– Maximum two representatives of each signatory

country they ensure the scientific coordination at national level

– One representative of any non-COST institution admitted to participate

– The Scientific Secretary– Representatives of the Commission services

Each signatory has one vote


21

Working Group (WG)

Small number of researchers per working group

Working group members may be:

– Management Committee members

– Other scientists from the signatory countries


22

COST TIST

~ 28 Actions, ~ 2000 Organisations Covering Basic Research on

– Antennas and Radio Propagation– Satellite Technologies and Services– Mobile Technologies and Services– Optical Networking Components and Services– Internet & Multimedia Network Services– Speech Technologies– Information and Computer Science

Strong Relationship with IST Program


23

Evolution of COST Evolution of COST TIST ActionsTIST Actions

0

5

10

15

20

25

30

1996 1998 2000

Total Actions

StartingActions


24

Special Needs & User Requirements

COST 219bis,

269

COST TISTResearch Domains & Actions

Antennas/Radio PropagationCOST 244bis, 255,

260, 261, 271

Mobile & Personal Comm.

COST 259, 273Satellite

Tech. & Services

COST 272

Optical Networking

COST 265, 266, 267, 268, 270

New Internet & Multimedia Services COST 211 Quad, 256,

257, 263, 264, 269, 275, 279

Speech Technologies

COST 258, 277, 278

Information & Computer Science

COST 274, 276


25

Other COST Actions in Speech Technologies

COST 275: Biometrics-Based Recognition of People over the Internet – Involves the use of both voice and face recognition

for user authentification over the Internet COST 278: Spoken Language Interaction in

Telecommunications– Improve knowledge regarding issues and problems

related to spoken language interaction, including robustness and multi-lingual aspects

– Human-computer interaction using spoken language in multi-modal context, including dialoque theories and application evaluation


26

Relationship between COST Actions 275, 277 and 278

275: Biometrics based Recognition of People

over the Internet

277: Non-linear Speech Processing

278: Spoken LanguageInteraction in

Telecommunication

Speaker

Recognition

Speech

Recognition

Natural

Language

Processing

Multi

Modality &

Data Fusion

Speech

Analysis & Coding

Image

Analysis &

Graphics

Speech

SynthesisDialogue

Application Fields

Interface Components

Generic Functions


27

GRANT CONTRACTS COST TIST support is provided through annual

Grant Contracts with coordinating organisation Contract covers costs for:

– Secretariat (manpower to cover administration)– Meetings (WG and MC)– Seminars and workshops– Short Term Scientific Missions– Publications


28

SECRETARIAT Contract Management, Payments Reimbursement of Meetings Rebuilding of WWW site

– Repository of Official Documents– TC and Action Activities and Events

Enhancing Dissemination– News Letter– Central Index and Storage of Reports for Retrieval

Links with EC (IST) and National Programmes


29

Overview:COST-277

DISCRETE MODELS

SY

NT

HE

TIC

SP

EE

CHH

UM

AN

SP

EE

CH

CODED SPEECH

WRITTEN SPEECH

TtS

StT

StC

CtS

Analysis SynthesisR

ecogn.

Cod

ing

© u

kl 2

002


30

Organization

Chair: Marcos Faúndez Vice-Chair: Gernot Kubin Secretary: Stephen McLaughlin

– WG1: Bastiaan Kleijn– WG2: Bojan Petek– WG3: Stephen McLaughlin– WG4: Gerard Chollet


31

Countries

Austria Belgium Czech Republic France Germany Greece Ireland Italy Lithuania Portugal Slovakia Slovenia Spain Sweden Switzerland UK

Canada


32

Dissemination of info

e-mail distribution list:

[email protected]

Subscribe/unsubscribe [email protected]

Website:

http://www.ee.ed.ac.uk/cost277/


33

Future Meetings of the management committee


34

Publications and reports

International Journal of control and intelligent systems, special issue on Non-linear Speech processing techniques and applications ACTAPRESS. Invited editor: A. Hussain (COST-277 MC member)

Special sessions in EUSIPCO’02, IWANN’01, IWANN’03, EUSIPCO’04 (TBC)


35

COST Actions in Speech Technologies

COST 275: Biometrics-Based Recognition of People over the Internet – Involves the use of both voice and face recognition for user

authentification over the Internet COST 277: Nonlinear speech processing COST 278: Spoken Language Interaction in

Telecommunications– Improve knowledge regarding issues and problems related

to spoken language interaction, including robustness and multi-lingual aspects

– Human-computer interaction using spoken language in multi-modal context, including dialoque theories and application evaluation


36

Relationship between COST Actions 275, 277 and 278

275: Biometrics based Recognition of People

over the Internet

277: Non-linear Speech Processing

278: Spoken LanguageInteraction in

Telecommunication

Speaker

Recognition

Speech

Recognition

Natural

Language

Processing

Multi

Modality &

Data Fusion

Speech

Analysis & Coding

Image

Analysis &

Graphics

Speech

SynthesisDialogue

Application Fields

Interface Components

Generic Functions


37

COST-277: A different approach

“The four classical areas of speech processing:

Speech Recognition (Speech-to-Text, StT)

Speech Synthesis (Text-to-Speech, TtS and Code-to-Speech, CtS)

Speech Coding (Speech-to-Code, StC with CtS) and

Speaker Verification and Identification (SV)

have all developed their own methodology almost independently from the neighboring areas. This has led to a plurality of tools and methods that are hard to integrate to any small multifunctional speech processing system (a mobile phone performing speaker verification and continuous speech recognition in addition to speech coding should have many separate processes running in parallel).


38

Relations between different fields

DISCRETE MODELS

SY

NT

HE

TIC

SP

EE

CHH

UM

AN

SP

EE

CH

CODED SPEECH

WRITTEN SPEECH

TtS

StT

StC

CtS

Analysis SynthesisR

ecogn.C

odin

g

© u

kl 2

002


39

COST277Non-linear speech processing

PROGRESS REPORT

Period: from (June-2001) to (June-2003)

Speech coding 40

LINEAR PREDICTION

Scalar linear prediction AR modeling of order P : where ai are the scalar prediction coefficients.

obtained with the levinson-durbin recursion.

Vectorial linear prediction AR-vector modeling of order P: where are matrices

P

ii neinxanx

1

neinxAnxP

ii

1

PiiA ,1mm

Speech coding 41

NL SCALAR PREDICTION WITH NNET

input layer

hidden layer

output layer

x[n-1]x[n-p] x[n-p+1]inputs: x[n]

output

Speech coding 42

NLVECTORIAL PREDICTION WITH NNET

input layer

hidden layer

output layer

inputs:

outputs

x[n-p] x[n-p+1] x[n-1]

x[n] x[n+1]

Speech coding 43

ADPCM NNET PREDICTION

Q

Q -1

MLP1

x[n]

+ -

d[n]

xN[n] ~

d[n] ~

c[n]

x[n] ^ MLP2

MLPN

x1[n] ~ C

OM

.

x[n] ~

Speech coding 44

VECTORIAL NL-ADPCM RESULTS

1 1.5 2 2.5 3 3.5 46

8

10

12

14

16

18

20

22

24

26

bits per sample

SE

GS

NR

1D2D3D4D5D


45

Very low bit rate speech coder

Demonstration !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


46

Broadcast news audio segmentation,

classification, clustering and speech recognition

Demonstration

demo

Available at http://193.126.86.80


47

SPEAKER RECOGNITION

Current systems rely on low-level information in speech.– Short time extent analysis windows (20-30 ms)– Spectral energy based (MFCC)

Another possibility: High level information– Speaking rate– Pitch patterns– Word/ Phrase usage– Idiosyncratic pronunciation


48

SPEAKER RECOGNITION:Possibilities of NOLISP

Low level information:– Non-linear predictive models instead of LPCC– Parameters: Fractal, Lyapunov exponents,

correlation dimension, etc. High level information:

– To take advantage of the other working groups. For instance intonation is fundamental in speech synthesis and useful for speaker recognition.


49

Why to use NL-models?

Listening to the residual signal of an LPC analysis it is possible to identify who is speaking.– Usually the residual signal is discarded.– NL models offer a better fit and whiter

residual signal. NL models can offer an improvement in

coding and synthesis, so there is room for speaker recognition improvement.


50

BANDWIDTH EXTENSION:An example of NL processing

A speech signal that has passed through the public switched telephony network (PSTN) has generally a limited frequency range between 0.3 and 3.4 kHz.

The Bandwidth extension algorithms aim at recovering the lost low- (0 - 0.3 kHz) and/or high- (3.4 –8 kHz) frequency band given the narrow-band speech signal


51

SPECTRAL BAND REPLICATION

0 fs/4 fs/2

0 fs/4 fs/2fs/8

0 fs/4 fs/2

0 fs/4 fs/2

initial

final

f [kHz]5 10

LPF


52

BANDWIDTH EXTENSION

Databases:– Original fullband: [0.3, 7] kHz

– Narrow band: [0.3, 3.4] kHz

– Bandwidth extended: [0.3, 7] kHz

LPF

Bandwidth extension


53

MIC database:DCF for several MELCEPS-l

8 10 12 14 16 18 20 22 24 260.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

l

DC

FMELCEPS

[0, 8] kHz[0.3, 3.4] kHz

[0.3, 8] kHz BWext


54

Bandwidth extension

For human beings it’s more easy to recognize using full band signals.

No new information is added Experimental results reveal that:

– The bandwidth extension algorithm does not introduce any damaging artifacts

– With MELCEPS parameterization, the results are better than using the narrow band signal.