17
SecurePhone Workshop - 24 /25 June 2004 1 Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano

SecurePhone Workshop - 24/25 June 2004 1 Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano

Embed Size (px)

Citation preview

SecurePhone Workshop - 24/25 June 2004

1

Speaking Faces Verification

Kevin McTaitRaphaël BlouetGérard Chollet

Silvia ColónGuido Aversano

SecurePhone Workshop - 24/25 June 2004

2

Outline

- Speaking faces verification problem

- State of the art in speaking faces verification

- Choice of system architecture

- Fusion of audio and visual modalities

- Initial results using BANCA database (Becars: voice only system)

SecurePhone Workshop - 24/25 June 2004

3

Problem definition-Detection and tracking of lips in the video sequence:

-Locate head/face in image frame-Locate mouth/lips area (Region of Interest)-Determine/calculate lip contours coordinates and intensity parameters (visual feature extraction) -Other parameters: visible teeth, tongue jaw movement, eyebrows, cheeks etc…

-Modelling parameters-Model deformation of lip (or other) parameters over time:-HMMs, GMMs…-Fusion of visual and acoustic parameters/models

-Calculate likelihood of model relative to client/world model in order to accept/reject-Augment in-house speaker verification system (Becars) with visual parameters

SecurePhone Workshop - 24/25 June 2004

4

Limitations

-Limited device (storage and CPU processing power)-Subject variability (aging, beard, glasses…), pose, illumination-Low complexity algorithms

-Subspace transforms, learning methods-Image based approaches, hue colouration/chromaticity clues-Model based approaches

SecurePhone Workshop - 24/25 June 2004

5

Active Shape Models

-Identification: based on spatio-temporal analysis of video sequence-Person represented by deformable parametric model of visible speech articulators (usually lips) with their temporal characteristics- Active Shape Model consists of shape parameters (lip contours) and greyscale/colour intensity (for illumination)-Model trained on training set using PCA to recover principal modes of deformation of the model- Model used to track lips over time, model parameters recovered from lip tracking results- Shape and intensity modelled by GMMs, temporal dependencies (state transition probabilities) by HMMs-Verification: using a Viterbi algorithm, if estimation of likelihood of model generating the observed sequence of features corresponding to a client is above a threshold, then accept, else reject

SecurePhone Workshop - 24/25 June 2004

6

Active Shape Models-Robust detection, tracking & parameterisation of visual features-Statistical, avoids use of constraints, thresholds, penalties-Model only allowed to deform to shapes similar to those seen in training set (trained using PCA)-Represent object by set of labelled points representing contours, height width, area etc.-Model consists of 5 Bézier curves (B-spline functions), each defined as two end points PO and P1 and one control point P1 :

P(t) = θ0(t)P0 + θ1(t)P1 + θ2(t)P2

points distribution model shape approximation

SecurePhone Workshop - 24/25 June 2004

7

Spatio-temporal model-Visual observation of speaker: O = o1, o2…oT

-Assumption: feature vectors follow normal distribution as in acoustic domain, modelled by GMMs-Assumption: temporal changes are piece-wise stationary and follow first order Markov process-Each state in HMM represents several consecutive feature vectors

SecurePhone Workshop - 24/25 June 2004

8

ASM: Training

SecurePhone Workshop - 24/25 June 2004

9

ASM: Tracking

SecurePhone Workshop - 24/25 June 2004

10

ASM: Lip Tracking Examples

SecurePhone Workshop - 24/25 June 2004

11

Image Based Approach

-Hue and saturation levels to find lip region (ROI)

-Eliminate outliers (red blobs) by constraints (geometric, gradient, saturation)

-Motion constraints: difference image (1d) pixelwise absolute difference between two adjacent frames

-a) greyscale image

-b) hue image

-c) binary hue/saturation threshholding

-c) accumulated difference image

-e) binary image after threshholding

-f) combined binary image c AND e-Find largest connecting region

SecurePhone Workshop - 24/25 June 2004

12

Image Based Approach (2)

-Derive lip dimensions using colour and edge information

-Random Markov field framework to combine two sources of info and segment lips from background

-Implementation close to completion

SecurePhone Workshop - 24/25 June 2004

13

Other Approaches-Deformable template/model/contour based:

-Geometric shapes, shape models, eigen vectors, appearance models, deform in order to minimise energy/distance function relating to template paramaters and image, template matching (correlation), best fit template, active shape models, active appearance models, model fitting problem

-Learning based approach:-MLP, SVMs…

-Knowledge based approach:-Subject rules or information to find and extract features, eye/nose detection symmetry

-Visual Motion analysis:-Motion analysis techniques, motion cues, difference images after thresholding and filtering-Optical flow, filter tracking (computationally expensive)

-Hue and saturation threshholding-Intensity of ruddy areas, pb of removal of outliers

-Image subspace transforms:-DCT, PCA, Discrete Wavelet, KLT (DWT + PCA analysis of ROI), FFT

SecurePhone Workshop - 24/25 June 2004

14

Fusion of audio-visual information-Instance of general classifier problem (bimodal classifier)

-2 observation streams: audio + video providing info about hidden class labels

-Typically each observation stream used to train a single modality classifier

-Aim: combine both streams to produce bimodal classifier to recognise pertinent classes with higher level of accuracy

-2 general types/levels of fusion:

-Feature fusion

-Decision fusion

SecurePhone Workshop - 24/25 June 2004

15

Feature Fusion

-Feature fusion: HMM classifier, concatenated feature vector of audio and visual parameters – time synchronous features, possibly including upsampling)

-Generation process of feature vector

-Using single stream HMM with emission (class conditional observation) probabilities given by Gaussian distribution:

SecurePhone Workshop - 24/25 June 2004

16

Decision Fusion

-State synchronous decision fusion-Captures reliability of each stream-HMM state level-combine single modality HMM classifier outputs-Class conditional log-likelihoods from the 2 classifiers linearly combined with appropriate weights-Various level: state (phone, syllable, word…)-multi-stream HMMs classifier, state emission probs:

-Product HMMs, factorial HMMs…-Other classifiers (SVMs, Bayesian classifiers, MLP…)

SecurePhone Workshop - 24/25 June 2004

17

Banca: results