17
Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney

Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney

Embed Size (px)

Citation preview

Macquarie RT05s Speaker Diarisation System

Steve Cassidy

Centre for Language TechnologyMacquarie University

Sydney

2©19 Apr 2023 Macquarie University

System Goals

• Develop a simple end-to-end system for the SPKR task

• Platform for experimentation • Improve on RT04s system

3©19 Apr 2023 Macquarie University

Overall Results

0

10

20

30

40

50

60

70

80

90

AMI AMI CMU CMU ICSI ICSI NIST NIST VT VT

4©19 Apr 2023 Macquarie University

System OverviewFeature Extractio

n

SAD

Segmentation

Turn Clusterin

g

Speaker ID

• Single Distant Microphone• Implemented in C and Tcl• Runs in around 6x real time on

single AMD64 • Developed with RT04 devtest

data– No AMI or VT data seen

before eval

5©19 Apr 2023 Macquarie University

Feature ExtractionFeature Extractio

n

Feature Extractio

n

SAD

Segmentation

Turn Clusterin

g

Speaker ID

• 26 coefficients:– 12 MFCC– RMS Energy– Delta Coefficients

• 10ms frame rate, 25.6ms window

• Mean subtraction based on mean of first 60 seconds of file

• Uses the KTH Snack toolkit

6©19 Apr 2023 Macquarie University

Speech Activity DetectionFeature Extractio

n

SADSAD

Segmentation

Turn Clusterin

g

Speaker ID

• Goal: find obvious regions of non-speech for gross segmentation of recording

• GMMs for speech and non-speech– Speech model: 32 mixtures– Non-speech model: 8 mixtures

• Trained on RT04s devtest data set– Reference labels generated from

speaker labelling– Ignored silence regions < 0.3s

7©19 Apr 2023 Macquarie University

Speech Activity DetectionFeature Extractio

n

SADSAD

Segmentation

Turn Clusterin

g

Speaker ID

• Evaluate frame classification error (%):Dataset NSPER SPER

RT04s unseen 32 19

RT05s 47 15

8©19 Apr 2023 Macquarie University

Speech Activity DetectionFeature Extractio

n

SADSAD

Segmentation

Turn Clusterin

g

Speaker ID

• SAD is performed by classifying successive windows of 10 frames using the GMM models

• Consecutive regions are merged and labelled

• Non-speech < 0.35s merged with following segment

• Speech < 0.15s merged with following non-speech

9©19 Apr 2023 Macquarie University

Speech Activity DetectionFeature Extractio

n

SADSAD

Segmentation

Turn Clusterin

g

Speaker ID

• Evaluation– Frame classification error– Boundaries missed

– nothing within 0.5s

– Boundaries inserted inside real segments

Meeting

Frame Error

%

Boundary Error

% # Auto

NSPER

SPER

Miss FP

CMU 1415

89 7 91 77 45

ICSI 1100

99 4 85 88 99

NIST 0939

71 9 83 84 97

AMI 1206 43 18 25 79 348VT 1430 100 0 99 50 2

10©19 Apr 2023 Macquarie University

Turn Segmentation Feature Extractio

n

SAD

Segmentation

Segmentation

Turn Clusterin

g

Speaker ID

• Speech regions are segmented using BIC criterion

• Compare fit of single gaussian model of sequence with pair of models each side of break

• Fixed windows of 200 frames advanced over speech region

• Peaks in delta BIC curve indicate change points

11©19 Apr 2023 Macquarie University

Turn Segmentation Feature Extractio

n

SAD

Segmentation

Segmentation

Turn Clusterin

g

Speaker ID

0 50 100

CMU/98

ICSI/198

NIST/257

AMI/427

VT/168

% Error

FPMiss

12©19 Apr 2023 Macquarie University

Turn ClusteringFeature Extractio

n

SAD

Segmentation

Turn Clusterin

g

Turn Clusterin

g

Speaker ID

• Given a set of speaker turns, find natural clusters

• Number of clusters unknown• Requires:

– Distance metric on speaker turns

– Clustering algorithm– Cluster evaluation metric

13©19 Apr 2023 Macquarie University

Speaker Similarity

Mean + variance of feature vectorsK-L distance metric

14©19 Apr 2023 Macquarie University

Turn ClusteringFeature Extractio

n

SAD

Segmentation

Turn Clusterin

g

Turn Clusterin

g

Speaker ID

• Implementation:– Select segments longer than

1.5s for clustering– KL distance on mean/variance of

features– Hierarchical clustering – Select labellings for 2, 3…N

speakers– Cluster evaluation performed

after speaker ID

15©19 Apr 2023 Macquarie University

Speaker IDFeature Extractio

n

SAD

Segmentation

Turn Clusterin

g

Speaker ID

Speaker ID

• Use cluster labelled turns to train speaker models– 32 mixture GMM

• Now classify and re-label all speaker turns

• Potentially correct poor clustering decisions

• Very small amounts of data to support models

16©19 Apr 2023 Macquarie University

Overall Results

0

10

20

30

40

50

60

70

80

90

AMI AMI CMU CMU ICSI ICSI NIST NIST VT VT

17©19 Apr 2023 Macquarie University

What Didn’t Work

• Inter-channel phase and level differences

• Exemplar speaker models• SVD based turn clustering

– Find similar groups by factoring the distance matrix

– One product of SVD is a number of clusters