ANDRÁS BEKE - doktori.btk.elte.hu

Eötvös Loránd University

Faculty of Arts

DOCTORAL DISSERTATION – THESIS

Doctoral School of Linguistics

Professor Dr. Vilmos Bárdosi CSc

Applied Linguistics Professor Dr. Mária Gósy DSc

AUTOMATIC SPEAKER DIARIZATION IN HUNAGRIAN SPONTANEOUS CONVERSATIONS

by

ANDRÁS BEKE

Supervisor

Professor Dr. Mária Gósy DSc

Budapest, 2014

1. INTRODUCTION In the Conversation Analysis framework, the conversation is made up in a

structural way (Garfinkel 1967; Goffman 1983; Schegloff 1992; Sacks et al. 1974;

Sacks 1992; Iványi 2001; Stokoe 2006). Based on this theory, if the conversation is

systematic, it could be automatically modeled. Therefore, the examination of

conversation is very important not only for linguistics but for speech technology as

well. In human-machine communication several processes have been modeled by applying

speech technology, such as speech decoding (speech recognition), speech production

(speech synthesis) or speaker identification based on voice (speaker recognition)

(Németh–Olaszy 2010). These processes are linked together in conversation where the

operation of speech decoding and speech production are circularly interleaved. This

circulation is caused by speaker changes. Automatic Detection of the speaker change

is therefore very important. Speaker diarization is the process of partitioning an input

audio stream into homogeneous segments according to the speaker's identity. It can

enhance the readability of an automatic speech transcription by structuring the audio

stream into speaker turns and, when used together with speaker recognition systems,

by providing the speaker’s true identity. It is used to answer the question "who spoke

when?" (Jin et al. 2004). Speaker diarization is a combination of speaker segmentation and speaker clustering.

The first aims at finding speaker change points in an audio stream. The second aims at

grouping together speech segments on the basis of speaker characteristics (Jin et al.

2004; Kotti et al. 2008). To improve the speaker diarization, voice activity detection

and detection of overlapping speech can be used as well. In the literature, extensive research is described for speaker diarization, but principally

for English (Tritschler–Gopinath 1999; Sivakumaran–Fortuna–Ariyaeeinia 2001; Lu–

Zhang 2002a; Cettolo–Vescovi 2003; Shih-Sian Cheng et al. 2010; Vescovi–Cettolo–

Rizzi 2003). However, for the Hungarian language, no work is known which

addresses the field of speaker diarization. The aim of this PhD thesis is to develope a

speaker diarization system for the Hungarian language. The main focus is to create

algorithms for speaker diarization (speaker segmentation, speaker clustering,

overlapping speech detection) and to implement and enhance some already existing

algorithms in speaker diarization (voice activity detection, speaker recognition),

focusing on Hungarian conversation. Besides, on one hand we introduce the main

application implementing speaker diarization and, on the other hand we present the

scientific areas connected to speaker diarization. The aim of this thesis is to build speaker diarization system which can automatically

mark speaker changes based on acoustic information in Hungarian spontaneous

conversation. The approach is mainly based on unsupervised methods.

2. ORGANIZATION OF THIS THESIS

The dissertation contains 11 chapters. The first chapter provides an introduction,

where the theoretical and practical background of the introduced scientific areas is

presented focusing on speaker diarization. Spontaneous speech will be described by

means of analyzing both the speech production and speech perception processes. In

this chapter we present the main theory of conversation analysis, discourse analysis

and speech accommodation. The building blocks of conversations are the turns and

their possible markers (the discourse markers) are also presented in the first chapter as

well. The last part of the first chapter addresses overlapping speech in spontaneous

conversations. Chapter 2 reviews some approaches previously used in speaker diarization.

Chapter 3 describes the goal of the thesis and research questions and hypotheses. The material, subject of the research and the evaluation of the speaker diarization

(DER: Detection Error Rate) is presented in Chapter 4.

The main algorithm for the speaker diarization system and results are presented in

Chapter 5. In this Chapter the development of our two-pass speaker diarization system

is explored. The preprocessing step relies on VAD (voice activity detection) and

proposes modifications in the algorithm, which are presented in Chapter 5. Chapter 5

describes the examination of speaker specific acoustic features for speaker

recognition. The overlapping speech detection algorithm is described in Chapter 5. Finally, Chapter 6 concludes this thesis with a summary of its contributions

(Chapter 7). Chapter 8 provides an outlook for possible directions of future work in

the field. The thesis of this dissertation are presented in Chapter 9. Chapter 10 shows

the references. The last Chapter (11) presents the abbreviation.

3. MATERIAL AND SUBJECTS

In this thesis, 100 spontaneous conversations (total duration is of 55 hours) were

selected from the BEA database (Gósy 2012), recorded in a laboratory environment.

In each case, three persons were involved in the conversations. Two of them were

permanent (2 females, average age 32 years old). The third subject (interviewer) was

one out of the 43 male and 67 female (average age 35 years old) speakers. The speech data is collected using single channel recording and 44 kHz sampling

frequency, encoded linearly for 16 bits per sample resolution. For final processing, the

recordings were resampled to 16 kHz. The manual annotation includes the following levels: (i) Silence: all of silent parts were marked, which were longer than 100 ms.

Naturally, caused by articulation, VOT is not marked even if longer than this

threshold.

(ii) Speaker change point: in the continuous signal we manual marked the

transition points were speakers changed. The backchannel was not marked to turn only

in case of speakers changed. (iii) Overlapping speech: regions in the audio file were marked where more the

one speaker spoke at the same time. If this region was not longer than 50 ms, it was

not marked because this is regarded to be a very short speech interval, automatically

not detectable. Methods employed in particular tasks are described at the beginning of the chapter,

before the presentation of the results.

4. RESULTS

4.1 Speech detection Voice activity detection (also known as speech activity detection or speech

detection) plays an important role in speaker diarization because this algorithm can

improve the result of speaker diarization. The VAD is a technique used in speech

processing in which the presence or absence of human speech is detected. As the aim of this thesis was not develop a new VAD algorithm, therefore we

choose and modify a very simple and fast algorithm to segment pauses, created by

Giannakopoulos and implemented in MATLAB. To segment speech and non-speech

regions in a waveform, the short-term energy and spectral centroid features were

extracted and used to make a decision using an adaptive threshold. Our proposed

method differs from the original algorithms, as it defines the two centers of the feature

distributions (one for the speech, and the other for the non-speech). k-means

unsupervised learning method is used (Ying et al. 2011). The aim of this task is to test

whether our modified VAD yields better result. For evaluation, the VAD Detection

Error Rate (DET) was used. To compare the results given by the basic and the

modified method, Wilcoxon-test completed by Monte Carlo simulation was applied.

The corpus contains 49 hours speech and 6 hours silence in total. The 10.9% of the

total duration of the corpus is silence. 5 hours of spontaneous conversation were used to develop our VAD algorithm: define

the optimal thresholds; and 39 hours of conversation were used to evaluate the system.

In the first experiment we focused on the impact of the window size to apply the

threshold in the audio file. This window length ranges between 25 ms and 250 ms. At

the same time we test whether our modified algorithm is better compared to the basic

method. Using the VAD algorithm with a 250 ms long window gave 9.51% DER.

However using k-mean clustering method in our VAD algorithm the result was

improved, although this improvement was not significant. We studied the effects of additive white noise on speech signals for voice activity

detection. Various levels of white noise were added to the speech signal so that the

signal to noise ratio (SNR) values of the recordings continuously decreased by 5dB

The original SNR value of the recording was 25 dB. The result showed that by

increasing noise, the DER value was increased, but in speech of 10dB SNR the DER

value was 34.72%. We proposed a VAD algorithm based on k-means unsupervised learning method to

segment the audio signal to speech and non-speech segments. We demonstrated that it

can be useful for robust speech detection in spontaneous conversations. The proposed

VAD was characterized by 90.49% accuracy in good quality recordings. 4.2 Overlapping speech detection The ratio of overlapping speech is high in spontaneous conversations (Gráczi–Bata

2010). Beattie analyzed such conversations and showed that 13% of the speech was

overlapping in English conversations, but this ratio increased when more than two

participants were involved in the conversations (31% of the total duration of

conversations) (1983, referring to Levelt 1989). According to Cetin and Shriberg (2006) 10-13% of total duration of conversations

is overlapping speech. In Hungarian literature Markó (2006) described that 6%

overlapping speech occurred in four-participant Hungarian conversations. Bata (2009)

claimed that this ratio is 1.7-3% in two-participant Hungarian spontaneous speech.

Several studies showed that overlapping speech was responsible for most of the errors

in speaker diarization. According to Wooters and Hujbert (2007) 17% of the DER is

false rejection caused by overlapping speech in speaker diarization. Some studies have

been reported about the effects of overlapping in meetings (e.g., Boakye et al. 2008

a,b,c; Boakye 2008; Boakye 2011; Trueba-Hornero 2008), but work on systems for

identifying overlapped speech and mitigating its effects in speaker diarization appear

to be absent from the literature. As overlapped speech is now a major obstacle in

improving the performance of speaker diarization systems, efforts in overlap detection

are of increasing interest and importance. In this Chapter we presented the measures by which automatically classified the

overlapping speech segments in Hungarian spontaneous speech. To detect overlapping

speech, we used an ANN/SVM system (Artificial Neural Network and Support Vector

Machine). In the first step, four acoustic features were extracted from speech: i) FFT

spectrum (SP); ii) mel-frequency cepstral coefficients (MFCC); iii) log mel scale

filterbank coefficients (MSFC); subband energy (SBE). In order to gain a better

representation of the features we used restricted Boltzmann Machine (RBM).

Restricted Boltzmann Machines are used successfully to extract features from raw

data. They are especially useful if they are stacked to form a Deep Belief Network

(DBN) and are able to extract feature hierarchies. The output from the DBN was the

input to the Support Vector Machine (LSVM: Last Square SVM using RBF kernel

function) which carried out the classification. We used three restricted Boltzmann

Machines in DBN (H1,H2,H3). The corpus contained 8056 time intervals where more than one speaker talk at the

same time. This means that the corpus contained 7 hours of overlapping speech,

representing 12% of the total duration of corpus. To train the SVM 5371 training

samples were used, and to test it 2386 testing samples were used.

The results showed that the four acoustic features of the MSFC gave the lowest

EER (Equal Error Rate). In this case, the EER on average was 47.49%. The second-

best feature was the MFCC. In this case the EER on average was 50.84%. Using

MSFC gave better results than MFCC which means 3.35% relative improvement. This

improvement is statistically confirmed (Wilcoxon test: Z=-2,211; p=0,023). We studied the effects of the number of units in the third hidden layer in DBN.

The best results were obtained using 500 neurons in third layer in the DBN, using

MSFC acoustic features (Figure 1).

Figure 1

The EER value depending on the number of neurons in H3 The statistical analysis confirmed that MSFC features work better than the other

three features independently of the number of neurons in H3 (MSFC-MFCC: Z=-

2,201; p=0,028; MSFC-SP: Z=-2,201; p=0,028; MSFC-SBE: Z=-2,201; p=0,028). In general, the results show that from 500 to 800 neurons results are worse than

under 500 neurons. We analyzed the type of the errors. First of all the manual annotation contained the

most of the errors. In other cases we found that the backchannel and laughters were

caused a lot of errors as well. This Chapter reports an investigation into the use of Deep Neural Network and

Support Vector Machine for the classification of overlapping speech in spontaneous

speech. The DBN/SVM approach gives overlap detection results comparable to the

ones published in the literature of overlapping detection systems. 4.3 Speaker recognition for speaker diarization In a general point of view, speaker recognition algorithms are a very useful part of

many speech technology applications, for example: speaker indexing and rich

transcription, automatic speech recognition. The result of speaker recognition can be

used for modules of speaker-based algorithms as speaker diarization (Campbell 1997).

In Hungarian literature there are several researches which examined the possibility

of speaker identification. Still, there exist only some works about automatic speaker

recognition in Hungarian speech (Fék 1997). The aim of this Chapter was to examine which spectral region is speaker specific,

and to make an automatic speaker recognition system based on GMM and GMM-

UBM method. The system is built around the likelihood ratio test for verification

using simple but effective GMMs for likelihood functions, a universal backround

model (UBM) for normalizing the scores (Higgins et al. 1991; Rosenberg et al. 1992;

Reynolds 1995; Matsui–Furui 1995; Reynolds 1997). Each speaker was modelled by

various number of mixture GMMs. A 25 s long interval was used from all 80 speakers

to train the model and 13 s to test it. For UBM training we selected 20 speakers (10 male, 10 female) from the corpus

and modelled them with various number of mixtures (2-512 mixtures). To classify the overlapping speech and non-overalapping speech, spectral

information is used. The spectral information is calculated using tree different

subbands which were represented by using three MFCCs. The front-end processing is

made in three different ways: i) MFC for spectral full-band; ii) MFC for spectral

subband between 1.5 and 2.5 kHz, iii) MFC for spectral subband between 2.5 and 3.5

kHz; iv) MFC for spectral spectral between 3.5 and 4.5 kHz. To compute subband

MFCC, we employed HCopy implemented in MATLAB: it bandlimits the signal

between i) 1.5, ii) 2.5, iii) 3.5 kHz to i) 2.5, ii) 3.5, iii) 4.5 kHz, and distributes the

four filterbank channels equally on the mel scale such that the lower cutoff of the first

filter is at i) 1.5, ii) 2.5, iii) 3.5 kHz and the upper cutoff of the fourth filter is at i) 2.5,

ii) 3.5, iii) 4.5 kHz. Three cepstral coefficients are then calculated from the four values

using Discrete Cosine Transform (DCT). To measure the performance of the speaker recognition the accuracy (true positive

rate) is calculated. The results showed that the best performance is yielded by using spectral sub-band

MFC between 2.5 and 3.5 kHz. This result was significantly better than using MFC(1,5–

2,5) (Z=-2,201; p=0,028) and using MFC(3,5–4,5) (Z=-2,201; p=0,028). However the

performance of MFC(2,5–3,5) is better than MFC(fullband), this difference was not

significant. These results are correlated by previous studies which have shown that the spectral

subband, 2500 Hz and 3500 Hz, carries speaker specific information (Furui 1986;

Parthasarathi et al. 2013). We examined the effect of number of mixtures in GMM-UBM system. The best

result was obtanied by using 256 mixtures GMM-UBM and MFC(2,5–3,5) acoustic

feature. In this case the accuracy was 79.76%. In this Chapter we studied various strategies to represent the speaker specific

acoustic features to improve the performance of the GMM-UBM speaker recognition

system. We proved experimentally that subband information from 2.5 kHz to 3.5 kHz

carried the speaker specific information in the spectrum. This result can be used for

improving the performance of speaker diarization.

4.4 Speaker diarization The previous speaker diarization systems have been set up on radio broadcasting in

different languages, which can largely be considered as half-spontaneous as the show

participants are familiar with the topic in advance. The BEA database contains more

spontaneity than the previous corpus because during the recording processing the

speech planning and speech production are due at the same time. Therefore, this work

also can be considered as a pioneer one because speaker diarization of such

conversations have not been developed so far. Our corpus contained 7,827 turns form 100 conversations. One conversation

contained 78 turns, on average (std.: 41 turns; max.: 240 turns; min.: 11 turns). The male speakers produced 79 turns (std.: 45 turns), while females produced 65

turns (std.: 37 turns), on average. There were no statistically significant differences in

the number of turns between males and females (one-way ANOVA). We examined the ratio of the total duration of the recordings and the time a given

speaker spoke in the recording. The total speaking time of all of the interviewees was

40.3% of the total duration. Females talked more (37% of the total duration) than

males (42% ot the total duration), but this difference was not significant (one-way

ANOVA). The total speaking time of the interviewer was 33.9% of the total duration. The

second participant talked the least, only 18.3% of the total duration (repeated

ANOVA: interviewer*second paticipant: F(2, 200)=39.833; p<0.001;

interviewees*second paticipant F(2, 200)=39.833; p<0.001) (Figure 2.). These ratios

indicate that, during the meeting, the roles were not tied: second participant stayed

often in the background (there could be several reasons for this: for example the

degree of familiarity).

Figure 2

The ratio of total duration of the recording and the speaker’s speaking time in the recording

We calculated how many turns were produced by the speaker during one minute.

The interviewees produced 1.38 turn per minute, on average. The interviewer

produced 1.15 turn per minute, on average. The second participant produced 0.78 turn

per minute, one average. Additionally, we examined whether the speaking time and the turn per minute rate

were correlated. In case of the interviewee, any tendency could not been estabilished;

so could not tell who talks a lot, gets more cases or takes the floor. In case of the

interviewer a moderate uphill (positive) relationship between the speaking time and

the turn per minute could be measured (Pearson correlation: r =0.424, p<0.001). The

same trend was seen in case of the second participant (Pearson correlation: r =0,441,

p<0,001). This supports that the subjects in the conversations should not seek the possibility

for taking the floor since the basic situation is that they are expected to speak. On the

contrary, the interviewer and the third participant of the conversation should seek the

possibility to take the floor as frequently as they can in order to create longer

utterances.

In this thesis we built, an unsupervised, BIC-based (Bayesian Information

Criterion) speaker segmentation algorithm. BIC is a most used algorithm for speaker

segmentation and clustering. The BIC is probably the most extensively used

segmentation and clustering metric justified by its simplicity and effectiveness. It uses

a likelihood criterion penalized by the model complexity (amount of free parameters

in the model) introduced by Schwarz 1971 and Schwarz 1978 as a model selection

criterion. For the task of speaker segmentation, the technique was first used by Chen

and Gopalakrishnan (Chen–Gopalakrishnan 1998), where a single full covariance

Gaussian was used for each of the models. Although not existent in the original

formulation, the λ parameter was introduced to adjust the effect of the penalty term on

the comparison, which constitutes a hidden threshold to the BIC difference. Such

threshold needs to be tuned to the data and therefore its correct setting has been

subject of constant study. Several researchers propose ways to automatically select λ

(Tritschler–Gopinath 1999; Delacourt–Wellekens 2000; Delacourt–Kryze–Wellekens

1999; Mori–Nakagawa 2001; Lopez–Ellis 2000; Vandecatseye et al. 2004). A false alarm compensation (FAC) was implemented to try to reduce the number

of incorrectly detected speaker changes. The symmetric Kullback-Leibler (KL2)

distance was measured between the data D1 and D2 centered around each proposed

change point, returned from the initial speaker segmentation (Siegler et al. 1997, Hung

et al. 2000; Adami et al. 2002). The KL2 distance will in this thesis be used as a post-

processing step with the purpose of reducing the number of false change points

introduced by the BIC algorithm (Ida 2011). The purpose of this stage is to associate or cluster segments from the same speaker

together. The clustering ideally produces one cluster for each speaker in the audio

with all segments from a given speaker in a single cluster. The predominant approach

used in diarization systems is hierarchical, agglomerative clustering with a BIC based

stopping criterion (Chen–Gopalakrishnan 1998). In our work the clusters are represented by GMM-supervectors with adapted

means using MAP (maximum a posteriori approach) from UBM (Reynolds et a.

2000). The GMM-UBM consists of 256 mixture components. The distance metric

between clusters was the BIC clustering framework. To reduce the high

dimensionality of data PCA (Principal Component Analysis) was used. A system hypothesizes a set of speaker segments each of which consists of a

(relative) speaker-id label such as ‘spkr1’, ‘spkr2’ or 'spkr3' and the corresponding

start and end times. This is then scored against reference ‘ground-truth’ speaker

segmentation which is generated using the rules given in (Fiscus et al. 2004). Since

the hypothesis speaker labels are relative, they must be matched appropriately to the

true speaker names in the reference. To accomplish this, a one-to-one mapping of the

reference speaker IDs to the hypothesis speaker IDs is performed so as to maximize

the total overlap of the reference and (corresponding) mapped hypothesis speakers.

Speaker diarization performance is then expressed in terms of the miss (speaker in

reference but not in hypothesis), false alarm (speaker in hypothesis but not in

reference), and speaker-error (mapped reference speaker is not the same as the

hypothesized speaker) rates. The overall diarization error (DER) is the sum of these

three components. A complete description of the evaluation measure and scoring

software implementing it can be found at http://nist.gov/speech/tests/rt/rt2004/fall. It should be noted that this measure is time-weighted, so the DER is primarily

driven by (relatively few) loquacious speakers and it is therefore more important to get

the main speakers complete and correct than to accurately find speakers who do not

speak much. From the BEA database, 12 conversations were selected randomly. The total

duration of conversations was 2.8 hours which contained 480 speaker-changes. In this thesis we built a standard BIC-base speaker segmentation to compare to the

proposed system. The standard BIC-base speaker segmentation used standard MFCC

and the λ parameter value was 0. In this standard system we did not use any speech

detection or overlapping detection algorithm. By using standard BIC-base speaker

segmentation the best DER (Diarization Error Rate) value was 39.43%. To improve this result we used MFC for the spectral subband between 2.5 and 3.5

kHz and energy along with deltas in standard BIC-base speaker segmentation. These

features were mean and variance normalized. In Chapter 5 we demonstrated that using

MFC(2.5–3.5) features the results could be improved. The result showed that when the

BIC-base segmentation included MFC(2.5–3.5), the proposed method achieved about

0.869% relative DER reduction (from 39.43.5% to 38.56.2%) which is a statistically

significant improvement (Wilcoxon test: Z=-2.824; p=0.005). Performance of BIC-base speaker segmentation is depending on the penalty factor

λ. We tested our speaker segmentation system using various values for λ (from 0 to 4).

The best DER is obtained if the penalty factor was 1. In this case DER was 35.73%. The use of a speech/non-speech detector in speaker diarization is important to

ensure that acoustic models for each of the clusters correctly represent the speech data

and are not “contaminated” by non-speech information. By adding the speech/non-

speech detector proposed for spontaneous speech, not only it does not improve the

non-speech errors but also reduces the speaker error, due to a reduction in clustering

errors as noted above. The result showed that when the BIC-base segmentation

included VAD, the proposed method achieved about 4.535% relative DER reduction

(from 35.73% to 31.21%) which is a statistically significant improvement (Wilcoxon

test: Z=-3.059; p<0.001). State-of-the-art speaker diarization systems for meetings are now at a point where

overlapping speech contributes significantly to the errors made by the system.

However, little if no work has yet been done on detecting overlapping speech. We

present our initial work toward developing an overlapping detection system for

improved speaker diarization. We demonstrated a relative improvement of about

2.49% DER over the baseline diarization system (from 31.21% to 28.713%) which is

a statistically significant improvement (Wilcoxon test: Z=-3.06; p=0.002). In this Chapter we presented the structure and the performance of the baseline

BIC-based speaker diarization and our system. Chapter 5 have shown that the spectral

subband ranging from 2500 Hz to 3500 Hz carries speaker specific information. We

exploited this for speaker change detection (SCD) by representing the subband using

three MFCCs. We confirmed experimentally that this spectral subband could play an

important role not only in speaker recognition but in speaker diarization as well. We

presented how the optimal value of λ could be determinated in BIC. We proved

experimentally that by implementing the VAD algorithm in speaker diarization system

it could improve the results. Experimental results revealed that by adding our

overlapping speech detection method to the speaker diarization system it could reduce

the diarization error rate by almost 2.49%. Generally, in this thesis, the best result is yielded by using BIC-base method where

the penalty value was 1, the features were MFCC(2,5–3,5) and the system contained the

VAD and overlapping detection algorithm as well. In this case the average DER value

was 28.71%.

5. CONCLUSIONS

This PhD thesis addressed the topic of speaker diarization for spontaneous

conversations. While answering the question “Who spoke when?”, the presented

speaker diarization system is able to process spontaneous speech and determine the

optimum output without any prior knowledge about the number of speakers or their

identities. Our speaker diarization system is based on unsupervised learning method

which could be easily adapted to another speech corpus. The presented BIC-base system uses as baseline the technology in speaker diarization

for broadcast news and adapts it to the spontaneous speech by developing new

algorithms and improving existent ones to use speaker specific features and to

implement VAD and overlapping speech detection algorithm.

The result of this thesis consists of two main parts. In first, we described three

experiments: i) adopted and modified voice activity detection; ii) propose developed

overlapping detection; iii) evaluate speaker specific features. In the second part, we

confirmed experimentally that by adding the algorithms presented in the first part to

baseline BIC-base speaker diarization system it could improve the results. In the topic of discourse modeling, speaker diarization could benefit from research

aiming at modeling the turn-taking between the speakers. Using information at a

higher level than simple acoustics, the transition probabilities between speakers could

be appropriately set to help the decoding.

6. REFERENCES Bata Sarolta 2009. Beszélőváltások a beszédpartnerek személyes kapcsolatának

függvényében. In: Beszédkutatás 2009. 107–120. Beattie, G. W. 1982. Turn-taking and interruption in political interviews: Margaret

Thatcher and Jim Callaghan compared and contrasted. Semiotica 39: 93–114. Beke András 2008. Az alapfrekvencia-eloszlás modellezése a beszélőfelismeréshez.

Alkalmazott Nyelvtudomány 2008/1–2: 121–132. Boakye K. 2008. Audio Segmentation for Meetings Speech Processing. Ph.D.

dissertation, University of California at Berkeley, 2008.

Boakye K. – Vinyals O. – Friedland G. 2008a. Two’sa crowd: Improving speaker

diarization by automatically identifying and excluding overlapped speech. In Proc.

Interspeech 2008. 32–35. Boakye K. –Trueba-Hornero B. –Vinyals O. –Friedland G. 2008b. Overlapped speech

detection for improved speaker diarization in multiparty meetings. Proc. ICASSP.

4353–4356, 2008.

Boakye, K. – Trueba-Hornero, B. – Vinyals, O. – Friedland, G. 2008c. Overlapped

speech detection for improved speaker diarization in multiparty meetings. In:

Proceeding of IEEE International Conference on Acoustics, Speech and Signal

Processing. Las Vegas, Nevada. 4353–4356. Boakye, K. – Vinyals, O. – Friedland, G. 2011. Improved Overlapped Speech

Handling for Speaker Diarization. In: Proceeding of INTERSPEECH 2011.

Firenze, Olaszország. 941–944. Bőhm Tamás 2006. A glottalizáció szerepe a beszélő személy felismerésében.

Beszédkutatás 2006. 197–207. Campbell, J. P. 1997. Speaker Recognition: A Tutorial. In: Proceedings of the the

Institute of Electrical and Electronic Engineers, Vol. 85, No. 9. 1437–1462. Çetin, Ö. – Shriberg, E. 2006. Analysis of overlaps in meetings by dialog factors, hot

spots, speakers, and collection site: insights for automatic speech recognition. In:

Proceedings of INTERSPEECH 2006. 293–296. Cettolo, M. – Vescovi, M. 2003. Efficient audio segmentation algorithms based on the

BIC. In: Proceedings of IEEE International Conference on Acoustics, Speech and

Signal Processing.

Chen, S. S. – Gopalakrishnan, P. 1998. Clustering via the Bayesian information

criterion with applications in speech recognition. In: Proceedings of IEEE

International Conference on Acoustics, Speech and Signal Processing. Seattle,

USA, 645–648. Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector

machines. ACM Transactions on Intelligent Systems and Technology, Letöltés

ideje: 2013.06.05. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Cho, Y.D.; Kondoz, A. (2001). Analysis and improvement of a statistical model-based

voice activity detector, IEEE Signal Processing Letters, vol. 8, no. 10, pp. 276–

278. Daniel P. W. Ellis 2005. PLP and RASTA and MFCC, and inversion in Matlab,

http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/ Delacourt, P. – Kryze, D. – Wellekens, C. J. 1999. Detection of speaker changes in an

audio document. In: Proceedings of Eurospeech 1999. 1195–1198. Delacourt, P. – Wellekens, C. J. 2000, DISTBIC: A speaker-based segmentation for

audio data indexing. Speech Communication: Special Issue in Accessing

Information in Spoken Audio 32: 111–126. Fék Márk 1997. Beszélőfelismerés neurális hálózatokkal és vektorkvantálóval. OTDK

konferencia. Szeged 1997. Fiscus, J. G. – Garofolo, J. S. – Le, A. – Martin, A. F. – Pallett, D. S. – Przybocki, M.

A. – Sanders, G. 2004. Results of the fall 2004 STT and MDE evaluation. In:

Proceeding of Fall 2004 4 Rich Transcription Workshop (RT-04), Palisades, NY,

Nov. 2004. Furui S. 1986. Research on individuality features in speech waves and automatic

speaker recognition techniques. Speech Communication, vol. 5, 183–197. Garfinkel, H. 1967. Studies in Ethnomethodology. Prentice Hall, Englewood Cliffs,

NJ. Giannakopoulos, T. 2009. Study and application of acoustic information for the

detection of harmful content, and fusion with visual information. Department of

Informatics and Telecommunications,University of Athens, Greece, PhD theis. Goffman, E. 1983. The Interaction Order. American Sociological Review 48: 1–17. Gósy Mária 2012. Multifunkcionális beszélt nyelvi adatbázis – BEA. In Prószéky

Gábor – Váradi Tamás (szerk.): Általános Nyelvészeti Tanulmányok XXIV.

Nyelvtechnológiai kutatások. Akadémiai Kiadó, Budapest, 329–349. Gráczi Tekla Etelka – Bata Sarolta 2010. Megszólalási formák és funkciók az

összeszokottság függvényében. In: Gecső Tamás – Sárdi Csilla (szerk.) Új

módszerek az alkalmazott nyelvészeti kutatásban. Kodolányi János Főiskola, Tinta

Könyvkiadó, Székesfehérvár, Budapest. 28–32. Higgins, A. L. – Bahler, L. – Porter, J. 1991. Speaker verification using randomized

phrase prompting. Digital Signal Processing 1/2: 89–106. Hung, J. – Wang, H. – Lee, L. 2000. Automatic metric based speech segmentation for

broadcast news via principal component analysis. In: Proceedings of the

International Conference on Speech and Language Processing, Beijing, China.

Ida, O. 2011. Indexing of Audio Databases : Event Log of Broadcast News. PhD

thesis. Norwegian University of Science and Technology, Department of

Electronics and Telecommunications. Iványi Zsuzsanna 2001. A nyelvészeti konverzációelemzés. Magyar Nyelvőr 125.

74−93. Jin, Q. – Laskowski, K. – Schultz, T. – Waibel, A. 2004. Speaker segmentation and

clustering in meetings. In: Proceedings of NIST 2004 Spring Rich Transcrition

Evaluation Workshop, Montreal, Canada. 112–117. Markó Alexandra 2006. Beszélőváltások a társalgásban.

http://fonetika.nytud.hu/letolt/ma_2.pdf (Letöltve: 2011. október 1.) Matsui, T. – Furui, S. 1995. Likelihood normalization for speaker verification using a

phoneme- and speaker-independent model. Speech Communication 17: 109–116. Matza, A.– Bistritz, Y. 2011. Skew Gaussian mixture models for speaker recognition.

Presentation. In: Proceedings of the 12th

Annual Conference of the International

Speech Communication Association (INTERSPEECH) 2011. 28–31. Mohamed, G. Hinton, and G. Penn, 2012. Understanding how deep belief networks

perform acoustic modelling. In: Proc. ICASSP, pp. 4273-4276, 2012.

Mori, K. – Nakagawa, S. 2001. Speaker change detection and speaker clustering using

VQ distortion for broadcast news speech recognition. In: Proceedings of IEEE

International Conference on Acoustics, Speech and Signal Processing, Salt Lake

City, USA, 413–416.

Németh Géza – Olaszy Gábor (szerk.) 2010. A magyar beszéd. Beszédkutatás,

beszédtechnológia, beszédinformációs rendszerek (8–12. fejezet). Akadémiai

Kiadó, Budapest. Nikléczy Péter – Gósy Mária 2008. A személyazonosítás lehetősége a beszédanyag

időtartamának függvényében. Beszédkutatás 2008. 172–181. Nikléczy Péter 2003. A zönge periódusidejének funkciója a hangszínezetben.

Beszédkutatás 2003. 101–113. Parthasarathi, S. H. K. – Bourlard, H. – Gatica-Perez, D. 2013. Wordless Sounds:

Robust Speaker Diarization Using Privacy-Preserving Audio Representations.

IEEE Transactions on Audio, Speech & Language Processing 21(1), 83–96. Reynolds, D. A. – Quatieri, T. F. – Dunn, R. 2000. Speaker verifcation using adapted

Gaussian mixture models. Digital Signal Processing 10/1–3: 19–41. Reynolds, D. A. 1995. Speaker identification and verification using Gaussian mixture

speaker models. Speech Communication 17: 91–108. Reynolds, D. A. 1997. Comparison of background normalization methods for text-

independent speaker verification, In: Proceedings of 5th European Conference on

Speech Communication and Technology (Eurospeech). 963–966. Rosenberg, A. E. – DeLong, J. – Lee, C.-H. – Juang, B.-H. –Soong, F. K. 1992. The

use of cohort normalized scores for speaker verification. In: Proceedings of

International Conference on Spoken Language Processing. 599–602. Sacks, H. – Schegloff, E. A. – Jefferson, G. 1974. A simplest systematics for the

organization of turntaking for conversation. Language 50: 696–735.

Sacks, H. 1992. Lectures on Conversation. Blackwell, Oxford. Schegloff, E. 1992. Introduction. In: Sacks, H. Lectures on Conversation. Vol.1.

Blackwell, Oxford. 9–12. Schwarz, G. 1971. A sequential student test. The Annals of Statistics 42/3: 1003–1009. Shih-Sian Cheng – Hsin-Min Wang – Hsin-Chia Fu 2010. BIC-Based Speaker

Segmentation Using Divide-and-Conquer Strategies With Application to Speaker

Diarization," IEEE Transaction on Audio, Speech, and Language Processing, vol.

18, no. 1, pp. 141–157, Jan. 2010. Siegler, M. A. – Jain, U. – Raj, B. – Stern, R. M. 1997. Automatic segmentation,

classification and clustering of broadcast news audio. In: Proceedings of DARPA

Speech Recognition Workshop, 97–99. Sivakumaran, P. – Fortuna, J. – Ariyaeeinia, A. 2001. On the use of the Bayesian

information criterion in multiple speaker detection. In: Proceedings of Eurospeech

2001, Scandinavia. Stokoe, E. 2006. On ethnomethodology, feminism, and the analysis of categorial

reference to gender in talk-in-interaction. Sociological Review 54: 467–94. Tritschler, A. – Gopinath, R. 1999. Improved speaker segmentation and segments

clustering using the bayesian information criterion. In: Proceedings of Eurospeech

1999. 679–682. Trueba-Hornero B. 2008 Handling overlapped speech in speaker diarization. Master's

thesis, Universitat Politecnica de Catalunya, May 2008.

Vandecatseye, A. – Martens, J.-P. 2003. A fast, accurate and stream-based speaker

segmentation and clustering algorithm. In: Proceedings of Eurospeech 2003,

Geneva, Switzerland, 941–944. Vescovi, M. – Cettolo, M. – Rizzi, R. 2003. A DP algoritm for speaker change

detection. In: Proceedings of Eurospeech 2003. Wooters C. – Huijberts M. 2007. The ICSI RT07s speaker diarization system. In

Proceedings of of the Rich Transcription 2007 Meeting Recognition Evaluation

Workshop, 2007. Baltimore, MD.

Ying, D. – Yan, Y. – Dang, J. – Soong, F. 2011. Voice Activity Detection Based On

An Unsupervised Learning Framework. In: IEEE Transactions on Audio, Speech

and Language Processing 19/8: 2624–2633.

7. PUBLICATIONS BY THE CANDIDATE ON THE TOPIC OF

THE THESIS

Beke A, Gósy M 2014. Phonetic analysis and automatic prediction of vowel duration

in Hungarian spontaneous speech. INTERNATIONAL JOURNAL OF

INTELLIGENT DECISION TECHNOLOGIES 10: 57-66.

Váradi V, Beke A 2013. Az artikulációs tempó variabilitása a felolvasásban.

BESZÉDKUTATÁS 21: 26–42.

Szaszák Gy, Beke A. 2013. Using phonological phrase segmentation to improve

automatic keyword spotting for the highly agglutinating Hungarian language. In:

14th Annual Conference of the International Speech Communication Association.

Lyon, Franciaország, 2013.08.25–2013.08.29.

Neuberger T, Beke A 2013. Automatic laughter detection in spontaneous speech using

GMM-SVM method. In: Habernal I, Matousek V (szerk.) Text, Speech, and

Dialogue: 16th International conference, TSD 2013, Pilsen, Czech Republic,

September 1–5, 2013. Proceedings. Berlin; H. eidelberg: Springer Verlag, 2013.

113–120.

Gósy M, Bóna J, Beke A, Horváth V 2013. A kitöltött szünetek fonetikai sajátosságai

az életkor függvényében. BESZÉDKUTATÁS 21: 121–143.

Beke A, Szaszák Gy, Váradi V 2013. Automatic phrase segmentation and clustering in

spontaneous speech In: IEEE 4th International Conference on Cognitive

Infocommunications, CogInfoCom 2013, December 2–5, 2013

Szaszák, György and Beke, András: Exploiting Prosody for Syntactic Analysis in

Automatic Speech Understanding, Journal of Language Modelling, 0(1) 143–172.

(2012)

Szaszák Gy, Beke A 2012. Statisztikai módszerek alkalmazása a szintaktikai szerkezet

és a beszédjel prozódiai szerkezetének feltérképezéséhez olvasott és spontán

beszédben In: Gósy M (szerk.)

Beszéd, adatbázis, kutatások. Budapest: Akadémiai Kiadó, 2012. 236–250.

Szaszák Gy, Beke A 2012. Automatic prosodic and syntactic analysis from speech in

Cognitive Infocommunication. In: IEEE (szerk.)3rd IEEE International

Conference on Cognitive Infocommunications. CogInfoCom 2012. Proceedings.

Kosice, Szlovákia, 2012.12.02–2012.12.05.

Gósy M, Gyarmathy D, Horváth V, Gráczi TE, Beke A, Neuberger T, Nikléczy P

2012. BEA: Beszélt nyelvi adatbázis. In: Gósy M (szerk.) Beszéd, adatbázis,

kutatások. Budapest: Akadémiai Kiadó, 2012. 9–24.

Beke A, Szaszák Gy 2012. Unsupervised clustering of prosodic patterns in

spontaneous speech In: Sojka P, Horák A, Kopeček I, Pala K (szerk.) Text, Speech

and Dialogue: 15th International Conference, TSD 2012, Brno, Czech Republic,

September 3–7, 2012. Proceedings. Berlin: Springer, 2012. 648–655.

Beke A, Gósy M, Horváth V 2012. Gyakorisági vizsgálatok spontán beszédben.


Beke A, Gósy M 2012. Characteristic and spectral features used in automatic

prediction of vowel duration in spontaneous speech. In: IEEE (szerk.) 3rd IEEE

International Conference on Cognitive Infocommunications. CogInfoCom 2012.

Proceedings. Kosice, Szlovákia, 2012.12.02–2012.12.05.

Beke A 2012. Beszélőfelismerés kevert Gauss–modellekkel. In: Markó Alexandra

(szerk.) Beszédtudomány: Az anyanyelv–elsajátítástól a zöngekezdési időig.

Budapest: ELTE és MTA Nyelvtudományi Intézete, 2012. 335–352.

Beke A 2012. Beszéddetektálás spontán beszédben a beszélőváltás–detektáláshoz. In:

Váradi T (szerk.) VI. Alkalmazott Nyelvészeti Doktoranduszkonferencia:

Budapest, 2012. 02. 03. Budapest: MTA Nyelvtudományi Intézet, 2012. 14–23.

Beke A 2012. Az egyszerre beszélések automatikus osztályozása spontán magyar

társalgásokban. In: Bárdosi Vilmos (szerk.) Tanulmányok: Nyelvtudományi

Doktori Iskola, Budapest: ELTE BTK, 2012. 23–39.

Szaszák Gy, Nagy K, Beke A 2011. Analysing the correspondence between automatic

prosodic segmentation and syntactic structure. In: Piero Cosi, Renato De Mori,

Giuseppe Di Fabbrizio, Roberto Pieraccini (szerk.) Interspeech 2011, 12th Annual

Conference of the International Speech Communication Association. Firenze,

Olaszország, 2011.08.27–2011.08.31.

Gósy M, Beke A, Horváth V 2011. Temporális variabilitás a spontán beszédben.


Beke A 2011. Szókezdetek automatikus osztályozása spontán beszédben. MAGYAR

NYELVŐR 135: 226–241.

Beke A, Szaszák Gy 2010. Szótagok automatikus osztályozása spontán beszédben

spektrális és prozódiai jellemzők alapján. In: Tanács Attila, Vincze Veronika

(szerk.) VII. Magyar Számítógépes Nyelvészeti Konferencia: MSZNY 2010,

Szeged, Szegedi Tudományegyetem, 2010. 236–249.

Beke A, Szaszák Gy 2010. Kísérlet a szintaktikai szerkezet részleges automatikus

feltérképezésére a prozódiai szerkezet alapján. In: Tanács Attila, Vincze Veronika

(szerk.) VII. Magyar Számítógépes Nyelvészeti Konferencia: MSZNY 2010,

Szeged, Szegedi Tudományegyetem, 2010. 178–190.

Beke A, Szaszák Gy 2010. Automatic recognition of schwa variants in spontanneous

Hungarian speech. ACTA LINGUISTICA HUNGARICA 57:(2–3) 329–353.

Beke A, Szaszák GY 2009. A sávvariációk automatikus felismerése magyar nyelvű

spontán beszédben. BESZÉDKUTATÁS 17: 148–169.

Beke A 2009. A beszélő hangtartományának vizsgálata: Néhány statisztikai jellemző

az alapfrekvencia–eloszlásról. In: Keszler Borbála, Tátrai Szilárd (szerk.)

Diskurzus a grammatikában – grammatika a diskurzusban Budapest: Tinta

Könyvkiadó, 2009. 83–91.

Beke A 2008. Az alapfrekvencia–eloszlás modellezése a beszélőfelismeréshez.

ALKALMAZOTT NYELVTUDOMÁNY 8:(1–2) 121–133.

Beke A 2008. A felolvasás és a spontán beszéd alaphangszerkezeteinek vizsgálata.


8. TALKS GIVEN BY THE CANDIDATE ON THE TOPIC OF

THE THESIS

April 2007. A kérdés–válasz prozódiája számítógépes vizsgálattal. OTDK,

Székesfehérvár.

April 2007. A kérdés–válasz dallamszerkezetének fonetikai vizsgálata magyar nyelvű

társalgásokban. FÉLÚTON Konferencia, Budapest.

April 2008. A beszélő személy felismerése: az automatikus formánslekérdezés

eredményei. FÉLÚTON Konferencia, Budapest.

November 2008. A beszélő hangtartományának vizsgálata (néhány statisztikai

jellemző az alapfrekvencia–eloszlásról).: Diskurzus a grammatikában –

grammatika a diskurzusban (Új nézőpontok a magyar nyelv leírásában 2.)

Konferencia, Budapest, 2008. nov. 11–12.

November 2009. A svávariációk automatikus felismerése magyar nyelvű spontán

beszédben. Beszédkutatás 2009. Budapest. 2009. (with: Szaszák György)

November 2010. Temporális variabilitás a spontán beszédben. Kultúra és nyelv,

kulturális nyelvészet – Új nézőpontok a magyar nyelv leírásában 3. ELTE BTK:

Budapest (with: Horváth Viktória és Gósy Mária).

December 2010. Szótagok automatikus osztályozása spontán beszédben spektrális és

prozódiai jellemzők alapján. VII. Magyar Számítógépes Nyelvészeti Konferencia

(with: Szaszák György)

May 2011. Figyi, ki beszél most? A beszélők automatikus osztályozása a spontán

társalgásokban. XIII. Balatonalmádi Pszicholingvisztikai Nyári Egyetem.

May 2011. A hezitációs jelenségek gépi osztályozása a spontán beszédben. XIII.

Balatonalmádi Pszicholingvisztikai Nyári Egyetem. (with: Horváth Viktória)

August 2011. Analysing the correspondence between automatic prosodic

segmentation and syntactic structure. Interspeech 2011, Firenze, Olaszország (with:

Szaszák György)

October 2011. Gyakorisági mutatók a spontán beszédben. Beszédkutatás konferencia.

(with: Gósy Máriával és Horváth Viktóriával)

Ocober 2011. Kísérlet a szintaktikai szerkezet részleges automatikus feltérképezésére

a prozódiai szerkezet alapján. Beszédkutatás konferencia. (with: Szaszák György)

October 2011. Az ismétlések automatikus osztályozása a spontán beszédben.

Beszédkutatás konferencia. (with: Gyarmathy Dorottyával)

November 2011. Beszélők szegmentálása és osztályozása társalgásban. A Magyar

Tudomány Ünnepe 2011 Beszédadatbázisok a kutatásban és az alkalmazásban.

MTA NYTUD. Budapest.

December 2011. A szintakai szerkezet automatikus feltérképezése a beszédjel

prozódiai elemzése alapján. VIII. Magyar Számítógépes Nyelvészeti Konferencia

(with: Szaszák György)

March 2012. Toward Exploring the Prosodic Structure of Spontaneous Speech by

Focusing on Automatic Modelling, IAST Workshop, Dublin (with: Szaszák

György)

April 2012. A beszélő személy gépi felismerése. Fonetikanap, ELTE BTK, Budapest.

September 2012. Unsupervised Clustering of Prosodic Patterns in Spontaneous

Speech. TSD, Brno, (with: Szaszák György)

September 2012. A szintaxis és a prozódia kapcsolata, BEA Workshop, MTA

Nyelvtudományi Intézet, Budapest (with: Szaszák György)

December 2012. Automatic prosodic and syntactic analysis from speech in Cognitive

Infocommunication, CogInfoCom konferencia, Kassa (with: Szaszák György)

December 2012. Characteristic and spectral features used in automatic prediction of

vowel duration in spontaneous speech, CogInfoCom konferencia, Kassa (with:

Gósy Mária)

March 2013. Automatic identification of discourse markers in spontaneous speech for

speaker diarization. SJUSK 2012. Copenhagen

March 2013. Automatic laughter detection in Hungarian spontaneous speech using

GMM/ANN hybrid method. SJUSK 2012. Copenhagen (with: Neuberger Tilda)

March 2013. Automatic classification of repeated words in Hungarian spontaneous

speech. ExAPP 2013. Copenhagen (with: Gyarmaty Dorottya)

August 2013. Using Phonological Phrase Segmentation to Improve Automatic

Keyword Spotting for the Highly Agglutinating Hungarian Language.

INTERSPEECH 2013, Lyon, Farnciaország (with: Szaszák György)

December 2013. Temporal variability in spontaneous Hungarian speech. 6th Language

& Technology Conference: Human Language Technologies as a Challenge for

Computer Science and Linguistics (társzerzőségben: Gósy Mária, Horáth Viktória)

September 2013. Automatic Laughter Detection in Spontaneous Speech Using GMM–

SVM Method.Text, Speech, and Dialogue – 16th International Conference, TSD

2013, Pilsen, Czech Republic (with: Neuberger Tilda)

September 2013. A Logistic Regression Approach for the Improvement of Keyword

Spotting based on Phonological Phrasing. Text, Speech, and Dialogue – 16th

International Conference, TSD 2013, Pilsen, Czech Republic (with: Szaszák

György)

Documents

ANDRÁS BEKE - doktori.btk.elte.hu